Skip to article frontmatterSkip to article content

1Your ICLR Recommendation list

There are 5000 papers for you in ICLR 2025

score_cdf

2Distribution Backtracking Builds A Faster Convergence Trajectory for Diffusion Distillation

[openreview] [pdf]

Abstract Accelerating the sampling speed of diffusion models remains a significant challenge. Recent score distillation methods distill a heavy teacher model into a student generator to achieve one-step generation, which is optimized by calculating the difference between two score functions on the samples generated by the student model. However, there is a score mismatch issue in the early stage of the score distillation process, since existing methods mainly focus on using the endpoint of pre-trained diffusion models as teacher models, overlooking the importance of the convergence trajectory between the student generator and the teacher model. To address this issue, we extend the score distillation process by introducing the entire convergence trajectory of the teacher model and propose Dis\textbf{Dis}tribution Back\textbf{Back}tracking Distillation (DisBack\textbf{DisBack}). DisBask is composed of two stages: Degradation Recording\textit{Degradation Recording} and Distribution Backtracking\textit{Distribution Backtracking}. Degradation Recording\textit{Degradation Recording} is designed to obtain the convergence trajectory by recording the degradation path from the pre-trained teacher model to the untrained student generator. The degradation path implicitly represents the intermediate distributions between the teacher and the student, and its reverse can be viewed as the convergence trajectory from the student generator to the teacher model. Then Distribution Backtracking\textit{Distribution Backtracking} trains the student generator to backtrack the intermediate distributions along the path to approximate the convergence trajectory of the teacher model. Extensive experiments show that DisBack achieves faster and better convergence than the existing distillation method and achieves comparable or better generation performance, with an FID score of 1.38 on the ImageNet 64×\times64 dataset. DisBack is easy to implement and can be generalized to existing distillation methods to boost performance.

3Diffusion Transformer Policy

[openreview] [pdf]

Abstract Recent large visual-language action models pretrained on diverse robot datasets have demonstrated the potential for generalizing to new environments with a few in-domain data. However, those approaches usually predict discretized or continuous actions by a small action head, which limits the ability in handling diverse action spaces. In contrast, we model the continuous action with a large multi-modal diffusion transformer, dubbed as Diffusion Transformer Policy, in which we directly denoise action chunks by a large transformer model rather than a small action head. By leveraging the scaling capability of transformers, the proposed approach can effectively model continuous end-effector actions across large diverse robot datasets, and achieve better generalization performance. Extensive experiments demonstrate Diffusion Transformer Policy pre-trained on diverse robot data can generalize to different embodiments, including simulation environments like Maniskill2 and Calvin, as well as the real-world Franka arm. Specifically, without bells and whistles, the proposed approach achieves state-of-the-art performance in the Calvin novel task setting, and the pre-training stage significantly facilitates the success sequence length on the Calvin by over 1.2. The code will be publicly available.

4Diffusion Policy Policy Optimization

[openreview] [pdf]

Abstract We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG methods are ubiquitous in training RL policies with other policy parameterizations; nevertheless, they had been conjectured to be less efficient for diffusion-based policies. Surprisingly, we show that DPPO achieves the strongest overall performance and efficiency for fine-tuning in common benchmarks compared to other RL methods for diffusion-based policies and also compared to PG fine-tuning of other policy parameterizations. Through experimental investigation, we find that DPPO takes advantage of unique synergies between RL fine-tuning and the diffusion parameterization, leading to structured and on-manifold exploration, stable training, and strong policy robustness. We further demonstrate the strengths of DPPO in a range of realistic settings, including simulated robotic tasks with pixel observations, and via zero-shot deployment of simulation-trained policies on robot hardware in a long-horizon, multi-stage manipulation task.

5Diffusion Models for 4D Novel View Synthesis

[openreview] [pdf]

Abstract We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works which generally operate in limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. Seehttps://anonymous-4d-diffusion.github.iofor video samples.

6The Deficit of New Information in Diffusion Models: A Focus on Diverse Samples

[openreview] [pdf]

Abstract Diffusion models are renowned for their state-of-the-art performance in generating high-quality images. Identifying samples with new information beyond the training data is essential for data augmentation, especially for enhancing model performance in diverse and unforeseen real-world scenarios. However, the investigation of new information in the generated samples has not been well explored. Our investigation through the lens of information theory reveals that diffusion models do not produce new information beyond what exists in the training data. Next, we introduce the concept of diverse samples (DS) to prove that generated images could contain information not present in the training data for diffusion models. Furthermore, we propose a method for identifying diverse samples among generated images by extracting deep features and detecting images that fall outside the boundary of real images. We demonstrate that diverse samples exist in the generated data of diffusion models, attributed to the estimation of forward and backward processes, but it can only produce a limited number of diverse samples, underscoring a notable gap in their capabilities in generating diverse samples. In addition, our experiment on the Chest X-ray dataset demonstrates that the diverse samples are more useful in improving classification accuracy than vanilla-generated samples. The source code is available at \url{https://github.com/lypz12024/diffusion-diverse-samples}.

7What Makes a Good Diffusion Planner for Decision Making?

[openreview] [pdf]

Abstract Diffusion models have recently shown significant potential in solving decision-making problems, particularly in generating behavior plans -- also known as diffusion planning. While numerous studies have demonstrated the impressive performance of diffusion planning, the mechanisms behind the key components of a good diffusion planner remain unclear and the design choices are highly inconsistent in existing studies. In this work, we address this issue through systematic empirical experiments on diffusion planning in an offline reinforcement learning (RL) setting, providing practical insights into the essential components of diffusion planning. We trained and evaluated over 6,000 diffusion models, identifying the critical components such as guided sampling, network architecture, action generation and planning strategy. We revealed that some design choices opposite to the common practice in previous work in diffusion planning actually lead to better performance, e.g., unconditional sampling with selection can be better than guided sampling and Transformer outperforms U-Net as denoising network. Based on these insights, we suggest a simple yet strong diffusion planning baseline that achieves state-of-the-art results on standard offline RL benchmarks.

8Diffusion-NPO: Negative Preference Optimization for Better Preference Aligned Generation of Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have made substantial advances in image generation, yet models trained on large, unfiltered datasets often yield outputs misaligned with human preferences. Numerous methods have already been proposed to fine-tune pre-trained diffusion models, achieving notable improvements in aligning generated outputs with human preferences. However, we point out that existing preference alignment methods neglect the critical role of handling unconditional/negative-conditional outputs, leading to a diminished capacity to avoid generating undesirable outcomes. This oversight limits the efficacy of classifier-free guidance (CFG), which relies on the contrast between conditional generation and unconditional/negative-conditional generation to optimize output quality. In response, we propose a straightforward but versatily effective approach that involves training a model specifically attuned to negative preferences. This method does not require new training strategies or datasets but rather involves minor modifications to existing techniques. Our approach integrates seamlessly with models such as SD15, SDXL, video diffusion models and models that have undergone preference optimization, consistently enhancing their ability to produce more human preferences aligned outputs.

9Discovery and Expansion of New Domains within Diffusion Models

[openreview] [pdf]

Abstract In this work, we study the generalization properties of diffusion models in a few-shot setup, introduce a novel tuning-free paradigm to synthesize the target out-of-domain (OOD) data, showcase multiple applications of those generalization properties, and demonstrate the advantages compared to existing tuning-based methods in data-sparse scientific scenarios with large domain gaps. Our work resides on the observation and premise that the theoretical formulation of denoising diffusion implicit models (DDIMs), a non-Markovian inference technique, exhibits latent Gaussian priors independent from the parameters of trained denoising diffusion probabilistic models (DDPMs). This brings two practical benefits: the latent Gaussian priors generalize to OOD data domains that have never been used in the training stage; existing DDIMs offer the flexibility to traverse the denoising chain bidirectionally for a pre-trained DDPM. We then demonstrate through theoretical and empirical studies that such established OOD Gaussian priors are practically separable from the originally trained ones after inversion. The above analytical findings allow us to introduce our novel tuning-free paradigm to synthesize new images of the target unseen domain by discovering qualified OOD latent encodings within the inverted noisy latent spaces, which is fundamentally different from most existing paradigms that seek to modify the denoising trajectory to achieve the same goal by tuning the model parameters. Extensive cross-model and domain experiments show that our proposed method can expand the latent space and synthesize images in new domains via frozen DDPMs without impairing the generation quality of their original domains.

10Iterative DPO with An Improvement Model for Fine-tuning Diffusion Models

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has been proven as an effective solution in aligning generative models with human preferences. However, as shown in recent works, DPO could suffer from constraints from the offline preference dataset. This paper introduces a novel improvement approach for online iterative optimization of the diffusion models without introducing extra annotation of the online data. We propose to learn a preference improvement model to extract the implicit preference from the preference dataset. The learned improvement model is then used to generate winning images from the images generated by the current diffusion model. We can construct new pairs of preference data by using images generated by the current diffusion model as losing images, and its corresponding improved images as winning images. The diffusion model can therefore be optimized via iteratively applying online preference datasets. This method enables online improvement beyond offline DPO training without requiring additional human labeling or risking overfitting the reward model. Results demonstrate improvements in preference alignment with higher diversity compared with other fine-tuning methods. Our work bridges the gap between offline preference learning and online improvement, offering a promising direction for enhancing diffusion models in image generation tasks with limited preference data.

11Adding Conditional Control to Diffusion Models with Reinforcement Learning

[openreview] [pdf]

Abstract Diffusion models are powerful generative models that allow for precise control over the characteristics of the generated samples. While these diffusion models trained on large datasets have achieved success, there is often a need to introduce additional controls in downstream fine-tuning processes, treating these powerful models as pre-trained diffusion models. This work presents a novel method based on reinforcement learning (RL) to add such controls using an offline dataset comprising inputs and labels. We formulate this task as an RL problem, with the classifier learned from the offline dataset and the KL divergence against pre-trained models serving as the reward functions. Our method,CTRL(Conditioning pre-Trained diffusion models withReinforcementLearning), produces soft-optimal policies that maximize the abovementioned reward functions. We formally demonstrate that our method enables sampling from the conditional distribution with additional controls during inference. Our RL-based approach offers several advantages over existing methods. Compared to classifier-free guidance, it improves sample efficiency and can greatly simplify dataset construction by leveraging conditional independence between the inputs and additional controls. Additionally, unlike classifier guidance, it eliminates the need to train classifiers from intermediate states to additional controls.

12Diffusion Models Meet Contextual Bandits

[openreview] [pdf]

Abstract Efficient exploration in contextual bandits is crucial due to their large action space, where uninformed exploration can lead to computational and statistical inefficiencies. However, the rewards of actions are often correlated, which can be leveraged for more efficient exploration. In this work, we use pre-trained diffusion model priors to capture these correlations and develop diffusion Thompson sampling (dTS). We establish both theoretical and algorithmic foundations for dTS. Specifically, we derive efficient posterior approximations (required by dTS) under a diffusion model prior, which are of independent interest beyond bandits and reinforcement learning. We analyze dTS in linear instances and provide a Bayes regret bound. Our experiments validate our theory and demonstrate dTS’s favorable performance.

13Influence-Guided Diffusion for Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation aims to streamline the training process by creating a compact yet effective dataset for a much larger original dataset. However, existing methods often struggle with distilling large, high-resolution datasets due to prohibitive resource costs and limited performance, primarily stemming from sample-wise optimizations in the pixel space. Motivated by the remarkable capabilities of diffusion generative models in learning target dataset distributions and controllably sampling high-quality data tailored to user needs, we propose framing dataset distillation as a controlled diffusion generation task aimed at generating data specifically tailored for effective training purposes. By establishing a correlation between the overarching objective of dataset distillation and the trajectory influence function, we introduce the Influence-Guided Diffusion (IGD) sampling framework to generate training-effective data without the need to retrain diffusion models. An efficient guided function is designed by leveraging the trajectory influence function as an indicator to steer diffusions to produce data with influence promotion and diversity enhancement. Extensive experiments show that the training performance of distilled datasets generated by diffusions can be significantly improved by integrating with our IGD method and achieving state-of-the-art performance in distilling ImageNet datasets. Particularly, an exceptional result is achieved on the ImageNet-1K, reaching 60.3% at IPC=50.

14Inverse Engineering Diffusion: Deriving Variance Schedules with Rationale

[openreview] [pdf]

Abstract A fundamental aspect of diffusion models is the variance schedule, which governs the evolution of variance throughout the diffusion process. Despite numerous studies exploring variance schedules, little effort has been made to understand the variance distributions implied by sampling from these schedules and how it benefits both training and data generation. We introduce a novel perspective on score-based diffusion models, bridging the gap between the variance schedule and its underlying variance distribution. Specifically, we propose the notion of sampling variance according to a probabilistic rationale, which induces a density. Our approach views the inverse of the variance schedule as a cumulative distribution function (CDF) and its first derivative as a probability density function (PDF) of the variance distribution. This formulation not only offers a unified view of variance schedules but also allows for the direct engineering of a variance schedule from the probabilistic rationale of its inverse function. Additionally, our framework is not limited to CDFs with closed-form inverse solutions, enabling the exploration of variance schedules that are unattainable through conventional methods. We present the tools required to obtain a diverse array of novel variance schedules tailored to specific rationales, such as separability metrics or prior beliefs. These schedules may exhibit varied dynamics, ranging from rapid convergence towards zero to prolonged periods in high-variance regions. Through comprehensive empirical evaluation, we demonstrate the efficacy of enhancing the performance of diffusion models with schedules distinct from those encountered during training. We provide a principled and unified approach to variance schedules in diffusion models, revealing the relationship between variance schedules and their underlying probabilistic rationales, which yields notable improvements in image generation performance, as measured by FID.

15Progressive distillation induces an implicit curriculum

[openreview] [pdf]

Abstract Knowledge distillation leverages a teacher model to improve the training of a student model. A persistent challenge is that a better teacher does not always yield a better student, to which a common mitigation is to use additional supervision from several “intermediate” teachers. One empirically validated variant of this principle is progressive distillation, where the student learns from successive intermediate checkpoints of the teacher. Using sparse parity as a sandbox, we identify an implicit curriculum as one mechanism through which progressive distillation accelerates the student’s learning. This curriculum is available only through the intermediate checkpoints but not the final converged one, and imparts both empirical acceleration and a provable sample complexity benefit to the student. We then extend our investigation to Transformers trained on probabilistic context-free grammars (PCFGs) and real-world pre-training datasets (Wikipedia and Books). Through probing the teacher model, we identify an analogous implicit curriculum where the model progressively learns features that capture longer context. Our theoretical and empirical findings on sparse parity, complemented by empirical observations on more complex tasks, highlight the benefit of progressive distillation via implicit curriculum across setups.

16One Step Diffusion via Shortcut Models

[openreview] [pdf]

Abstract Diffusion models and flow matching models have enabled generating diverse and realistic images by learning to transfer noise to data. However, sampling from these models involves iterative denoising over many neural network passes, making generation slow and expensive. Previous approaches for speeding up sampling require complex training regimes, such as multiple training phases, multiple networks, or fragile scheduling. We introduce Shortcut Models, a family of generative models that use a single network and training phase to produce high-quality samples in a single or multiple sampling steps. Shortcut models condition the network not only on the current noise level but also on the desired step size, allowing the model to skip ahead in the generation process. Across a wide range of sampling step budgets, shortcut models consistently produce higher quality samples than previous approaches, such as consistency models and reflow. Compared to distillation, shortcut models reduce complexity to a single network and training phase and additionally allow varying step budgets at inference time.

17Unveiling Concept Attribution in Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains black-box; little do we know about the role of its components in exhibiting a concept such as object or style. Recent works employ causal tracing to localize layers storing knowledge in generative models. In this work, we approach from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?‘’}. We adapt component attribution to decompose diffusion models, unveiling how a component contributes to a concept. Our framework allows effective model editing, in particular, we can erase a concept from diffusion models by removing positive components while remaining knowledge of other concepts. Surprisingly, we also show that there exist components that contribute negatively to a concept that has not been discovered in the knowledge localization approach. Experimental results confirm the role of positive and negative components pinpointed by our framework, depicting a complete view of interpreting generative models.

[openreview] [pdf]

Abstract Time-series forecasting finds broad applications in real-world scenarios. Due to the dynamic nature of time series data, it is crucial for time-series forecasting models to produce robust predictions under potential distribution shifts. In this paper, we initially identify two types of distribution shifts in time series: concept drift and temporal shift. We acknowledge that while existing studies primarily focus on addressing temporal shift issues in time series, designing proper concept drift methods for time series data received comparatively less attention.Motivated by the need to mitigate potential concept drift issues in time-series forecasting, this work proposes a novel soft attention mechanism that effectively leverages and ensemble information from the horizon time series. Furthermore, recognizing that both concept drift and temporal shift could occur concurrently in time-series forecasting scenarios while an integrated solution remains missing, this paper introduces ShifTS, a model-agnostic framework seamlessly addressing both concept drift and temporal shift issues in time-series forecasting. Extensive experiments demonstrate the efficacy of ShifTS in consistently enhancing the forecasting accuracy of agnostic models across multiple datasets, and consistently outperforming existing concept drift, temporal shift, and combined baselines.

19Can Diffusion Models Disentangle? A Theoretical Perspective

[openreview] [pdf]

Abstract This paper introduces a novel theoretical framework to understand how diffusion models can learn disentangled representations under the assumption of an \normltwo score approximation. We also provide sufficient conditions under which such representations are beneficial for domain adaptation. Our theory offers new insights into how existing diffusion models disentangle latent variables across general distributions and suggests strategies to enhance their disentanglement capabilities. To validate our theory, we perform experiments using both synthetic data generated from latent subspace models and real speech data for non-parallel voice conversion - a canonical disentanglement problem. Across various classification tasks, we found voice conversion-based adaptation methods achieve significant improvements in classification accuracy, demonstrating their effectiveness as domain adaptors. Code will be released upon acceptance.

20Fast Multi-Mode Adaptive Generative Distillation for Continually Learning Diffusion Models

[openreview] [pdf]

Abstract Diffusion models are powerful generative models, but their computational demands, vulnerability to catastrophic forgetting, and class imbalance in generated data pose significant challenges in continual learning scenarios. In this paper, we introduce Fast Multi-Mode Adaptive Generative Distillation (MAGD), a novel approach designed to address these three core challenges. MAGD combines generative replay and knowledge distillation, enhancing the continual training of diffusion models through three key innovations: (1) Noisy Intermediate Generative Distillation (NIGD), which leverages intermediate noisy images during the reverse diffusion process to improve data utility and preserve image quality without additional computational costs; (2) Class-guided generative distillation (CGGD), which uses classifier guidance to ensure balanced class representation in generated images, addressing the issue of class imbalance in traditional methods; and (3) Signal-Guided Generative Distillation (SGGD), which reduces computational overhead while maintaining image clarity through the reuse of the model’s denoising capabilities across tasks. Our experimental results on Fashion-MNIST, CIFAR-10, and CIFAR-100 demonstrate that MAGD significantly outperforms existing methods in both image quality, measured by Fréchet Inception Distance (FID), and class balance, measured by Kullback-Leibler Divergence (KLD). Moreover, MAGD achieves competitive results with far fewer generation steps compared to traditional methods, making it a practical solution for real-life continual learning applications.

21A Tailored Framework for Aligning Diffusion Models with Human Preference

[openreview] [pdf]

Abstract The direct preference optimization (DPO) method has shown success in aligning text-to-image diffusion models with human preference. Previous approaches typically assume a consistent preference label between final generated images and their corresponding noisy samples at intermediate steps, and directly apply DPO to these noisy samples for fine-tuning. However, we identify a significant issue with this consistency assumption, as directly applying DPO to noisy samples from different generation trajectories based on final preference order may disrupt the optimization process. We first demonstrate the issues inherent in previous methods from two perspectives:gradient directionandpreference order, and then propose aTailoredPreferenceOptimization (TailorPO) framework for aligning diffusion models with human preference, underpinned by some theoretical insights. Our approach directly ranks the preference order of intermediate noisy samples based on their step-wise reward, and effectively resolves the optimization direction issues through a simple yet efficient design. Additionally, to the best of our knowledge, we are the first to consider the distinct structure of diffusion models and leverage the gradient guidance in preference aligning to enhance the optimization effectiveness. Experimental results demonstrate that our method significantly improves the model’s ability to generate aesthetically pleasing and human-preferred images.

22Optimal Targets for Concept Erasure in Diffusion Models and Where To Find Them

[openreview] [pdf]

Abstract Concept erasure has emerged as a promising technique for mitigating the risk of harmful content generation in diffusion models by selectively unlearning undesirable concepts. The common principle of previous works to remove a specific concept is to map it to a fixed generic concept, such as a neutral concept or just an empty text prompt. In this paper, we demonstrate that this fixed-target strategy is suboptimal, as it fails to account for the impact of erasing one concept on the others. To address this limitation, we model the concept space as a graph and empirically analyze the effects of erasing one concept on the remaining concepts. Our analysis uncovers intriguing geometric properties of the concept space, where the influence of erasing a concept is confined to a local region. Building on this insight, we propose the Adaptive Guided Erasure (AGE) method, which \emph{dynamically} selects neutral concepts tailored to each undesirable concept, minimizing unintended side effects. Experimental results show that AGE significantly outperforms state-of-the-art erasure methods on preserving unrelated concepts while maintaining effective erasure performance.

23Protecting Minorities in Diffusion Models via Capacity Allocation

[openreview] [pdf]

Abstract Diffusion models have advanced quickly in image generation. However, their performance declines significantly on the imbalanced data commonly encountered in real-world scenarios. Current research on imbalanced diffusion models focuses on improving the objective function to facilitate knowledge transfer between majorities and minorities, thereby enhancing the generation of minority samples. In this paper, we make the first attempt to address the imbalanced data challenges in diffusion models from the perspective of model capacity. Specifically, majorities occupy most of the model capacity because of their larger representation, consequently restricting the capacity available for minority classes. To tackle this challenge, we propose Protecting Minorities via Capacity ALLocation (CALL). We reserve capacity for minority expertise by low-rank decomposing the model parameters and allocate the corresponding knowledge to the reserved model capacity through a capacity allocation loss function. Extensive experiments demonstrate that our method, which is orthogonal to existing methods, consistently and significantly improves the robustness of diffusion models on imbalanced data.

24DC-DPM: A Divide-and-Conquer Approach for Diffusion Reverse Process

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generative tasks. However, previous approaches typically approximate the reversed transition kernel with a Gaussian distribution. This approximation can diverge from real scenarios, necessitating multiple iterative steps for high-quality sample generation and limiting the real-time inference performance of diffusion models. In this paper, we propose a \textbf{D}ivide-and-\textbf{C}onquer strategy to improve the traditional single Gaussian transition kernel representation in each denoising step of \textbf{D}iffusion \textbf{P}robabilistic \textbf{M}odels (DC-DPM), thus enhancing generation quality particularly over a limited number of timesteps. By dividing the data into clusters, our DC-DPM learns specific kernels for each partition. We design two merging strategies for these cluster-specific kernels along with corresponding training and sampling methods. We provide theoretical proof of DC-DPM’s convergence to the true data distribution from a novel perspective. Experimental results demonstrate the superior generation quality of our method compared to the traditional single Gaussian kernel. Furthermore, our DC-DPM can synergize with previous kernel optimization methods, enhancing their generation quality, especially with a small number of timesteps.

25Adaptive Concept Bottleneck for Foundation Models Under Distribution Shifts

[openreview] [pdf]

Abstract Advancements in foundation models (FMs) have led to a paradigm shift in machine learning. The rich, expressive feature representations from these pre-trained, large- scale FMs are leveraged for multiple downstream tasks, usually via lightweight fine-tuning of a shallow fully-connected network following the representation. However, the non-interpretable, black-box nature of this prediction pipeline can be a challenge, especially in critical domains, such as healthcare, finance, and security. In this paper, we explore the potential of Concept Bottleneck Models (CBMs) for transforming complex, non-interpretable foundation models into interpretable decision-making pipelines using high-level concept vectors. Specifically, we focus on the test-time deployment of such an interpretable CBM pipeline “in the wild”, where the distribution of inputs often shifts from the original training distribution. We first identify the potential failure modes of such pipelines under different types of distribution shifts. Then we propose an adaptive concept bottleneck framework to address these failure modes, that dynamically adapts the concept-vector bank and the prediction layer based solely on unlabeled data from the target domain, without access to the source dataset. Empirical evaluations with various real-world distribution shifts show our framework produces concept-based interpretations better aligned with the test data and boosts post-deployment accuracy by up to 28%, aligning CBM performance with that of non-interpretable classification.

26Latent Weight Diffusion: Generating policies from trajectories

[openreview] [pdf]

Abstract With the increasing availability of open-source robotic data, imitation learning has emerged as a viable approach for both robot manipulation and locomotion. Currently, large generalized policies are trained to predict controls or trajectories using diffusion models, which have the desirable property of learning multimodal action distributions. However, generalizability comes with a cost — namely, larger model size and slower inference. Further, there is a known trade-off between performance and action horizon for Diffusion Policy (i.e., diffusing trajectories): fewer diffusion queries accumulate greater trajectory tracking errors. Thus, it is common practice to run these models at high inference frequency, subject to robot computational constraints.To address these limitations, we propose Latent Weight Diffusion (LWD), a method that uses diffusion to learn a distribution over policies for robotic tasks, rather than over trajectories. Our approach encodes demonstration trajectories into a latent space and then decodes them into policies using a hypernetwork. We employ a diffusion denoising model within this latent space to learn its distribution. We demonstrate that LWD can reconstruct the behaviors of the original policies that generated the trajectory dataset. LWD offers the benefits of considerably smaller policy networks during inference and requires fewer diffusion model queries. When tested on the Metaworld MT10 benchmark, LWD achieves a higher success rate compared to a vanilla multi-task policy, while using models up to ∼18x smaller during inference. Additionally, since LWD generates closed-loop policies, we show that it outperforms Diffusion Policy in long action horizon settings, with reduced diffusion queries during rollout.

27Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning

[openreview] [pdf]

Abstract Controllable generation through Stable Diffusion (SD) fine-tuning aims to improve fidelity, safety, and alignment with human guidance. Existing reinforcement learning from human feedback methods usually rely on predefined heuristic reward functions or pretrained reward models built on large-scale datasets, limiting their applicability to scenarios where collecting such data is costly or difficult. To effectively and efficiently utilize human feedback, we develop a framework, HERO, which leverages online human feedback collected on the fly during model learning. Specifically, HERO features two key mechanisms: (1) Feedback-Aligned Representation Learning, an online training method that captures human feedback and provides informative learning signals for fine-tuning, and (2) Feedback-Guided Image Generation, which involves generating images from SD’s refined initialization samples, enabling faster convergence towards the evaluator’s intent. We demonstrate that HERO is 4x more efficient in online feedback for body part anomaly correction compared to the best existing method. Additionally, experiments show that HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.

28How do diffusion models learn and generalize on abstract rules for reasoning?

[openreview] [pdf]

Abstract Diffusion models excel in generating and completing patterns in images. But how good is their ability to learn hidden rules from samples and to generate and reason according to such rules or even generalize to similar rules? We trained a wide family of unconditional diffusion models on Raven’s progression matrix task to precisely study this. We quantified their capability to generate structurally consistent samples and complete missing parts according to hidden rules. We found diffusion models can synthesize novel samples consistent with rules without memorizing the training set, much better than GPT2 trained on the same data. They memorized and recombined local parts of the training samples to create new rule-conforming samples. When tasked to complete the missing panel with inpainting techniques, advanced sampling techniques were needed to perform well. Further, their pattern completion capability can generalize to rules unseen during training. Further, through generative training on rule data, a robust rule representation rapidly emerged in the diffusion model, which could linearly classify rules at 99.8% test accuracy. Our results suggest diffusion training is a useful paradigm for reasoning and learning representations for downstream tasks even for abstract rules data.

29O(d/T) Convergence Theory for Diffusion Probabilistic Models under Minimal Assumptions

[openreview] [pdf]

Abstract Score-based diffusion models, which generate new data by learning to reverse a diffusion process that perturbs data from the target distribution into noise, have achieved remarkable success across various generative tasks. Despite their superior empirical performance, existing theoretical guarantees are often constrained by stringent assumptions or suboptimal convergence rates. In this paper, we establish a fast convergence theory for a popular SDE-based sampler under minimal assumptions. Our analysis shows that, provided 2\ell_{2}-accurate estimates of the score functions, the total variation distance between the target and generated distributions is upper bounded by O(d/T)O(d/T) (ignoring logarithmic factors), where dd is the data dimensionality and TT is the number of steps. This result holds for any target distribution with finite first-order moment. To our knowledge, this improves upon existing convergence theory for both the SDE-based sampler and another ODE-based sampler, while imposing minimal assumptions on the target data distribution and score estimates. This is achieved through a novel set of analytical tools that provides a fine-grained characterization of how the error propagates at each step of the reverse process.

30Diffusion Models are Evolutionary Algorithms

[openreview] [pdf]

Abstract In a convergence of machine learning and biology, we reveal that diffusion models are evolutionary algorithms. By considering evolution as a denoising process and reversed evolution as diffusion, we mathematically demonstrate that diffusion models inherently perform evolutionary algorithms, naturally encompassing selection, mutation, and reproductive isolation. Building on this equivalence, we propose the Diffusion Evolution method: an evolutionary algorithm utilizing iterative denoising -- as originally introduced in the context of diffusion models -- to heuristically refine solutions in parameter spaces. Unlike traditional approaches, Diffusion Evolution efficiently identifies multiple optimal solutions and outperforms prominent mainstream evolutionary algorithms. Furthermore, leveraging advanced concepts from diffusion models, namely latent space diffusion and accelerated sampling, we introduce Latent Space Diffusion Evolution, which finds solutions for evolutionary tasks in high-dimensional complex parameter space while significantly reducing computational steps. This parallel between diffusion and evolution not only bridges two different fields but also opens new avenues for mutual enhancement, raising questions about open-ended evolution and potentially utilizing non-Gaussian or discrete diffusion models in the context of Diffusion Evolution.

31How and how well do diffusion models improve adversarial robustness?

[openreview] [pdf]

Abstract Recent findings suggest that diffusion models significantly enhance empirical adversarial robustness. While some intuitive explanations have been proposed, the precise mechanisms underlying these improvements remain unclear. In this work, we systematically investigate how and how well do diffusion models improve adversarial robustness. First, we observe that diffusion models intriguingly increase—rather than decrease—the p\ell_p distances to clean samples. This is the opposite of what was believed previously. Second, we find that the purified images are heavily influenced by the internal randomness of diffusion models. To properly evaluate the robustness of systems with inherent randomness, we introduce the concept of fuzzy adversarial robustness, and find that empirically a substantial fraction of adversarial examples are fuzzy in nature. Finally, by leveraging a hyperspherical cap model of adversarial regions, we show that diffusion models increase robustness by dramatically compressing the image space. Our findings provide novel insights into the mechanisms behind the robustness improvements of diffusion-model-based purification and offer guidance for the development of more efficient adversarial purification systems.

32Distributionally Robust Policy Learning under Concept Drifts

[openreview] [pdf]

Abstract Distributionally robust policy learning aims to find a policy that performs well under the worst-case distributional shift, and yet most existing methods for robust policy learning consider the worst-case joint distribution of the covariate and the outcome. The joint-modeling strategy can be unnecessarily conservative when we have more information on the source of distributional shifts. This paper studies a more nuanced problem --- robust policy learning under the concept drift, when only the conditional relationship between the outcome and the covariate changes. To this end, we first provide a doubly-robust estimator for evaluating the worst-case average reward of a given policy under a set of perturbed conditional distributions. We show that the policy value estimator enjoys asymptotic normality even if the nuisance parameters are estimated with a slower-than-root-nn rate. We then propose a learning algorithm that outputs the policy maximizing the estimated policy value within a given policy class Π\Pi, and show that the sub-optimality gap of the proposed algorithm is of the order κ(Π)n1/2\kappa(\Pi)n^{-1/2}, with κ(Π)\kappa(\Pi) is the entropy integral of Π\Pi under the Hamming distance and nn is the sample size. The proposed methods are implemented and evaluated in numerical studies, demonstrating substantial improvement compared with existing benchmarks.

33Representative Guidance: Diffusion Model Sampling with Consistency

[openreview] [pdf]

Abstract The diffusion sampling process faces a persistent challenge stemming from its incoherence, attributable to varying noise directions across different time steps. Our Representative Guidance (RepG) offers a new perspective to handle this issue by reformulating the sampling process with a coherent direction towards a representative target. In this formulation, while the classic classifier guidance improves feature discernment by steering the model away from ambiguous features, it fails to provide a favorable representative target, since the class label is overly compact and leads to sacrificed diversity and the adversarial generation problem. In contrast, we leverage self-supervised representations as the coherent target and treat sampling as a downstream task, which refines image details and corrects errors rather than settling for simpler samples. Our representative guidance achieves superior performance and also illustrates the potential of pre-trained self-supervised models in image sampling. Our findings demonstrate that RepG not only substantially enhances vanilla diffusion sampling but also surpasses state-of-the-art benchmarks when combined with the classifier-free guidance. Our code will be released.

34Diffusion Transportation Cost for Domain Adaptation

[openreview] [pdf]

Abstract In recent years, there has been considerable interest in leveraging the Optimal Transport (OT) problem for domain adaptation, a strategy shown to be highly effective. However, a less explored aspect is the choice of the transportation cost function, as most existing methods rely on the pairwise squared Euclidean distances for the transportation cost, potentially overlooking important intra-domain geometries. This paper presents Diffusion-OT, a new transport cost for the OT problem, designed specifically for domain adaptation. By utilizing concepts and tools from the field of manifold learning, specifically diffusion geometry, we derive an operator that accounts for the intra-domain relationships, thereby extending beyond the conventional inter-domain distances. This operator, which quantifies the probability of transporting between source and target samples, forms the basis for our transportation cost. We provide proof that the proposed operator is in fact a diffusion operator, demonstrating that the cost function is defined by an anisotropic diffusion process between the domains. In addition, to enhance performance, we integrate source labels into the operator, thereby guiding the anisotropic diffusion according to the classes. We showcase the effectiveness of Diffusion-OT through comprehensive experiments, demonstrating its superior performance compared to recent methods across various benchmarks and datasets.

35Improved Convergence Rate for Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Score-based diffusion models have achieved remarkable empirical performance in the field of machine learning and artificial intelligence for their ability to generate high-quality new data instances from complex distributions. Improving our understanding of diffusion models, including mainly convergence analysis for such models, has attracted a lot of interests. Despite a lot of theoretical attempts, there still exists significant gap between theory and practice. Towards to close this gap, we establish an iteration complexity at the order of d1/3ε2/3d^{1/3}\varepsilon^{-2/3}, which is better than d5/12ε1d^{5/12}\varepsilon^{-1}, the best known complexity achieved before our work. This convergence analysis is based on a randomized midpoint method, which is first proposed for log-concave sampling \citep{Shen2019TheRandomized}, and then extended to diffusion models by \citet{Gupta2024Faster}. Our theory accommodates ε\varepsilon-accurate score estimates, and does not require log-concavity on the target distribution. Moreover, the algorithm can also be parallelized to run in only O(log2(d/ε))O(\log^2(d/\varepsilon)) parallel rounds in a similar way to prior works.

36Exploration by Running Away from the Past

[openreview] [pdf]

Abstract The ability to explore efficiently and effectively is a central challenge of reinforcement learning. In this work, we consider exploration through the lens of information theory. Specifically, we cast exploration as a problem of maximizing the Shannon entropy of the state occupation measure. This is done by maximizing a sequence of divergences between distributions representing an agent’s past behavior and its current behavior. Intuitively, this encourages the agent to explore new behaviors that are distinct from past behaviors. Hence, we call our method RAMP, for ``R\textbf{R}unning A\textbf{A}way from\textbf{m} the P\textbf{P}ast.‘’ A fundamental question of this method is the quantification of the distribution change over time. We consider both the Kullback-Leibler divergence and the Wasserstein distance to quantify divergence between successive state occupation measures, and explain why the former might lead to undesirable exploratory behaviors in some tasks. We demonstrate that by encouraging the agent to explore by actively distancing itself from past experiences, it can effectively explore mazes and a wide range of behaviors on robotic manipulation and locomotion tasks.

37Broadening Target Distributions for Accelerated Diffusion Models via a Novel Analysis Approach

[openreview] [pdf]

Abstract Accelerated diffusion models hold the potential to significantly enhance the efficiency of standard diffusion processes. Theoretically, these models have been shown to achieve faster convergence rates than the standard O(1/ϵ2)\mathcal O(1/\epsilon^2) rate of vanilla diffusion models, where ϵ\epsilon denotes the target accuracy. However, current theoretical studies have established the acceleration advantage only for restrictive target distribution classes, such as those with smoothness conditions imposed along the entire sampling path or with bounded support. In this work, we significantly broaden the target distribution classes with a new accelerated stochastic DDPM sampler. In particular, we show that it achieves accelerated performance for three broad distribution classes not considered before. Our first class relies on the smoothness condition posed only to the target density q0q_0, which is far more relaxed than the existing smoothness conditions posed to all qtq_t along the entire sampling path. Our second class requires only a finite second moment condition, allowing for a much wider class of target distributions than the existing finite-support condition. Our third class is Gaussian mixture, for which our result establishes the first acceleration guarantee. Moreover, among accelerated DDPM type samplers, our results specialized for bounded-support distributions show an improved dependency on the data dimension dd. Our analysis introduces a novel technique for establishing performance guarantees via constructing a tilting factor representation of the convergence error and utilizing Tweedie’s formula to handle Taylor expansion terms. This new analytical framework may be of independent interest.

38Unstable Unlearning: The Hidden Risk of Concept Resurgence in Diffusion Models

[openreview] [pdf]

Abstract Text-to-image diffusion models rely on massive, web-scale datasets. Training them from scratch is computationally expensive, and as a result, developers often prefer to make incremental updates to existing models. These updates often compose fine-tuning steps (to learn new concepts or improve model performance) with “unlearning” steps (to “forget” existing concepts, such as copyrighted data or the ability to generate explicit content). In this work, we demonstrate a critical and previously unknown vulnerability that arises in this paradigm: even under benign, non-adversarial conditions, fine-tuning a text-to-image diffusion model on seemingly unrelated images can cause it to “relearn” concepts that were previously “unlearned.” We comprehensively investigate the causes and scope of this phenomenon, which we term concept resurgence, by performing a series of experiments based on fine-tuning Stable Diffusion v1.4 alongside “mass concept erasure”, the current state of the art for unlearning in text-to-image diffusion models (Lu et al., 2024). Our findings underscore the fragility of composing incremental model updates, and raise new serious concerns about current approaches to ensuring the safety and alignment of text-to-image diffusion models.

39How to Find the Exact Pareto Front for Multi-Objective MDPs?

[openreview] [pdf]

Abstract Multi-objective Markov Decision Processes (MDPs) are receiving increasing attention, as real-world decision-making problems often involve conflicting objectives that cannot be addressed by a single-objective MDP. The Pareto front identifies the set of policies that cannot be dominated, providing a foundation for finding Pareto optimal solutions that can efficiently adapt to various preferences. However, finding the Pareto front is a highly challenging problem. Most existing methods either (i) rely on traversing the continuous preference space, which is impractical and results in approximations that are difficult to evaluate against the true Pareto front, or (ii) focus solely on deterministic Pareto optimal policies, from which there are no known techniques to characterize the full Pareto front. Moreover, finding the structure of the Pareto front itself remains unclear even in the context of dynamic programming, where the MDP is fully known in advance. In this work, we address the challenge of efficiently discovering the Pareto front. By investigating the geometric structure of the Pareto front in MO-MDP, we uncover a key property: the Pareto front is on the boundary of a convex polytope whose vertices all correspond to deterministic policies, and neighboring vertices of the Pareto front differ by only one state-action pair of the deterministic policy, almost surely. This insight transforms the global comparison across all policies into a localized search among deterministic policies that differ by only one state-action pair, drastically reducing the complexity of searching for the exact Pareto front. We develop an efficient algorithm that identifies the vertices of the Pareto front by solving a single-objective MDP only once and then traversing the edges of the Pareto front, making it more efficient than existing methods. Furthermore, the entire Pareto front can be found in VV iterations, where VV represents the number of vertices on the Pareto front. Our empirical studies demonstrate the effectiveness of our theoretical strategy in discovering the Pareto front efficiently.

40APCtrl: Adding Conditional Control to Diffusion Models by Alternative Projection

[openreview] [pdf]

Abstract Enhancing the versatility of pretrained diffusion models through advanced conditioning techniques is crucial for improving their applicability. We present APCtrl, a novel conditional image generation approach that formulates the latent ( \dmrv{z}\dms{t} ) at timestep ( t ) as the projection ( \dmrv{z}\dms{t} = \text{Proj}{\bmfrakD\dms{t}} (\dmrv{z}{ \dms{t} + \dms{1} }) ) onto the denosing set ( \bmfrakD\dms{t} ). For conditional control, APCtrl integrates the condition set ( \bmfrakC_\dms{t} ), defined by a latent control network (\bmcalA_{\dmv{theta}}(\cdot, \cdot)). Our method simplifies conditional sampling to recursive projections ( \dmrv{z}\dms{t} = \text{Proj}{\bmfrakI_\dms{t}} \circ \text{Proj}{\bmfrakD\dms{t}} (\dmrv{z}_{ \dms{t} + \dms{1} }) ), where each projection step integrates both the diffusion and condition priors. By employing Alternative Projection, our approach offers several key advantages: 1. Multi-Condition Generation: easily expandable with additional conditional sets; 2. Model and Sampling Agnosticism: works with any model or sampling method; 3. Unified Control Loss: simplifies the management of diverse control applications; 4. Efficiency: delivers comparable control with reduced training and sampling times. Extensive experiments demonstrate the superior performance of our method.

41Choose Your Anchor Wisely: Effective Unlearning Diffusion Models via Concept Reconditioning

[openreview] [pdf]

Abstract Large-scale conditional diffusion models (DMs) have demonstrated exceptional ability in generating high-quality images from textual descriptions, gaining widespread use across various domains. However, these models also carry the risk of producing harmful, sensitive, or copyrighted content, creating a pressing need to remove such information from their generation capabilities. While retraining from scratch is prohibitively expensive, machine unlearning provides a more efficient solution by selectively removing undesirable knowledge while preserving utility. In this paper, we introduce \textbf{COncept REconditioning (CORE)}, a simple yet effective approach for unlearning diffusion models. Similar to some existing approaches, CORE guides the noise predictor conditioned on forget concepts towards an anchor generated from alternative concepts. However, CORE introduces key differences in the choice of anchor and retain loss, which contribute to its enhanced performance. We evaluate the unlearning effectiveness and retainability of CORE on UnlearnCanvas. Extensive experiments demonstrate that CORE surpasses state-of-the-art methods including its close variants and achieves near-perfect performance, especially when we aim to forget multiple concepts. More ablation studies show that CORE’s careful selection of the anchor and retain loss is critical to its superior performance.

42Enhancing Dataset Distillation with Concurrent Learning: Addressing Negative Correlations and Catastrophic Forgetting in Trajectory Matching

[openreview] [pdf]

Abstract Dataset distillation generates a small synthetic dataset on which a model is trained to achieve performance comparable to that obtained on a complete dataset. Current state-of-the-art methods primarily focus on Trajectory Matching (TM), which optimizes the synthetic dataset by matching its training trajectory with that from the real dataset. Due to convergence issues and numerical stability, it is impractical to match the entire trajectory in one go; typically, a segment is sampled for matching at each iteration. However, previous TM-based methods overlook the potential interactions between matching different segments, particularly the presence of negative correlations. To study this problem, we conduct a quantitative analysis of the correlation between matching different segments and discover varying degrees of negative correlation depending on the image per class (IPC). Such negative correlation could lead to an increase in accumulated trajectory error and transform trajectory matching into a continual learning paradigm, potentially causing catastrophic forgetting. To tackle this issue, we propose a concurrent learning-based trajectory matching that simultaneously matches multiple segments. Extensive experiments demonstrate that our method consistently surpasses previous TM-based methods on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet-1K.

43Discrete Distribution Networks

[openreview] [pdf]

Abstract We introduce a novel generative model, the Discrete Distribution Networks (DDN), that approximates data distribution using hierarchical discrete distributions. We posit that since the features within a network inherently capture distributional information, enabling the network to generate multiple samples simultaneously, rather than a single output, may offer an effective way to represent distributions. Therefore, DDN fits the target distribution, including continuous ones, by generating multiple discrete sample points. To capture finer details of the target data, DDN selects the output that is closest to the Ground Truth (GT) from the coarse results generated in the first layer. This selected output is then fed back into the network as a condition for the second layer, thereby generating new outputs more similar to the GT. As the number of DDN layers increases, the representational space of the outputs expands exponentially, and the generated samples become increasingly similar to the GT. This hierarchical output pattern of discrete distributions endows DDN with unique property: more general zero-shot conditional generation. We demonstrate the efficacy of DDN and its intriguing properties through experiments on CIFAR-10 and FFHQ.

44Anti-Exposure Bias in Diffusion Models via Prompt Learning

[openreview] [pdf]

Abstract Diffusion models (DMs) have achieved record-breaking performance in image generation tasks. Nevertheless, in practice, the training-sampling discrepancy, caused by score estimation error and discretization error, limits the modeling ability of DMs, a phenomenon known as exposure bias. To alleviate such exposure bias and further improve the generative performance, we put forward a prompt learning framework built upon a lightweight prompt prediction model. Concretely, our model learns an anti-bias prompt for the generated sample at each sampling step, aiming to compensate for the exposure bias that arises. Following this design philosophy, our framework rectifies the sampling trajectory to match the training trajectory, thereby reducing the divergence between the target data distribution and the modeling distribution. To train the prompt prediction model, we simulate exposure bias by constructing training data and introduce a time-dependent weighting function for optimization. Empirical results on various DMs demonstrate the superiority of our prompt learning framework across three benchmark datasets. Importantly, the optimized prompt prediction model effectively improves image quality with only a 5% increase in sampling overhead, which remains negligible.

45Direct Distributional Optimization for Provable Alignment of Diffusion Models

[openreview] [pdf]

Abstract We introduce a novel alignment method for diffusion models from distribution optimization perspectives while providing rigorous convergence guarantees. We first formulate the problem as a generic regularized loss minimization over probability distributions and directly optimize the distribution using the Dual Averaging method. Next, we enable sampling from the learned distribution by approximating its score function via Doob’s hh-transform technique. The proposed framework is supported by rigorous convergence guarantees and an end-to-end bound on the sampling error, which imply that when the original distribution’s score is known accurately, the complexity of sampling from shifted distributions is independent of isoperimetric conditions. This framework is broadly applicable to general distribution optimization problems, including alignment tasks in Reinforcement Learning with Human Feedback (RLHF), Direct Preference Optimization (DPO), and Kahneman-Tversky Optimization (KTO). We empirically validate its performance on synthetic and image datasets using the DPO objective.

46Stabilizing the Kumaraswamy Distribution

[openreview] [pdf]

Abstract Large-scale latent variable models require expressive continuous distributions that support efficient sampling and low-variance differentiation, achievable through the reparameterization trick. The Kumaraswamy (KS) distribution is both expressive and supports the reparameterization trick with a simple closed-form inverse CDF. Yet, its adoption remains limited. We identify and resolve numerical instabilities in the inverse CDF and log-pdf, exposing issues in libraries like PyTorch and TensorFlow. We then introduce simple and scalable latent variable models based on the KS, improving exploration-exploitation trade-offs in contextual multi-armed bandits and enhancing uncertainty quantification for link prediction with graph neural networks. Our results support the stabilized KS distribution as a core component in scalable variational models for bounded latent variables.

47Dynamic Negative Guidance of Diffusion Models

[openreview] [pdf]

Abstract Negative Prompting (NP) is widely utilized in diffusion models, particularly in text-to-image applications, to prevent the generation of undesired features. In this paper, we show that conventional NP is limited by the assumption of a constant guidance scale, which may lead to highly suboptimal results, or even complete failure, due to the non-stationarity and state-dependence of the reverse process. Based on this analysis, we derive a principled technique calledDynamicNegativeGuidance, which relies on a near-optimal time and state dependent modulation of the guidance without requiring additional training. Unlike NP, negative guidance requires estimating the posterior class probability during the denoising process, which is achieved with limited additional computational overhead by tracking the discrete Markov Chain during the generative process. We evaluate the performance of DNG class-removal on MNIST and CIFAR10, where we show that DNG leads to higher safety, preservation of class balance and image quality when compared with baseline methods. Furthermore, we show that it is possible to use DNG with Stable Diffusion to obtain more accurate and less invasive guidance than NP.

48One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

[openreview] [pdf]

Abstract Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only 2%-10% additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided athttps://drive.google.com/file/d/1eIa11gw6DwYKG9CKERy41bjE1ruklRtT/view?usp=sharing, and the code will be publicly available soon.

49Dynamic Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its compu- tation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial- wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning it- erations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73×, and achieves a competitive FID score of 2.07 on ImageNet.

50Data Unlearning in Diffusion Models

[openreview] [pdf]

Abstract Recent work has shown that diffusion models memorize and reproduce training data examples. At the same time, large copyright lawsuits and legislation such as GDPR have highlighted the need for erasing datapoints from diffusion models. However, retraining from scratch is often too expensive. This motivates the setting of data unlearning, i.e., the study of efficient techniques for unlearning specific datapoints from the training set. Existing concept unlearning techniques require an anchor prompt/class/distribution to guide unlearning, which is not available in the data unlearning setting. General-purpose machine unlearning techniques were found to be either unstable or failed to unlearn data. We therefore propose a family of new loss functions called Subtracted Importance Sampled Scores (SISS) that utilize importance sampling and are the first method to unlearn data with theoretical guarantees. SISS is constructed as a weighted combination between simpler objectives that are responsible for preserving model quality and unlearning the targeted datapoints. When evaluated on CelebA-HQ and MNIST, SISS achieved Pareto optimality along the quality and unlearning strength dimensions. On Stable Diffusion, SISS successfully mitigated memorization on nearly 90% of the prompts we tested. We release our code online.

51Towards a Theoretical Understanding of Memorization in Diffusion Models

[openreview] [pdf]

Abstract As diffusion probabilistic models (DPMs) are being employed as mainstream models for Generative Artificial Intelligence (GenAI), the study of their memorization of training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn via memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for trustworthy application of GenAI. Existing works revealed that conditional DPMs are more prone to training data memorization than unconditional DPMs, and the motivated data extraction methods are mostly for conditional DPMs. However, these understandings are primarily empirical, and extracting training data from unconditional models has been found to be extremely challenging. In this work, we provide a theoretical understanding of memorization in both conditional and unconditional DPMs under the assumption of model convergence. Our theoretical analysis indicates that extracting data from unconditional models can also be effective by constructing a proper surrogate condition. Based on this result, we propose a novel data extraction method named \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a time-dependent classifier trained on the generated data as a surrogate condition to extract training data from unconditional DPMs. Empirical results demonstrate that our SIDE can extract training data in challenging scenarios where previous methods fail, and it is, on average, over 50% more effective across different scales of the CelebA dataset.

52RETHINK MAXIMUM STATE ENTROPY

[openreview] [pdf]

Abstract In the absence of specific tasks or extrinsic reward signals, a key objective for an agent is the efficient exploration of its environment. A widely adopted strategy to achieve this is maximizing state entropy, which encourages the agent to uniformly explore the entire state space. Most existing approaches for maximum state entropy (MaxEnt) are rooted in two foundational approaches, which were proposed by Hazan and Liu & Abbeel, respectively. However, a unified perspective on these methods is lacking within the community.In this paper, we analyze these two foundational approaches within a unified framework and demonstrate that both methods share the same reward function when employing the kkNN density estimator. We also show that the η\eta-based policy sampling method proposed by Hazan is unnecessary and that the primary distinction between the two lies in the frequency with which the locally stationary reward function is updated. Building on this analysis, we introduce MaxEnt-(V)eritas, which combines the most effective components of both methods: iteratively updating the reward function as defined by Liu & Abbeel, and training the agent until convergence before updating the reward functions, akin to the procedure used by Hazan. We prove that MaxEnt-V is an efficient ε\varepsilon-optimal algorithm for maximizing state entropy, where the tolerance ε\varepsilon decreases as the number of iterations increases. Empirical validation in three Mujoco environments shows that MaxEnt-Veritas significantly outperforms the two MaxEnt frameworks in terms of both state coverage and state entropy maximization, with sound explanations for these results.

53Conditional Information Bottleneck Approach for Out-of-Distribution Sequential Recommendation

[openreview] [pdf]

Abstract Sequential recommendation (SR) aims to suggest items users are most likely to engage with next based on their past interactions. However, in practice, SR systems often face the out-of-distribution (OOD) problem due to dynamic environmental factors (e.g., seasonal changes), leading to significant performance degradation in the testing phase. Some methods incorporate distributionally robust optimization (DRO) into SR to alleviate OOD, but the sparsity of SR data challenges this. Other approaches use random data augmentations to explore the OOD, potentially distorting important information, as user behavior is personalized rather than random. Additionally, they often overlook users’ varying sensitivity to distribution shifts during the exploration, which is crucial for capturing the evolution of user preferences in OOD contexts. In this work, inspired by information bottleneck theory (IB), we propose the Conditional Distribution Information Bottleneck (CDIB), a novel objective that creates diverse OOD distributions while preserving minimal sufficient information regarding the origin distribution conditioned on the user. Building on this, we introduce a framework with a learnable, personalized data augmentation method using a mask-then-generate paradigm to craft diverse and reliable OOD distributions optimized with CDIB. Experiments on four real-world datasets show our model consistently outperforms baselines. The code is available athttps://anonymous.4open.science/r/CDIB-51C8.

54Backtracking Improves Generation Safety

[openreview] [pdf]

Abstract Text generation has a fundamental limitation almost by definition: there is no taking back tokens that have been generated, even when they are clearly problematic. In the context of language model safety, when a partial unsafe generation is produced, language models by their nature tend to happily keep on generating similarly unsafe additional text. This is in fact how safety alignment of frontier models gets circumvented in the wild, despite great efforts in improving their safety. Deviating from the paradigm of approaching safety alignment as prevention (decreasing the probability of harmful responses), we propose backtracking, a technique that allows language models to “undo” and recover from their own unsafe generation through the introduction of a special [RESET] token. Our method can be incorporated into either SFT or DPO training to optimize helpfulness and harmlessness. We show that models trained to backtrack are consistently safer than baseline models: backtracking Llama-3-8B is four times more safe than the baseline model (6.1% \to 1.5%) in our evaluations without regression in helpfulness. Our method additionally provides protection against four adversarial attacks including an adaptive attack, despite not being trained to do so.

55Optimizing Latent Goal by Learning from Trajectory Preference

[openreview] [pdf]

Abstract A glowing body of work has emerged focusing on instruction-following policies for open-world agents, aiming to better align the agent’s behavior with human intentions. However, the performance of these policies is highly susceptible to the initial prompt, which leads to extra efforts in selecting the best instructions. We propose a framework named \emph{\textbf{P}reference \textbf{G}oal \textbf{T}uning} (PGT). PGT allows policies to interact with the environment to collect several trajectories, which will be categorized into positive and negative examples based on preference. A preference optimization algorithm is used to fine-tune the initial goal latent representation using the collected trajectories while keeping the policy backbone frozen. The experiment result shows that with minimal data and training, PGT achieves an average relative improvement of 72.072.0% and 81.681.6% over 17 tasks in 2 different foundation policies respectively, and outperforms the best human-selected instructions. Moreover, PGT surpasses full fine-tuning in the out-of-distribution (OOD) task-execution environments by 13.413.4%, indicating that our approach retains strong generalization capabilities. Since our approach stores a single latent representation for each task independently, it can be viewed as an efficient method for Continual Learning, without the risk of catastrophic forgetting or task interference. In short, PGT enhances the performance of agents across nearly all tasks in the Minecraft Skillforge benchmark and demonstrates robustness to the execution environment.

56Domain Guidance: A Simple Transfer Approach for a Pre-trained Diffusion Model

[openreview] [pdf]

Abstract Recent advancements in diffusion models have revolutionized generative modeling. However, the impressive and vivid outputs they produce often come at the cost of significant model scaling and increased computational demands. Consequently, building personalized diffusion models based on off-the-shelf models has emerged as an appealing alternative. In this paper, we introduce a novel perspective on conditional generation for transferring a pre-trained model. From this viewpoint, we propose \textit{Domain Guidance}, a straightforward transfer approach that leverages pre-trained knowledge to guide the sampling process toward the target domain. Domain Guidance shares a formulation similar to advanced classifier-free guidance, facilitating better domain alignment and higher-quality generations. We provide both empirical and theoretical analyses of the mechanisms behind Domain Guidance. Our experimental results demonstrate its substantial effectiveness across various transfer benchmarks, achieving over a 19.6% improvement in FID and a 20.6% improvement in FDDINOv2_\text{DINOv2} compared to standard fine-tuning. Notably, existing fine-tuned models can seamlessly integrate Domain Guidance to leverage these benefits, without additional training.

57Distilled Diffusion Language Models

[openreview] [pdf]

Abstract Transformer-based Large Language Models (LLMs) have demonstrated remarkable capa- bilities, yet their autoregressive nature forces sequential token-by-token decoding, leading to inefficiencies during inference. Furthermore, autoregressive language models lack in- herent self-correction abilities, which hinders their capacity to refine and improve gener- ated content without relying on external prompting or retraining techniques. In contrast, diffusion-based models offer the advantage of fast parallel generation through iterative refinement, while leveraging bi-directional attention to utilize full context at once. How- ever, diffusion models are unable to match their autoregressive counterparts. This moti- vates us to explore the possibility of distilling a pre-trained autoregressive (AR) language model (teacher) into a non-autoregressive diffusion (non-AR) language model (student), combining the best of both worlds. In this work, we present Target Concrete Score (TCS) distillation, a theoretically grounded framework that bridges autoregressive and diffusion paradigms. TCS distillation is broadly applicable to both discrete and continuous diffu- sion models, with any pre-trained autoregressive teacher model. We propose techniques to make TCS distillation scalable and efficient for transformer-based models, and show how it can both improve pre-trained diffusion language models and also train new mod- els from scratch. Through comprehensive experiments on language modeling tasks, we demonstrate the effectiveness of our proposed methods.

58Revamping Diffusion Guidance for Conditional and Unconditional Generation

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) has become the standard method for enhancing the quality of conditional diffusion models. However, employing CFG requires either training an unconditional model alongside the main diffusion model or modifying the training procedure by periodically inserting a null condition. There is also no clear extension of CFG to unconditional models. In this paper, we revisit the core principles of CFG and introduce a new method, independent condition guidance (ICG), which provides the benefits of CFG without the need for any special training procedures. Our approach streamlines the training process of conditional diffusion models and can also be applied during inference on any pre-trained conditional model. Additionally, by leveraging the time-step information encoded in all diffusion networks, we propose an extension of CFG, called time-step guidance (TSG), which can be applied toanydiffusion model, including unconditional ones. Our guidance techniques are easy to implement and have the same sampling cost as CFG. Through extensive experiments, we demonstrate that ICG matches the performance of standard CFG across various conditional diffusion models. Moreover, we show that TSG improves generation quality in a manner similar to CFG, without relying on any conditional information.

59Task-agnostic Pre-training and Task-guided Fine-tuning for Versatile Diffusion Planner

[openreview] [pdf]

Abstract Diffusion models have demonstrated their capabilities in modeling trajectories of multi-tasks. However, existing multi-task planners or policies typically rely on task-specific demonstrations via multi-task imitation, or require task-specific reward labels to facilitate policy optimization via Reinforcement Learning (RL). They heavily rely on the task-specific labeled data which can be difficult to acquire. To address these challenges, we aim to develop a versatile diffusion planner that can leverage large-scale inferior data that contains task-agnostic sub-optimal trajectories, with the ability to fast adapt to specific tasks. In this paper, we propose SODP, a two-stage framework that leverages Sub-Optimal data to learn a Diffusion Planner, which is generalizable for various downstream tasks. Specifically, in the pre-training stage, we train a foundation diffusion planner that extracts general planning capabilities by modeling the versatile distribution of multi-task trajectories, which can be sub-optimal and has wide data coverage. Then for downstream tasks, we adopt RL-based fine-tuning with task-specific rewards to fast refine the diffusion planner, which aims to generate action sequences with higher task-specific returns. Experimental results from multi-task domains including Meta-World and Adroit demonstrate that SODP outperforms state-of-the-art methods with only a small amount of data for reward-guided fine-tuning.

60Can the Training Loss be Predictive for Out-of-Distribution Generalization?

[openreview] [pdf]

Abstract Traditional model selection in deep learning relies on carefully tuning several hyper-parameters (HPs) controlling regularization strength on held-out validation data, which can be challenging to obtain in scarce-data scenarios or may not accurately reflect real-world deployment conditions due to distribution shifts. Motivated by such issues, this paper investigates the potential of using solely the training loss to predict the generalization performance of neural networks on out-of-distribution (OOD) test scenarios. Our analysis reveals that preserving consistent prediction variance across training and testing distributions is essential for establishing a correlation between training loss and OOD generalization. We propose architectural adjustments to ensure variance preservation\textit{variance preservation}, enabling reliable model selection based on training loss alone, even in over-parameterized settings with a sample-to-parameter ratio exceeding four orders of magnitude. We extensively assess the model-selection capabilities of variance-preserving\textit{variance-preserving} architectures on several scarce data, domain-shift, and corruption benchmarks by optimizing HPs such as learning rate, weight decay, batch size, and data augmentation strength.

61Balancing Domain-Invariant and Domain-Specific Knowledge for Domain Generalization with Online Knowledge Distillation

[openreview] [pdf]

Abstract Deep learning models often experience performance degradation when the distribution of testing data differs from that of training data. Domain generalization addresses this problem by leveraging knowledge from multiple source domains to enhance model generalizability. Recent studies have shown that distilling knowledge from large pretrained models effectively improves a model’s ability to generalize to unseen domains. However, current knowledge distillation-based domain generalization approaches overlook the importance of domain-specific knowledge and rely on a two-stage training process, which limits the effectiveness of knowledge transfer. To overcome these limitations, we propose the Balanced Online knowLedge Distillation (BOLD) framework for domain generalization. BOLD employs a multi-domain expert teacher model, with each expert specializing in specific source domains to preserve domain-specific knowledge. This approach enables the student to distil both domain-invariant and domain-specific knowledge from the teacher. Additionally, BOLD adopts an online knowledge distillation strategy where the teacher and students learn simultaneously, allowing the teacher to adapt based on the student’s feedback, thereby enhancing knowledge transfer and improving the student’s generalizability. Extensive experiments conducted with state-of-the-art baselines on seven domain generalization benchmarks demonstrate the effectiveness of the BOLD framework. We also provide a theoretical analysis that underscores the effectiveness of domain-specific knowledge and the online knowledge distillation strategy in domain generalization.

62Heavy-Tailed Diffusion Models

[openreview] [pdf]

Abstract Diffusion models achieve state-of-the-art generation quality across many applications, but their ability to capture rare or extreme events in heavy-tailed distributions remains unclear. In this work, we show that traditional diffusion and flow-matching models with standard Gaussian priors fail to accurately capture heavy-tailed behavior. We address this by repurposing the diffusion framework for heavy-tail estimation using multivariate Student-t distributions. We develop a tailored perturbation kernel and derive the denoising posterior based on the conditional Student-t distribution for the backward process. Inspired by γ\gamma-divergence for heavy-tailed distributions, we derive a training objective for heavy-tailed denoisers. The resulting framework introduces controllable tail generation using only a single scalar hyperparameter, making it easily tunable for diverse real-world distributions. As specific instantiations of our framework, we introduce t-EDM and t-Flow, extensions of existing diffusion and flow models that employ a Student-t prior. Remarkably, our approach is readily compatible with standard Gaussian diffusion models and requires only minimal code changes. Empirically, we show that our t-EDM and t-Flow outperform standard diffusion models in heavy-tail estimation on high-resolution weather datasets in which generating rare and extreme events is crucial.

63Variational Search Distributions

[openreview] [pdf]

Abstract We develop variational search distributions (VSD), a method for finding discrete, combinatorial designs of a rare desired class in a batch sequential manner with a fixed experimental budget. We formalize the requirements and desiderata for this problem and formulate a solution via variational inference. In particular, VSD uses off-the-shelf gradient based optimization routines, can learn powerful generative models for designs, and can take advantage of scalable predictive models. We derive asymptotic convergence rates for learning the true conditional generative distribution of designs with certain configurations of our method. After illustrating the generative model on images, we empirically demonstrate that VSD can outperform existing baseline methods on a set of real sequence-design problems in various biological systems.

64Diffusion Modulation via Environment Mechanism Modeling for Planning

[openreview] [pdf]

Abstract Diffusion models have shown promising capabilities in trajectory generation for planning in offline reinforcement learning (RL). However, conventional diffusion-based planning methods often fail to account for the fact that generating trajectories in RL requires unique consistency between transitions to ensure coherence in real environments. This oversight can result in considerable discrepancies between the generated trajectories and the underlying mechanisms of a real environment. To address this problem, we propose a novel diffusion-based planning method, termed as Diffusion Modulation via Environment Mechanism Modeling (DMEMM). DMEMM modulates diffusion model training by incorporating key RL environment mechanisms, particularly transition dynamics and reward functions. Experimental results demonstrate that DMEMM achieves state-of-the-art performance for planning with offline reinforcement learning.

65Convergence of Score-Based Discrete Diffusion Models: A Discrete-Time Analysis

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generating high-dimensional samples across various applications. While the theoretical guarantees for continuous-state diffusion models have been extensively studied, the convergence analysis of the discrete-state counterparts remains under-explored. In this paper, we study the theoretical aspects of score-based discrete diffusion models under the Continuous Time Markov Chain (CTMC) framework. We introduce a discrete-time sampling algorithm in the general state space [S]d[S]^d that utilizes score estimators at predefined time points. We derive convergence bounds for the Kullback-Leibler (KL) divergence and total variation (TV) distance between the generated sample distribution and the data distribution, considering both scenarios with and without early stopping under specific assumptions. Notably, our KL divergence bounds are nearly linear in dimension dd, aligning with state-of-the-art results for diffusion models. Our convergence analysis employs a Girsanov-based method and establishes key properties of the discrete score function, which are essential for characterizing the discrete-time sampling process.

66Energy-Based Conceptual Diffusion Model

[openreview] [pdf]

Abstract Diffusion models have shown impressive sample generation capabilities across various domains. However, current methods are still lacking in human-understandable explanations and interpretable control: (1) they do not provide a probabilistic framework for systematic interpretation. For example, when tasked with generating an image of a “Nighthawk”, they cannot quantify the probability of specific concepts (e.g., “black bill” and “brown crown” usually seen in Nighthawks) or verify whether the generated concepts align with the instruction. This limits explanations of the generative process; (2) they do not naturally support control mechanisms based on concept probabilities, such as correcting errors (e.g., correcting “black crown” to “brown crown” in a generated “Nighthawk” image) or performing imputations using these concepts, therefore falling short in interpretable editing capabilities. To address these limitations, we propose Energy-based Conceptual Diffusion Models (ECDMs). ECDMs integrate diffusion models and Concept Bottleneck Models (CBMs) within the framework of Energy-Based Models to provide unified interpretations. Unlike conventional CBMs, which are typically discriminative, our approach extends CBMs to the generative process. ECDMs use a set of energy networks and pretrained diffusion models to define the joint energy estimation of the input instructions, concept vectors, and generated images. This unified framework enables concept-based generation, interpretation, debugging, intervention, and imputation through conditional probabilities derived from energy estimates. Our experiments on various real-world datasets demonstrate that ECDMs offer both strong generative performance and rich concept-based interpretability.

67Diversity-Rewarded CFG Distillation

[openreview] [pdf]

Abstract Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations athttps://musicdiversity.github.io/.

68EC-DIT: Scaling Diffusion Transformers with Adaptive Expert-Choice Routing

[openreview] [pdf]

Abstract Diffusion transformers have been widely adopted for text-to-image synthesis. While scaling these models up to billions of parameters shows promise, the effectiveness of scaling beyond current sizes remains underexplored and challenging. By explicitly exploiting the computational heterogeneity of image generations, we develop a new family of Mixture-of-Experts (MoE) models (EC-DIT) for diffusion transformers with expert-choice routing. EC-DIT learns to adaptively optimize the compute allocated to understand the input texts and generate the respective image patches, enabling heterogeneous computation aligned with varying text-image complexities. This heterogeneity provides an efficient way of scaling EC-DIT up to 97 billion parameters and achieving significant improvements in training convergence, text-to-image alignment, and overall generation quality over dense models and conventional MoE models. Through extensive ablations, we show that EC-DIT demonstrates superior scalability and adaptive compute allocation by recognizing varying textual importance through end-to-end training. Notably, in text-to-image alignment evaluation, our largest models achieve a state-of-the-art GenEval score of 71.68% and still maintain competitive inference speed with intuitive interpretability.

69Diffusion-Based Planning for Autonomous Driving with Flexible Guidance

[openreview] [pdf]

Abstract Achieving human-like driving behaviors in complex open-world environments is a critical challenge in autonomous driving. Contemporary learning-based planning approaches such as imitation learning methods often struggle to balance competing objectives and lack of safety assurance,due to limited adaptability and inadequacy in learning complex multi-modal behaviors commonly exhibited in human planning, not to mention their strong reliance on the fallback strategy with predefined rules. We propose a novel transformer-based Diffusion Planner for closed-loop planning, which can effectively model multi-modal driving behavior and ensure trajectory quality without any rule-based refinement. Our model supports joint modeling of both prediction and planning tasks under the same architecture, enabling cooperative behaviors between vehicles. Moreover, by learning the gradient of the trajectory score function and employing a flexible classifier guidance mechanism, Diffusion Planner effectively achieves safe and adaptable planning behaviors. Evaluations on the large-scale real-world autonomous planning benchmark nuPlan and our newly collected 200-hour delivery-vehicle driving dataset demonstrate that Diffusion Planner achieves state-of-the-art closed-loop performance with robust transferability in diverse driving styles.

70Satisficing Exploration in Bandit Optimization

[openreview] [pdf]

Abstract Motivated by the concept of satisficing in decision-making, we consider the problem of satisficing exploration in bandit optimization. In this setting, the learner aims at finding a satisficing arm whose mean reward exceeds a certain threshold. The performance is measured by satisficing regret, which is the cumulative deficit of the chosen arm’s mean reward compared to the threshold. We propose SELECT, a general algorithmic template for Satisficing Exploration via LowEr Confidence bound Testing, that attains constant satisficing regret for a wide variety of bandit optimization problems in the realizable case (i.e., whenever a satisficing arm exists). Specifically, given a class of bandit optimization problems and a corresponding learning oracle with sub-linear (standard) regret upper bound, SELECT iteratively makes use of the oracle to identify a potential satisficing arm. Then, it collects data samples from this arm, and continuously compares the lower confidence bound of the identified arm’s mean reward against the threshold value to determine if it is a satisficing arm. As a complement, SELECT also enjoys the same (standard) regret guarantee as the oracle in the non-realizable case. Finally, we conduct numerical experiments to validate the performance of SELECT for several popular bandit optimization settings.

71Longitudinal Latent Diffusion Models

[openreview] [pdf]

Abstract Longitudinal data are crucial in several fields, but collecting them is a challenging process, often hindered by concerns such as individual privacy. Extrapolating in time initial trajectories or generating fully synthetic sequences could address these issues and prove valuable in clinical trials, drug design, and even public policy evaluation. We propose a generative statistical model for longitudinal data that links the temporal dependence of a sequence to a latent diffusion model and leverages the geometry of the autoencoder latent space. This versatile method can be used for several tasks - prediction, generation, oversampling - effectively handling high-dimensional data such as images and irregularly-measured sequences, needing only relatively few training samples. Thanks to its ability to generate sequences with controlled variability, it outperforms previously proposed methods on datasets of varying complexity, while remaining interpretable.

72Understanding and Mitigating Memorization in Diffusion Models for Tabular Data

[openreview] [pdf]

Abstract Tabular data generation has attracted significant research interest in recent years, with the tabular diffusion models greatly improving the quality of synthetic data. However, while memorization—where models inadvertently replicate exact or near-identical training data—has been thoroughly investigated in image and text generation, its effects on tabular data remain largely unexplored. In this paper, we conduct the first comprehensive investigation of memorization phenomena in diffusion models for tabular data. Our empirical analysis reveals that memorization appears in tabular diffusion models and increases with larger training epochs. We further examine the influence of factors such as dataset sizes, feature dimensions, and different diffusion models on memorization. Additionally, we provide a theoretical explanation for why memorization occurs in tabular diffusion models. To address this issue, we propose TabCutMix, a simple yet effective data augmentation technique that exchanges randomly selected feature segments between random training sample pairs. Experimental results across various datasets and diffusion models demonstrate that TabCutMix effectively mitigates memorization while maintaining high-quality data generation. Our code is available at \url{https://anonymous.4open.science/r/TabCutMix-3F7B}.

73Principle Counterfactual Fairness

[openreview] [pdf]

Abstract Fairness in human and algorithmic decision-making is crucial in areas such as criminal justice, education, and social welfare. Recently, counterfactual fairness has drawn increasing research interest, suggesting that decision-making for individuals should remain the same when intervening with different values on the protected attributes. Nevertheless, the question of “which attributes and individuals should be protected” is rarely discussed in the existing counterfactual fairness literature. For example, when considering leg disability as a protected attribute, the algorithms should not treat individuals with leg disabilities differently in college admissions, but one may naturally take into this factor for the purpose of selecting runner athletes. In other words, when and how to enforce fairness is expected to depend on the causal relation between the protected attribute and the outcome of interest. Formally, this paper proposes principal counterfactual fairness using the concept of principal stratification from the causal inference literature, focusing on whether an algorithm is counterfactually fair for individuals whose protected attribute has no individual causal effect on the outcome of interest. To examine whether an algorithm satisfies principal counterfactual fairness, we derive the statistical bounds, and propose a post-processing approach to achieving principal counterfactual fairness with minimal individual decision changes. Experiments are conducted using synthetic and real-world datasets to verify the effectiveness of our methods.

74Unified Convergence Analysis for Score-Based Diffusion Models with Deterministic Samplers

[openreview] [pdf]

Abstract Score-based diffusion models have emerged as powerful techniques for generating samples from high-dimensional data distributions. These models involve a two-phase process: first, injecting noise to transform the data distribution into a known prior distribution, and second, sampling to recover the original data distribution from noises. Among the various sampling methods, deterministic samplers stand out for their enhanced efficiency. However, analyzing these deterministic samplers presents unique challenges, as they preclude the use of established techniques such as Girsanov’s theorem, which are only applicable to stochastic samplers. Furthermore, existing analysis for deterministic samplers usually focuses on some specific examples, lacking a generalized approach for general forward processes and various deterministic samplers. Our paper addresses these limitations by introducing a unified convergence analysis framework. To demonstrate the power of our framework, we analyze the variance-preserving (VP) forward process with the exponential integrator (EI) scheme, and achieved iteration complexity of O~(d2/ϵ)\tilde{O}(d^2/\epsilon). Additionally, we provide a detailed analysis of DDIM-type samplers, which have been underexplored in previous research, achieving polynomial iteration complexity.

75Improving Discrete Diffusion with Schedule-Conditioning

[openreview] [pdf]

Abstract In research on discrete diffusion generative models, one long-standing mystery is the dominance of the masking state corruption process. In masking diffusion, all data points collapse to a sequence of mask tokens without any transitions between non-mask tokens, ruling out small edits from one unmasked token to another. By contrast, in image modeling, the dominant corruption process is Gaussian noise, which encourages gradual movements in pixel space. In this paper, we propose that masking diffusion dominates due to knowledge of when corruptions occurred. When it makes predictions, it does so conditional on the schedule of previous corruptions; this allows it to devote less capacity to inferring whether a corruption has occurred and more capacity to modeling relationships between tokens. We use this insight to build knowledge of corruptions into other discrete diffusion models; we call our method schedule-conditioned diffusion (SCUD). We show that SCUD generalizes classical discrete diffusion and masking diffusion. We show that applying SCUD to models with different corruption processes leads to improved perplexities on images, text, and protein sequences; Finally, by applying SCUD to models with corruption processes with ``gradual’’ structure, we build diffusion models that outperform masking.

76Preference Diffusion for Recommendation

[openreview] [pdf]

Abstract Recommender systems predict personalized item rankings based on user preference distributions derived from historical behavior data. Recently, diffusion models (DMs) have gained attention in recommendation for their ability to model complex distributions, yet current DM-based recommenders often rely on traditional objectives like mean squared error (MSE) or recommendation objectives, which are not optimized for personalized ranking tasks or fail to fully leverage DM’s generative potential. To address this, we propose PreferDiff, a tailored optimization objective for DM-based recommenders. PreferDiff transforms BPR into a log-likelihood ranking objective and integrates multiple negative samples to better capture user preferences. Specifically, we employ variational inference to handle the intractability through minimizing the variational upper bound and replaces MSE with cosine error to improve alignment with recommendation tasks. Finally, we balance learning generation and preference to enhance the training stability of DMs. PreferDiff offers three key benefits: it is the first personalized ranking loss designed specifically for DM-based recommenders and it improves ranking and faster convergence by addressing hard negatives. We also prove that it is theoretically connected to Direct Preference Optimization which indicates that it has the potential to align user preferences in DM-based recommenders via generative modeling. Extensive experiments across three benchmarks validate its superior recommendation performance and commendable general sequential recommendation capabilities. Our codes are available at \url{https://anonymous.4open.science/r/PreferDiff}.

77Mitigating Shortcut Learning with Diffusion Counterfactuals and Diverse Ensembles

[openreview] [pdf]

Abstract Spurious correlations in the data, where multiple cues are predictive of the target labels, often lead to a phenomenon known as shortcut learning, where a model relies on erroneous, easy-to-learn cues while ignoring reliable ones. In this work, we propose DiffDiv an ensemble diversification framework exploiting Diffusion Probabilistic Models (DPMs) to mitigate this form of bias. We show that at particular training intervals, DPMs can generate images with novel feature combinations, even when trained on samples displaying correlated input features. We leverage this crucial property to generate synthetic counterfactuals to increase model diversity via ensemble disagreement. We show that DPM-guided diversification is sufficient to remove dependence on shortcut cues, without a need for additional supervised signals. We further empirically quantify its efficacy on several diversification objectives, and finally show improved generalization and diversification on par with prior work that relies on auxiliary data collection.

78HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?

[openreview] [pdf]

Abstract Accurately forecasting multiple future events within a given time horizon is crucial for applications in finance, retail, social networks, and healthcare. Event timing and labels are typically modeled using Marked Temporal Point Processes (MTPP), with evaluations often focused on next-event prediction quality. While some studies have extended evaluations to a fixed number of future events, we demonstrate that this approach leads to inaccuracies in handling false positives and false negatives. To address these issues, we propose a novel evaluation method inspired by object detection techniques from computer vision. Specifically, we introduce Temporal mean Average Precision (T-mAP), a temporal variant of mAP, which overcomes the limitations of existing long-horizon evaluation metrics. Our extensive experiments demonstrate that models with strong next-event prediction accuracy can yield poor long-horizon forecasts, and vice versa, indicating that specialized methods are needed for each task. To support further research, we release HoTPP, the first benchmark specifically designed for evaluating long-horizon MTPP predictions. HoTPP includes large-scale datasets with up to 43 million events and provides optimized procedures for both autoregressive and parallel inference, paving the way for future advancements in the field.

79Zigzag Diffusion Sampling: The Path to Success ls Zigzag

[openreview] [pdf]

Abstract Diffusion models, the most popular generative paradigm so far, can inject conditional information into the generation path to guide the latent towards desired directions. However, existing text-to-image diffusion models often fail to maintain high image quality and high prompt-image alignment for those challenging prompts. To mitigate this issue and enhance existing pretrained diffusion models, we mainly made three contributions in this paper. First, we theoretically and empirically demonstrate that the conditional guidance gap between the denoising and inversion processes captures prompt-related semantic information. Second, motivated by theoretical analysis, we derive Zigzag Diffusion Sampling (Z-Sampling), a novel sampling method that leverages the guidance gap to accumulate semantic information step-by-step throughout the entire generation process, leading to improved sampling results. Moreover, as a plug-and-play method, Z-Sampling can be generally applied to various diffusion models (e.g., accelerated ones and Transformer-based ones) with very limited coding costs. Third, extensive experiments demonstrate that Z-Sampling can generally and significantly enhance generation quality across various benchmark datasets, diffusion models, and performance evaluation metrics. Particularly, Z-Sampling is good at handling those challenging fine-grained prompts, such as style, position, counting, and multiple objects, due to its guidance-gap-based information gain. Moreover, Z-Sampling can even further enhance existing diffusion models combined with other orthogonal methods, including Diffusion-DPO.

80RTDiff: Reverse Trajectory Synthesis via Diffusion for Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning (RL), managing the distribution shift between the learned policy and the static offline dataset is a persistent challenge that can result in overestimated values and suboptimal policies. Traditional offline RL methods address this by introducing conservative biases that limit exploration to well-understood regions, but they often overly restrict the agent’s generalization capabilities. Recent work has sought to generate trajectories using generative models to augment the offline dataset, yet these methods still struggle with overestimating synthesized data, especially when out-of-distribution samples are produced. To overcome this issue, we propose RTDiff, a novel diffusion-based data augmentation technique that synthesizes trajectoriesin reverse, moving from unknown to known states. Such reverse generation naturally mitigates the risk of overestimation by ensuring that the agent avoids planning through unknown states. Additionally, reverse trajectory synthesis allows us to generate longer, more informative trajectories that take full advantage of diffusion models’ generative strengths while ensuring reliability. We further enhance RTDiff by introducing flexible trajectory length control and improving the efficiency of the generation process through noise management. Our empirical results show that RTDiff significantly improves the performance of several state-of-the-art offline RL algorithms across diverse environments, achieving consistent and superior results by effectively overcoming distribution shift.

81Revealing the Unseen: Guiding Personalized Diffusion Models to Expose Training Data

[openreview] [pdf]

Abstract Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask:“Can training data be extracted from these fine-tuned DMs shared online?”A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model’s learned distribution---from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance. The code is available at an anonymous link.

82CONCORD: Concept-informed Diffusion for Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation has witnessed significant progress in synthesizing small-scale datasets that encapsulate rich information from large-scale original ones. Particularly, methods based on generative priors show promising performance, while maintaining computational efficiency and cross-architecture generalization. However, the generation process lacks explicit controllability for each sample. Previous distillation methods primarily match the real distribution from the perspective of the entire dataset, whereas overlooking conceptual completeness at the instance level. This oversight can result in missing or incorrectly represented object details and compromised dataset quality. To this end, we propose to incorporate the conceptual understanding of large language models (LLMs) to perform a CONCept-infORmed Diffusion process for dataset distillation, in short as CONCORD. Specifically, distinguishable and fine-grained concepts are retrieved based on category labels to explicitly inform the denoising process and refine essential object details. By integrating these concepts, the proposed method significantly enhances both the controllability and interpretability of the distilled image generation, without replying on pre-trained classifiers. We demonstrate the efficacy of CONCORD by achieving state-of-the-art performance on ImageNet-1K and its subsets. It further advances the practical application of dataset distillation methods. The code implementation is attached in the supplementary material.

83Cohesion: Coherence-Based Diffusion for Long-Range Dynamics Forecasting

[openreview] [pdf]

Abstract We recast existing works on probabilistic dynamics forecasting through a unified framework connecting turbulence and diffusion principles: Cohesion. Specifically, we relate the coherent part of nonlinear dynamics as a conditioning prior in a denoising process, which can be efficiently estimated using reduced-order models. This fast generation of long prior sequences allows us to reframe forecasting as trajectory planning, a common task in RL. This reformulation is beneficial because we can perform a single conditional denoising pass for an entire sequence, rather than autoregressively over long lead time, gaining orders-of-magnitude speedups with little performance loss. Nonetheless, Cohesion supports flexibility through temporal composition that allows iterations to be performed over smaller subsequences, with autoregressive being a special case. To ensure temporal consistency within and between subsequences, we incorporate a model-free, small receptive window via temporal convolution that leverages large NFEs during denoising. Finally, we perform our guidance in a classifier-free manner to handle a broad range of conditioning scenarios for zero-shot forecasts. Our experiments demonstrate that Cohesion outperforms state-of-the-art probabilistic emulators for chaotic systems over long lead time, including in Kolmogorov Flow and Shallow Water Equation. Its low spectral divergence highlights Cohesion’s ability to resolve multi-scale physical structures, even in partially-observed cases, and are thus essential for long-range, high-fidelity, physically-realistic emulation.

[openreview] [pdf]

Abstract Machine Learning (ML) has advanced Combinatorial Optimization (CO), especially for one of the most focused problems, the Travelling Salesman Problem (TSP). While certain methods demonstrate promising performance, they still fall short compared to mathematical solvers. This study utilizes TSP as a case study, dissecting established mainstream learning-based solvers to outline a comprehensive design space. It advances a unified modular streamline incorporating existing technologies in both learning and search for transparent ablation, aiming to reassess the role of learning and to discern which parts of existing techniques are genuinely beneficial and which are not. This further leads to the investigation of desirable principles of learning designs and the exploration of concepts guiding method designs. We demonstrate the desirability of principles such as joint probability estimation, symmetry solution representation, and online optimization for learning-based designs. Leveraging the findings, we propose enhancements to existing methods to compensate for their missing attributes, thereby advancing performance and enriching the technique library. From a higher viewpoint, we also uncover a performance advantage in non-autoregressive and supervised paradigms compared to their counterparts. The strategic decoupling and organic recompositions yield a factory of new TSP solvers, where we investigate synergies across various method combinations and pinpoint the optimal design choices to create more powerful ML4TSP solvers, thereby facilitating and offering a reference for future research and engineering endeavors. Source code will be made publicly available.

85Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved remarkable success at tasks like summarization that involve asingle turnof interaction. However, they can still struggle withmulti-turntasks like dialogue that require long-term planning. Previous works on multi-turn dialogue extend single-turn reinforcement learning from human feedback (RLHF) methods to the multi-turn setting by treating all prior dialogue turns as a long context. Such approaches suffer fromcovariate shift: the conversations in the training set have previous turns generated by some reference policy, which means that low training error may not necessarily correspond to good performance when the learner is actually in the conversation loop. In response, we introduce REgressing the RELative FUture (REFUEL), an efficient policy optimization approach designed to address multi-turn RLHF in LLMs. REFUEL employs a single model to estimate Q-values and trains on self-generated data, addressing the covariate shift issue. REFUEL frames the multi-turn RLHF problem as a sequence of regression tasks on iteratively collected datasets, enabling ease of implementation. Theoretically, we prove that REFUEL can match the performance of any policy covered by the training set. Empirically, we evaluate our algorithm by using Llama-3.1-70B-it to simulate a user in conversation with our model. REFUEL consistently outperforms state-of-the-art methods such as DPO and REBEL across various settings. Furthermore, despite having only 8 billion parameters, Llama-3-8B-it fine-tuned with REFUEL, outperforms Llama-3.1-70B-it on long multi-turn dialogues.

86Beyond Predefined Depots: A Dual-Mode Generative DRL Framework for Proactive Depot Generation in Location-Routing Problem

[openreview] [pdf]

Abstract The Location-Routing Problem (LRP), which combines the challenges of facility (depot) locating and vehicle route planning, is critically constrained by the reliance on predefined depot candidates, limiting the solution space and potentially leading to suboptimal outcomes. Previous research on LRP without predefined depots is scant and predominantly relies on heuristic algorithms that iteratively attempt depot placements across a planar area. Such approaches lack the ability to proactively generate depot locations that meet specific geographic requirements, revealing a notable gap in current research landscape. To bridge this gap, we propose a data-driven generative DRL framework, designed to proactively generate depots for LRP without predefined depot candidates, solely based on customer requests data which include geographic and demand information. It can operate in two distinct modes: direct generation of exact depot locations, and the creation of a multivariate Gaussian distribution for flexible depots sampling. By extracting depots’ geographic pattern from customer requests data, our approach can dynamically respond to logistical needs, identifying high-quality depot locations that further reduce total routing costs compared to traditional methods. Extensive experiments demonstrate that, for a same group of customer requests, compared with those depots identified through random attempts, our framework can proactively generate depots that lead to superior solution routes with lower routing cost. The implications of our framework potentially extend into real-world applications, particularly in emergency medical rescue and disaster relief logistics, where rapid establishment and adjustment of depot locations are paramount, showcasing its potential in addressing LRP for dynamic and unpredictable environments.

87Distributional Reinforcement Learning Based On Historical Information For Option Hedging

[openreview] [pdf]

Abstract Options are widely used financial derivatives for risk management and corporate operations. Option hedging aims to mitigate investment risks from asset price fluctuations by buying and selling other financial products. Traditional hedging strategies based on the Black-Scholes model face practical limitations due to the assumptions of constant volatility and the neglect of transaction costs. Recently, reinforcement learning(RL) has gained attention in the study of option hedging strategies, but several challenges remain: current methods rely on real-time market data (e.g., underlying asset prices, holdings, remaining option term) to determine optimal positions, underutilizing the potential value of historical data; existing approaches focus on the expected hedging cost, overlooking the comprehensive distribution of costs; In the aspect of training data generation, commonly used single simulation methods perform well under specific conditions but struggle to ensure the robustness of the model across diverse datasets. To address these issues, we propose a novel distributional RL option hedging method that incorporates historical information. Historical states are included in the state variables, with a gated recurrent unit (GRU) network layer extracting historical information. This is then combined with current information from fully connected layers to inform subsequent network layers, ensuring the agent considers both current and historical market information when learning hedging strategies. The output of the value network is set as a series of quantiles, with the Quantile Huber Loss function fitting their distribution to evaluate strategies based on distribution rather than expected value. To diversify data sources, we use a combination of the Black-Scholes model, the Binomial model, and the Heston model to simulate a large volume of option data. Experimental results show that our method significantly reduces hedging costs and demonstrates strong adaptability and practicality under various market conditions.

88Dual-Head Knowledge Distillation: Enhancing Logits Utilization with an Auxiliary Head

[openreview] [pdf]

Abstract Traditional knowledge distillation focuses on aligning the student’s predicted probabilities with both ground-truth labels and the teacher’s predicted probabilities. However, the transition to predicted probabilities from logits would obscure certain indispensable information. To address this issue, it is intuitive to additionally introduce a logit-level loss function as a supplement to the widely used probability-level loss function, for exploiting the latent information of logits. Unfortunately, we empirically find that the amalgamation of the newly introduced logit-level loss and the previous probability-level loss will lead to performance degeneration, even trailing behind the performance of employing either loss in isolation. We attribute this phenomenon to the collapse of the classification head, which is verified by our theoretical analysis based on the neural collapse theory. Specifically, the gradients of the two loss functions exhibit contradictions in the linear classifier yet display no such conflict within the backbone. Drawing from the theoretical analysis, we propose a novel method called dual-head knowledge distillation, which partitions the linear classifier into two classification heads responsible for different losses, thereby preserving the beneficial effects of both losses on the backbone while eliminating adverse influences on the classification head. Extensive experiments validate that our method can effectively exploit the information inside the logits and achieve superior performance against state-of-the-art counterparts.

89Generalizing to any diverse distribution: uniformity, gentle finetuning & rebalancing

[openreview] [pdf]

Abstract As training datasets grow larger, we aspire to develop models that generalize well to any diverse test distribution, even if the latter deviates significantly from the training data. Various approaches like domain adaptation, domain generalization, and robust optimization attempt to address the out-of-distribution challenge by posing assumptions about the relation between training and test distribution. Differently, we adopt a more conservative perspective by accounting for the worst-case error across all sufficiently diverse test distributions within a known domain. Our first finding is that training on a uniform distribution over this domain is optimal. We also interrogate practical remedies when uniform samples are unavailable by considering methods for mitigating non-uniformity through finetuning and rebalancing. Our theory provides a mathematical grounding for previous observations on the role of entropy and rebalancing for o.o.d. generalization and foundation model training. We also provide new empirical evidence across tasks involving o.o.d. shifts which illustrate the broad applicability of our perspective.

90Counterfactual Techniques for Enhancing Customer Retention

[openreview] [pdf]

Abstract In this paper, we introduce a novel counterfactual reasoning method using eBERT embeddings to convert customers from an e-commerce company who frequently add items to their cart but don’t proceed to checkout. We demonstrate that our method i) outperforms existing techniques such as DiCE, GANs, and CFRL in key metrics such as coverage, while also maintaining a low latency; ii) balances high coverage and low latency by adjusting the number of nearest unlike neighbors, highlighting a trade-off between these competing goals; and iii) allows customization of mutable features, improving the practical applicability of our counterfactual explanations.

91Effectively Steer LLM To Follow Preference via Building Confident Directions

[openreview] [pdf]

Abstract Having an LLM that aligns with human preference is essential for accommodating individual needs, such as maintaining writing style or generating specific topics of interest.The majority of current alignment methods rely on fine-tuning or prompting, which can be either costly or difficult to control. Model steering algorithms, which construct certain steering directions used to modify the model output}, are typically easy to implement and optimization-free. {However, their capabilities are typically limited to steering the model into one of the two directions (i.e., bidreictional steering), and that there has been no theoretical understanding to guarantee their performance. In this work, we propose a theoretical framework to understand and quantify the model steering methods. Inspired by the framework, we propose a confident direction steering method (CONFST) that steers LLMs via modifying their activations in inference time. More specifically, CONFST builds a {\it confident direction} that is closely aligned with users’ preferences, and then this direction is added to the activations of the LLMs to effectively steer the model output. Our approach offers three key advantages over popular bidirectional model steering methods: 1) {It is more powerful, since multiple (i.e. more than two) users’ preferences can be aligned simultaneously; 2) It is very simple to implement, since there is no need to determine which layer the steering vector should be added to; 3) No explicit user instruction is required. We validate our method on GPT-2 XL (1.5B), Mistral (7B) and Gemma-it (9B) models for tasks that require shifting the output of LLMs across a number of different topics and styles.

92Learning mirror maps in policy mirror descent

[openreview] [pdf]

Abstract Policy Mirror Descent (PMD) is a popular framework in reinforcement learning, serving as a unifying perspective that encompasses numerous algorithms. These algorithms are derived through the selection of a mirror map and enjoy finite-time convergence guarantees. Despite its popularity, the exploration of PMD’s full potential is limited, with the majority of research focusing on a particular mirror map---namely, the negative entropy---which gives rise to the renowned Natural Policy Gradient (NPG) method. It remains uncertain from existing theoretical studies whether the choice of mirror map significantly influences PMD’s efficacy. In our work, we conduct empirical investigations to show that the conventional mirror map choice (NPG) often yields less-than-optimal outcomes across several standard benchmark environments. Using evolutionary strategies, we identify more efficient mirror maps that enhance the performance of PMD. We first focus on a tabular environment, i.e.\ Grid-World, where we relate existing theoretical bounds with the performance of PMD for a few standard mirror maps and the learned one. We then show that it is possible to learn a mirror map that outperforms the negative entropy in more complex environments, such as the MinAtar suite. Additionally, we demonstrate that the learned mirror maps generalize effectively to different tasks by testing each map across various other environments.

93Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts

[openreview] [pdf]

Abstract With the rapid progress of diffusion-based content generation, significant efforts are being made to unlearn harmful or copyrighted concepts from pretrained diffusion models (DMs) to prevent potential model misuse. However, it is observed that even when DMs are properly unlearned before release, malicious finetuning can compromise this process, causing DMs torelearn the unlearned concepts. This occurs partly because certain benign concepts (e.g., “skin”) retained in DMs are related to the unlearned ones (e.g., “nudity”), facilitating their relearning via finetuning. To address this, we proposemeta-unlearningon DMs. Intuitively, a meta-unlearned DM should behave like an unlearned DM when used as is; moreover, if the meta-unlearned DM undergoes malicious finetuning on unlearned concepts, the related benign concepts retained within it will be triggered toself-destruct, hindering the relearning of unlearned concepts. Our meta-unlearning framework is compatible with most existing unlearning methods, requiring only the addition of an easy-to-implement meta objective. We validate our approach through empirical experiments on meta-unlearning concepts from Stable Diffusion models (SD-v1-4 and SDXL), supported by extensive ablation studies.

94On Statistical Rates of Conditional Diffusion Transformer: Approximation and Estimation

[openreview] [pdf]

Abstract We investigate the approximation and estimation rates of conditional diffusion transformers (DiTs) with classifier-free guidance. We present a comprehensive analysis for ``in-context’’ conditional DiTs under four common data assumptions. We show that both conditional DiTs and their latent variants lead to the minimax optimality of unconditional DiTs under identified settings. Specifically, we discretize the input domains into infinitesimal grids and then perform a term-by-term Taylor expansion on the conditional diffusion score function under Hölder smooth data assumption. This enables fine-grained use of transformers’ universal approximation through a more detailed piecewise constant approximation, and hence obtains tighter bounds. Additionally, we extend our analysis to the latent setting under the linear latent subspace assumption. We not only show that latent conditional DiTs achieve lower bounds than conditional DiTs both in approximation and estimation, but also show the minimax optimality of latent unconditional DiTs. Our findings establish statistical limits for conditional and unconditional DiTs, and offer practical guidance toward developing more efficient and accurate DiT models.

95Counterfactual Concept Bottleneck Models

[openreview] [pdf]

Abstract Current deep learning models are not designed to simultaneously address three fundamental questions: predict class labels to solve a given classification task (the “What?”), simulate changes in the situation to evaluate how this impacts class predictions (the “How?”), and imagine how the scenario should change to result in different class predictions (the “Why not?”). The inability to answer these questions represents a crucial gap in deploying reliable AI agents, calibrating human trust, and improving human-machine interaction. To bridge this gap, we introduce CounterFactual Concept Bottleneck Models (CF-CBMs), a class of models designed to efficiently address the above queries all at once without the need to run post-hoc searches. Our experimental results demonstrate that CF-CBMs: achieve classification accuracy comparable to black-box models and existing CBMs (“What?”), rely on fewer important concepts leading to simpler explanations (“How?”), and produce interpretable, concept-based counterfactuals (“Why not?”). Additionally, we show that training the counterfactual generator jointly with the CBM leads to two key improvements: (i) it alters the model’s decision-making process, making the model rely on fewer important concepts (leading to simpler explanations), and (ii) it significantly increases the causal effect of concept interventions on class predictions, making the model more responsive to these changes.

96Sampling from Energy-based Policies using Diffusion

[openreview] [pdf]

Abstract Energy-based policies offer a flexible framework for modeling complex, multimodal behaviors in reinforcement learning (RL). In maximum entropy RL, the optimal policy is a Boltzmann distribution derived from the soft Q-function, but direct sampling from this distribution in continuous action spaces is computationally intractable. As a result, existing methods typically use simpler parametric distributions, like Gaussians, for policy representation — limiting their ability to capture the full complexity of multimodal action distributions. In this paper, we introduce a diffusion-based approach for sampling from energy-based policies, where the negative Q-function defines the energy function. Based on this approach, we propose an actor-critic method called Diffusion Q-Sampling (DQS) that enables more expressive policy representations, allowing stable learning in diverse environments. We show that our approach enhances exploration and captures multimodal behavior in continuous control tasks, addressing key limitations of existing methods.

97Adaptive backtracking for fast optimization

[openreview] [pdf]

Abstract Backtracking line search is foundational in numerical optimization. The basic idea is to adjust the step size of an algorithm by a {\em constant} factor until some chosen criterion (e.g. Armijo, Goldstein, Descent Lemma) is satisfied. We propose a new way for adjusting step sizes, replacing the constant factor used in regular backtracking with one that takes into account the degree to which the chosen criterion is violated, without additional computational burden. We perform a variety of experiments on over fifteen real world datasets, which confirm that adaptive backtracking often leads to significantly faster optimization. For convex problems, we prove adaptive backtracking requires fewer adjustments to produce a feasible step size than regular backtracking does for two popular line search criteria: the Armijo condition and the descent lemma. For nonconvex smooth problems, we prove adaptive backtracking enjoys the same guarantees of regular backtracking.

98On the Byzantine-Resilience of Distillation-Based Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) algorithms using Knowledge Distillation (KD) have received increasing attention due to their favorable properties with respect to privacy, non-i.i.d. data and communication cost. These methods depart from transmitting model parameters and instead communicate information about a learning task by sharing predictions on a public dataset. In this work, we study the performance of such approaches in the byzantine setting, where a subset of the clients act in an adversarial manner aiming to disrupt the learning process. We show that KD-based FL algorithms are remarkably resilient and analyze how byzantine clients can influence the learning process. Based on these insights, we introduce two new byzantine attacks and demonstrate their ability to break existing byzantine-resilient methods. Additionally, we propose a novel defence method which enhances the byzantine resilience of KD-based FL algorithms. Finally, we provide a general framework to obfuscate attacks, making them significantly harder to detect, thereby improving their effectiveness. Our findings serve as an important building block in the analysis of byzantine FL, contributing through the development of new attacks and new defence mechanisms, further advancing the robustness of KD-based FL algorithms.

99Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

[openreview] [pdf]

Abstract State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-tofrequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment.

100Continual Learning After Model Deployment

[openreview] [pdf]

Abstract This paper studies continual learning after model deployment. A real-world application environment is often an open world filled with novel or out-of-distribution (OOD) objects that have not been seen before. We can call continual learning in such an environmentopen-world continual learning(OWCL). OWCL incrementally performs two main tasks: (1) detecting OOD objects, and (2) continually learning the OOD or new objects on the fly. Although OOD detection and continual learning have been extensively studied separately, their combination for OWCL has barely been attempted. This is perhaps because in addition to the existing challenges of OOD detection and continual learning such ascatastrophic forgetting(CF), OWCL also faces the challenge of data scarcity. As novel objects appear sporadically, when an object from a new/novel class is detected, it is difficult to learn it from one or a few samples to give good accuracy. This paper proposes a novel method called OpenLD to deal with these problems based onlinear discriminant analysis(LDA) and a pre-trained model. This method enables OOD detection and incremental learning of the detected samples on the fly with no CF. Experimental evaluation demonstrates the effectiveness of OpenLD.

101Transition Path Sampling with Improved Off-Policy Training of Diffusion Path Samplers

[openreview] [pdf]

Abstract Understanding transition pathways between meta-stable states in molecular systems is crucial to advance material design and drug discovery. However, unbiased molecular dynamics simulations are computationally infeasible due to the high energy barriers separating these states. Although recent machine learning techniques offer potential solutions, they are often limited to simple systems or rely on collective variables (CVs) derived from costly domain expertise. In this paper, we introduce a novel approach that trains diffusion path samplers (DPS) for transition path sampling (TPS) without the need for CVs. We recast the problem as an amortized sampling of the target path measure, minimizing the log-variance divergence between the path measure induced by our DPS and the target path measure. To ensure scalability for high-dimensional tasks, we introduce (1) a new off-policy training objective based on learning control variates with replay buffers and (2) a scale-based equivariant parameterization of the bias forces. We evaluate our approach, coined TPS-DPS, on a synthetic double-well potential and three peptides: Alanine Dipeptide, Polyproline Helix, and Chignolin. Results show that our approach produces more realistic and diverse transition pathways compared to existing baselines. We also provide links toproject pageandcode.

102One Model to Train Them All: A Unified Diffusion Framework for Multi-Context Neural Population Forecasting

[openreview] [pdf]

Abstract Recent research has revealed shared neural patterns among animals performing similar tasks and within individual animals across different tasks. This has led to a growing interest in replacing single-session latent variable models with a unified model that allows us to align recordings across different animals, sessions, and tasks, despite the challenge of distinct neuron identities in each recording. In this work, we present a conditioned diffusion framework to model population dynamics of neural activity across multiple contexts. The quality of the learned dynamics is evaluated through the model’s forecasting ability, which predicts multiple timesteps of both neural activity and behavior. Additionally, we introduce a benchmark dataset spanning six electrophysiology datasets, seven tasks, 19 animals, and 261 sessions, providing a standardized framework for multi-task neural population models. Our results demonstrate that the pretrained model can be efficiently adapted to novel, unseen sessions without requiring explicit neuron correspondence. This enables few-shot learning with minimal labeled data, as well as competitive performance in zero-shot learning.

103Knowledge Lift Alignment Fine Tuning

[openreview] [pdf]

Abstract We present a visual tuning framework, \textbf{K}nowledge \textbf{L}ift \textbf{A}lignment \textbf{F}ine \textbf{T}uning (KLAFT), which enhances the expressive image captioning capabilities of Pre-trained Language Models (PLMs), including LLMs and VLMs. As this task involves generating more detailed and comprehensive captions than basic image descriptions, the core idea behind KLAFT is that fine-grained alignment could exploit the capabilities of PLMs and a given target domain dataset. This idea motivates and challenges us to explore the framework that deeply understands both given images and text for this alignment and tuning PLMs towards expressive image captioning. This direction modifies the attention mechanism (Modified Attention Mechanism, MAM) and develops both a Topic Control Mechanism (TCM) and their training objectives. The innovation of KLAFT lies in its approach to addressing the disparities in knowledge - visual versus textual via MAM and source versus target domain via TCM. As these hidden spaces are conceptualized as distinct sub-networks within the PLM, each possessing specific knowledge, KLAFT’s unique contribution is in aligning and adjusting the weights of these sub-networks in a fine-grained manner, and fine-tuning this PLM. Our empirical studies demonstrate that KLAFT significantly improves expressive captioning tasks by aligning and amplifying target knowledge, with the potential for Parameter-Efficient Fine-Tuning (PEFT) at low computational cost.

104When do GFlowNets learn the right distribution?

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are an emerging class of sampling methods for distributions over discrete and compositional objects, e.g., graphs. In spite of their remarkable success in problems such as drug discovery and phylogenetic inference, the question of when and whether GFlowNets learn to sample from the target distribution remains underexplored. To tackle this issue, we first assess the extent to which a violation of the detailed balance of the underlying flow network might hamper the correctness of GFlowNet’s sampling distribution. In particular, we demonstrate that the impact of an imbalanced edge on the model’s accuracy is influenced by the total amount of flow passing through it and, as a consequence, is unevenly distributed across the network. We also argue that, depending on the parameterization, imbalance may be inevitable. In this regard, we consider the problem of sampling from distributions over graphs with GFlowNets parameterized by graph neural networks (GNNs) and show that the representation limits of GNNs delineate which distributions these GFlowNets can approximate. Lastly, we address these limitations by proposing a theoretically sound and computationally tractable metric for assessing GFlowNets, experimentally showing it is a better proxy for correctness than popular evaluation protocols.

105Efficient Fairness-Performance Pareto Front Computation

[openreview] [pdf]

Abstract There is a well known intrinsic trade-off between the fairness of a representation and the performance of classifiers derived from the representation. Due to the complexity of optimisation algorithms in most modern representation learning approaches, for a given method it may be non-trivial to decide whether the obtained fairness-performance curve of the method is optimal, i.e., whether it is close to the true Pareto front for these quantities for the underlying data distribution.In this paper we propose a new method to compute the optimal Pareto front, which does not require the training of complex representation models. We show that optimal fair representations possess several useful structural properties, and that these properties enable a reduction of the computation of the Pareto Front to a compact discrete problem. We then also show that these compact approximating problems can be efficiently solved via off-the shelf concave-convex programming methods. Finally, in addition to representations, we show that the new methods may also be used to directly compute the Pareto front of fair classification problems.Since our approach is independent of the specific model of representations, it may be used as the benchmark to which representation learning algorithms, or classifiers, may be compared. We experimentally evaluate the approach on a number of real world benchmark datasets.

106Combating Dual Noise Effect in Spatial-temporal Forecasting via Information Bottleneck Principle

[openreview] [pdf]

Abstract Spatial-temporal forecasting plays a pivotal role in urban planning and computing. Although Spatial-Temporal Graph Neural Networks (STGNNs) excel in modeling spatial-temporal dynamics, they often suffer from relatively poor computational efficiency. Recently, Multi-Layer Perceptrons (MLPs) have gained popularity in spatial-temporal forecasting for their simplified architecture and better efficiency. However, existing MLP-based models can be susceptible to noise interference, especially when the noise can affect both input and target sequences in spatial-temporal forecasting on noisy data. To alleviate this impact, we proposeRobust Spatial-Temporal Information Bottleneck (RSTIB)principle. The RSTIB extends previous Information Bottleneck (IB) approaches by lifting the specific Markov assumption without impairing the IB nature. Then, by explicitly minimizing the irrelevant noisy information, the representation learning guided by RSTIB can be more robust against noise interference. Furthermore, the instantiation, RSTIB-MLP, can be seamlessly implemented with MLPs, thereby achieving efficient and robust spatial-temporal modeling. Moreover, a training regime is designed to handle the dynamic nature of spatial-temporal relationships by incorporating a knowledge distillation module to alleviate feature collapse and enhance model robustness under noisy conditions. Our extensive experimental results on six intrinsically noisy benchmark datasets from various domains show that the RSTIB-MLP runs much faster than state-of-the-art STGNNs and delivers superior forecasting accuracy across noisy environments, substantiating its robustness and efficiency.

107Leveraging Knowledge Distillation to Mitigate Model Collapse

[openreview] [pdf]

Abstract Since the amount of data generated by neural networks on the Internet is growing rapidly due to widespread access to corresponding models, it is logical to inquire about the impact of this surge in synthetic data on the training of subsequent models that will utilize it during training. Previous work has demonstrated a concerning trend: models trained predominantly on synthetic data often experience a decline in performance, which can escalate to a complete loss of the ability to reproduce the initial distribution of real-world data. This phenomenon, now referred to as model collapse, highlights the potential pitfalls of over-reliance on synthetic datasets, which may lack the diversity and complexity inherent in genuine data. To address this issue, we propose a novel method that leverages the well-established technique of knowledge distillation. Our approach aims to mitigate the adverse effects of synthetic data by facilitating a more effective transfer of knowledge from high-performing teacher models to student model. By doing so, we seek to enhance not only the qualitative aspects—such as the richness and variability of the generated outputs—but also the quantitative metrics that gauge model performance. Through extensive experimentation, we demonstrate that our method improves the robustness and generalization capabilities of models trained on synthetic data, for instance, for DDPM enhancement is 68.8%, in terms of the FID metric, contributing to a more sustainable and effective use of synthetic datasets in machine learning applications.

108Robust Root Cause Diagnosis using In-Distribution Interventions

[openreview] [pdf]

Abstract Diagnosing the root cause of an anomaly in a complex interconnected system is a pressing problem in today’s cloud services and industrial operations. Effective root cause diagnosis calls for identifying nodes whose disrupted local mechanismscauseanomalous behavior at a target node. We propose IDI, a novel algorithm that predicts root cause as nodes that meet two criteria: 1)Anomaly:root cause nodes should take on anomalous values; 2)Fix:had the root cause nodes assumed usual values, the target node would not have been anomalous. Prior methods of assessing the fix condition rely on counterfactuals inferred from a Structural Causal Model (SCM) trained on historical data. But since anomalies are rare and fall outside the training distribution, the fitted SCMs yield unreliable counterfactual estimates. IDI overcomes this by relying on interventional estimates obtained by solely probing the fitted SCM at in-distribution inputs. Our theoretical analysis demonstrates that IDI’s in-distribution intervention approach outperforms other counterfactual estimation methods under mild assumptions about the data-generating process. Experiments on both synthetic and Petshop RCD benchmark datasets demonstrate that IDI consistently identifies true root causes more accurately and robustly than nine existing state-of-the-art RCD baselines. We release the anonymized code athttps://anonymous.4open.science/r/petshop-BB8A/.

109The Inductive Bias of Minimum-Norm Shallow Diffusion Models That Perfectly Fit the Data

[openreview] [pdf]

Abstract While diffusion models can generate high-quality images through the probability flow process, the theoretical understanding of this process is incomplete. A key open question is determining when the probability flow converges to the training samples used for denoiser training and when it converges to more general points on the data manifold. To address this, we analyze the probability flow of shallow ReLU neural network denoisers which interpolate the training data and have a minimal 2\ell^2 norm of the weights. For intuition, we also examine a simpler dynamics which we call the score flow, and demonstrate that, in the case of orthogonal datasets, the score flow and probability flow follow similar trajectories. Both flows converge to a training point or a sum of training points. However, due to early stopping induced by the scheduler, the probability flow can also converge to a general point on the data manifold. This result aligns with empirical observations that diffusion models tend to memorize individual training examples and reproduce them during testing. Moreover, diffusion models can combine memorized foreground and background objects, indicating they can learn a “semantic sum” of training points. We generalize these results from the orthogonal dataset case to scenarios where the clean data points lie on an obtuse simplex. Simulations further confirm that the probability flow converges to one of the following: a training point, a sum of training points, or a point on the data manifold.

110Mitigating Distribution Shifts: Uncertainty-Aware Offline-to-Online Reinforcement Learning

[openreview] [pdf]

Abstract Deploying reinforcement learning (RL) policies in real-world scenarios faces challenges due to distribution shifts from training environments. Past approaches have shown limitations such as poor generalization to out-of-distribution (OOD) variations or requiring extensive retraining on new data. We propose Uncertainty-aware Adaptive RL, UARL, a novel RL pipeline that enhances policy generalization across diverse variations of a given environment. UARL frames distribution shifts as OOD problems and incorporates a new OOD detection method to quantify uncertainty. This approach enables iterative policy fine-tuning, starting with offline training on a limited state space and progressively expanding to more diverse variations of the same environment through online interactions. We demonstrate the effectiveness and robustness of UARL through extensive experiments on continuous control tasks, showing improved performance and sample efficiency as well as reliability in OOD detection compared to existing methods.

111Diffusion Bridge Implicit Models

[openreview] [pdf]

Abstract Denoising diffusion bridge models (DDBMs) are a powerful variant of diffusion models for interpolating between two arbitrary paired distributions given as endpoints. Despite their promising performance in tasks like image translation, DDBMs require a computationally intensive sampling process that involves the simulation of a (stochastic) differential equation through hundreds of network evaluations. In this work, we take the first step in fast sampling of DDBMs without extra training, motivated by the well-established recipes in diffusion models. We generalize DDBMs via a class of non-Markovian diffusion bridges defined on the discretized timesteps concerning sampling, which share the same marginal distributions and training objectives, and give rise to generative processes ranging from stochastic to deterministic, resulting in diffusion bridge implicit models (DBIMs). DBIMs are not only up to 25×\times faster than the vanilla sampler of DDBMs but also induce a novel, simple, and insightful form of ordinary differential equation (ODE) which inspires high-order numerical solvers. Moreover, DBIMs maintain the generation diversity in a distinguished way, by using a booting noise in the initial sampling step, which enables faithful encoding, reconstruction, and semantic interpolation in image translation tasks.

112Scaling Diffusion Models for Downstream Prediction

[openreview] [pdf]

Abstract In this paper, we argue that iterative computation, as exemplified by diffusion models, offers a powerful paradigm for not only image generation but also for visual perception tasks. First, we unify few of the mid-level vision tasks as image to image translations tasks ranging from depth estimation to optical flow to segmentation. Then, through extensive experiments across these tasks, we demonstrate how diffusion models scale with increased compute during both training and inference. Notably, we train various dense and Mixture of Expert models up to 2.8 billion parameters, and we utilize increased sampling steps, use various ensembling methods to increase compute at test time. Our work provides compelling evidence for the benefits of scaling compute at train and test time for diffusion models for visual perception, and by studying the scaling properties carefully, we were able to archive same performance of the state-of-the-art models with less compute.

113On the onset of memorization to generalization transition in diffusion models

[openreview] [pdf]

Abstract As the training set size increases, diffusion models have been observed to transition from memorizing the training dataset to generalizing to and sampling from the underlying data distribution. To study this phenomenon more closely, here, we first present a mathematically principled definition of this transition: the model is said to be in the generalization regime if the generated distribution is closer to the sampling distribution compared to the probability distribution associated with a Gaussian kernel approximation to the training dataset. Then, we develop an analytically tractable diffusion model that features this transition when the training data is sampled from an isotropic Gaussian distribution. Our study reveals that this transition occurs when the distance between the generated and underlying sampling distribution begins to decrease rapidly with the addition of more training samples. This is to be contrasted with an alternative scenario, where the model’s memorization performance degrades, but generalization performance doesn’t improve. We also provide empirical evidence indicating that realistic diffusion models exhibit the same alignment of scales.

114Pareto Prompt Optimization

[openreview] [pdf]

Abstract Natural language prompt optimization, or prompt engineering, has emerged as a powerful technique to unlock the potential of Large Language Models (LLMs) for various tasks. While existing methods primarily focus on maximizing a single task-specific performance metric for LLM outputs, real-world applications often require considering trade-offs between multiple objectives. In this work, we address this limitation by proposing an effective technique for multi-objective prompt optimization for LLMs. Specifically, we proposeParetoPrompt, a reinforcement learning~(RL) method that leverages dominance relationships between prompts to derive a policy model for prompts optimization using preference-based loss functions. By leveraging multi-objective dominance relationships, ParetoPrompt enables efficient exploration of the entire Pareto front without the need for a predefined scalarization of multiple objectives. Our experimental results show that ParetoPrompt consistently outperforms existing algorithms that use specific objective values. ParetoPrompt also yields robust performances when the objective metrics differ between training and testing.

115Decouple-Then-Merge: Towards Better Training for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a De\textbf{De}couple-then-Me\textbf{Me}rge (DeMe\textbf{DeMe}) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10. Code is included in the supplementary material and will be released on Github.

116Tra-MoE: Scaling Trajectory Prediction Models for Adaptive Policy Conditioning

[openreview] [pdf]

Abstract Scale is a primary factor that influences the performance and generalization of a robot learning system. In this paper, we aim to scale up the trajectory prediction model by using broad out-of-domain data to improve its robustness and generalization ability. Trajectory model is designed to predict any-point trajectories in the current frame given an instruction and can provide detailed control guidance for robotic policy learning. To handle the diverse out-of-domain data distribution, we propose a sparsely-gated MoE (\textbf{Top-1} gating strategy) architecture for trajectory model, coined as \textbf{Tra-MoE}. The sparse activation design enables good balance between parameter cooperation and specialization, effectively benefiting from large-scale out-of-domain data while maintaining constant FLOPs per token. In addition, we further introduce an adaptive policy conditioning technique by learning 2D mask representations for predicted trajectories, which is explicitly aligned with image observations to guide policy prediction more flexibly. We perform experiments on both simulation and real-world scenarios to verify the effectiveness of our Tra-MoE and adaptive policy conditioning technique. We jointly train the Tra-MoE model on all 130 tasks in the LIBERO benchmark and conduct a comprehensive empirical analysis, demonstrating that our Tra-MoE consistently exhibits superior performance compared to the dense baseline model, even when the latter is scaled to match Tra-MoE’s parameter count.

117AutoLoRA: AutoGuidance Meets Low-Rank Adaptation for Diffusion Models

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) is a fine-tuning technique that can be applied to conditional generative diffusion models. LoRA utilizes a small number of context examples to adapt the model to a specific domain, character, style, or concept. However, due to the limited data utilized during training, the fine-tuned model performance is often characterized by strong context bias and a low degree of variability in the generated images. To solve this issue, we introduce AutoLoRA, a novel guidance technique for diffusion models fine-tuned with the LoRA approach. Inspired by other guidance techniques, AutoLoRA searches for a trade-off between consistency in the domain represented by LoRA weights and sample diversity from the base conditional diffusion model. Moreover, we show that incorporating classifier-free guidance for both LoRA fine-tuned and base models leads to generating samples with higher diversity and better quality. The experimental results for several fine-tuned LoRA domains show superiority over existing guidance techniques on selected metrics.

118Amortized Posterior Sampling with Diffusion Prior Distillation

[openreview] [pdf]

Abstract We propose Amortized Posterior Sampling (APS), a novel variational inference approach for efficient posterior sampling in inverse problems. Our method trains a conditional flow model to minimize the divergence between the variational distribution and the posterior distribution implicitly defined by the diffusion model. This results in a powerful, amortized sampler capable of generating diverse posterior samples with a single neural function evaluation, generalizing across various measurements. Unlike existing methods, our approach is unsupervised, requires no paired training data, and is applicable to both Euclidean and non-Euclidean domains. We demonstrate its effectiveness on a range of tasks, including image restoration, manifold signal reconstruction, and climate data imputation. APS significantly outperforms existing approaches in computational efficiency while maintaining competitive reconstruction quality, enabling real-time, high-quality solutions to inverse problems across diverse domains.

119GUIDE: Guidance-based Incremental Learning with Diffusion Models

[openreview] [pdf]

Abstract We introduce GUIDE, a novel continual learning approach that directs diffusion models to rehearse samples at risk of being forgotten. Existing generative strategies combat catastrophic forgetting by randomly sampling rehearsal examples from a generative model. Such an approach contradicts buffer-based approaches where sampling strategy plays an important role. We propose to bridge this gap by incorporating classifier guidance into the diffusion process to produce rehearsal examples specifically targeting information forgotten by a continuously trained model. This approach enables the generation of samples from preceding task distributions, which are more likely to be misclassified in the context of recently encountered classes. Our experimental results show that GUIDE significantly reduces catastrophic forgetting, outperforming conventional random sampling approaches and surpassing recent state-of-the-art methods in continual learning with generative replay.

120Learning-Augmented Frequent Directions

[openreview] [pdf]

Abstract An influential paper of Hsu et al. (ICLR’19) introduced the study of learning-augmented streaming algorithms in the context of frequency estimation. A fundamental problem in the streaming literature, the goal of frequency estimation is to approximate the number of occurrences of items appearing in a long stream of data using only a small amount of memory. Hsu et al. develop a natural framework to combine the worst-case guarantees of popular solutions such as CountMin and CountSketch with learned predictions of high frequency elements. They demonstrate that learning the underlying structure of data can be used to yield better streaming algorithms, both in theory and practice.We simplify and generalize past work on learning-augmented frequency estimation. Our first contribution is a learning-augmented variant of the Misra-Gries algorithm which improves upon the error of learned CountMin and learned CountSketch and achieves the state-of-the-art performance of randomized algorithms (Aamand et al., NeurIPS’23) with a simpler, deterministic algorithm. Our second contribution is to adapt learning-augmentation to a high-dimensional generalization of frequency estimation corresponding to finding important directions (top singular vectors) of a matrix given its rows one-by-one in a stream. We analyze a learning-augmented variant of the Frequent Directions algorithm, extending the theoretical and empirical understanding of learned predictions to matrix streaming.

121GUARANTEED USER FAIRNESS IN RECOMMENDATION

[openreview] [pdf]

Abstract Although recommender systems (RS) have been well-developed for various fields of applications, they suffer from the crisis of platform credibility with respect to RS confidence and fairness, which may drive users away from the platform and result in the failure of the platform’s long-term success. In recent years, a few works have tried to solve either the model confidence or fairness issue, while there is no statistical guarantee for these methods. It is therefore an urgent need to solve both issues with a unifying framework with statistical guarantee. In this paper, we propose a novel and reliable framework called Guaranteed User Fairness in Recommendation (GUFR) to dynamically generate prediction sets for users across various groups, which are guaranteed 1) to include the ground-truth items with user-predefined high confidence/probability (e.g., 90%); 2) to ensure user fairness across different groups; 3) to have the minimum average set size. We further design an efficient algorithm named Guaranteed User Fairness Algorithm (GUFA) to optimize the proposed method, and upper bounds of the risk and fairness metric are derived to help speed up the optimization process. Moreover, we provide rigorous theoretical analysis with respect to risk and fairness control as well as the minimum set size. Extensive experiments also validate the effectiveness of the proposed framework, which aligns with our theoretical analysis. The code is publicly available athttps://anonymous.4open.science/r/GUFR-76EC.

122SafeDiffuser: Safe Planning with Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Diffusion models have shown promise in data-driven planning. While these planners are commonly employed in applications where decisions are critical, they still lack established safety guarantees. In this paper, we address this limitation by introducing SafeDiffuser, a method to equip diffusion models with safety guarantees via control barrier functions. The key idea of our approach is to embed finite-time diffusion invariance, i.e., a form of specification consisting of safety constraints, into the denoising diffusion procedure. This way we enable data generation under safety constraints. We show that SafeDiffusers maintain the generative performance of diffusion models while also providing robustness in safe data generation. We evaluate our method on a series of tasks, including maze path generation, legged robot locomotion, and 3D space manipulation, and demonstrate the advantages of robustness over vanilla diffusion models.

123Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

[openreview] [pdf]

Abstract Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences. However, rather than merely generating designs that are natural, we often aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require differentiable proxy models (e.g., classifier guidance or DPS) or involve computationally expensive fine-tuning of diffusion models (e.g., classifier-free guidance, RL-based fine-tuning). In our work, we propose a new method to address these challenges. Our algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models. Notably, our approach avoids fine-tuning generative models and eliminates the need to construct differentiable models. This enables us to (1) directly utilize non-differentiable features/reward feedback, commonly used in many scientific domains, and (2) apply our method to recent discrete diffusion models in a principled way. Finally, we demonstrate the effectiveness of our algorithm across several domains, including image generation, molecule generation, and DNA/RNA sequence generation.

124Prompt-Agnostic Erasure for Diffusion Models Using Task Vectors

[openreview] [pdf]

Abstract With the rapid growth of text-to-image models, a variety of techniques have been suggested to prevent undesirable image generations. Yet, these methods often only protect against specific user prompts and have been shown to allow undesirable generations with other inputs. Here we focus on \textit{unconditionally} erasing a concept from a text-to-image model rather than conditioning the erasure on the user’s prompt. We first show that compared to input-dependent erasure methods, concept erasure that uses Task Vectors (TV) is more robust to unexpected user inputs, not seen during training. However, TV-based erasure can also affect the core performance of the edited model, particularly when the required edit strength is unknown. To this end, we propose a method called \textit{Diverse Inversion}, which we use to estimate the required strength of the TV edit. Diverse Inversion finds within the model input space a large set of word embeddings, each of which induces the generation of the target concept. We find that encouraging diversity in the set makes our estimation more robust to unexpected prompts. Finally, we show that Diverse Inversion enables us to apply a TV edit only to a subset of the model weights, enhancing the erasure capabilities while better maintaining model utility.

125SePPO: Semi-Policy Preference Optimization for Diffusion Alignment

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) methods are emerging as a way to fine-tune diffusion models (DMs) for visual generation. However, commonly used on-policy strategies are limited by the generalization capability of the reward model, while off-policy approaches require large amounts of difficult-to-obtain paired human-annotated data, particularly in visual generation tasks. To address the limitations of both on- and off-policy RLHF, we propose a preference optimization method that aligns DMs with preferences without relying on reward models or paired human-annotated data. Specifically, we introduce a Semi-Policy Preference Optimization (SePPO) method. SePPO leverages previous checkpoints as reference models while using them to generate on-policy reference samples, which replace “losing images” in preference pairs. This approach allows us to optimize using only off-policy “winning images”. Furthermore, we design a strategy for reference model selection that expands the exploration in the policy space. Notably, we do not simply treat reference samples as negative examples for learning. Instead, we design an anchor-based criterion to assess whether the reference samples are likely to be winning or losing images, allowing the model to selectively learn from the generated reference samples. This approach mitigates performance degradation caused by the uncertainty in reference sample quality. We validate SePPO across both text-to-image and text-to-video benchmarks. SePPO surpasses all previous approaches on the text-to-image benchmarks and also demonstrates outstanding performance on the text-to-video benchmarks.

126Optimizing Backward Policies in GFlowNets via Trajectory Likelihood Maximization

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are a family of generative models that learn to sample objects with probabilities proportional to a given reward function. The key concept behind GFlowNets is the use of two stochastic policies: a forward policy, which incrementally constructs compositional objects, and a backward policy, which sequentially deconstructs them. Recent results show a close relationship between GFlowNet training and entropy-regularized reinforcement learning (RL) problems with a particular reward design. However, this connection applies only in the setting of a fixed backward policy, which might be a significant limitation. As a remedy to this problem, we introduce a simple backward policy optimization algorithm that involves direct maximization of the value function in an entropy-regularized Markov Decision Process (MDP) over intermediate rewards. We provide an extensive experimental evaluation of the proposed approach across various benchmarks in combination with both RL and GFlowNet algorithms and demonstrate its faster convergence and mode discovery in complex environments.

127DEALING WITH OUT OF DISTRIBUTION IN PREDICTION PROBLEM

[openreview] [pdf]

Abstract Open world assumption in model development means that a model may not have enough information to effectively handle data that is completely different or out of distribution (OOD). When a model encounters OOD data, it may suffer a significant decrease in performance. Addressing OOD data requires extensive fine-tuning and experimental trials, which in turn require substantial computational resources. Deep learning has been suggested as a solution and has shown significant improvements, but it often requires high-specification hardware, particularly GPUs, which may not always be readily available to general users. Additionally, there is a lack of clear guidance for common users on how to select and evaluate OOD data. This study delves into detection, evaluation, and prediction tasks within the context of OOD on tabular datasets. It demonstrates how common users can identify OOD data from real datasets and provides guidance on evaluating the OOD selection through experiments and visualizations. Furthermore, the study introduces tabular contrast learning (TCL), an enhanced technique specifically designed for tabular prediction tasks. TCL is more efficient compared to other baseline models, making it useful for general machine learning user with computational limitation on dealing with OOD problems. The study also includes a comprehensive comparison with existing approaches, focusing on both accuracy and computational efficiency.

128Statistical Test on Diffusion Model-based Anomaly Detection by Selective Inference

[openreview] [pdf]

Abstract Advancements in AI image generation, particularly diffusion models, have progressed rapidly. However, the absence of an established framework for quantifying the reliability of AI-generated images hinders their use in critical decision-making tasks, such as medical image diagnosis. In this study, we address the task of detecting anomalous regions in medical images using diffusion models and propose a statistical method to quantify the reliability of the detected anomalies. The core concept of our method involves a selective inference framework, wherein statistical tests are conducted under the condition that the images are produced by a diffusion model. With our approach, the statistical significance of anomaly detection results can be quantified in the form of a pp-value, enabling decision-making with controlled error rates, as is standard in medical practice. We demonstrate the theoretical soundness and practical effectiveness of our statistical test through numerical experiments on both synthetic and brain image datasets.

129Episodic Novelty Through Temporal Distance

[openreview] [pdf]

Abstract Exploration in sparse reward environments remains a significant challenge in reinforcement learning, particularly in Contextual Markov Decision Processes (CMDPs), where environments differ across episodes. Existing episodic intrinsic motivation methods for CMDPs primarily rely on count-based approaches, which are ineffective in large state spaces, or on similarity-based methods that lack appropriate metrics for state comparison. To address these shortcomings, we propose Episodic Novelty Through Temporal Distance (ETD), a novel approach that introduces temporal distance as a robust metric for state similarity and intrinsic reward computation. By employing contrastive learning, ETD accurately estimates temporal distances and derives intrinsic rewards based on the novelty of states within the current episode. Extensive experiments on various benchmark tasks demonstrate that ETD significantly outperforms state-of-the-art methods, highlighting its effectiveness in enhancing exploration in sparse reward CMDPs.

130On Inductive Biases That Enable Generalization in Diffusion Transformers

[openreview] [pdf]

Abstract Recent work studying the generalization of diffusion models with UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. However, in practice, more recent denoising networks are often based on transformers, e.g., the diffusion transformer (DiT). This raises the question: do transformer-based denoising networks exhibit inductive biases that can also be expressed via geometry-adaptive harmonic bases? To our surprise, we find that this is not the case. This discrepancy motivates our search for the inductive bias that can lead to good generalization in DiT models. Investigating a DiT’s pivotal attention modules, we find that locality of attention maps are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows. We inject local attention windows to a DiT and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, and LSUN datasets show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code will be released publicly upon paper publication.

131CURIOSITY IS THE PATH TO OPTIMIZATION

[openreview] [pdf]

Abstract In PAC theory, it is posited that larger hypothesis spaces necessitate more independently and identically distributed (i.i.d) data to maintain the accuracy of model performance. PAC-MDP theory defines curiosity by assigning higher rewards for visiting states that are far from the previously visited trajectory, which supports more independent and i.i.d data collection. Recently, this field has witnessed attempts to narrow the hypothesis space by developing additional mechanisms that train multiple skills and facilitate the sharing of information among them, thereby discovering commonalities. However, one might wonder: What if curiosity could not only enhance the efficiency of data collection but also significantly reduce the hypothesis space, thereby driving optimal outcomes independently without additional mechanism used in PAC-MDP? Significant discussion has been devoted to the reduction of hypothesis spaces and the utilization of curiosity. Within this context, contrastive multi-skill reinforcement learning (RL) exhibits both traits. Previous research in contrastive multi-skill RL has utilized this technique primarily as a form of pretraining, However, there has been scant investigation into whether the technique itself can reduce the hypothesis space to optimize the outcomes. We have mathematically proven that curiosity provides bounds to guarantee optimality in contrastive multi-skill reinforcement learning (RL). Additionally, we have leveraged these findings to develop an algorithm that is applicable in real-world scenarios, which has been demonstrated to surpass other prominent algorithms. Furthermore, our experiments have shown that different skills are actually reducing the hypothesis space of the policy by being hierarchically grouped.

132Latent Abstractions in Generative Diffusion Models

[openreview] [pdf]

Abstract In this work we study how diffusion-based generative models produce high-dimensional data, such as an image, by implicitly relying on a manifestation of a low-dimensional set of latent abstractions, that guide the generative process. We present a novel theoretical framework that extends Nonlinear Filtering (NLF), and that offers a unique perspective on SDE-based generative models. The development of our theory relies on NLF, including a novel formulation of the joint (state and measurement) dynamics, and an information-theoretic measure of the influence of the system state on the measurement process. According to our theory, diffusion models can be cast as a system of SDE, describing a non-linear filter in which the evolution of unobservable latent abstractions steers the dynamics of an observable measurement process (corresponding to the generative pathways). In addition, we present an empirical study to validate our theory and previous empirical results on the emergence of latent abstractions at different stages of the generative process.

133Fast Diversity-Preserving Reward Finetuning of Diffusion Models via Nabla-GFlowNets

[openreview] [pdf]

Abstract While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to finetune pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer either 1) lack of diversity in generated samples, or 2) costly finetuning and slow convergence. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as \nabla-GFlowNet), together with an objective called \nabla-DB, plus its variant residual \nabla-DB for finetuning pretrained diffusion models. These objectives leverage the rich signal in reward gradients for diversity-aware finetuning. We empirically show that our proposed residual \nabla-DB achieves fast yet diversity- & prior-preserving finetuning of StableDiffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.

134Constrained Diffusion Implicit Models

[openreview] [pdf]

Abstract This paper describes an efficient algorithm for solving noisy linear inverse problems using pretrained diffusion models. Extending the paradigm of denoising diffusion implicit models (DDIM), we propose conditional diffusion implicit models (CDIM) that modify the diffusion updates to enforce a constraint upon the final output. For noiseless inverse problems, CDIM exactly satisfies the constraints; in the noisy case, we generalize CDIM to satisfy an exact constraint on the residual distribution of the noise. Experiments across a variety of tasks and metrics show strong performance of CDIM, with analogous inference acceleration to unconditional DDIM: 10 to 50 times faster than previous conditional diffusion methods. We demonstrate the versatility of our approach on many problems including super-resolution, denoising, inpainting, deblurring, and 3D point cloud reconstruction.

135Uncertainty Prioritized Experience Replay

[openreview] [pdf]

Abstract Prioritized experience replay, which improves sample efficiency by selecting relevant transitions to update parameter estimates, is a crucial component of contemporary deep reinforcement learning models. Typically, transitions are prioritized based on their temporal difference error. However, this approach is prone to favoring noisy transitions, even when the value estimation closely approximates the target mean. This phenomenon resembles thenoisy TVproblem postulated in the exploration literature, in which exploration-guided agents get stuck by mistaking noise for novelty. To mitigate the disruptive effects of noise in value estimation, we propose using epistemic uncertainty to guide the prioritization of transitions from the replay buffer. Epistemic uncertainty quantifies the uncertainty that can be reduced by learning, hence reducing transitions sampled from the buffer generated by unpredictable random processes. We first illustrate the benefits of epistemic uncertainty prioritized replay in two tabular toy models: a simple multi-arm bandit task, and a noisy gridworld. Subsequently, we evaluate our prioritization scheme on the Atari suite, outperforming quantile regression deep Q-learning benchmarks; thus forging a path for the use of epistemic uncertainty prioritized replay in reinforcement learning agents.

136Replay can provably increase forgetting

[openreview] [pdf]

Abstract Continual learning seeks to enable machine learning systems to solve an increasing corpus of tasks sequentially. A critical challenge for continual learning is forgetting, where the performance on previously learned tasks decreases as new tasks are introduced. One of the commonly used techniques to mitigate forgetting, sample replay, has been shown empirically to reduce forgetting by retaining some examples from old tasks and including them in new training episodes. In this work, we provide a theoretical analysis of sample replay in an over-parameterized continual linear regression setting, where given enough replay samples, one would be able to eliminate forgetting. Our analysis focuses on replaying a few examples and highlights the role of the replay samples and task subspaces. Surprisingly, we find that forgetting can be non-monotonic with respect to the number of replay samples. We construct tasks where replay of a single example can increase forgetting and even distributions where replay of a randomly selected sample increases forgetting on average. We provide empirical evidence that this is a property of the tasks rather than the model used to train on them, by showing a similar behavior for a neural net equipped with SGD. Through experiments on a commonly used benchmark, we provide additional evidence that performance of the replay heavily depends on the choice of replay samples and the relationship between tasks.

137PFDiff: Training-free Acceleration of Diffusion Models through the Gradient Guidance of Past and Future

[openreview] [pdf]

Abstract Diffusion Probabilistic Models (DPMs) have shown remarkable potential in image generation, but their sampling efficiency is hindered by the need for numerous denoising steps. Most existing solutions accelerate the sampling process by proposing fast ODE solvers. However, the inevitable discretization errors of the ODE solvers are significantly magnified when the number of function evaluations (NFE) is fewer. In this work, we propose PFDiff, a novel training-free and orthogonal timestep-skipping strategy, which enables existing fast ODE solvers to operate with fewer NFE. Specifically, PFDiff initially utilizes gradient replacement from past time steps to predict a “springboard”. Subsequently, it employs this “springboard” along with foresight updates inspired by Nesterov momentum to rapidly update current intermediate states. This approach effectively reduces unnecessary NFE while correcting for discretization errors inherent in first-order ODE solvers. Experimental results demonstrate that PFDiff exhibits flexible applicability across various pre-trained DPMs, particularly excelling in conditional DPMs and surpassing previous state-of-the-art training-free methods. For instance, using DDIM as a baseline, we achieved 16.46 FID (4 NFE) compared to 138.81 FID with DDIM on ImageNet 64x64 with classifier guidance, and 13.06 FID (10 NFE) on Stable Diffusion with 7.5 guidance scale.

138Efficient Discovery of Pareto Front for Multi-Objective Reinforcement Learning

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) excels at handling rapidly changing preferences in tasks that involve multiple criteria, even for unseen preferences. However, previous dominating MORL methods typically generate a fixed policy set or preference-conditioned policy through multiple training iterations exclusively for sampled preference vectors, and cannot ensure the efficient discovery of the Pareto front. Furthermore, integrating preferences into the input of policy or value functions presents scalability challenges, in particular as the dimension of the state and preference space grow, which can complicate the learning process and hinder the algorithm’s performance on more complex tasks. To address these issues, we propose a two-stage Pareto front discovery algorithm called Constrained MORL (C-MORL), which serves as a seamless bridge between constrained policy optimization and MORL. Concretely, a set of policies is trained in parallel in the initialization stage, with each optimized towards its individual preference over the multiple objectives. Then, to fill the remaining vacancies in the Pareto front, the constrained optimization steps are employed to maximize one objective while constraining the other objectives to exceed a predefined threshold. Empirically, compared to recent advancements in MORL methods, our algorithm achieves more consistent and superior performances in terms of hypervolume, expected utility, and sparsity on both discrete and continuous control tasks, especially with numerous objectives (up to nine objectives in our experiments).

139A Study of Posterior Stability for Time-Series Latent Diffusion

[openreview] [pdf]

Abstract Latent diffusion has demonstrated promising results in image generation and permits efficient sampling. However, this framework might suffer from the problem of posterior collapse when applied to time series. In this paper, we first show that posterior collapse will reduce latent diffusion to a variational autoencoder (VAE), making it less expressive. This highlights the importance of addressing this issue. We then introduce a principled method: dependency measure, that quantifies the sensitivity of a recurrent decoder to input variables. Using this tool, we confirm that posterior collapse significantly affects time-series latent diffusion on real datasets, and a phenomenon termed dependency illusion is also discovered in the case of shuffled time series. Finally, building on our theoretical and empirical studies, we introduce a new framework that extends latent diffusion and has a stable posterior. Extensive experiments on multiple real time-series datasets show that our new framework is free from posterior collapse and significantly outperforms previous baselines in time series synthesis.

140Diffusion-PINN Sampler

[openreview] [pdf]

Abstract Recent success of diffusion models has inspired a surge of interest in developing sampling techniques using reverse diffusion processes. However, accurately estimating the drift term in the reverse stochastic differential equation (SDE) solely from the unnormalized target density poses significant challenges, hindering existing methods from achieving state-of-the-art performance. In this paper, we introduce the Diffusion-PINN Sampler (DPS), a novel diffusion-based sampling algorithm that estimates the drift term by solving the governing partial differential equation of the log-density of the underlying SDE marginals via physics-informed neural networks (PINN). We prove that the error of log-density approximation can be controlled by the PINN residual loss, enabling us to establish convergence guarantees of DPS. Experiments on a variety of sampling tasks demonstrate the effectiveness of our approach, particularly in accurately identifying mixing proportions when the target contains isolated components.

141LLM Pruning and Distillation in Practice

[openreview] [pdf]

Abstract Structured pruning with knowledge distillation is a potent combination for obtaining small language models (SLMs) with significantly fewer training tokens and compute resources compared to training from scratch. In this work, we investigate how this strategy can be effectively applied in instances where access to the the original pretraining dataset is restricted. We introduce a newteacher correctionphase before distillation which lets the teacher model adjust to our specific data distribution using a lightweight fine-tuning phase. We apply this strategy to compress the Mistral NeMo 12B and Llama 3.1 8B models to 8B and 4B parameters, respectively, using pruning and distillation. We explore two distinct pruning strategies: (1) depth pruning and (2) joint hidden/attention/MLP (width) pruning, and evaluate the results on common benchmarks from the LM Evaluation Harness. The models are then aligned with NeMo Aligner and further tested for instruction following, role-play, math, coding and function calling capabilities. This approach produces the state-of-the-art Mistral-NeMo-Compressed-8B (\MNMinitron for brevity) model from Mistral NeMo 12B, and a compelling 4B model from Llama 3.1 8B.

142Constant Rate Schedule: Constant-Rate Distributional Change for Efficient Training and Sampling in Diffusion Models

[openreview] [pdf]

Abstract We propose a noise schedule that ensures a constant rate of change in the probability distribution of diffused data throughout the diffusion process. To obtain this noise schedule, we measure the rate of change in the probability distribution of the forward process and use it to determine the noise schedule before training diffusion models. The functional form of the noise schedule is automatically determined and tailored to each dataset and type of diffusion model. We evaluate the effectiveness of our noise schedule on unconditional and class-conditional image generation tasks using the LSUN (bedroom/church/cat/horse), ImageNet, and FFHQ datasets. Through extensive experiments, we confirmed that our noise schedule broadly improves the performance of the diffusion models regardless of the dataset, sampler, number of function evaluations, or type of diffusion model.

143Agential AI for integrated continual learning, deliberative behavior, and comprehensible models

[openreview] [pdf]

Abstract Contemporary machine learning paradigm excels in statistical data analysis, solving problems that classical AI couldn’t. However, it faces key limitations, such as lack of integration with planning, incomprehensible internal structures, and inability to learn continually without erasing prior knowledge. We present initial design for an AI system, Agential AI (AAI), in principle operating independently or on top of statistical methods, that overcomes all these issues. AAI’s core is a learning method that models temporal dynamics with guarantees of completeness, minimality, and continual learning. It integrates this with a behavior algorithm that plans on a learned model and encapsulate high-level behavior patterns. Preliminary experiments on a simple abstract environment show AAI’s effectiveness and future potential.

144Moonwalk: Inverse-Forward Differentiation

[openreview] [pdf]

Abstract Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible and right-invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of naïve forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible and right-invertible networks in computation time comparable to backpropagation and using significantly less memory.

145Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning

[openreview] [pdf]

Abstract In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization. The order in which the data from these domains is used for training can significantly affect the model’s performance on each domain. However, this dependence is under-studied. In this paper, we investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields. By analyzing the infinitesimal effects of changing the training order, we identify regions in the parameter space where altering the order between two training domains can benefit the target loss. We validate the predictions of our theoretical framework on the influence of training order (or data mixing) both on a toy example and bilingual LLM pre-training.

146Last-Iterate Convergence of Smooth Regret Matching+Variants in Learning Nash Equilibria

[openreview] [pdf]

Abstract Regret Matching+^+ (RM+^+) variants have been widely developed to superhuman Poker AIs, yet few studies investigate their last-iterate convergence. Their last-iterate convergence has been demonstrated only for games with strong monotonicity or two-player zero-sum matrix games. A primary obstacle in proving the last-iterate convergence for these algorithms is that their feedback is not the loss gradient of the vanilla games. This deviation results in the absence of crucial properties, \eg, monotonicity or the weak Minty variation inequality (MVI), which are pivotal for establishing the last-iterate convergence. To address the absence of these properties, we propose a remarkably succinct yet novel proof paradigm that consists of: (i) recovering these key properties through the equivalence between RM+^+ and Online Mirror Descent (OMD), and (ii) measuring the the distance to Nash equilibrium (NE) via the tangent residual to show this distance is related to the distance between accumulated regrets. To show the practical applicability of our proof paradigm, we use it to prove the last-iterate convergence of two existing smooth RM+^+ variants, Smooth Extra-gradient RM+^+ (SExRM+^+) and Smooth Predictive RM+^+ (SPRM+^+). We show that they achieve last-iterate convergence in learning an NE of games satisfying monotonicity, a weaker condition than the one used in existing proofs for both variants. Then, inspired by our proof paradigm, we propose Smooth Optimistic Gradient RM+^+ (SOGRM+^+). We show that SOGRM+^+ achieves last-iterate convergence in learning an NE of games satisfying the weak MVI, the weakest condition in all known proofs for RM+^+ variants. The experimental results show that SOGRM+^+ significantly outperforms other algorithms.

147Guided Reinforcement Learning with Roll-Back

[openreview] [pdf]

Abstract Reinforcement learning-based solutions are increasingly being considered as strong alternatives to classical system controllers, despite their significant sample inefficiency when learning controller tasks from scratch. Many methods that address this issue use prior task knowledge to guide the agent’s learning, with several recent algorithms providing a guide policy that is sometimes chosen to execute actions instead of the learner policy. While this approach lends excellent flexibility as it allows the guide knowledge to be provided in any format, it can be challenging to decide when and for how long to use the guide agent. Current guide policy-based approaches typically choose a static guide sampling rate empirically, and do not vary it. Approaches that transfer control use simple methods like linear decay, or require hyperparameter choices that strongly impact the performance. We show that under certain assumptions, the sampling rate of the guide policy can be calculated to guarantee that the mean return of the learning policy will surpass a user-defined performance degradation threshold. To the best of our knowledge, this is the first time a performance guarantee has been established for a guided RL method. We then implement a guided RL (GRL) algorithm that can make use of this sample rate, and additionally introduce a roll-back feature in guided RL with roll-back (GRL-RB) to adaptively balance the trade-off between performance degradation and rapid transfer of control to the learner. Our approach is simple to implement on top of existing algorithms, robust to hyperparameter choices, and effective in warm-starting online learning.

148On Rollouts in Model-Based Reinforcement Learning

[openreview] [pdf]

Abstract Model-based reinforcement learning (MBRL) seeks to enhance data efficiency by learning a model of the environment and generating synthetic rollouts from it. However, accumulated model errors during these rollouts can distort the data distribution, negatively impacting policy learning and hindering long-term planning. Thus, the accumulation of model errors is a key bottleneck in current MBRL methods. We propose Infoprop, a model-based rollout mechanism that separates aleatoric from epistemic model uncertainty and reduces the influence of the latter on the data distribution. Further, Infoprop keeps track of accumulated model errors along a model rollout and provides termination criteria to limit data corruption. We demonstrate the capabilities of Infoprop in the Infoprop-Dyna algorithm, reporting state-of-the-art performance in Dyna-style MBRL on common MuJoCo benchmark tasks while substantially increasing rollout length and data quality.

149Multi-Student Diffusion Distillation for Better One-Step Generators

[openreview] [pdf]

Abstract Diffusion models achieve high-quality sample generation at the cost of a lengthy multistep inference procedure. To overcome this, diffusion distillation techniques produce student generators capable of matching or surpassing the teacher in a single step. However, the student model’s inference speed is limited by the size of the teacher architecture, preventing real-time generation for computationally heavy applications. In this work, we introduce Multi-Student Distillation (MSD), a framework to distill a conditional teacher diffusion model into multiple single-step generators. Each student generator is responsible for a subset of possible conditioning data, thereby obtaining higher generation quality for the same capacity. MSD trains multiple distilled students allowing smaller sizes and, therefore, faster inference. Also, MSD offers a lightweight quality boost over single-student distillation with the same architecture. We demonstrate MSD is effective by training multiple same-sized or smaller students on single-step distillation using distribution matching and adversarial distillation techniques. With smaller students, MSD obtains competitive results with a faster inference time for single-step generation. Using same-sized students, MSD with 4 students sets new state-of-the-art results for one-step image generation: FID 1.20 on ImageNet-64×64 and 8.20 on zero-shot COCO2014.

150The Superposition of Diffusion Models

[openreview] [pdf]

Abstract The undeniable success of deep generative models for learning complex and high-dimensional data distributions has led to the proliferation of large-scale diffusion models across the entire machine-learning application spectrum. This Cambrian explosion of easily accessible pre-trained models, including fine-tuned open-source models on user-specific data, suggests a demand for methods that combine multiple different pre-trained models without incurring the significant computational burden of re-training a larger combined model. In this paper, we cast the problem of combining multiple pre-trained diffusion models at the generation stage under a novel proposed framework termed superposition. Theoretically, we derive superposition from rigorous first principles stemming from the celebrated continuity equation and design two novel algorithms tailor-made for combining diffusion models in SuperDiff. We demonstrate that SuperDiff is scalable to large pre-trained diffusion models as superposition is performedsolely through composition during inference, and also enjoys painless implementation as it combines different pre-trained vector fields through an automated re-weighting scheme. Notably, we show that SuperDiffis efficient during inference time, and mimics traditional composition operators such as the logical OR\texttt{OR} and the logical AND\texttt{AND}. We empirically demonstrate the utility of using SuperDiff for generating more diverse images on CIFAR-10, more faithful prompt conditioned image editing using Stable Diffusion, and improved unconditionalde novostructure design of proteins.

151Expected Return Symmetries

[openreview] [pdf]

Abstract Symmetry is an important inductive bias that can improve model robustness and generalization across many deep learning domains. In multi-agent settings, a priori known symmetries have been shown to address a fundamental coordination failure mode known as mutually incompatible symmetry breaking; e.g. in a game where two independent agents can choose to move “left” or “right”, and where a reward of +1 or -1 is received when the agents choose the same action or different actions, respectively. However, the efficient and automatic discovery of environment symmetries, in particular for decentralized partially observable Markov decision processes, remains an open problem. Furthermore, environmental symmetry breaking constitutes only one type of coordination failure, which motivates the search for a more accessible and broader symmetry class. In this paper, we introduce such a broader group of previously unexplored symmetries, which we call expected return symmetries, which contains environment symmetries as a subgroup. We show that agents trained to be compatible under the group of expected return symmetries achieve better zero-shot coordination results than those using environment symmetries. As an additional benefit, our method makes minimal a priori assumptions about the structure of their environment and does not require access to ground truth symmetries.

152Avoiding mode collapse in diffusion models fine-tuned with reinforcement learning

[openreview] [pdf]

Abstract Fine-tuning foundation models via reinforcement learning (RL) has proven promising for aligning to downstream objectives. In the case of diffusion models (DMs), though RL training improves alignment from early timesteps, critical issues such as training instability and mode collapse arise. We address these drawbacks by exploiting the hierarchical nature of DMs: we train them dynamically at each epoch with a tailored RL method, allowing for continual evaluation and step-by-step refinement of the model performance (or alignment). Furthermore, we find that not every denoising step needs to be fine-tuned to align DMs to downstream tasks. Consequently, in addition to clipping, we regularise model parameters at distinct learning phases via a sliding-window approach. Our approach, termed Hierarchical Reward Fine-tuning (HRF), is validated on the Denoising Diffusion Policy Optimisation method, where we show that models trained with HRF achieve better preservation of diversity in downstream tasks, thus enhancing the fine-tuning robustness and at uncompromising mean rewards.

153Latent Diffusion Planning for Imitation Learning

[openreview] [pdf]

Abstract Recent progress in robotic imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods rely on supervised learning of actions from expert demonstrations, which can be challenging to scale. We propose Latent Diffusion Planning, which forecasts future states as well as actions via diffusion. This objective can scalably leverage heterogeneous data sources and provides a denser supervision signal for learning. To plan over images, we learn a compact latent space through a variational autoencoder. We then train a planner to forecast future latent states, and an inverse dynamics model to extract actions from the plans. As planning is separated from action prediction, LDP can leverage suboptimal or action-free data to improve performance in low demonstration regimes. On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches as they cannot leverage such additional data.

154DuRND: Rewarding from Novelty to Contribution for Reinforcement Learning via Dual Random Networks Distillation

[openreview] [pdf]

Abstract Existing reward shaping techniques for sparse-reward tasks in reinforcement learning generally fall into two categories: novelty-based exploration bonuses and value-based rewards. The former encourages agents to explore less visited areas but can divert them from their main objectives, while the latter promotes stable late-stage convergence but often lacks sufficient early exploration. To combine the benefits of both, we propose Dual Random Networks Distillation (DuRND), a novel framework integrating two lightweight random network modules. These modules jointly generate two rewards: a novelty reward to drive exploration and a contribution reward to evaluate progress toward desired behaviors, achieving an efficient balance between exploration and exploitation. With low computational overhead, DuRND excels in high-dimensional environments like Atari, VizDoom, and MiniWorld, outperforming several benchmarks.

155Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?

[openreview] [pdf]

Abstract Predictable behavior from scaling advanced AI systems is an extremely desirable property for engineers, companies, economists and governments alike, and while a well-established literature exists on how pretraining performance scales, predictable scaling behavior on downstream capabilities remains elusive. While many factors are certainly responsible, this paper shines a light on a significant factor that makes predicting scaling behavior on widely used multiple-choice question answering benchmarks challenging and illuminates a path towards making such downstream evaluations predictable with scale. Using five model families and twelve well-established multiple-choice benchmarks, we show that downstream performance is computed from negative log likelihoods via a sequence of transformations that progressively degrades the statistical relationship between performance and scale. We then reveal the mechanism causing this degradation: downstream metrics require comparing the correct choice against a small number of specific incorrect choices, meaning accurately predicting downstream capabilities requires predicting not just how probability mass concentrates on the correct choice with scale, but also how probability mass fluctuates on specific incorrect choices with scale. We empirically study how probability mass on the correct choice co-varies with probability mass on incorrect choices with increasing compute, suggesting that scaling laws for \textit{incorrect} choices might be achievable. Our work also explains why pretraining scaling laws are commonly regarded as more predictable than downstream capabilities and contributes towards establishing scaling-predictable evaluations of frontier AI models.

156State Combinatorial Generalization In Decision Making With Conditional Diffusion Models

[openreview] [pdf]

Abstract Many real-world decision-making problems are combinatorial in nature, where states (e.g., surrounding traffic of a self-driving car) can be seen as a combination of basic elements (e.g., pedestrians, trees, and other cars). Due to combinatorial complexity, observing all combinations of basic elements in the training set is infeasible, which leads to an essential yet understudied problem of zero-shot generalization to states that are unseen combinations of previously seen elements.\textit{zero-shot generalization to states that are unseen combinations of previously seen elements.} In this work, we first formalize this problem and then demonstrate how existing value-based reinforcement learning (RL) algorithms struggle due to unreliable value predictions in unseen states. We argue that this problem cannot be addressed with exploration alone, but requires more expressive and generalizable models. We demonstrate that behavior cloning with a conditioned diffusion model trained on expert trajectory generalizes better to states formed by new combinations of seen elements than traditional RL methods. Through experiments in maze, driving, and multiagent environments, we show that conditioned diffusion models outperform traditional RL techniques and highlight the broad applicability of our problem formulation.

157Scaling Concept With Text-Guided Diffusion Models

[openreview] [pdf]

Abstract Text-guided diffusion models have revolutionized generative tasks by producing high-fidelity content based on text descriptions. Additionally, they have enabled an editing paradigm where concepts can be replaced through text conditioning. In this work, we explore a novel paradigm: instead of replacing a concept, can we scale it? We conduct an empirical study to investigate concept decomposition trends in text-guided diffusion models. Leveraging these insights, we propose a simple yet effective method, ScalingConcept, designed to enhance or suppress existing concepts in real input without introducing new ones. To systematically evaluate our method, we introduce the WeakConcept-10 dataset. More importantly, ScalingConcept enables a range of novel zero-shot applications across both image and audio domains, including but not limited to canonical pose generation and generative sound highlighting/removal.

158Distributional Sobolev reinforcement learning

[openreview] [pdf]

Abstract Distributional reinforcement learning (DRL) is a framework for learning a complete distribution over returns, rather than merely estimating expectations. In this paper, we further expand DRL by estimating a distribution over the gradient of the state-action value function, in addition to its scalar value. We refer to this method as Distributional Sobolev training. Inspired by Stochastic Value Gradients (SVG), we achieve this by leveraging a one-step world model of the reward and transition distributions implemented using a conditional Variational Autoencoder (cVAE). Our approach is sampled-based and relies on Maximum Mean Discrepancy (MMD) to instantiate the distributional Bellman operator. We first showcase the method on a toy supervised learning problem. We then validate our algorithm in several Mujoco/Brax environments.

159Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) aligns Large Language Models (LLMs) with human preferences. However, these preferences can often change over time due to external factors (e.g. environment change and societal influence). Consequently, what was wrong then might be right now. Current preference optimization algorithms do not account for temporal preference drift in their modeling, which can lead to severe misalignment. To address this limitation, we use a Dynamic Bradley-Terry model that models preferences via time-dependent reward functions, and propose Non-Stationary Direct Preference Optimisation (NS-DPO). By introducing a discount parameter in the loss function, NS-DPO applies exponential weighting, which proportionally focuses learning on more time-relevant datapoints. We theoretically analyse the convergence of NS-DPO in the offline setting, providing upper bounds on the estimation error caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO1 for fine-tuning LLMs in scenarios with drifting preferences. By simulating preference drift using renowned reward models and modifying popular LLM datasets accordingly, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

160On the Generalization of Preference Learning with DPO

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable capabilities but often struggle to align with human preferences, leading to harmful or undesirable outputs. Preference learning, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for ensuring that LLMs align with human values. Despite the widespread adoption in real-world systems, a thorough theoretical understanding of the generalization guarantees for these models remains lacking. This paper bridges that gap by introducing a new theoretical framework to analyze the generalization guarantees of models trained with direct preference optimization. While existing generalization theory often focuses on overparameterized models achieving near-optimal loss or models independent of the training process, our framework rigorously assesses how well models generalize after a finite number of gradient steps, reflecting real-world LLM training practices. By analyzing the reward margin associated with each sample and its trajectory throughout training, we can effectively bound the generalization error. We derive learning guarantees showing that, under specific conditions, models trained with DPO can correctly discern preferred responses on unseen data with high probability. These insights are empirically validated on contemporary LLMs, underscoring the practical relevance of our theory.

161Dynamical Diffusion: Learning Temporal Dynamics with Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have emerged as powerful generative frameworks by progressively adding noise to data through a forward process and then reversing this process to generate realistic samples. While these models have achieved strong performance across various tasks and modalities, their application to temporal predictive learning remains underexplored. Existing approaches treat predictive learning as a conditional generation problem, but often fail to fully exploit the temporal dynamics inherent in the data, leading to challenges in generating temporally coherent sequences. To address this, we introduce Dynamical Diffusion (DyDiff), a theoretically sound framework that incorporates temporally aware forward and reverse processes. Dynamical Diffusion explicitly models temporal transitions at each diffusion step, establishing dependencies on preceding states to better capture temporal dynamics. Through the reparameterization trick, Dynamical Diffusion achieves efficient training and inference similar to any standard diffusion model. Extensive experiments across scientific spatiotemporal forecasting, video prediction, and time series forecasting demonstrate that Dynamical Diffusion consistently improves performance in temporal predictive tasks, filling a crucial gap in existing methodologies.

162Rethinking Knowledge Distillation: A Mixture-of-Experts Perspective

[openreview] [pdf]

Abstract Knowledge distillation (KD) aims to transfer useful information from a large-scale model (teacher) to a lightweight model (student). Classical KD focuses on leveraging the teacher’s predictions as soft labels to regularize student training. However, the exact match of predictions in Kullback-Leibler (KL) divergence could be somewhat in conflict with the classification objective, given that the distribution discrepancies between teacher-generated predictions and ground-truth annotations tend to be fairly severe. In this paper, we rethink the role of teacher predictions from a Mixture-of-Experts (MoE) perspective and transfer knowledge by introducing teacher predictions as latent variables to reformulate the classification objective. This MoE strategy results in breaking down the vanilla classification task into a mixture of easier subtasks with the teacher classifier as a gating function to weigh the importance of subtasks. Each subtask is efficiently conquered by distinct experts that are effectively implemented by resorting to multi-level teacher outputs. We further develop a theoretical framework to formulate our method, termed MoE-KD, as an Expectation-Maximization (EM) algorithm and provide proof of the convergence. Extensive experiments manifest that MoE-KD outperforms advanced knowledge distillers on mainstream benchmarks.

163A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations

[openreview] [pdf]

Abstract Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain human preferences between trajectory pairs. Existing offline PbRL studies typically model a reward function by maximizing its likelihood of generating the observed human preferences. However, due to the varying number of samples within the limited dataset, less frequently compared trajectories exhibit greater uncertainty, which potentially leads to unrelible behaviors during reward and policy updates. To solve this issue, in this work, we introduce Uncertainty-Aware PbRL (UA-PbRL) to learn a distributional reward model and a risk-sensitive policy from an offline preference dataset. Our approach employs a Maximum A Posteriori (MAP) objective to update trajectory rewards and incorporates an informative prior to account for the uncertainties. Building upon this reward update, we propose a generative reward model to capture the reward distribution, utilizing the offline distributional Bellman operator and the Conditional Value-at-Risk (CVaR) metric to train a risk-sensitive policy. Experimental results demonstrate that UA-PbRL effectively identifies and avoids states with high uncertainty, facilitating risk-averse behaviors across various tasks, including robot control and language model alignment.

164Counterfactual History Distillation on Continuous-time Event Sequences

[openreview] [pdf]

Abstract This study aims to distill history events that have essential information for predicting subsequent events with counterfactual analysis. The problem is named Counterfactual History Distillation (CHD). CHD distills a minimum set of events from history, based on which the distribution provided by a trained MTPP model fits the events observed later, and the distribution based on the remaining events in history cannot. It can help understand what event marks may have more influence on the occurrence of future events and what events in history may have a causal relationship with the events observed later. This study proposes a robust solution for CHD, called MTPP-based Counterfactual History Distiller (MTPP-CHD). MTPP-CHD learns to select the optimal event combination from history for the events observed later. Experiment results demonstrate the superiority of MTPP-CHD by outperforming baselines in terms of distillation quality and processing speed.

165Orient Anything

[openreview] [pdf]

Abstract Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape’s orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

166Diversifying Spurious Subgraphs for Graph Out-of-Distribution Generalization

[openreview] [pdf]

Abstract Environment augmentation methods have gained some success in overcoming the out-of-distribution (OOD) generalization challenge in Graph Neural Networks (GNNs). Yet, there exists a challenging trade-off in the augmentation: On one hand, it requires the generated graphs as diverse as possible to extrapolate to unseen environments. On the other hand, it requires the generated graphs to preserve the invariant substructures causally related to the targets. Existing approaches have proposed various environment augmentation strategies to enrich spurious patterns for OOD generalization. However, we argue that these methods remain limited in diversity and precision of the generated environments for two reasons: i) the deterministic nature of the graph composition strategy used for environment augmentation may limit the diversity of the generated environments, and ii) the presence of spurious correlations may lead to the exclusion of invariant subgraphs and reduce the precision of the generated environments. To address this trade-off, we propose a novel paradigm that accurately identifies spurious subgraphs, and an environment augmentation strategy called spurious subgraph diversification, which extrapolates to maximally diversified spurious subgraphs by randomizing the spurious subgraph generation, while preserving the invariant substructures. Our method is theoretically sound and demonstrates strong empirical performance on both synthetic and real-world datasets, outperforming the second-best method by up to 24.19% across 17 baseline methods, underscoring its superiority in graph OOD generalization.

167Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment

[openreview] [pdf]

Abstract Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning (SFT) based upon large demonstrations data, (ii) a reward model (RM) is estimated based upon human feedback data, and (iii) reinforcement learning (RL) is used to further refine the SFT model by optimizing the estimated reward model. Typically, the number of parameters in the reward model greatly exceeds the number of preference observations in the human feedback data. As a result, the reward model is likely inaccurate and the resulting policy model (fine-tuned with RL) may exhibit poor alignment performance. In this paper, we introduce a new approach AIHF in which reward and policy models are {\em jointly} trained by simultaneously leveraging demonstration and human feedback data.We introduce a tractable algorithm for finding the AIHF reward and policy models and provide a finite time performance guarantee.Additionally, we demonstrate the efficiency of the proposed solution with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the data is unbalanced.

168Training-Free Diffusion Model Alignment with Sampling Demons

[openreview] [pdf]

Abstract Aligning diffusion models with user preferences has been a key challenge. Existing methods for aligning diffusion models either require retraining or are limited to differentiable reward functions. To address these limitations, we propose a stochastic optimization approach, dubbed Demon, to guide the denoising process at inference time without backpropagation through reward functions or model retraining. Our approach works by controlling noise distribution in denoising steps to concentrate density on regions corresponding to high rewards through stochastic optimization. We provide comprehensive theoretical and empirical evidence to support and validate our approach, including experiments that use non-differentiable sources of rewards such as Visual-Language Model (VLM) APIs and human judgments. To the best of our knowledge, the proposed approach is the first inference-time, backpropagation-free preference alignment method for diffusion models. Our method can be easily integrated with existing diffusion models without further training. Our experiments show that the proposed approach significantly improves the average aesthetics scores for text-to-image generation.

169DDIL: Improved Diffusion Distillation with Imitation Learning

[openreview] [pdf]

Abstract Diffusion models excel at generative modeling (e.g., text-to-image) but sampling requires multiple denoising network passes, limiting practicality. Diffusion distillation methods have shown promise by reducing the number of passes at the expense of quality of the generated samples but suffer from lack of diversity, quality, etc. . In this work we identify co-variate shift as one of reason for poor performance of multi-step distilled models from compounding error at inference time. To address co-variate shift, we formulate diffusion distillation within imitation learningDDILframework and enhance training distribution for distilling diffusion models on both data distribution (forward diffusion) and student induced distributions (backward diffusion). Training on data distribution helps to diversify the generations bypreserving marginal data distributionand training on student distribution addresses compounding error bycorrecting covariate shift. In addition, we adopt reflected diffusion formulation for distillation and demonstrate improved performance, stable training across different distillation methods. We show that DDIL and reflected diffusion formulation consistency improves on baseline algorithms of progressive distillation(PD), Latent consistency models(LCM)and Distribution Matching Distillation(DMD2)

170Medium-Difficulty Samples Constitute Smoothed Decision Boundary for Knowledge Distillation on Pruned Datasets

[openreview] [pdf]

Abstract This paper tackles a new problem of dataset pruning for Knowledge Distillation (KD), from a fresh perspective of Decision Boundary (DB) preservation and drifts. Existing dataset pruning methods generally assume that the post-pruning DB formed by the selected samples can be well-captured by future networks that use those samples for training. Therefore, they tend to preserve hard samples since hard samples are closer to the DB and better characterize the nuances in the distribution of the entire dataset. However, in KD, the limited learning capacity from the student network leads to imperfect preservation of the teacher’s feature distribution, resulting in the drift of DB in the student space. Specifically, hard samples worsen such drifts as they are difficult for the student to learn, creating a situation where the student’s DB can drift deeper into other classes and make incorrect classifications. Motivated by these findings, our method selects medium-difficulty samples for KD-based dataset pruning. We show that these samples constitute a smoothed version of the teacher’s DB and are easier for the student to learn, obtaining a general feature distribution preservation for a class of samples and reasonable DB between different classes for the student. In addition, to reduce the distributional shift due to dataset pruning, we leverage the class-wise distributional information of the teacher’s outputs to reshape the logits of the preserved samples. Experiments show that the proposed static pruning method can even perform better than the state-of-the-art dynamic pruning method which needs access to the entire dataset. In addition, our method halves the training times of KD and improves the student’s accuracy by 0.4% on ImageNet with a 50% keep ratio. When the ratio further increases to 70%, our method achieves higher accuracy over the vanilla KD while reducing the training times by 30%.

171RAGDP: Retrieve-Augmented Generative Diffusion Policy

[openreview] [pdf]

Abstract Diffusion Policy has attracted attention for its ability to achieve significant accuracy gains in a variety of imitation learning tasks. However, since Diffusion Policy relies on the Diffusion Model, it requires multiple denoising steps to generate a single action leading to long generation times. To address this issue, methods like DDIM and Consistency Models have been introduced to speed up the process. While these methods reduce computation time, this often comes at the cost of accuracy. In this paper, we propose RAGDP, a technique designed to improve the efficiency of learned Diffusion Policies without sacrificing accuracy. RAGDP builds upon the Retrieval-Augmented Generation (RAG) technique, which is commonly used in large language models to store and retrieve data from a vector database based on encoded embeddings. In RAGDP, pairs of expert observation and actions data are stored in a vector database. The system then searches the database using encoded observation data to retrieve expert action data with high similarity. This retrieved expert data is subsequently used by the RAGDP algorithm to generate actions tailored to the current environment. We introduce two action generation algorithms, RAGDP-VP and RAGDP-VE, which correspond to different types of Diffusion Models. Our results demonstrate that RAGDP can significantly improve the speed of Diffusion Policy without compromising accuracy. Furthermore, RAGDP can be integrated with existing speed-up methods to enhance their performance.

172Provable Causal State Representation under Asynchronous Diffusion Model for POMDPs

[openreview] [pdf]

Abstract A major challenge in applying reinforcement learning (RL) to real-world scenarios is managing high-dimensional, noisy perception input signals. Identifying and utilizing representations that contain sufficient and essential information for decision-making tasks is key to computational efficiency and generalization of RL by reducing bias in decision-making processes. In this paper, we present a new RL framework, namedCausal State Representation under Asynchronous Diffusion Model (CSR-ADM), which accommodates and enhances any RL algorithm for partially observable Markov decision processes (POMDPs) with perturbed inputs. A new asynchronous diffusion model is proposed to denoise both reward and observation spaces, and integrated with the bisimulation technology to capture causal state representations in POMDPs. Notably, the causal state is the coarsest partition of the denoised observations. We link the causal state to a causal feature set and provide theoretical guarantees by deriving the upper bound on value function approximation between the noisy observation space and the causal state space, demonstrating equivalence to bisimulation under the Lipschitz assumption. To the best of our knowledge, CSR-ADM is the first framework to approximate causal states with diffusion models, substantiated by a comprehensive theoretical foundation. Extensive experiments on Roboschool tasks show that CSR-ADM outperforms state-of-the-art methods, significantly improving the robustness of existing RL algorithms under varying scales of random noise.

173Model predictive control is almost optimal for restless bandits

[openreview] [pdf]

Abstract We consider the discrete time infinite horizon average reward restless markovian bandit (RMAB) problem. We propose a model predictive control based non-stationary policy with a rolling computational horizon τ\tau. At each time-slot, this policy solves a τ\tau horizon linear program whose first control value is kept as a control for the RMAB. Our solution requires minimal assumptions and quantifies the loss in optimality in terms of τ\tau and the number of arms, NN. We show that its sub-optimality gap is O(1/N)O(1/\sqrt{N}) in general, and exp(ΩN)\exp(-\Omega{N}) under a local-stability condition. Our proof is based on a framework from dynamic control known as dissipativity. Not only is our solution easy to implement but performs very well in practice when compared to the state of the art. Further, both our solution and our proof methodology can easily be generalized to more general constrained MDP settings and should thus, be of great interest to the burgeoning RMAB community.

174Score Forgetting Distillation: A Swift, Data-Free Method for Machine Unlearning in Diffusion Models

[openreview] [pdf]

Abstract The machine learning community is increasingly recognizing the importance of fostering trust and safety in modern generative AI (GenAI) models. We posit machine unlearning (MU) as a crucial foundation for developing safe, secure, and trustworthy GenAI models. Traditional MU methods often rely on stringent assumptions and require access to real data. This paper introduces Score Forgetting Distillation (SFD), an innovative MU approach that promotes the forgetting of undesirable information in diffusion models by aligning the conditional scores of “unsafe” classes or concepts with those of “safe” ones. To eliminate the need for real data, our SFD framework incorporates a score-based MU loss into the score distillation objective of a pretrained diffusion model. This serves as a regularization term that preserves desired generation capabilities while enabling the production of synthetic data through a one-step generator. Our experiments on pretrained label-conditional and text-to-image diffusion models demonstrate that our method effectively accelerates the forgetting of target classes or concepts during generation, while preserving the quality of other classes or concepts. This unlearned and distilled diffusion not only pioneers a novel concept in MU but also accelerates the generation speed of diffusion models. Our experiments and studies on a range of diffusion models and datasets confirm that our approach is generalizable, effective, and advantageous for MU in diffusion models.

175Generate explorative goals with large language model guidance

[openreview] [pdf]

Abstract Reinforcement learning (RL) struggles with sparse reward environments. Recent developments in intrinsic motivation have revealed the potential of language models to guide agents in exploring the environment. However, the mismatch between the granularity of environment transitions and natural language descriptions hinders effective exploration for current methods. To address this problem, we introduce a model-based RL method named Language-Guided Explorative Goal Generation (LanGoal), which combines large language model (LLM) guidance with intrinsic exploration reward by learning to propose meaningful goals. LanGoal learns a hierarchical policy together with a world model. The high-level policy learns to propose goals based on LLM guidance to explore the environment, and the low-level policy learns to achieve the goals. Extensive results on Crafter demonstrate the effectiveness of LanGoal compared to recent methods.

176Stability and Sharper Risk Bounds with Convergence RateO(1/n2)

[openreview] [pdf]

Abstract The sharpest known high probability excess risk bounds are up to O(1/n)O\left( 1/n \right) for empirical risk minimization and projected gradient descent via algorithmic stability (Klochkov & Zhivotovskiy, 2021). In this paper, we show that high probability excess risk bounds of order up to O(1/n2)O(1/n^2) are possible. We discuss how high probability excess risk bounds reach O(1/n2)O(1/n^2) under strongly convexity, smoothness and Lipschitz continuity assumptions for empirical risk minimization, projected gradient descent and stochastic gradient descent. Besides, to the best of our knowledge, our high probability results on the generalization gap measured by gradients for nonconvex problems are also the sharpest.

177Learning Conditionally Independent Marginals Enables Logical Compositions in Conditional Diffusion Models

[openreview] [pdf]

Abstract How can we learn generative models to sample data with arbitrary logical compositions of statistically independent attributes? The prevailing solution is to sample from distributions expressed as a composition of attributes’ conditional marginal distributions under the assumption that they are statistically independent. This paper shows that standard conditional diffusion models violate this assumption, even when all attribute compositions are observed during training. And, this violation is significantly more severe when only a subset of the compositions is observed. We propose CoInD to address this problem. It explicitly enforces statistical independence between the conditional marginal distributions by minimizing Fisher’s divergence between the joint and marginal distributions. The theoretical advantages of CoInD are reflected in both qualitative and quantitative experiments, demonstrating a significantly more faithful and controlled generation of samples for arbitrary logical compositions of attributes. The benefit is more pronounced for scenarios that current solutions relying on the assumption of conditionally independent marginals struggle with, namely, logical compositions involving the NOT operation and when only a subset of compositions are observed during training.

178Diffusion-Guided Safe Policy Optimization From Cost-Label-Free Offline Dataset

[openreview] [pdf]

Abstract Offline safe reinforcement learning (RL) aims to guarantee the safety of decision-making in both training and deployment phases by learning the safe policy entirely from offline data without further interaction with the environment, which pushes the RL towards real-world applications. Previous efforts in offline safe RL typically presume the presence of Markovian costs within the dataset. However, the design of a Markovian cost function involves rehearsal of all potentially unsafe cases, which is inefficient and even unfeasible in many practical tasks. In this work, we take a further step forward by learning a safe policy from an offline dataset without any cost labels, but with a small number of safe demonstrations included. To solve this problem, we propose a two-stage optimization method calledDiffusion-guidedSafePolicyOptimization (DSPO). Initially, we derive trajectory-wise safety signals by training a return-agnostic discriminator. Subsequently, we train a conditional diffusion model that generates trajectories conditioned both on the trajectory return and the safety signal. Remarkably, the trajectories generated by our diffusion model not only yield high returns but also comply with the safety signals, from which we can derive a desirable policy through behavior cloning (BC). The evaluation experiments conducted across tasks from the SafetyGym, BulletGym, and MetaDrive environments demonstrate that our approach can achieve a safe policy with high returns, significantly outperforming various established baselines.

179KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies in Curiosity-driven Exploration

[openreview] [pdf]

Abstract In continuous control tasks, Soft Actor-Critic (SAC) has achieved notable success by balancing exploration and exploitation. However, SAC struggles in sparse reward environments, where infrequent rewards hinder efficient exploration. While curiosity-driven exploration methods help address this issue by encouraging the agent to explore novel states, they introduce challenges, such as the difficulty of setting an optimal reward scale and managing the interaction between curiosity-based exploration and SAC’s stochastic policy. These complexities often lead to inefficient exploration or premature convergence and make balancing exploration-exploitation challenging. In this paper, we propose KEA (Keeping Exploration Alive) to tackle the inefficiencies in balancing the exploration-exploitation trade-off when combining SAC with curiosity-based methods. KEA introduces an additional co-behavior agent that works alongside SAC and a switching mechanism to facilitate proactive coordination between exploration strategies from the co-behavior agent and the SAC agent with curiosity-based exploration. This coordination allows the agent to maintain stochasticity in high-novelty regions, preventing premature convergence and enhancing exploration efficiency. We first analyze the difficulty of balancing exploration-exploitation when combining SAC with curiosity-based methods in a 2D grid environment. We then evaluate KEA on sparse reward control tasks from the DeepMind Control Suite and compare against two state-of-the-art curiosity-based exploration baselines — Random Network Distillation (RND) and NovelD. KEA improves episodic rewards by up to 119% over RND and 28% over NovelD, significantly improving learning efficiency and robustness in sparse reward environments.

180Practical alignment requires more than learning from human feedback

[openreview] [pdf]

Abstract Ensuring the alignment of artificial intelligence (AI) systems with human objectives is a critical challenge in the development of safe and effective AI technologies. Reinforcement learning from human feedback (RLHF) has been a predominant method to tackle this challenge. However, this framework operates under the unrealistic assumptions that human preferences are accurate reflections of their desires and that they remain constant over time. This paper identifies and challenges these assumptions by illustrating how they can lead to undesirable consequences, particularly when human beliefs about the environment are incorrect or mutate over time. To address these challenges, we introduce a novel framework termed practical alignment. This framework redefines the alignment objective to accommodate the variability and irrationality of human beliefs, emphasizing the need for AI systems not only to learn from but also to teach humans about the world. We discuss the theoretical underpinnings of practical alignment and introduce MindGrid, a toolkit designed to simulate and evaluate alignment scenarios. Our experimental results using large language models in teaching scenarios underscore the importance of teaching skills as a requisite capability to achieve alignment.

181TerDiT: Ternary Diffusion Models with Transformers

[openreview] [pdf]

Abstract Recent developments in large-scale pre-trained text-to-image diffusion models have significantly improved the generation of high-fidelity images, particularly with the emergence of diffusion transformer models (DiTs). Among diffusion models, diffusion transformers have demonstrated superior image generation capabilities, boosting lower FID scores and higher scalability. However, deploying large-scale DiT models can be expensive due to their excessive parameter numbers. Although existing research has explored efficient deployment techniques for diffusion models such as model quantization, there is still little work concerning DiT-based models. To tackle this research gap, in this paper, we proposeTerDiT, a quantization-aware training (QAT) and efficient deployment scheme for ternary diffusion transformer models. We focus on the ternarization of DiT networks, with model sizes ranging from 600M to 4.2B, and image resolution from 256×\times256 to 512×\times512. Our work contributes to the exploration of efficient deployment of large-scale DiT models, demonstrating the feasibility of training extremely low-bit DiT models from scratch while maintaining competitive image generation capacities compared to full-precision models. Code has been uploaded in the supplemental materials.

182UniCon: Unidirectional Information Flow for Effective Control of Large-Scale Diffusion Models

[openreview] [pdf]

Abstract We introduce UniCon, a novel architecture designed to enhance control and efficiency in training adapters for large-scale diffusion models like the Diffusion transformer. Unlike existing methods that rely on bidirectional interaction between the diffusion model and control adapter, UniCon implements a unidirectional flow from the diffusion network to the adapter, allowing the adapter alone to generate the final output. UniCon reduces computational demands by eliminating the need for the diffusion model to compute and store gradients during adapter training. UniCon is free from the constrains of encoder-focused designs and is able to utilize all parameters of the diffusion model, making it highly effective for transformer-based architectures. Our results indicate that UniCon reduces GPU memory usage by one-third and increases training speed by 2.3 times, while all maintaining the same adapter parameter size. Additionally, without requiring extra computational resources, UniCon enables the training of adapters with double the parameter volume of existing ControlNets. In a series of image condition generation tasks, UniCon has demonstrated precise response to control information and excellent generation capabilities. UniCon makes the control of large-scale diffusion models feasible and provides a basis for further scaling up of diffusion models.

183Understanding Scale Shift in Domain Generalization for Crowd Localization

[openreview] [pdf]

Abstract Crowd localization plays a crucial role in visual scene understanding towards predicting each pedestrian location in a crowd, thus being applicable to various downstream tasks. However, existing approaches suffer from significant performance degradation due to differences in head scale distributions (scale shift) between training and testing data, a challenge known as domain generalization (DG). This paper aims to comprehend the nature of scale shift within the context of domain generalization for crowd localization models. To this end, we address three key questions: (i) how to quantify the scale shift influence on DG task, (ii) why does this influence occur, (iii) how to mitigate the influence. Specifically, we first establish a benchmark, ScaleBench, and reproduce 20 advanced DG algorithms, to quantify the influence. Through extensive experiments, we demonstrate the limitations of existing algorithms and highlight the under-explored nature of this issue. To further understand its behind reason, we provide a rigorous theoretical analysis on scale shift. Building on this analysis, we further propose a simple yet effective algorithm called Semantic Hook to mitigate the influence of scale shift on DG, which also serves as a case study revealing three significant insights for future research. Our results emphasize the importance of this novel and applicable research direction, which we term Scale Shift Domain Generalization\textit{Scale Shift Domain Generalization}.

184Bandits with Anytime Knapsacks

[openreview] [pdf]

Abstract We consider bandits with anytime knapsacks (BwAK), a novel version of the BwK problem where there is an anytime cost constraint instead of a total cost budget. This problem setting introduces additional complexities as it mandates adherence to the constraint throughout the decision-making process. We propose SUAK, a novel algorithm that utilizes upper confidence bounds to identify the optimal mixture of arms while maintaining a balance between exploration and exploitation. SUAK is an adaptive algorithm that strategically utilizes the available budget in each round in the decision-making process and skips a round when it is possible to violate the anytime cost constraint. In particular, SUAK slightly under-utilizes the available cost budget to reduce the need for skipping rounds. We show that SUAK attains the same problem-dependent regret upper bound of O(KlogT)O(K \log T) established in prior work under the simpler BwK framework. Finally, we provide simulations to verify the utility of SUAK in practical settings.

185CoLa-DCE – Concept-guided Latent Diffusion Counterfactual Explanations

[openreview] [pdf]

Abstract Recent advancements in generative AI have introduced novel prospects and prac- tical implementations. Especially diffusion models show their strength in gener- ating diverse and, at the same time, realistic features, positioning them well for generating counterfactual explanations for computer vision models. Answering “what if” questions of what needs to change to make an image classifier change its prediction, counterfactual explanations align well with human understanding and consequently help in making model behavior more comprehensible. Current methods succeed in generating authentic counterfactuals, but lack transparency as feature changes are not directly perceivable. To address this limitation, we intro- duce Concept-guided Latent Diffusion Counterfactual Explanations (CoLa-DCE). CoLa-DCE generates concept-guided counterfactuals for any classifier with a high degree of control regarding concept selection and spatial conditioning. The coun- terfactuals comprise an increased granularity through minimal feature changes. The reference feature visualization ensures better comprehensibility, while the feature localization provides increased transparency of “where” changed “what”. We demonstrate the advantages of our approach in minimality and comprehen- sibility across multiple image classification models and datasets and provide in- sights into how our CoLa-DCE explanations help comprehend model errors like misclassification cases.

186Controlling Information Leakage in Concept Bottleneck Models with Trees

[openreview] [pdf]

Abstract As AI models grow larger, the demand for accountability and interpretability has become increasingly critical for understanding their decision-making processes. Concept Bottleneck Models (CBMs) have gained attention for enhancing interpretability by mapping inputs to intermediate concepts before making final predictions. However, CBMs often suffer from information leakage, where additional input data, not captured by the concepts, is used to improve task performance, complicating the interpretation of downstream predictions. In this paper, we introduce a novel approach for training both joint and sequential CBMs that allows us to identify and control leakage using decision trees. Our method quantifies leakage by comparing the decision paths of hard CBMs with their soft, leaky counterparts. Specifically, we show that soft leaky CBMs extend the decision paths of hard CBMs, particularly in cases where concept information is incomplete. Using this insight, we develop a technique to better inspect and manage leakage, isolating the subsets of data most affected by this. Through synthetic and real-world experiments, we demonstrate that controlling leakage in this way not only improves task accuracy but also yields more informative and transparent explanations.

187Diffusion-based Decoupled Deterministic and Uncertain Framework for Probabilistic Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Diffusion-based denoising models have demonstrated impressive performance in probabilistic forecasting for multivariate time series (MTS). Nonetheless, existing approaches often model the entire data distribution, neglecting the variability in uncertainty across different components of the time series. This paper introduces a Diffusion-based Decoupled Deterministic and Uncertain (D3U\mathrm{D^3U}) framework for probabilistic MTS forecasting. The framework integrates non-probabilistic forecasting with conditional diffusion generation, enabling both accurate point predictions and probabilistic forecasting. D3U\mathrm{D^3U} utilizes a point forecasting model to non-probabilistically model high-certainty components in the time series, generating embedded representations that are conditionally injected into a diffusion model. To better model high-uncertainty components, a patch-based denoising network (PatchDN) is designed in the conditional diffusion model. Designed as a plug-and-play framework, D3U\mathrm{D^3U} can be seamlessly integrated into existing point forecasting models to provide probabilistic forecasting capabilities. It can also be applied to other conditional diffusion methods that incorporate point forecasting models. Experiments on six real-world datasets demonstrate that our method achieves over a 20% improvement in both point and probabilistic forecasting performance in MTS long-term forecasting compared to state-of-the-art (SOTA) methods. Additionally, extensive ablation studies further validate the effectiveness of the D3U\mathrm{D^3U} framework.

188On the feature learning in diffusion models

[openreview] [pdf]

Abstract The predominant success of diffusion models in generative modeling has spurred significant interest in understanding their theoretical foundations. In this work, we propose a feature learning framework aimed at analyzing and comparing the training dynamics of diffusion models with those of traditional classification models. Our theoretical analysis demonstrates that, under identical settings, neural networks trained for classification tend to prioritize learning specific patterns in the data, often focusing on easy-to-learn features. In contrast, diffusion models, due to the denoising objective, are encouraged to learn more balanced and comprehensive representations of the data. To support these theoretical insights, we conduct several experiments on both synthetic and real-world datasets, which empirically validate our findings and underscore the distinct feature learning dynamics in diffusion models compared to classification.

189HP3O: Hybrid-Policy Proximal Policy Optimization with Best Trajectory

[openreview] [pdf]

Abstract Proximal policy optimization (PPO) is one of the most popular state-of-the-art on-policy algorithms that has become a standard baseline in modern reinforcement learning with applications in numerous fields. Though it delivers stable performance with theoretical policy improvement guarantees, high variance and high sample complexity still remain critical challenges in on-policy algorithms. To alleviate these issues, we propose Hybrid-Policy Proximal Policy Optimization (HP3O), which utilizes a trajectory replay buffer to make efficient use of trajectories generated by recent policies. Particularly, the buffer applies the “first in, first out” (FIFO) strategy so as to keep only the recent trajectories to attenuate the data distribution drift. A batch consisting of the trajectory with the best return and other randomly sampled ones from the buffer is used for updating the policy networks. The strategy helps the agent to improve its capability on top of the most recent best performance and in turn reduce variance empirically. We theoretically construct the policy improvement guarantees for the proposed algorithm. HP3O is validated and compared against several baseline algorithms using multiple continuous control environments. Our code is available here.

190TopoDiffusionNet: A Topology-aware Diffusion Model

[openreview] [pdf]

Abstract Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constraint. This limitation restricts their utility in applications requiring exact control, like robotics and environmental modeling. To address this, we propose TopoDiffusionNet (TDN), a novel approach that enforces diffusion models to maintain the desired topology. We leverage tools from topological data analysis, particularly persistent homology, to extract the topological structures within an image. We then design a topology-based objective function to guide the denoising process, preserving intended structures while suppressing noisy ones. Our experiments across four datasets demonstrate significant improvements in topological accuracy. TDN is the first to integrate topology with diffusion models, opening new avenues of research in this area.

191Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants

[openreview] [pdf]

Abstract Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We also prove that SMOTE exhibits boundary artifacts, thus justifying existing SMOTE variants. Then we introduce two new SMOTE-related strategies, and compare them with state-of-the-art rebalancing procedures. Surprisingly, for most data sets, we observe that applying no rebalancing strategy is competitive in terms of predictive performances, with tuned random forests, logistic regression or LightGBM. For highly imbalanced data sets, our new methods, named CV-SMOTE and Multivariate Gaussian SMOTE, are competitive. Besides, our analysis sheds some lights on the behavior of common rebalancing strategies, when used in conjunction with random forests.

192q-exponential family for policy optimization

[openreview] [pdf]

Abstract Policy optimization methods benefit from a simple and tractable policy parametrization, usually the Gaussian for continuous action spaces. In this paper, we consider a broader policy family that remains tractable: the qq-exponential family. This family of policies is flexible, allowing the specification of both heavy-tailed policies (q>1q>1) and light-tailed policies (q<1q<1). This paper examines the interplay between qq-exponential policies for several actor-critic algorithms conducted on both online and offline problems. We find that heavy-tailed policies are more effective in general and can consistently improve on Gaussian. In particular, we find the Student’s t-distribution to be more stable than the Gaussian across settings and that a heavy-tailed qq-Gaussian for Tsallis Advantage Weighted Actor-Critic consistently performs well in offline benchmark problems.

193Refining Counterfactual Explanations With Joint-Distribution-Informed Shapley Towards Actionable Minimality

[openreview] [pdf]

Abstract Counterfactual explanations (CE) identify data points that closely resemble the observed data but produce different machine learning (ML) model outputs, offering critical insights into model decisions. Despite the diverse scenarios, goals and tasks to which they are tailored, existing CE methods often lack actionable efficiency because of unnecessary feature changes included within the explanations that are presented to users and stakeholders. We address this problem by proposing a method that minimizes the required feature changes while maintaining the validity of CE, without imposing restrictions on models or CE algorithms, whether instance- or group-based. The key innovation lies in computing a joint distribution between observed and counterfactual data and leveraging it to inform Shapley values for feature attributions (FA). We demonstrate that optimal transport (OT) effectively derives this distribution, especially when the alignment between observed and counterfactual data is unclear in used CE methods. Additionally, a counterintuitive finding is uncovered: it may be misleading to rely on an exact alignment defined by the CE generation mechanism in conducting FA. Our proposed method is validated on extensive experiments across multiple datasets, showcasing its effectiveness in refining CE towards greater actionable efficiency.

194Searching For Robust Point Cloud Distillation

[openreview] [pdf]

Abstract Deep Neural Networks (DNNs) have shown remarkable performance in machine learning; however, their vulnerabilities to adversarial attacks have been exposed, particularly in point cloud data. Neural Architecture Search (NAS) is a technique for discovering new neural architectures with high predictive accuracy, yet its potential for enhancing model robustness against adversarial attacks remains largely unexplored. In this study, we investigate the application of NAS within the framework of knowledge distillation, aiming to generate robust student architectures that inherit resilience from robust teacher models. We introduce RDANAS, an effective NAS method that utilizes cross-layer knowledge distillation from robust teacher models to enhance the robustness of the student model. Unlike previous studies, RDANAS considers the teacher model’s outputs and automatically identifies the optimal teacher layer for each student layer during supervision. Experimental results on ModelNet40, ScanObjectNN and ScanNet datasets demonstrate the efficacy of RDANAS, revealing that the neural architectures it generates are compact and possess adversarial robustness, which shows potential in multiple applications.

195Diffusion Auto-regressive Transformer for Effective Self-supervised Time Series Forecasting

[openreview] [pdf]

Abstract Self-supervised learning has become an essential and popular approach for enhancing time series forecasting, enabling models to learn universal representations from unlabeled data. However, effectively capturing both the global sequence dependence and local detail features within time series data remains challenging. To address this, we propose a novel generative self-supervised method called TimeDART, denoting Diffusion Auto-regressive Transformer for Time series forecasting. In TimeDART, we treat time series patches as basic modeling units. For one thing, we employ an self-attention based Transformer encoder to model the dependencies of inter-patches. For another, we introduce diffusion and denoising mechanisms to capture the locality features of intra-patch. Notably, we design a cross-attention-based denoising decoder that allows for adjustable optimization difficulty in the self-supervised task, facilitating more effective self-supervised pre-training. Extensive experiments demonstrate that TimeDART achieves state-of-the-art fine-tuning performance compared to the most advanced competitive methods in forecasting tasks. Our code is publicly available athttps://anonymous.4open.science/r/TimeDART-2024.

196Concepts’ Information Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) offer a self-explainable AI framework by predicting targets based on human-understandable concepts, but they often fail to achieve optimal performance and interpretability due to leakage of irrelevant information into the concept activations. This paper presents an information-theoretic enhancement of CBMs through the integration of the Information Bottleneck (IB) framework, aimed at addressing their issues of concept leakage and reduced performance. Our approach reshapes the way CBMs process and utilize concepts by constraining mutual information between input data and concepts, ensuring that only the most relevant information is preserved for decision-making. This introduces a new paradigm for CBMs that not only enhances performance but also enforces a tighter connection between latent representations and human-understandable concepts, ensuring a more robust and interpretable model. Our experiments on datasets such as CUB, AwA2, and aPY demonstrate that IB-augmented CBMs improve both concept and target prediction accuracy, while also increasing intervenability. Additionally, we propose a novel metric to assess the quality of concept sets based on intervention performance. Unlike traditional task performance metrics, which may obscure the effects of concept leakage, the new metric offers a direct, interpretable evaluation of concept set goodness.

197Online Policy Selection for Inventory Problems

[openreview] [pdf]

Abstract We tackle online inventory problems where at each time period the manager makes a replenishment decision based on partial historical information in order to meet demands and minimize costs. To solve such problems, we build upon recent works in online learning and control, use insights from inventory theory and propose a new algorithm called GAPSI. This algorithm follows a new feature-enhanced base-stock policy and deals with the troublesome question of non-differentiability which occurs in inventory problems. Our method is illustrated in the context of a complex and novel inventory system involving multiple products, lost sales, perishability, warehouse-capacity constraints and lead times. Extensive numerical simulations are conducted to demonstrate the good performances of our algorithm on real-world data.

198Outward Odyssey: Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) heavily depends on well-designed reward functions, which can be challenging to create and may introduce biases, especially for complex behaviors. Preference-based RL (PbRL) addresses this by using human feedback to construct a reward model that reflects human preferences, yet requiring considerable human involvement. To alleviate this, several PbRL methods aim to select queries that need minimal feedback. However, these methods do not directly enhance the data coverage within the preference buffer. In this paper, to emphasize the critical role of preference buffer coverage in determining the quality of the reward model, we first investigate and find that a reward model’s evaluative accuracy is the highest for trajectories within the preference buffer’s distribution and significantly decreases for out-of-distribution trajectories. Against this phenomenon, we introduce theProximal Policy Exploration (PPE)algorithm, which consists of aproximal-policy extensionmethod and amixture distribution querymethod. To achieve higher preference buffer coverage, theproximal-policy extensionmethod encourages active exploration of data within near-policy regions that fall outside the preference buffer’s distribution. To balance the inclusion of in-distribution and out-of-distribution data, themixture distribution querymethod proactively selects a mix of data from both outside and within the preference buffer’s distribution for querying. PPE not only expands the preference buffer’s coverage but also ensures the reward model’s evaluative capability for in-distribution data. Our comprehensive experiments demonstrate that PPE achieves significant improvement in both human feedback efficiency and RL sample efficiency, underscoring the importance of preference buffer coverage in PbRL tasks.

199Conditional Diffusion Models are Minimax-Optimal and Manifold-Adaptive for Conditional Distribution Estimation

[openreview] [pdf]

Abstract We consider a class of conditional forward-backward diffusion models for conditional generative modeling, that is, generating new data given a covariate (or control variable). To formally study the theoretical properties of these conditional generative models, we adopt a statistical framework of distribution regression to characterize the large sample properties of the conditional distribution estimators induced by these conditional forward-backward diffusion models. Here, the conditional distribution of data is assumed to smoothly change over the covariate. In particular, our derived convergence rate is minimax-optimal under the total variation metric within the regimes covered by the existing literature. Additionally, we extend our theory by allowing both the data and the covariate variable to potentially admit a low-dimensional manifold structure. In this scenario, we demonstrate that the conditional forward-backward diffusion model can adapt to both manifold structures, meaning that the derived estimation error bound (under the Wasserstein metric) depends only on the intrinsic dimensionalities of the data and the covariate.

200D3PM: Diffusion Model Responds to the Duty Call from Causal Discovery

[openreview] [pdf]

Abstract Causal discovery (CD) involves inferring cause-and-effect relationships as directed acyclic graphs (DAGs). In this work, we assume that the data is generated by an additive noise model (ANM). Recent work has formulated the problem as a continuous optimization problem, which consists of solving an inverse problem and satisfying an acyclicity constraint. However, solving the inverse problem in CD is often unstable, i.e. high sensitivity of the effects to perturbations in the causes. To address this instability, we formulate the inverse problem as a regularized optimization scheme and propose a novel variation-negotiation regularizer. Compared to traditional regularization techniques for the continuous optimization problem, e.g. 1\ell_1 penalty on graphs, the proposed regularizer exploits the variation variable in ANMs to stabilize the solutions (i.e. DAGs). This regularizer is advantageous as it does not rely on any hypotheses, such as graph sparsity, about true DAGs. The variation-negotiation regularizer regulates the DAG purely based on observed data.Building on the proposed regularizer, a series of improvements to the regularized optimization scheme reveal the connections between solving the regularized optimization problem and learning a diffusion model, as they share comparable objective functions. This insight leads us to develop an equivalent diffusion model called DAG-invariant Denoising Diffusion Probabilistic Model. Extensive empirical experiments on synthetic and real datasets demonstrate that the proposed diffusion model achieves outstanding performance on all datasets.

201Domain Shift Tuning over Knowledge Gap

[openreview] [pdf]

Abstract This paper introduces Domain Shift Tuning (DST), a novel framework designed to guide pre-trained language models (PLMs), including Large Language Models (LLMs), in overcoming domain discrepancies (i.e., source-target). PLMs, pre-trained on extensive and diverse corpora, the source domain, often encounter domain gaps after fine-tuning over the target domain. Unlike conventional adapters or Parameter-Efficient Fine-Tuning (PEFT) methods, DST conceptualizes domain gaps as differences in knowledge encapsulated within multiple subnetworks of PLMs. To bridge this gap, our challenge is to find a subnetwork set that corresponds to these pieces of knowledge and their weight. This direction leads DST to employ a lightweight subnetwork, the Knowledge Steering Layer (KSL), and a training objective, Knowledge Distribution Modeling (KDM). These components enable DST to fine-tune PLMs by aligning the knowledge weights of the source domain with those of the target domain. Experimental results on diverse datasets demonstrate that DST effectively mitigates the domain gap, allowing PLMs to generate text that closely aligns with even a small target corpus, thereby significantly enhancing domain adaptation for PLMs at lower computational cost.

202Deployment Efficient Reward-Free Exploration with Linear Function Approximation

[openreview] [pdf]

Abstract We study deployment efficient reward-free exploration with linear function approximation, where the goal is to explore a linear Markov Decision Process (MDP) without revealing the reward function, while minimizing the number of exploration policies used during the algorithm. We design a new reinforcement learning (RL) algorithm whose sample complexity is polynomial in the feature dimension and horizon length, while achieving nearly optimal deployment efficiency for linear MDPs under the reward-free exploration setting. More specifically, our algorithm explores a linear MDP in a reward-free manner, while using at most HH exploration policies during its execution where HH is the horizon length. Compared to previous algorithms with similar deployment efficiency guarantees, the sample complexity of our algorithm does not depend on the reachability coefficient or the explorability coefficient of the underlying MDP, which can be arbitrarily small for certain MDPs. Our result addresses an open problem proposed in prior work. To achieve such a result, we show how to truncate state-action pairs of the underlying linear MDP in a data-dependent manner, and devise efficient offline policy evaluation and offline policy optimization algorithms in the truncated linear MDP. We further show how to implement reward-free exploration mechanisms in the linear function approximation setting by carefully combines these offline RL algorithms without sacrificing the deployment efficiency.

203Channel-aware Contrastive Conditional Diffusion for Multivariate Probabilistic Time Series Forecasting

[openreview] [pdf]

Abstract Forecasting faithful trajectories of multivariate time series from practical scopes is essential for reasonable decision-making. Recent methods majorly tailor generative conditional diffusion models to estimate the target temporal predictive distribution. However, it remains an obstacle to enhance the exploitation efficiency of given implicit temporal predictive information to bolster conditional diffusion learning. To this end, we propose a generic channel-aware contrastive conditional diffusion model termed CCDM to achieve desirable multivariate probabilistic forecasting, obviating the need for curated temporal conditioning inductive biases. In detail, we first design a channel-centric conditional denoising network to manage intra-variate variations and cross-variate correlations, which can lead to scalability on diverse prediction horizons and channel numbers. Then, we devise an ad-hoc denoising-based temporal contrastive learning to explicitly amplify the predictive mutual information between past observations and future forecasts. It can coherently complement naive step-wise denoising diffusion training and improve the forecasting accuracy and generality on unknown test time series. Besides, we offer theoretic insights on the benefits of such auxiliary contrastive training refinement from both neural mutual information and temporal distribution generalization aspects. The proposed CCDM can exhibit superior forecasting capability compared to current state-of-the-art diffusion forecasters over a comprehensive benchmark, with best MSE and CRPS outcomes on 66.67% and 83.33% cases.

204Dataset Distillation for Domain Generalization

[openreview] [pdf]

Abstract Dataset Distillation (DD) has been applied to various downstream tasks and recently scaled to ImageNet-1k, highlighting its potential for practical applications. However, in real-world scenarios, robustness to unseen domains is essential, and the robustness of models trained on synthetic datasets remains uncertain. To address this, we propose a novel task, Dataset Distillation for Domain Generalization (DD for DG), and evaluate the unseen domain generalization of models trained on synthetic datasets distilled by state-of-the-art DD methods using the DomainBed benchmark. Additionally, we introduce a new method for this task, which interprets DD through the lens of image style transfer, achieving superior performance in unseen domain generalization compared to baseline approaches.

205Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback

[openreview] [pdf]

Abstract Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, directly collecting human preferences can be expensive, time-consuming, and can have high variance. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations as they are more consistent, cheaper, and scale better than human annotation; however, they are also prone to biases and errors. In this work, we introduce a routing framework that combines inputs from humans and LMs to achieve better annotation quality, while reducing the total cost of human annotation. The crux of our approach is to identify preference instances that will benefit from human annotations. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we train a performance prediction model to predict a reward model’s performance on an arbitrary combination of human and LM annotations and employ a routing strategy that selects a combination that maximizes predicted performance. We train the performance prediction model on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of LM and direct human preferences using our routing framework achieves better reward model performance compared to using either one exclusively. We simulate selective human preference collection on three other datasets and show that our method generalizes well to all three. We analyze features from the routing model to identify characteristics of instances that can benefit from human feedback, e.g., prompts with a moderate safety concern or moderate intent complexity. We release the dataset, annotation platform, and source code used in this study to foster more efficient and accurate preference collection in the future.

206Counterfactual Delayed Feedback Learning

[openreview] [pdf]

Abstract Estimation of heterogeneous treatment effects has gathered much attention in recent years and has been widely adopted in medicine, economics, and marketing. Previous studies assumed that one of the potential outcomes of interest could be observed timely and accurately. However, a more practical scenario is that treatment takes time to produce causal effects on the outcomes. For example, drugs take time to produce medical utility for patients and users take time to purchase items after being recommended, and ignoring such delays in feedback can lead to biased estimates of heterogeneous treatment effects. To address the above problem, we study the impact of observation time on estimating heterogeneous treatment effects by further considering the potential response time that potential outcomes have. We theoretically prove the identifiability results and further propose a principled learning approach, known as CFR-DF (Counterfactual Regression with Delayed Feedback), to simultaneously learn potential response times and potential outcomes of interest. Results on both simulated and real-world datasets demonstrate the effectiveness of our method.

207Influence Functions for Scalable Data Attribution in Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models by extending influence functions. Influence function-based data attribution methods approximate how a model’s output would have changed if some training data were removed. In supervised learning, this is usually used for predicting how the loss on a particular example would change. For diffusion models, we focus on predicting the change in the probability of generating a particular example via several proxy measurements. We show how to formulate influence functions for such quantities and how previously proposed methods can be interpreted as particular design choices in our framework. To ensure scalability of the Hessian computations in influence functions, we use a K-FAC approximation based on generalised Gauss-Newton matrices specifically tailored to diffusion models. We show that our recommended method outperforms previously proposed data attribution methods on common data attribution evaluations, such as the Linear Data-modelling Score (LDS) or retraining without top influences, without the need for method-specific hyperparameter tuning.

208AVID: Adapting Video Diffusion Models to World Models

[openreview] [pdf]

Abstract Large-scale generative models have achieved remarkable success in a number of domains. However, for sequential decision-making problems, such as robotics, action-labelled data is often scarce and therefore scaling-up foundation models for decision-making remains a challenge. A potential solution lies in leveraging widely-available unlabelled videos to train world models that simulate the consequences of actions. If the world model is accurate, it can be used to optimize decision-making in downstream tasks. Image-to-video diffusion models are already capable of generating highly realistic synthetic videos. However, these models are not action-conditioned, and the most powerful models are closed source which means they cannot be finetuned. In this work, we propose to adapt pretrained video diffusion models to action-conditioned world models, without access to the parameters of the pretrained model. Our approach, AVID, trains an adapter on a small domain-specific dataset of action-labelled videos. AVID uses a learnt mask to modify the intermediate outputs of the pretrained model and generate accurate action-conditioned videos. We evaluate AVID on video game and real-world robotics data, and show that it outperforms existing baselines for diffusion model adaptation. Our results demonstrate that if utilized correctly, pretrained video models have the potential to be powerful tools for embodied AI.

209SHIFT-RESILIENT DIFFUSIVE IMPUTATION FOR VARIABLE SUBSET FORECASTING

[openreview] [pdf]

Abstract It is common for sensor failures to result in missing data, leading to training sets being complete while test sets have only a small subset of variables. The challenge lies in utilizing incomplete data for forecasting, which is known as the Variable Subset Forecasting (VSF). In VSF tasks, significant distribution shift is present. One type is inter-series shift, which indicates changes in correlations between different series, and the other type is intra-series shift, which refers to substantial distribution differences within the same series across different time windows. Existing approaches to solving VSF tasks typically involve imputing the missing data first and then making predictions using the completed series. However, these methods do not account for the shift inherent in VSF tasks, resulting in poor model performance. To address these challenges, we propose a Shift-Resilient Diffusive Imputation (SRDI) framework against the shift. Specifically, SRDI integrates divide-conquer strategy with the denoising process, that decomposes the input into invariant patterns and variant patterns, representing the temporally stable parts of inter-series correlation and the highly fluctuating parts, respectively. By extracting spatiotemporal features from each separately and then appropriately combining them, inter-series shift can be effectively mitigated. Then, we innovatively organize SRDI and the forecasting model into a meta-learning paradigm tailored for VSF scenarios. We address the intra-series shift by treating time windows as tasks during training and employing an adaptation process before testing. Extensive experiments on four datasets have demonstrated our superior performance compared with state-of-the-art methods.

210Learning-Guided Rolling Horizon Optimization for Long-Horizon Flexible Job-Shop Scheduling

[openreview] [pdf]

Abstract Long-horizon combinatorial optimization problems, such as the Flexible Job-Shop Scheduling Problem (FJSP), often involve complex, interdependent decisions over extended time frames, posing significant challenges for existing solvers. While Rolling Horizon Optimization (RHO) addresses this by decomposing problems into overlapping shorter-horizon subproblems, such overlap often leads to redundant computations. In this paper, we present L-RHO, the first learning-guided RHO framework for long-horizon FJSP. L-RHO employs a customized attention-based model to intelligently fix variables that in hindsight did not need to be re-optimized, resulting in smaller and thus easier-to-solve subproblems. For FJSP, this means identifying operations with unchanged machine assignments between two consecutive subproblems. Empirically, L-RHO accelerates RHO by up to 54% while showing significantly improved solution quality, enabling it to outperform other heuristic and learning-based baselines. We also provide in-depth discussions and verify the desirable adaptability and generalization of L-RHO across various FJSP settings, distributions, and online scenarios. Moreover, we provide a theoretical analysis to elucidate the conditions under which learning is beneficial.

211Make Interval Bound Propagation great again

[openreview] [pdf]

Abstract In various scenarios motivated by real life, such as medical data analysis, autonomous driving, and adversarial training, we are interested in robust deep networks. A network is robust when a relatively small perturbation of the input cannot lead to drastic changes in output (like change of class, etc.). This falls under the broader scope field of Neural Network Certification (NNC). Two crucial problems in NNC are of profound interest to the scientific community: how to calculate the robustness of a given pre-trained network and how to construct robust networks. The common approach to constructing robust networks is Interval Bound Propagation (IBP). This paper demonstrates that IBP is sub-optimal in the first case due to its susceptibility to the wrapping effect. Even for linear activation, IBP gives strongly sub-optimal bounds. Consequently, one should use strategies immune to the wrapping effect to obtain bounds close to optimal ones. We adapt two classical approaches dedicated to strict computations -- Dubleton Arithmetic and Affine Arithmetic -- to mitigate the wrapping effect in neural networks. These techniques yield precise results for networks with linear activation functions, thus resisting the wrapping effect. As a result, we achieve bounds significantly closer to the optimal level than IBPs.

212Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

[openreview] [pdf]

Abstract We study the reinforcement learning (RL) problem with trajectory feedback. The trajectory feedback based reinforcement learning problem, where the learner can only observe the accumulative noised reward along the trajectory, is particularly suitable for the practical scenarios where the agent suffers extensively from querying the reward in each single step. For a finite-horizon Markov Decision Process (MDP) with SS states, AA actions and a horizon length of HH, we develop an algorithm that enjoys an optimal regret of O~(SAH3K)\tilde{O}\left(\sqrt{SAH^3K}\right) in KK episodes for sufficiently large KK. To achieve this, our technical contributions are two-fold: (1) we incorporate reinforcement learning with linear bandits problem to construct a tighter confidence region for the reward function; (2) we construct a reference transition model to better guide the exploration process.

213AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials

[openreview] [pdf]

Abstract Graphical User Interface (GUI) agents hold great potential for automating complex tasks across diverse digital environments, from web applications to desktop software. However, the development of such agents is hindered by the lack of high-quality, multi-step trajectory data required for effective training. Existing approaches rely on expensive and labor-intensive human annotation, making them unsustainable at scale. To address this challenge, we propose \ourwork, a scalable data synthesis pipeline that generates high-quality GUI agent trajectories by leveraging web tutorials. Our method automatically gathers tutorial-like texts from the internet, transforms them into task goals with step-by-step instructions, and employs a visual-language model (VLM) agent to simulate their execution in a real digital environment. A VLM-based evaluator ensures the correctness of the generated trajectories. We demonstrate that training GUI agents with these synthesized trajectories significantly improves their grounding and planning performance over the current models. Moreover, our approach is more cost-efficient compared to traditional human annotation methods. This work underscores the potential of guided replay with web tutorials as a viable strategy for large-scale GUI agent training, paving the way for more capable and autonomous digital agents.

214Incorporating continuous dependence implies better generalization ability

[openreview] [pdf]

Abstract When applying deep-learning-based solvers to differential equations, a key challenge is how to improve their generalization ability, so that the pre-trained models could be easily adapted to new scenarios of interest. In this paper, inspired by the well-known mathematical statements on the continuous dependence of solutions to ordinary differential equations on initial values and parameters, we make a non-trivial extension of the physics-informed neural networks by incorporating additional information on the continuous dependence of solutions (abbreviated as cd-PINN). Our cd-PINN integrates the advantages of neural operators and Meta-PINN, requiring only few labeled data while enabling solving ordinary differential equations with respect to new initial values and parameters in a fast and accurate way without fine-tuning. As demonstrated through novel examples like the Logistic model, the Lotka-Volterra model as well as damped harmonic oscillators, the accuracy of cd-PINN under those untrained conditions is usually 1-3 orders of magnitude higher than PINN. Meanwhile, the GPU time cost for training in the two approaches is comparable. Therefore, we expect our cd-PINN would be particularly useful in improving the efficiency and accuracy of deep-learning-based solvers for differential equations.

215Uncertainty-aware Guided Diffusion for Missing Data in Sequential Recommendation

[openreview] [pdf]

Abstract Denoising diffusion models (DDMs) have shown significant potential in generating oracle items that best match user preference with guidance from user historical interaction sequences. However, the quality of guidance is often compromised by the unpredictable missing data in the observed sequence, leading to suboptimal item generation. To tackle this challenge, we propose a novel uncertainty-aware guided diffusion model (DreamMiss) to alleviate the influence of missing data. The core of DreamMiss is the utilization of a dual-side Thompson sampling (DTS) strategy, which simulates the stochastical mechanism of missing data without disrupting preference evolution. Specifically, we first define dual-side probability models to capture user preference evolution, taking into account both local item continuity and global sequence stability. We then strategically remove items based on these two models with DTS, creating uncertainty-aware guidance for DDMs to generate oracle items. This can achieve DDMs’ consistency regularization, enabling them to resile against uncertain missing data. Additionally, to accelerate sampling in the reverse process, DreamMiss is implemented under the framework of denoising diffusion implicit models (DDIM). Extensive experimental results show that DreamMiss significantly outperforms baselines in sequential recommendation.

216DyDiff: Long-Horizon Rollout via Dynamics Diffusion for Offline Reinforcement Learning

[openreview] [pdf]

Abstract With the great success of diffusion models (DMs) in generating realistic synthetic vision data, many researchers have investigated their potential in decision-making and control. Most of these works utilized DMs to sample directly from the trajectory space, where DMs can be viewed as a combination of dynamics models and policies. In this work, we explore how to decouple DMs’ ability as dynamics models in fully offline settings, allowing the learning policy to roll out trajectories. As DMs learn the data distribution from the dataset, their intrinsic policy is actually the behavior policy induced from the dataset, which results in a mismatch between the behavior policy and the learning policy. We propose Dynamics Diffusion, short as DyDiff, which can inject information from the learning policy to DMs iteratively. DyDiff ensures long-horizon rollout accuracy while maintaining policy consistency and can be easily deployed on model-free algorithms. We provide theoretical analysis to show the advantage of DMs on long-horizon rollout over models and demonstrate the effectiveness of DyDiff in the context of offline reinforcement learning, where the rollout dataset is provided but no online environment for interaction. Our code is athttps://anonymous.4open.science/r/DyDiff.

217The Convergence of Second-Order Sampling Methods for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have achieved great success in generating samples from complex distributions, notably in the domains of images and videos. Beyond the experimental success, theoretical insights into their performance have been illuminated, particularly concerning the convergence of diffusion models when applied with discretization methods such as Euler-Maruyama (EM) and Exponential Integrator (EI). This paper embarks on analyzing the convergence of the higher-order discretization method (SDE-DPM-2) under L2L^2-accurate score estimate. Our findings reveal that to attain O~(ϵ02)\tilde{O}(\epsilon_0^2) Kullback-Leibler (KL) divergence between the target and the sampled distributions, the sampling complexity - or the required number of discretization steps - for SDE-DPM-2 is O~(1/ϵ0)\tilde{O}(1/\epsilon_0), which is better than the currently known sample complexity of EI given by O~(1/ϵ02)\tilde{O}(1/\epsilon_0^2). We further extend our analysis to the Runge-Kutta-2 (RK-2) method, which demands a sampling complexity of O~(1/ϵ02)\tilde{O}(1/\epsilon_0^2), indicating that SDE-DPM-2 is more efficient than RK-2. Our study also demonstrates that the convergence of SDE-DPM-2 under Variance Exploding (VE) SDEs aligns with that of Variance Preserving (VP) SDEs, highlighting the adaptability of SDE-DPM-2 across various diffusion models frameworks.

218An Efficient Framework for Crediting Data Contributors of Diffusion Models

[openreview] [pdf]

Abstract As diffusion models are deployed in real-world settings and their performance driven by training data, appraising the contribution of data contributors is crucial to creating incentives for sharing quality data and to implementing policies for data compensation. Depending on the use case, model performance corresponds to various global properties of the distribution learned by a diffusion model (e.g., overall aesthetic quality). Hence, here we address the problem of attributing global properties of diffusion models to data contributors. The Shapley value provides a principled approach to valuation by uniquely satisfying game-theoretic axioms of fairness. However, estimating Shapley values for diffusion models is computationally impractical because it requires retraining and rerunning inference on many subsets of data contributors. We introduce a method to efficiently retrain and rerun inference for Shapley value estimation, by leveraging model pruning and fine-tuning. We evaluate the utility of our method with three use cases: (i) image quality for a DDPM trained on a CIFAR dataset, (ii) demographic diversity for an LDM trained on CelebA-HQ, and (iii) aesthetic quality for a Stable Diffusion model LoRA-finetuned on Post-Impressionist artworks. Our results empirically demonstrate that our framework can identify important data contributors across global properties, outperforming existing attribution methods for diffusion models.

219Policy Gradient with Tree Expansion

[openreview] [pdf]

Abstract Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

220DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Models

[openreview] [pdf]

Abstract Recent advancements in generative models have sparked a significant interest within the machine learning community. Particularly, diffusion models have demonstrated remarkable capabilities in synthesizing images and speech. Studies such as those by Lee et al. (2023), Black et al. (2023), Wang et al. (2023), and Fan et al. (2024) illustrate that Reinforcement Learning with Human Feedback (RLHF) can enhance diffusion models for image synthesis. However, due to architectural differences between these models and those employed in speech synthesis, it remains uncertain whether RLHF could similarly benefit speech synthesis models. In this paper, we explore the practical application of RLHF to diffusion-based text-to-speech synthesis, leveraging the mean opinion score (MOS) as predicted by UTokyo-SaruLab MOS prediction system (Saeki et al., 2022) as a proxy loss. We introduce diffusion model loss-guided RL policy optimization (DLPO) and compare it against other RLHF approaches, employing the NISQA speech quality and naturalness assessment model (Mittag et al., 2021) and human preference experiments for further evaluation. Our results show that RLHF can enhance diffusion-based text-to-speech synthesis models, and, moreover, DLPO can better improve diffusion models in generating natural and high quality speech audios.

221Learning Loss Landscapes in Preference Optimization

[openreview] [pdf]

Abstract We present an empirical study investigating how specific properties of preference datasets, such as mixed-quality or noisy data, affect the performance of Preference Optimization (PO) algorithms. Our experiments, conducted in MuJoCo environments, reveal several scenarios where state-of-the-art PO methods experience significant drops in performance. To address this issue, we introduce a novel PO framework based on mirror descent, which can recover existing methods like Direct Preference Optimization (DPO) and Odds-Ratio Preference Optimization (ORPO) for specific choices of the mirror map. Within this framework, we employ evolutionary strategies to discover new loss functions capable of handling the identified problematic scenarios. These new loss functions lead to significant performance improvements over DPO and ORPO across several tasks. Additionally, we demonstrate the generalization capability of our approach by applying the discovered loss functions to fine-tuning large language models using mixed-quality data, where they outperform ORPO.

222Time Can Invalidate Algorithmic Recourse

[openreview] [pdf]

Abstract Algorithmic Recourse (AR) aims to provide users with actionable steps to overturn unfavourable decisions made by machine learning predictors. However, these actions often take time to implement (e.g., getting a degree can take years), and their effects may vary as the world evolves. Thus, it is natural to ask for recourse that remains valid in a dynamic environment. In this paper, we study the robustness of algorithmic recourse over time by casting the problem through the lens of causality. We demonstrate theoretically and empirically that (even robust) causal AR methods can fail over time except in the -- unlikely -- case that the world is stationary. Even more critically, unless the world is fully deterministic, counterfactual AR cannot be solved optimally. To account for this, we propose a simple yet effective algorithm for temporal AR that explicitly accounts for time. Our simulations on synthetic and realistic datasets show how considering time produces more resilient solutions to potential trends in the data distribution.

223Rethinking and Defending Protective Perturbation in Personalized Diffusion Models

[openreview] [pdf]

Abstract Personalized diffusion models (PDMs) have become prominent for adapting pretrained text-to-image models to generate images of specific subjects using minimal training data. However, PDMs are susceptible to minor adversarial perturbations, leading to significant degradation when fine-tuned on corrupted datasets. These vulnerabilities are exploited to create protective perturbations that prevent unauthorized image generation. Existing purification methods attempt to mitigate this issue but often over-purify images, resulting in information loss. In this work, we conduct an in-depth analysis of the fine-tuning process of PDMs through the lens of shortcut learning. We hypothesize and empirically demonstrate that adversarial perturbations induce a latent-space misalignment between images and their text prompts in the CLIP embedding space. This misalignment causes the model to erroneously associate noisy patterns with unique identifiers during fine-tuning, resulting in poor generalization. Based on these insights, we propose a systematic defense framework that includes data purification and contrastive decoupling learning. We first employ off-the-shelf image restoration techniques to realign images with their original semantic meanings in latent space. Then, we introduce contrastive decoupling learning with noise tokens to decouple the learning of personalized concepts from spurious noise patterns. Our study not only uncovers fundamental shortcut learning vulnerabilities in PDMs but also provides a comprehensive evaluation framework for developing stronger protection. Our extensive evaluation demonstrates its superiority over existing purification methods and stronger robustness against adaptive perturbation.

224Diverse Policies Recovering via Pointwise Mutual Information Weighted Imitation Learning

[openreview] [pdf]

Abstract Recovering a spectrum of diverse policies from a set of expert trajectories is an important research topic in imitation learning. After determining a latent style for a trajectory, previous diverse polices recovering methods usually employ a vanilla behavioral cloning learning objective conditioned on the latent style, treating each state-action pair in the trajectory with equal importance. Based on an observation that in many scenarios, behavioral styles are often highly relevant with only a subset of state-action pairs, this paper presents a new principled method in diverse polices recovering. In particular, after inferring or assigning a latent style for a trajectory, we enhance the vanilla behavioral cloning by incorporating a weighting mechanism based on pointwise mutual information. This additional weighting reflects the significance of each state-action pair’s contribution to learning the style, thus allowing our method to focus on state-action pairs most representative of that style. We provide theoretical justifications for our new objective, and extensive empirical evaluations confirm the effectiveness of our method in recovering diverse polices from expert data.

225Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering

[openreview] [pdf]

Abstract Recent empirical studies have demonstrated that diffusion models can effectively learn the image distribution and generate new samples. Remarkably, these models can achieve this even with a small number of training samples despite a large image dimension, circumventing the curse of dimensionality. In this work, we provide theoretical insights into this phenomenon by leveraging key empirical observations: (i) the low intrinsic dimensionality of image data, (ii) a union of manifold structure of image data, and (iii) the low-rank property of the denoising autoencoder in trained diffusion models. These observations motivate us to assume the underlying data distribution of image data as a mixture of low-rank Gaussians and to parameterize the denoising autoencoder as a low-rank model according to the score function of the assumed distribution. With these setups, we rigorously show that optimizing the training loss of diffusion models is equivalent to solving the canonical subspace clustering problem over the training samples. Based on this equivalence, we further show that the minimal number of samples required to learn the underlying distribution scales linearly with the intrinsic dimensions under the above data and model assumptions. This insight sheds light on why diffusion models can break the curse of dimensionality and exhibit the phase transition in learning distributions. Moreover, we empirically establish a correspondence between the subspaces and the semantic representations of image data, facilitating image editing. We validate these results with corroborated experimental results on both simulated distributions and image datasets.

226FairGen: controlling fair generations in diffusion models via adaptive latent guidance

[openreview] [pdf]

Abstract Diffusion models have shown remarkable proficiency in generating photorealistic images, but their outputs often exhibit biases toward specific social groups, raising ethical concerns and limiting their wider adoption. This paper tackles the challenge of mitigating generative bias in diffusion models while maintaining image quality. We propose FairGen, an adaptive latent guidance mechanism enhanced by an auxiliary memory module, which operates during inference to control the generation distribution at a desired level. The latent guidance module dynamically adjusts the direction in the latent space to influence specific attributes, while the memory module tracks prior generation statistics and steers the scalar direction to align with the target distribution. To evaluate FairGen comprehensively, we introduce a bias evaluation benchmark tailored for diffusion models, spanning diverse domains such as employment, education, finance, and healthcare, along with complex user-generated prompts. Extensive empirical evaluations demonstrate that FairGen outperforms existing bias mitigation approaches, achieving substantial bias reduction while preserving generation quality. Furthermore, FairGen offers precise and flexible control over various target distributions, enabling nuanced adjustments to the generative process.

227Improving Influence-based Instruction Tuning Data Selection for Balanced Learning of Diverse Capabilities

[openreview] [pdf]

Abstract Selecting appropriate training data is crucial for successful supervised instruction fine-tuning (SFT), which aims to (1) elicit strong capabilities from pretrained large language models (LLMs), and (2) achieve balanced performance across a diverse range of tasks. Algorithms based on influence estimation have shown promise in achieving (1) through estimating the contribution of each training example to model’s prediction on a downstream task, but often struggle with (2). Through systematic experiments, we attribute their underperformance to an inherent bias---certain tasks intrinsically have greater influence than others. Directly comparing influence scores across different tasks would thus bias the selected data towards these tasks, hurting the LM’s performance not only on other capabilities, but also, surprisingly, on the tasks for which the selected data has high influence.We propose BIDS, a novel Data Selection algorithm that targets Influential data in a Balanced way, to address this issue. Aiming to address the biased influence, BIDS first normalizes influence scores of the training data with respect to each downstream task at an instance level. BIDS then applies an iterative optimization process to further balance the selection of influential training data. At each step, BIDS selects the training example that bears the highest influence on the most underrepresented capability by the currently selected data. Experimental results demonstrate that BIDS consistently outperforms state-of-the-art influence-based data selection algorithms under various budgets. Remarkably, training on a 15% subset by BIDS can even outperform full-dataset training with a much more balanced distribution of downstream performance. Our analysis further highlights the importance of both instance-level normalization and iterative optimization of selected data for balanced learning of diverse capabilities.

228Towards Adaptive Time Series Foundation Models Against Distribution Shift

[openreview] [pdf]

Abstract Foundation models have demonstrated remarkable success across diverse machine-learning domains through large-scale pretraining. However, their application to time series data poses challenges due to substantial mismatches in the distributions of pretraining datasets. In this paper, we tackle this issue by proposing a domain-aware adaptive normalization strategy within the Transformer architecture. Specifically, we replace the traditional LayerNorm with a prototype-guided dynamic normalization mechanism, where learned prototypes represent distinct data distributions, and sample-to-prototype similarity determines the appropriate normalization layer. This approach effectively captures the diverse characteristics of time series data, ensuring better alignment between pretrained representations and downstream tasks. Our method significantly improves fine-tuning performance, outperforming vanilla pretraining techniques and reducing the negative impact of distribution shifts. Extensive experiments on various real-world time series datasets demonstrate the efficacy of our approach, paving the way for more robust and generalizable time series foundation models.

229Leveraging Diffusion Transformers for Stock Factor Augmentation in Financial Markets

[openreview] [pdf]

Abstract Data scarcity poses a significant challenge in training machine learning models for stock forecasting, often leading to low signal-to-noise ratio (SNR) and data homogeneity that degrade model performance. To address these issues, we introduce DiffsFormer, a novel approach utilizing artificial intelligence-generated samples (AIGS) with a Transformer-based Diffusion Model. Initially trained on a large-scale source domain with conditional guidance to capture global joint distribution, DiffsFormer augments training by editing existing samples for specific downstream tasks, allowing control over the deviation of generated data from the target domain. We evaluate DiffsFormer on the CSI300 and CSI800 datasets using eight commonly used machine learning models, achieving relative improvements of 7.3% and 22.1% in annualized return ratio, respectively. Extensive experiments provide insights into DiffsFormer’s functionality and its components, illustrating their role in mitigating data scarcity and enhancing model performance. Our findings demonstrate the potential of AIGS and DiffsFormer in addressing data limitations in stock forecasting, with the ability to generate realistic stock factors and control the editing process. These results validate our approach and contribute to a deeper understanding of its underlying mechanisms.

230Transformers Struggle to Learn to Search Without In-context Exploration

[openreview] [pdf]

Abstract Search is an ability fundamental in many important tasks, and recent studies have shown that large-language models (LLMs) struggle to perform search robustly. It is unknown whether this inability is due to a lack of data, insufficient model parameters, or fundamental limitations of the transformer architecture. In this work, we use graph connectivity as a testbed to generate effectively limitless high-coverage data to train small transformers and test whether they can learn to perform search. We find that, under specific conditions on the training distribution, the transformer is able to learn to search.We analyze the algorithm that the transformer has learned through a novel mechanistic interpretability technique that enables us to extract the computation graph from the trained model. We find that for each vertex in the input graph, transformers compute the set of vertices reachable from that vertex. Each layer then progressively expands these sets, allowing the model to search over a number of vertices exponential in the number of layers.However, we find that as the input graph size increases, the transformer has greater difficulty in learning the task. This difficulty is not resolved even as the number of parameters is increased, suggesting that simply increasing the scale of LLMs will not lead to robust search abilities.Finally, we show that by loosening the task to allow the model toexplorethe graphin-context, allowing the model to visit vertices that do not necessarily lead to the goal and backtrack, the transformer is able to more easily learn to search robustly.

231Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel

[openreview] [pdf]

Abstract Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the data pool through the collaboration between two models, the instruction generator and the navigator, without any human-in-the-loop annotation. Specifically, SRDF starts with using a base generator to create an initial data pool for training a base navigator, followed by applying the trained strong navigator to filter the data pool. This leads to higher-fidelity data to train a better generator, which can, in turn, produce higher-quality data for training the next-round navigator. Such a flywheel establishes a data self-refining process, yielding a continuously improved and highly effective dataset for large-scale language-guided navigation learning. Our experiments demonstrate that after several flywheel rounds, the navigator elevates the performance boundary from 70% to 78% SPL on the classic R2R test set, surpassing human performance (76%) for the first time. Meanwhile, this process results in a superior instruction generator, as reflected by the improved SPICE from 23.5 to 25.7, better than all published approaches tailored for VLN instruction generation. Finally, we demonstrate the scalability of our method through increasing environment and instruction diversity, and the generalization ability of our pre-trained navigator across various downstream navigation tasks, surpassing state-of-the-art performance by a large margin in all cases. Code is uploaded as supplementary materials and all our data/code/models will also be publicly released.

232DOME: Taming Diffusion Model into High-Fidelity Controllable Occupancy World Model

[openreview] [pdf]

Abstract We propose DOME, a diffusion-based world model that predicts future occupancy frames based on past occupancy observations. The ability of this world model to capture the evolution of the environment is crucial for planning in autonomous driving. Compared to 2D video-based world models, the occupancy world model utilizes a native 3D representation, which features easily obtainable annotations and is modality-agnostic. This flexibility has the potential to facilitate the development of more advanced world models. Existing occupancy world models either suffer from detail loss due to discrete tokenization or rely on simplistic diffusion architectures, leading to inefficiencies and difficulties in predicting future occupancy with controllability. Our DOME exhibits two key features: (1) High-Fidelity and Long-Duration Generation. We adopt a spatial-temporal diffusion transformer to predict future occupancy frames based on historical context. This architecture efficiently captures spatial-temporal information, enabling high-fidelity details and the ability to generate predictions over long durations. (2)Fine-grained Controllability. We address the challenge of controllability in predictions by introducing a trajectory resampling method, which significantly enhances the model’s ability to generate controlled predictions. Extensive experiments on the widely used nuScenes dataset demonstrate that our method surpasses existing baselines in both qualitative and quantitative evaluations, establishing a new state-of-the-art performance on nuScenes. Specifically, our approach surpasses the baseline by 10.5% in mIoU and 21.2% in IoU for occupancy reconstruction, and by 36.0% in mIoU and 24.6% in IoU for 4D occupancy forecasting.

233Improving Probabilistic Diffusion Models With Optimal Covariance Matching

[openreview] [pdf]

Abstract The probabilistic diffusion model has become highly effective across various domains. Typically, sampling from a diffusion model involves using a denoising distribution characterized by a Gaussian with a learned mean and either fixed or learned covariances. In this paper, we leverage the recently proposed covariance moment matching technique and introduce a novel method for learning the diagonal covariances. Unlike traditional data-driven covariance approximation approaches, our method involves directly regressing the optimal analytic covariance using a new, unbiased objective named Optimal Covariance Matching (OCM). This approach can significantly reduce the approximation error in covariance prediction. We demonstrate how our method can substantially enhance the sampling efficiency, recall rate and likelihood of both diffusion models and latent diffusion models.

234Learning Actionable Counterfactual Explanations in Large State Spaces

[openreview] [pdf]

Abstract An increasing number of high-stakes domains rely on machine learning to make decisions that have significant consequences for individuals, such as in loan approvals and college admissions. The black-box nature of these processes has led to a growing demand for solutions that make individuals aware of potential ways they could improve their qualifications. Counterfactual explanations (CFEs) are one form of feedback commonly used to provide insight into decision-making systems. Specifically, contemporary CFE generators provide explanations in the form of low-level CFEs whose constituent actions precisely describe how much a negatively classified individual should add or subtract from their input features to achieve the desired positive classification. However, the low-level CFE generators have several shortcomings: they are hard to scale, often misaligned with real-world conditions, constrained by information access (e.g., they can not query the classifier), and make inadequate use of available historical data. To address these challenges, we propose three data-driven CFE generators that create generalizable CFEs with desirable characteristics for individuals and decision-makers. Through extensive empirical experiments, we compare the proposed CFE generators with a low-level CFE generator on four real-world (BRFSS, Foods, and two NHANES datasets), five semi-synthetic, and five variants of fully-synthetic datasets. Our problem can also be seen as learning an optimal policy in a family of large but deterministic Markov decision processes.

235DDRL: A DIFFUSION-DRIVEN REINFORCEMENT LEARNING APPROACH FOR ENHANCED TSP SOLUTIONS

[openreview] [pdf]

Abstract The Traveling Salesman Problem (TSP) is a fundamental challenge in combinatorial optimization, known for its NP-hard complexity. Reinforcement Learning (RL) has proven to be effective in managing larger and more complex TSP instances, yet it encounters challenges such as training instability and necessity for a substantial amount of training resources. Diffusion models, known for iteratively refining noisy inputs to generate high-quality solutions, offer scalability and exploration capabilities for TSP but may struggle with optimality in complex cases and require large, resource-intensive training datasets. To address these limitations, we propose DDRL (Diffusion-Driven Reinforcement Learning), which integrates diffusion models with RL. DDRL employs a latent vector to generate an adjacency matrix, merging image and graph learning within a unified RL framework. By utilizing a pre-trained diffusion model as a prior, DDRL exhibits strong scalability and enhanced convergence stability. We also provide theoretical analysis that training DDRL aligns with the diffusion policy gradient in the process of solving the TSP, demonstrating its effectiveness. Additionally, we introduce novel constraint datasets—obstacle, path, and cluster constraints—to evaluate DDRL’s generalization capabilities. We demonstrate that DDRL offers a robust solution that outperforms existing methods in both basic and constrained TSP problems.

236Classroom-Inspired Multi-Mentor Distillation with Adaptive Learning Strategies

[openreview] [pdf]

Abstract We proposeClassroomKD, a novel multi-mentor knowledge distillation framework inspired by classroom environments to enhance knowledge transfer between student and multiple mentors. Unlike traditional methods that rely on fixed mentor-student relationships, our framework dynamically selects and adapts the teaching strategies of diverse mentors based on their effectiveness for each data sample. ClassroomKD comprises two main modules: theKnowledge Filtering (KF)Module and theMentoringModule. The KF Module dynamically ranks mentors based on their performance for each input, activating only high-quality mentors to minimize error accumulation and prevent information loss. The Mentoring Module adjusts the distillation strategy by tuning each mentor’s influence according to the performance gap between the student and mentors, effectively modulating the learning pace. Extensive experiments on image classification (CIFAR-100 and ImageNet) and 2D human pose estimation (COCO Keypoints and MPII Human Pose) demonstrate that ClassroomKD outperforms existing knowledge distillation methods for different network architectures. Our results highlight that a dynamic and adaptive approach to mentor selection and guidance leads to more effective knowledge transfer, paving the way for enhanced model performance through distillation.

237Prompt Optimization with Logged Bandit Data

[openreview] [pdf]

Abstract We study how to use naturally available user feedback, such as clicks, to optimize large language model (LLM) pipelines for generating personalized sentences using prompts. Naive approaches, which estimate the policy gradient in the prompt space, suffer either from variance caused by the large action space of prompts or bias caused by inaccurate reward predictions. To circumvent these challenges, we proposeDirect Sentence Off-policy gradient(DSO), which estimates the policy gradient by leveraging similarity among generated sentences, substantially reducing variance while suppressing the bias. Empirical results on our newly established suite of benchmarks, calledOfflinePrompts, demonstrate the effectiveness of the proposed approach in generating personalized descriptions for movie recommendations, particularly when the number of candidate prompts is large.

238Exploring the Design Space of Diffusion Bridge Models via Stochasticity Control

[openreview] [pdf]

Abstract Diffusion bridge models effectively facilitate image-to-image (I2I) translation by connecting two distributions. However, existing methods overlook the impact of noise in sampling SDEs, transition kernel, and the base distribution on sampling efficiency, image quality and diversity. To address this gap, we propose the Stochasticity-controlled Diffusion Bridge (SDB), a novel theoretical framework that extends the design space of diffusion bridges, and provides strategies to mitigate singularities during both training and sampling. By controlling stochasticity in the sampling SDEs, our sampler achieves speeds up to 5×5 \times faster than the baseline, while also producing lower FID scores. After training, SDB sets new benchmarks in image quality and sampling efficiency via managing stochasticity within the transition kernel. Furthermore, introducing stochasticity into the base distribution significantly improves image diversity, as quantified by a newly introduced metric.

239Consistency Model is an Effective Posterior Sample Approximation for Diffusion Inverse Solvers

[openreview] [pdf]

Abstract Diffusion Inverse Solvers (DIS) are designed to sample from the conditional distribution pθ(X0y)p_{\theta}(X_0|y), with a pre-trained diffusion model pθ(X0)p_{\theta}(X_0), an operator f(.)f(.), and a measurement y=f(x0)y=f(x'_0) derived from an unknown image x0x'_0. Existing DIS estimate the conditional score function by evaluating f(.)f(.) with an approximated posterior sample drawn from pθ(X0Xt)p_{\theta}(X_0|X_t). However, most prior approximations rely on the posterior means, which may not lie in the support of the image distribution and diverge from the appearance of genuine images. Such out-of-support samples may significantly degrade the performance of the operator f(.)f(.), particularly when it is a neural network. In this paper, we introduces a novel approach for posterior approximation that guarantees to generate valid samples within the support of the image distribution, and also enhances the compatibility with neural network-based operators f(.)f(.). We first demonstrate that the solution of the Probability Flow Ordinary Differential Equation (PF-ODE) with an initial value xtx_t yields an effective posterior sample of pθ(X0Xt=xt)p_{\theta}(X_0|X_t=x_t) with high probability. Based on this observation, we adopt the Consistency Model (CM), which is distilled from PF-ODE, for posterior sampling. Through extensive experiments, we show that our proposed method for posterior sample approximation substantially enhance the effectiveness of DIS for neural network operators f(.)f(.) (e.g., in semantic segmentation). The source code is provided in the supplementary material.

240Distilling the Knowledge in Data Pruning

[openreview] [pdf]

Abstract With the increasing size of datasets used for training neural networks, data pruning has gained traction in recent years. However, most current data pruning algorithms are limited in their ability to preserve accuracy compared to models trained on the full data, especially in high pruning regimes. In this paper we explore the application of data pruning while incorporating knowledge distillation (KD) when training on a pruned subset. That is, rather than relying solely on ground-truth labels, we also use the soft predictions from a teacher network pre-trained on the complete data. By integrating KD into training, we demonstrate significant improvement across datasets, pruning methods, and on all pruning fractions. We first establish a theoretical motivation for employing self-distillation to improve training on pruned data. Then, we empirically make a compelling and highly practical observation: using KD, simple random pruning is comparable or superior to sophisticated pruning methods across all pruning regimes. On ImageNet for example, we achieve superior accuracy despite training on a random subset of only 50% of the data. Additionally, we demonstrate a crucial connection between the pruning factor and the optimal knowledge distillation weight. This helps mitigate the impact of samples with noisy labels and low-quality images retained by typical pruning algorithms. Finally, we make an intriguing observation: when using lower pruning fractions, larger teachers lead to accuracy degradation, while surprisingly, employing teachers with a smaller capacity than the student’s may improve results. Our code will be made available.

241Continuous Ensemble Weather Forecasting with Diffusion models

[openreview] [pdf]

Abstract Weather forecasting has seen a shift in methods from numerical simulations to data-driven systems. While initial research in the area focused on deterministic forecasting, recent works have used diffusion models to produce skillful ensemble forecasts. These models are trained on a single forecasting step and rolled out autoregressively. However, they are computationally expensive and accumulate errors for high temporal resolution due to the many rollout steps. We address these limitations with Continuous Ensemble Forecasting, a novel and flexible method for sampling ensemble forecasts in diffusion models. The method can generate temporally consistent ensemble trajectories completely in parallel, with no autoregressive steps. Continuous Ensemble Forecasting can also be combined with autoregressive rollouts to yield forecasts at an arbitrary fine temporal resolution without sacrificing accuracy. We demonstrate that the method achieves competitive results for global weather forecasting with good probabilistic properties.

242Mitigating Goal Misgeneralization via Minimax Regret

[openreview] [pdf]

Abstract Robustness research in reinforcement learning often focuses on ensuring that the policy consistently exhibits capable, goal-driven behavior. However, not every capable behavior is the intended behavior.Goal misgeneralizationcan occur when the policy generalizes capably with respect to a ‘proxy goal’ whose optimal behavior correlates with the intended goal on the training distribution, but not out of distribution. Though the intended goal would be ambiguous if they were perfectly correlated in training, we show progress can be made if the goals are onlynearly ambiguous, with the training distribution containing a small proportion ofdisambiguatinglevels. We observe that the training signal from disambiguating levels could be amplified by regret-based prioritization. We formally show that approximately optimal policies on maximal-regret levels avoid the harmful effects of goal misgeneralization, which may exist without this prioritization. Empirically, we find that current regret-based Unsupervised Environment Design (UED) methods can mitigate the effects of goal misgeneralization, though do not always entirely eliminate it. Our theoretical and empirical results show that as UED methods improve they could further mitigate goal misgeneralization in practice.

243A Closer Look at Time Steps is Worthy of Triple Speed-Up for Diffusion Model Training

[openreview] [pdf]

Abstract Training diffusion models is always a computation-intensive task. In this paper, we introduce a novel speed-up method for diffusion model training, called, which is based on a closer look at time steps. Our key findings are: i) Time steps can be empirically divided into acceleration, deceleration, and convergence areas based on the process increment. ii) These time steps are imbalanced, with many concentrated in the convergence area. iii) The concentrated steps provide limited benefits for diffusion training. To address this, we design an asymmetric sampling strategy that reduces the frequency of steps from the convergence area while increasing the sampling probability for steps from other areas. Additionally, we propose a weighting strategy to emphasize the importance of time steps with rapid-change process increments. As a plug-and-play and architecture-agnostic approach, SpeeD consistently achieves 3-times acceleration across various diffusion architectures, datasets, and tasks. Notably, due to its simple design, our approach significantly reduces the cost of diffusion model training with minimal overhead. Our research enables more researchers to train diffusion models at a lower cost.

244DiffMove: Human Trajectory Recovery via Conditional Diffusion Model

[openreview] [pdf]

Abstract Recovering human trajectories from incomplete or missing data is crucial for many mobility-based urban applications, e.g., urban planning, transportation, and location-based services. Existing methods mainly rely on recurrent neural networks or attention mechanisms. Though promising, they encounter limitations in capturing complex spatial-temporal dependencies in low-sampling trajectories. Recently, diffusion models show potential in content generation. However, most of proposed methods are used to generate contents in continuous numerical representations, which cannot be directly adapted to the human location trajectory recovery. In this paper, we introduce a conditional diffusion-based trajectory recovery method, namely, DiffMove. It first transforms locations in trajectories into the embedding space, in which the embedding denoising is performed, and then missing locations are recovered by an embedding decoder. DiffMove not only improves accuracy by introducing high-quality generative methods in the trajectory recovery, but also carefully models the transition, periodicity, and temporal patterns in human mobility. Extensive experiments based on two representative real-world mobility datasets are conducted, and the results show significant improvements (an average of 11% in recall) over the baselines.

245Earlier Tokens Contribute More: Learning Direct Preference Optimization From Temporal Decay Perspective

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has gained attention as an efficient alternative to reinforcement learning from human feedback (RLHF) for aligning large language models (LLMs) with human preferences. Despite its advantages, DPO suffers from a length bias, generating responses longer than those from the reference model. Existing solutions like SimPO and SamPO address this issue but uniformly treat the contribution of rewards across sequences, overlooking temporal dynamics. To this end, we propose an enhanced preference optimization method that incorporates a temporal decay factor controlled by a gamma parameter. This dynamic weighting mechanism adjusts the influence of each reward based on its position in the sequence, prioritizing earlier tokens that are more critical for alignment. By adaptively focusing on more relevant feedback, our approach mitigates overfitting to less pertinent data and remains responsive to evolving human preferences. Experimental results on several benchmarks show that our approach consistently outperforms vanilla DPO by 5.9-8.8 points on AlpacaEval 2 and 3.3-9.7 points on Arena-Hard across different model architectures and sizes.

246Tighter Performance Theory of FedExProx

[openreview] [pdf]

Abstract We revisit FedExProx -- a recently proposed distributed optimization method designed to enhance convergence properties of parallel proximal algorithms via extrapolation. In the process, we uncover a surprising flaw: its known theoretical guarantees on quadratic optimization tasks are no better than those offered by the vanilla Gradient Descent (GD) method. Motivated by this observation, we develop a novel analysis framework, establishing a tighter linear convergence rate for non-strongly convex quadratic problems. By incorporating both computation and communication costs, we demonstrate that FedExProx can indeed provably outperform GD, in stark contrast to the original analysis. Furthermore, we consider partial participation scenarios and analyze two adaptive extrapolation strategies-based on gradient diversity and Polyak stepsizes --- again significantly outperforming previous results. Moving beyond quadratics, we extend the applicability of our analysis to general functions satisfying the Polyak-Łojasiewicz condition, outperforming the previous strongly convex analysis while operating under weaker assumptions. Backed by empirical results, our findings point to a new and stronger potential of FedExProx, paving the way for further exploration of the benefits of extrapolation in federated learning.

247Regret Bounds and Reinforcement Learning Exploration of EXP-based Algorithms

[openreview] [pdf]

Abstract We study the challenging exploration incentive problem in both bandit and reinforcement learning, where the rewards are scale-free and potentially unbounded, driven by real-world scenarios and differing from existing work. Past works in reinforcement learning either assume costly interactions with an environment or propose algorithms finding potentially low quality local maxima. Motivated by EXP-type methods that integrate multiple agents (experts) for exploration in bandits with the assumption that rewards are bounded, we propose new algorithms, namely EXP4.P and EXP4-RL for exploration in the unbounded reward case, and demonstrate their effectiveness in these new settings. Unbounded rewards introduce challenges as the regret cannot be limited by the number of trials, and selecting suboptimal arms may lead to infinite regret. Specifically, we establish EXP4.P’s regret upper bounds in both bounded and unbounded linear and stochastic contextual bandits. Surprisingly, we also find that by including one sufficiently competent expert, EXP4.P can achieve global optimality in the linear case. This unbounded reward result is also applicable to a revised version of EXP3.P in the Multi-armed Bandit scenario. In EXP4-RL, we extend EXP4.P from bandit scenarios to reinforcement learning to incentivize exploration by multiple agents, including one high-performing agent, for both efficiency and excellence. This algorithm has been tested on difficult-to-explore games and shows significant improvements in exploration compared to state-of-the-art.

248d-Linear Generation Error Bound for Distributed Diffusion Models

[openreview] [pdf]

Abstract The recent rise of distributed diffusion models has been driven by the explosive growth of data and the increasing demand for data generation. However, distributed diffusion models face unique challenges in resource-constrained environments. Existing approaches lack theoretical support, particularly with respect to generation error in such settings. In this paper, we are the first to derive the generation error bound for distributed diffusion models with arbitrary pruning, not assuming perfect score approximation. By analyzing the convergence of the score estimation model trained with arbitrary pruning in a distributed manner, we highlight the impact of complex factors such as model evolution dynamics and arbitrary pruning on the generation performance. This theoretical generation error bound is linear in the data dimension dd, aligning with state-of-the-art results in the single-worker paradigm.

249Dual Caption Preference Optimization for Diffusion Models

[openreview] [pdf]

Abstract Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified a performance issue in previous optimization methods, where using the same prompt for preferred and less preferred images, known as the irrelevant prompt issue, restricts model performance. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT-Chosen, Diffusion-DPO and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.

250Enhancing Adversarial Transferability Through Exploiting Multiple Randomized Trajectories for Better Global Guidance

[openreview] [pdf]

Abstract Deep neural networks are well-known for their vulnerability to adversarial examples, particularly demonstrating poor performance in white-box attack settings. However, most white-box attack methods heavily depend on the target model and often get trapped in local optima, leading to limited adversarial transferability. Techniques such as momentum, variance reduction, and gradient penalty mitigate overfitting by combining historical information with local regions around adversarial examples, but exploration of the global loss landscape remains constrained, hindering further performance improvements.In this work, we find that initialization influences the optimization of adversarial examples, often guiding them toward multiple local optima, providing an opportunity to explore the loss landscape more effectively. Based on this insight, we propose two strategies: randomized global initialization and dual examples. These strategies utilize multiple trajectories from benign samples to capture global optimization directions, enhancing adversarial transferability. Our approach integrates seamlessly with existing adversarial attack methods and significantly improves transferability, as demonstrated by empirical evaluations on the standard ImageNet dataset.

251Looking Backward: Retrospective Backward Synthesis for Goal-Conditioned GFlowNets

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets), a new family of probabilistic samplers, have demonstrated remarkable capabilities to generate diverse sets of high-reward candidates, in contrast to standard return maximization approaches (e.g., reinforcement learning) which often converge to a single optimal solution. Recent works have focused on developing goal-conditioned GFlowNets, which aim to train a single GFlowNet capable of achieving different outcomes as the task specifies. However, training such models is challenging due to extremely sparse rewards, particularly in high-dimensional problems. Moreover, previous methods suffer from the limited coverage of explored trajectories during training, which presents more pronounced challenges when only offline data is available. In this work, we propose a novel method called \textbf{R}etrospective \textbf{B}ackward \textbf{S}ynthesis (\textbf{RBS}) to address these critical problems. Specifically, RBS synthesizes new backward trajectories in goal-conditioned GFlowNets to enrich training trajectories with enhanced quality and diversity, thereby introducing copious learnable signals for effectively tackling the sparse reward problem. Extensive empirical results show that our method improves sample efficiency by a large margin and outperforms strong baselines on various standard evaluation benchmarks.

252α-DPO: Adaptive Reward Margin is What Direct Preference Optimization Needs

[openreview] [pdf]

Abstract Aligning large language models (LLMs) with human values and intentions is crucial for their utility, honesty, and safety. Reinforcement learning from human feedback (RLHF) is a popular approach to achieve this alignment, but it faces challenges in computational efficiency and training stability. Recent methods like Direct Preference Optimization (DPO) and Simple Preference Optimization (SimPO) have proposed offline alternatives to RLHF, simplifying the process by reparameterizing the reward function. However, DPO depends on a potentially suboptimal reference model, and SimPO’s assumption of a fixed target reward margin may lead to suboptimal decisions in diverse data settings. In this work, we propose (\alpha)-DPO, an adaptive preference optimization algorithm designed to address these limitations by introducing a dynamic reward margin. Specifically, (\alpha)-DPO employs an adaptive preference distribution, balancing the policy model and the reference model to achieve personalized reward margins. We provide theoretical guarantees for (\alpha)-DPO, demonstrating its effectiveness as a surrogate optimization objective and its ability to balance alignment and diversity through KL divergence control. Empirical evaluations on AlpacaEval 2 and Arena-Hard show that (\alpha)-DPO consistently outperforms DPO and SimPO across various model settings, establishing it as a robust approach for fine-tuning LLMs. Our method achieves significant improvements in win rates, highlighting its potential as a powerful tool for LLM alignment.

253Soup to go: mitigating forgetting during continual learning with model averaging

[openreview] [pdf]

Abstract In continual learning with pretrained large language models (LLMs), where data from instruction fine-tuning (IFT) tasks arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when the IFT tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the LLM has learned? Inspired by a classical continual learning method, L2 penalty to previous weights, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges models with earlier checkpoints trained on previous tasks during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. However, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms penalty methods like L2 regression and EWC, as well as other common merging techniques such as Task Arithmetic, and TIES Merging. Finally, we show that using our method, a single model can simultaneously perform well on a range of fine-tuning tasks in diverse domains, including Math, Law and Code.

254Adapting Prediction Sets to Distribution Shifts Without Labels

[openreview] [pdf]

Abstract Recently there has been a surge of interest to deploy confidence set predictions rather than point predictions. Unfortunately, the effectiveness of such prediction sets is frequently impaired by distribution shifts in practice, and the challenge is often compounded by the lack of ground truth labels at test time. In this paper, we present a method for improving the quality of outputted prediction sets using only unlabeled data from the test domain. This is achieved by two new methods called ECP\texttt{ECP} and EACP\texttt{E{\small A}CP}, that sit on top of existing set-valued classification methods and adjust their intervals according to the base model’s own uncertainty evaluation on the unlabeled test data. Through extensive experiments on a number of large-scale datasets and neural network architectures, we show that our methods provide consistent improvement over existing conformal prediction based baselines and nearly match the performance of fully supervised methods.

255Offline Safe Policy Optimization From Human Feedback

[openreview] [pdf]

Abstract Offline preference-based reinforcement learning (PbRL) learns rewards and policies aligned with human preferences without the need for extensive reward engineering and direct interaction with human annotators. However, ensuring safety remains a critical challenge across many domains and tasks. Previous works on safe RL from human feedback (RLHF) first learn reward and cost models from offline data, and then use constrained RL to optimize a safe policy. However, inaccuracies in the reward and cost learning can impair performance when used with constrained RL methods. To address these challenges, (a) we introduce a framework that learns a policy based on pairwise preferences regarding the agent’s behavior in terms of rewards, as well as binary labels indicating the safety of trajectory segments, without access to ground-truth rewards or costs; (b) we combine the preference learning module with safety alignment in a constrained optimization problem. This optimization problem is solved using a Lagrangian method that directly learns reward maximizing safe policy without explicitly learning reward and cost models, avoiding the need for constrained RL; (c) to evaluate our approach, we construct new datasets with synthetic human feedback, built upon a well-established offline safe RL benchmark. Empirically, our method successfully learns safe policies with high rewards, outperforming baselines with ground-truth reward and cost, as well as state-of-the-art RLHF approaches.

256Goal Achievement Guided Exploration: Mitigating Premature Convergence in Reinforcement Learning

[openreview] [pdf]

Abstract Premature convergence to suboptimal policies remains a significant challenge in reinforcement learning (RL), particularly in tasks with sparse rewards or non-convex reward landscapes. Existing work usually utilizes reward shaping, such as curiosity-based internal rewards, to encourage exploring promising spaces. However, this may inadvertently introduce new local optima and impair the optimization for the actual target reward. To address this issue, we propose Goal Achievement Guided Exploration (GAGE), a novel approach that incorporates an agent’s goal achievement as a dynamic criterion for balancing exploration and exploitation. GAGE adaptively adjusts the exploitation level based on the agent’s current performance relative to an estimated optimal performance, thereby mitigating premature convergence. Extensive evaluations demonstrate that GAGE substantially improves learning outcomes across various challenging tasks by adapting convergence based on task success. Applicable to both continuous and discrete tasks, GAGE seamlessly integrates into existing RL frameworks, highlighting its potential as a versatile tool for enhancing exploration strategies in RL.

257Elucidating the Preconditioning in Consistency Distillation

[openreview] [pdf]

Abstract Consistency distillation is a prevalent way for accelerating diffusion models adopted in consistency (trajectory) models, in which a student model is trained to traverse backward on the probability flow (PF) ordinary differential equation (ODE) trajectory determined by the teacher model. Preconditioning is a vital technique for stabilizing consistency distillation, by linear combining the input data and the network output with pre-defined coefficients as the consistency function. It imposes the boundary condition of consistency functions without restricting the form and expressiveness of the neural network. However, previous preconditionings are hand-crafted and may be suboptimal choices. In this work, we offer the first theoretical insights into the preconditioning in consistency distillation, by elucidating its design criteria and the connection to the teacher ODE trajectory. Based on these analyses, we further propose a principled way dubbed \textit{Analytic-Precond} to analytically optimize the preconditioning according to the consistency gap (defined as the gap between the teacher denoiser and the optimal student denoiser) on a generalized teacher ODE. We demonstrate that Analytic-Precond can facilitate the learning of trajectory jumpers, enhance the alignment of the student trajectory with the teacher’s, and achieve 2×2\times to 3×3\times training acceleration of consistency trajectory models in multi-step generation across various datasets.

258Unleashing the Potential of Diffusion Models for Incomplete Data Imputation

[openreview] [pdf]

Abstract Generative models play an important role in missing data imputation in that they aim to learn the joint distribution of full data. However, applying advanced deep generative models (such as Diffusion models) to missing data imputation is challenging due to 1) the inherent incompleteness of the training data and 2) the difficulty in performing conditional inference from unconditional generative models. To deal with these challenges, this paper introduces DiffPuter, a tailored diffusion model combined with the Expectation-Maximization (EM) algorithm for missing data imputation. DiffPuter iteratively trains a diffusion model to learn the joint distribution of missing and observed data and performs an accurate conditional sampling to update the missing values using a tailored reversed sampling strategy. Our theoretical analysis shows that DiffPuter’s training step corresponds to the maximum likelihood estimation of data density (M-step), and its sampling step represents the Expected A Posteriori estimation of missing values (E-step). Extensive experiments across ten diverse datasets and comparisons with 17 different imputation methods demonstrate DiffPuter’s superior performance. Notably, DiffPuter achieves an average improvement of 8.10% in MAE and 5.64% in RMSE compared to the most competitive existing method.

259Exploring New Frontiers in Vertical Federated Learning: the Role of Saddle Point Reformulation

[openreview] [pdf]

Abstract Distributed learning problems have gained significant popularity due to the increasing need for cluster training and the emergence of novel paradigms like Federated Learning (FL). One variant of FL, called Vertical Federated Learning (VFL), partitions data based on features across devices. The objective is to collectively train a model using the information available on each user’s device. This paper focuses on solving the VFL problem using the saddle point reformulation via the classical Lagrangian function. We first demonstrate how this formulation can be solved using deterministic methods. But more importantly, the paper explores various stochastic modifications to adapt to practical scenarios, such as employing compression techniques for efficient information transmission, enabling partial participation for asynchronous communication, and utilizing coordinate selection for faster local computation. We show that the saddle point reformulation plays a key role and opens up possibilities to use mentioned extension that seem to be impossible in the standard minimization formulation. Convergence estimates are provided for each algorithm, demonstrating their effectiveness in addressing the VFL problem. Additionally, alternative reformulations of the VFL problem are investigated, and numerical experiments are conducted to validate the proposed methods’ performance and effectiveness.

260Interactive Adjustment for Human Trajectory Prediction with Individual Feedback

[openreview] [pdf]

Abstract Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along, namely individual feedback. Individual feedback exists in the sequential nature of trajectory prediction, where earlier predictions of a target can be verified over time by his ground-truth trajectories to obtain feedback which provides valuable experience for subsequent predictions on the same agent. In this paper, we show such feedback can reveal the strengths and weaknesses of the model’s predictions on a specific target and heuristically guide to deliver better predictions on him. We present an interactive adjustment network to effectively model and leverage the feedback. This network first exploits the feedback from previous predictions to dynamically generate an adjuster which then interactively makes appropriate adjustments to current predictions for more accurate ones. We raise a novel displacement expectation loss to train this interactive architecture. Through experiments on representative prediction methods and widely-used benchmarks, we demonstrate the great value of individual feedback and the superior effectiveness of proposed interactive adjustment network. Our code will be made publicly available.

261Best-of-Both-Worlds Policy Optimization for CMDPs with Bandit Feedback

[openreview] [pdf]

Abstract We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, Stradi et al. (2024b) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret and constraint violation, while, when they are adversarial, it attains O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

262A Contextual Online Learning Theory of Brokerage

[openreview] [pdf]

Abstract We study the role ofcontextual informationin the online learning problem of brokerage between traders. At each round, two traders arrive with secret valuations about an asset they wish to trade. The broker suggests a trading price based on contextual data about the asset. Then, the traders decide to buy or sell depending on whether their valuations are higher or lower than the brokerage price. We assume the market value of traded assets is an unknown linear function of a dd-dimensional vector representing the contextual information available to the broker. Additionally, at each time step, we model traders’ valuations as independent bounded zero-mean perturbations of the asset’s current market value, allowing for potentially different unknown distributions across traders and time steps. Consistently with the existing online learning literature, we evaluate the performance of a learning algorithm with the regret with respect to thegain from trade. If the noise distributions admit densities bounded by some constant LL, then, for any time horizon TT:If the agents’ valuations are revealed after each interaction, we provide an algorithm achieving O(LdlnT)O ( L d \ln T ) regret, and show a corresponding matching lower bound of Ω(LdlnT)\Omega( Ld \ln T ).If only their willingness to sell or buy at the proposed price is revealed after each interaction, we provide an algorithm achieving O(LdTlnT)O( \sqrt{LdT \ln T }) regret, and show that this rate is optimal (up to logarithmic factors), via a lower bound of Ω(LdT)\Omega(\sqrt{LdT}).To complete the picture, we show that if the bounded density assumption is lifted, then the problem becomes unlearnable, even with full feedback.

263Diffusion Guided Adversarial State Perturbations in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) systems, while achieving remarkable success across various domains, are vulnerable to adversarial attacks. This is especially a concern in vision-based environments where minor manipulations of high-dimensional image inputs can easily mislead the agent’s behavior. To this end, various defenses have been proposed recently, with state-of-the-art approaches achieving robust performance even under large state perturbations. Upon closer investigation, however, we found that the effectiveness of the current defenses is due to a fundamental weakness of the existing lpl_p-norm constrained attacks, which can barely alter the semantics of the input even under a relatively large perturbation budget. In this work, we propose SHIFT, a novel diffusion-based state perturbation attack to go beyond this limitation. Specifically, we train a history-conditioned diffusion model, enhanced with policy guidance and realism detection to generate perturbed states that are semantically different from the true states while remaining realistic and history-aligned to avoid detection. Evaluations show that our attack effectively breaks existing defenses, including the most sophisticated ones, and significantly lowers the agent’s cumulative reward in various Atari games by more than 50%. The results highlight the vulnerability of RL agents to semantics-aware adversarial perturbations, indicating the importance of developing more robust policies for safety-critical domains.

264Mitigating Embedding Collapse in Diffusion Models for Categorical Data

[openreview] [pdf]

Abstract Latent diffusion models have enabled continuous-state diffusion models to handle a variety of datasets, including categorical data. However, most methods rely on fixed pretrained embeddings, limiting the benefits of joint training with the diffusion model. While jointly learning the embedding (via reconstruction loss) and the latent diffusion model (via score matching loss) could enhance performance, our analysis shows that end-to-end training risks embedding collapse, degrading generation quality. To address this issue, we introduce CATDM, a continuous diffusion framework within the embedding space that stabilizes training. We propose a novel objective combining the joint embedding-diffusion variational lower bound with a Consistency-Matching (CM) regularizer, alongside a shifted cosine noise schedule and random dropping strategy. The CM regularizer ensures the recovery of the true data distribution. Experiments on benchmarks show that CATDM mitigates embedding collapse, yielding superior results on FFHQ, LSUN Churches, and LSUN Bedrooms. In particular, CATDM achieves an FID of 6.81 on ImageNet 256×256256\times256 with 50 steps. It outperforms non-autoregressive models in machine translation and is on a par with previous methods in text generation.

265Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) is crucial for improving both generation quality and alignment between the input condition and final output in diffusion models. While a high guidance scale is generally required to enhance these aspects, it also causes oversaturation and unrealistic artifacts. In this paper, we revisit the CFG update rule and introduce modifications to address this issue. We first decompose the update term in CFG into parallel and orthogonal components with respect to the conditional model prediction and observe that the parallel component primarily causes oversaturation, while the orthogonal component enhances image quality. Accordingly, we propose down-weighting the parallel component to achieve high-quality generations without oversaturation. Additionally, we draw a connection between CFG and gradient ascent and introduce a new rescaling and momentum method for the CFG update rule based on this insight. Our approach, termed adaptive projected guidance (APG), retains the quality-boosting advantages of CFG while enabling the use of higher guidance scales without oversaturation. APG is easy to implement and introduces practically no additional computational overhead to the sampling process. Through extensive experiments, we demonstrate that APG is compatible with various conditional diffusion models and samplers, leading to improved FID, recall, and saturation scores while maintaining precision comparable to CFG, making our method a superior plug-and-play alternative to standard classifier-free guidance.

266Eliminating Oversaturation and Artifacts of High Guidance Scales in Diffusion Models

[openreview] [pdf]

Abstract No absctract

267Leveraging Driver Field-of-View for Multimodal Ego-Trajectory Prediction

[openreview] [pdf]

Abstract Understanding drivers’ decision-making is crucial for road safety. Although predicting the ego-vehicle’s path is valuable for driver-assistance systems, existing methods mainly focus on external factors like other vehicles’ motions, often neglecting the driver’s attention and intent. To address this gap, we infer the ego-trajectory by integrating the driver’s attention and the surrounding scene. We introduce RouteFormer, a novel multimodal ego-trajectory prediction network combining GPS data, environmental context, and driver field-of-view—comprising first-person video and gaze fixations. We also present the Path Complexity Index (PCI), a new metric for trajectory complexity that enables a more nuanced evaluation of challenging scenarios. To tackle data scarcity and enhance diversity, we introduce GEM, a comprehensive dataset of urban driving scenarios enriched with synchronized driver field-of-view and gaze data. Extensive evaluations on GEM and DR(eye)VE demonstrate that RouteFormer significantly outperforms state-of-the-art methods, achieving notable improvements in prediction accuracy across diverse conditions. Ablation studies reveal that incorporating driver field-of-view data yields significantly better average displacement error, especially in challenging scenarios with high PCI scores, underscoring the importance of modeling driver attention. All data, code, and models will be made publicly available.

268Data Exfiltration in Diffusion Models: A Backdoor Attack Approach

[openreview] [pdf]

Abstract As diffusion models (DMs) become increasingly susceptible to adversarial attacks, this paper investigates a novel method of data exfiltration through strategically implanted backdoors. Unlike conventional techniques that directly alter data, we pioneer the use of unique trigger embeddings for each image to enable covert data retrieval. Furthermore, we extend our exploration to text-to-image diffusion models such as Stable Diffusion by introducing the Caption Backdoor Subnet (CBS), which exploits these models for both image and caption extraction. This innovative approach not only reveals an unexplored facet of diffusion model security but also contributes valuable insights toward enhancing the resilience of generative models against sophisticated threats.

269Policy Transfer via Latent Graph Planning

[openreview] [pdf]

Abstract We introduce a transfer learning framework for deep reinforcement learning that integrates graph-based planning with self-supervised representation learning to efficiently transfer knowledge across tasks. While standard reinforcement learning aims to learn policies capable of solving long-horizon tasks, the resulting policies often fail to generalize to novel tasks and environments. Our approach addresses this limitation by decomposing long-horizon tasks into sequences of transferable short-horizon tasks modeled by goal-conditioned policies. We utilize a planning graph to generate fine-grained sub-goals that guide these short-horizon policies to solve novel long-horizon tasks. Experimental results show that our method improves sample efficiency and demonstrates an improved ability to solve sparse-reward and long-horizon tasks compared to baseline methods in challenging single-agent and multi-agent scenarios. In particular, compared to the state-of-the-art, our method achieves the same or better expected policy reward while requiring fewer training samples when learning novel tasks.

270How to distill task-agnostic representations from many teachers?

[openreview] [pdf]

Abstract Casting complex inputs onto tractable representations is a critical step in many fields. Differences in architectures, loss functions, input modalities, and datasets lead to embedding models that capture diverse information of the input. Multi-teacher distillation seeks to exploit this diversity to create richer representations but often remains task-specific. We extend this framework by proposing a task-oriented setting that introduces an objective function based on the “majority vote” principle. We demonstrate that the mutual information between the student and the teachers is an upper bound for this function, providing a task-agnostic loss for our distillation procedure. An extensive evaluation is performed in different domains ---natural language processing, computer vision, and molecular modeling --- indicating that our method effectively leverages teacher diversity to produce more informative representations. Finally, we use our method to train and release new state-of-the-art embedders, enabling improved downstream performance in NLP and molecular modeling.

271Can In-context Learning Really Generalize to Out-of-distribution Tasks?

[openreview] [pdf]

Abstract In this work, we explore the mechanism of in-context learning (ICL) on out-of-distribution (OOD) tasks that were not encountered during training. To achieve this, we conduct synthetic experiments where the objective is to learn OOD mathematical functions through ICL using a GPT-2 model. We reveal that Transformers may struggle to learn OOD task functions through ICL. Specifically, ICL performance resembles implementing a function within the pretraining hypothesis space and optimizing it with gradient descent based on the in-context examples. Additionally, we investigate ICL’s well-documented ability to learn unseen abstract labels in context. We demonstrate that such ability only manifests in the scenarios without distributional shifts and, therefore, may not serve as evidence of new-task-learning ability. Furthermore, we assess ICL’s performance on OOD tasks when the model is pretrained on multiple tasks. Both empirical and theoretical analyses demonstrate the existence of the \textbf{low-test-error preference} of ICL, where it tends to implement the pretraining function that yields low test error in the testing context. We validate this through numerical experiments. This new theoretical result, combined with our empirical findings, elucidates the mechanism of ICL in addressing OOD tasks.

272Performance Control in Early Exiting to Deploy Large Models at the Same Cost of Smaller Ones

[openreview] [pdf]

Abstract Early Exiting (EE) is a promising technique for speeding up inference at the cost of limited performance loss. It adaptively allocates compute budget to data points based on their difficulty by exiting at earlier layers when predictions are confident. In this study, we first present a novel perspective on the EE approach, demonstrating that larger models, when deployed with EE, can achieve higher performance than smaller models while maintaining similar computational costs. As existing EE approaches rely on confidence estimation at each exit point, we further study the impact of overconfidence on the controllability of the compute/performance trade-off. We introduce Performance Control Early Exiting (PCEE), a method that enables accuracy thresholding by basing decisions not on a datapoint’s condfidence but on the average accuracy of samples with similar confidence levels from a held-out validation set. In our experiments with MSDNets and Vision Transformer architectures on CIFAR-10, CIFAR-100, and ImageNet, we show that PCEE offers a simple yet computationally efficient approach that provides better control over performance than standard confidence-based approaches, and allows us to scale up model sizes to yield performance gain while reducing the computational cost.

273Federated Unlearning with Diffusion Models

[openreview] [pdf]

Abstract In recent years, diffusion models are widely adopted by individual users due to their outstanding performance in generation. During usage, individual users develop a need to forget privacy-related contents, making the scenario of using diffusion models on the clients a natural federated unlearning setting. For this scenario, we propose FedDUL, a Federated UnLearning method with Diffusion models, which addresses the unlearn requests from clients using the diffusion models. On one hand, we utilize local data on the clients to perform attention-based unlearning, enabling the local diffusion model to forget the concepts specified by the clients. On the other hand, we filter and group the unlearn requests from clients, gradually aggregating reasonable requests into the global diffusion model on the server, thereby protecting client privacy within the global model. The theoretical analysis further demonstrates the inherent unity between the federated unlearning problem based on diffusion models and federated learning, and extend this unity to traditional federated unlearning methods. Extensive quantitation and visualization experiments are conducted to evaluate the unlearning of both local and global models and discuss the communication and computation costs of our method, demonstrating that our method can satisfy the unlearn requests of multiple clients without compromising the generative capabilities for irrelevant concepts, providing new ideas and methods for the application of diffusion models in federated unlearning.

274Strategic Exploration for Inverse Constraint Inference with Efficiency Guarantee

[openreview] [pdf]

Abstract In many realistic applications, the constraint is not readily available, and we need to infer the constraints respected by the expert agents from their behaviors. The problem is known as Inverse Constraint Inference (ICI). A common solver, Inverse Constrained Reinforcement Learning (ICRL) seeks to recover the optimal constraints in complex environments in a data-driven manner. Existing ICRL algorithms collect training samples from an interactive environment. However, the efficacy and efficiency of these sampling strategies remain unknown. To bridge this gap, we introduce a strategic exploration framework with guaranteed efficiency. Specifically, we define a feasible constraint set for ICRL problems and investigate how expert policy and environmental dynamics influence the optimality of constraints. Motivated by our findings, we propose two exploratory algorithms to achieve efficient constraint inference via 1) dynamically reducing the bounded aggregate error of cost estimation and 2) strategically constraining the exploration policy. Both algorithms are theoretically grounded with tractable sample complexity. We empirically demonstrate the performance of our algorithms under various environments.

275Supervised and Semi-Supervised Diffusion Maps with Label-Driven Diffusion

[openreview] [pdf]

Abstract In this paper, we introduce Supervised Diffusion Maps (SDM) and Semi-Supervised Diffusion Maps (SSDM), which transform the well-known unsupervised dimensionality reduction algorithm, Diffusion Maps, into supervised and semi-supervised learning tools. The proposed methods, SDM and SSDM, are based on our new approach that treats the labels as a second view of the data. This unique framework allows us to incorporate ideas from multi-view learning. Specifically, we propose constructing two affinity kernels corresponding to the data and the labels. We then propose a multiplicative interpolation scheme of the two kernels, whose purpose is twofold. First, our scheme extracts the common structure underlying the data and the labels by defining a diffusion process driven by the data and the labels. This label-driven diffusion produces an embedding that emphasizes the properties relevant to the label-related task. Second, the proposed interpolation scheme balances the influence of the two kernels. We show on multiple benchmark datasets that the embedding learned by SDM and SSDM is more effective in downstream regression and classification tasks than existing unsupervised, supervised, and semi-supervised nonlinear dimension reduction methods.

276Learning to Permute with Discrete Diffusion

[openreview] [pdf]

Abstract The group of permutations SnS_n, also known as the finite symmetric groups, are essential in fields such as combinatorics, physics, and chemistry. However, learning a probability distribution over SnS_n poses significant challenges due to its intractable size and discrete nature. In this paper, we introduceSymmetricDiffusers, a novel discrete diffusion model that simplifies the task of learning a complicated distribution over SnS_n by decomposing it into learning simpler transitions of the reverse diffusion using deep neural networks. We identify the riffle shuffle as an effective forward transition and provide empirical guidelines for selecting the diffusion length based on the theory of random walks on finite groups. Additionally, we propose a generalized Plackett-Luce (PL) distribution for the reverse transition, which is provably more expressive than the PL distribution. We further introduce a theoretically grounded “denoising schedule” to improve sampling and learning efficiency. Extensive experiments show that our model achieves state-of-the-art or comparable performances on solving tasks including sorting 4-digit MNIST images, jigsaw puzzles, and traveling salesman problems.

277PROGRESSIVE KNOWLEDGE DISTILLATION (PKD): A MODULAR APPROACH FOR ARCHITECTURE-AGNOSTIC KNOWLEDGE DISTILLATION

[openreview] [pdf]

Abstract \textbf{Knowledge distillation (KD)} is a key technique for training \textbf{lightweight deep neural networks}, particularly in \textbf{resource-constrained environments}. While existing KD methods utilize intermediate features to improve student models, they often overlook the proper \textbf{alignment between teacher-student layers} and fail to select the most \textbf{informative data} for training each student layer. These limitations are especially pronounced in \textbf{architecture-agnostic scenarios}, where different network architectures complicate knowledge transfer.We propose \textbf{PKD}, a \textbf{Progressive Knowledge Distillation} framework that progressively aligns teacher and student layers through \textbf{feature-based modularization}. Each student module is trained using the most \textbf{representative features} from its corresponding teacher module, starting with the shallowest layers and progressively moving to deeper ones. This training method enables efficient, architecture-agnostic knowledge transfer across a variety of model architectures. \textbf{Experiments on CIFAR-100 and ImageNet-1K} demonstrate that PKD outperforms baseline models, achieving performance improvements of up to \textbf{4.54%} and \textbf{6.46%}, respectively, thereby validating its effectiveness in diverse neural network settings.

278What should an AI assessor optimise for?

[openreview] [pdf]

Abstract An AI assessor is an external, ideally independent system that predicts an indicator, e.g., a loss value, of another AI system. Assessors can leverage information from the test results of many other AI systems and have the flexibility of being trained on any loss function: from squared error to toxicity metrics. Here we address the question: is it always optimal to train the assessor for the target loss? Or could it be better to train for a different loss and then map predictions back to the target loss? Using ten regression problems with tabular data, we experimentally explore this question for regression losses with monotonic and nonmonotonic mappings and find that, contrary to intuition, optimising for more informative losses is not generally better. Surprisingly though, some monotonic transformations, such as the logistic loss used to minimise the absolute or squared error, are promising.

279XXLTraffic: Expanding and Extremely Long Traffic forecasting beyond test adaptation

[openreview] [pdf]

Abstract Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the distribution shift nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and long temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, the longest available public traffic dataset with the longest timespan collected from Los Angeles, USA, and New South Wales, Australia, curated to support research in extremely long forecasting beyond test adaptation. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle the extremely long forecasting problems beyond test adaptation. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.

280Training Task Experts through Retrieval Based Distillation

[openreview] [pdf]

Abstract One of the most reliable ways to create deployable models for specialized tasks is to obtain an adequate amount of high-quality task-specific data. However, for specialized tasks, often such datasets do not exist. Existing methods address this by creating such data from large language models (LLMs) and then distilling such knowledge into smaller models. However, these methods are limited by the quality of the LLMs output, and tend to generate repetitive or incorrect data. In this work, we present Retrieval Based Distillation (ReBase), a method that first retrieves data from rich online sources and then transforms them into domain-specific data. This method greatly enhances data diversity. Moreover, ReBase generates Chain-of-Thought reasoning and distills the reasoning capacity of LLMs. We test our method on 4 benchmarks and shows that our method significantly improves performance by up to 10.76% on SQuAD, 1.37% on MNLI, and 1.94% on BBH.

281Improving real-world sequence design with a simple meta-heuristic for detecting distribution shift

[openreview] [pdf]

Abstract Biological sequence design is one of the most impactful areas where model-based optimization is applied. A common scenario involves using a fixed training set to train predictive models, with the goal of designing new sequences that outperform those present in the training data. This by definition results in a distribution shift, where the model is applied to samples that are substantially different from those in the training set (or otherwise they wouldn’t have a chance of being much better). While most MBO methods offer some balancing heuristic to control for false positives, finding the right balance of pushing the design distribution while maintaining model accuracy requires deep knowledge of the algorithm and artful application, limiting successful adoption by practitioners. To tackle this issue, we propose a straightforward meta-algorithm for design practitioners that detects distribution shifts when using any MBO. By doing a real-world sequence design experiment, we show that (1) Real world distribution shift is far more severe than observed in simulated settings, where most MBO algorithms are benchmarked (2) Our approach successfully reduces the adverse effects of distribution shift. We believe this method can significantly improve design quality for sequence design tasks and potentially other domain applications where offline optimization faces harsh distribution shifts.

282Algorithm for Concept Extrapolation: Diverse Generalization via Selective Disagreement

[openreview] [pdf]

Abstract Standard deep learning approaches often struggle to handle out-of-distribution data, especially when the distributional shift breaks spurious correlations. While some approaches to handling spurious correlations under distributional shift aim to separate causal and spurious features without access to target distribution data, they rely on labeled data from different domains or contingent assumptions about the nature of neural representations. Existing methods that do make use of unlabeled target data make strict assumptions about the target data distribution. To overcome these limitations, we present the Algorithm for Concept Extrapolation (ACE). Using an exponentially-weighted disagreement loss to maximize disagreement on target instances \textit{that break spurious correlations}, ACE achieves state of the art performance on spurious complete correlation benchmarks. We also show ACE is robust to unlabeled target distributions where spurious and ground truth features are not statistically independent. Finally, we demonstrate the applicability of ACE for handling goal-misgeneralization in deep reinforcement learning, with our ``ACE agent’’ achieving a 16% higher level completion rate in the CoinRun goal misgeneralisation problem when the coin is randomly placed in the level.

283Learning-Augmented Robust Algorithmic Recourse

[openreview] [pdf]

Abstract The widespread use of machine learning models in high-stakes domains can have a major negative impact, especially on individuals who receive undesirable outcomes. Algorithmic recourse provides such individuals with suggestions of minimum-cost improvements they can make to achieve a desirable outcome in the future. However, machine learning models often get updated over time and this can cause a recourse to become invalid (i.e., not lead to the desirable outcome). The robust recourse literature aims to choose recourses less sensitive, even against adversarial model changes, but this comes at a higher cost. To overcome this obstacle, we initiate the study of algorithmic recourse through the learning-augmented framework and evaluate the extent to which a designer equipped with a prediction regarding future model changes can reduce the cost of recourse when the prediction is accurate (consistency) while also limiting the cost even when the prediction is inaccurate (robustness). We propose a novel algorithm for this problem, study the robustness-consistency trade-off, and analyze how prediction accuracy affects performance.

284Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models

[openreview] [pdf]

Abstract Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challenging problems including Atari games and robotic control, but requires manually designing heuristics to guide exploration (i.e. determine which states to save and explore from, and what actions to consider next), which is time-consuming and infeasible in general. To resolve this, we propose Intelligent Go-Explore (IGE) which greatly extends the scope of the original Go-Explore by replacing these handcrafted heuristics with the intelligence and internalized human notions of interestingness captured by giant pretrained foundation models (FMs). This provides IGE with a human-like ability to instinctively identify how interesting or promising any new state is (e.g. discovering new objects, locations, or behaviors), even in complex environments where heuristics are hard to define. Moreover, IGE offers the exciting and previously impossible opportunity to recognize and capitalize on serendipitous discoveries that cannot be predicted ahead of time. We evaluate our algorithm on a diverse range of language and vision-based tasks that require search and exploration. Across these tasks, IGE strongly exceeds classic reinforcement learning and graph search baselines, and also succeeds where prior state-of-the-art FM agents like Reflexion completely fail. Overall, Intelligent Go-Explore combines the tremendous strengths of FMs and the powerful Go-Explore algorithm, opening up a new frontier of research into creating more generally capable agents with impressive exploration capabilities.

285Instant Policy: In-Context Imitation Learning via Graph Diffusion

[openreview] [pdf]

Abstract Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key components. First, we introduce inductive biases through a graph representation and model ICIL as a graph generation problem using a learned diffusion process, enabling structured reasoning over demonstrations, observations, and actions. Second, we show that such a model can be trained using pseudo-demonstrations – arbitrary trajectories generated in simulation – as a virtually infinite pool of training data. Our experiments, in both simulation and reality, show that Instant Policy enables rapid learning of various everyday robot tasks. We also show how it can serve as a foundation for cross-embodiment and zero-shot transfer to language-defined tasks.

286Dataset Condensation with Sharpness-Aware Trajectory Matching

[openreview] [pdf]

Abstract Dataset condensation aims to synthesise datasets with a few representative samples that can effectively represent the original datasets. This enables efficient training and produces models with performance close to those trained on the original sets. Most existing dataset condensation methods conduct dataset learning under the bilevel (inner and outer loop) based optimisation. However, due to its notoriously complicated loss landscape and expensive time-space complexity, the preceding methods either develop advanced training protocols so that the learned datasets generalise to unseen tasks or reduce the inner loop learning cost increasing proportionally to the unrolling steps. This phenomenon deteriorates when the datasets are learned via matching the trajectories of networks trained on the real and synthetic datasets with a long horizon inner loop. To address these issues, we introduce Sharpness-Aware Trajectory Matching (SATM), which enhances the generalisation capability of learned synthetic datasets by minimising sharpness in the outer loop of bilevel optimisation. Moreover, our approach is coupled with an efficient hypergradient approximation that is mathematically well-supported and straightforward to implement along with controllable computational overhead. Empirical evaluations of SATM demonstrate its effectiveness across various applications, including standard in-domain benchmarks and out-of-domain settings. Moreover, its easy-to-implement properties afford flexibility, allowing it to integrate with other advanced sharpness-aware minimisers. We will release our code on GitHub.

287Decoupled Offline to Online finetuning via Dynamics Model

[openreview] [pdf]

Abstract Constrained by the sub-optimal dataset in offline reinforcement learning (RL), the offline trained agent should be online finetuned before deployment. Due to the conservative offline algorithms and unbalanced state distribution in offline dataset, offline to online finetuning faces severe distribution shift. This shift will disturb the policy improvement during online interaction, even a performance drop. A natural yet unexplored idea is whether policy improvement can be decoupled from distribution shift. In this work, we propose a decoupled offline to online finetuning framework using the dynamics model from model-based methods. During online interaction, only dynamics model is finetuned to overcome the distribution shift. Then the policy is finetuned in offline manner with finetuned dynamics and without further interaction. As a result, online stage only needs to deal with a simpler supervised dynamics learning, rather than the complex policy improvement with the interference from distribution shift. When finetuning the policy, we adopt the offline approach, which ensures the conservatism of the algorithm and fundamentally avoids the sudden performance crashes. We conduct extensive evaluation on the classical datasets of offline RL, demonstrating the effective elimination of distribution shift, stable and superior policy finetuning performance, and exceptional interaction efficiency within our decouple offline to online finetuning framework.

288Simple Policy Optimization

[openreview] [pdf]

Abstract Model-free reinforcement learning algorithms have seen remarkable progress, but key challenges remain. Trust Region Policy Optimization (TRPO) is known for ensuring monotonic policy improvement through conservative updates within a trust region, backed by strong theoretical guarantees. However, its reliance on complex second-order optimization limits its practical efficiency. Proximal Policy Optimization (PPO) addresses this by simplifying TRPO’s approach using ratio clipping, improving efficiency but sacrificing some theoretical robustness. This raises a natural question: Can we combine the strengths of both methods? In this paper, we introduce Simple Policy Optimization (SPO), a novel unconstrained first-order algorithm. SPO integrates the surrogate objective with Total Variation (TV) divergence instead of Kullback-Leibler (KL) divergence, achieving a balance between the theoretical rigor of TRPO and the efficiency of PPO. Our new objective improves upon ratio clipping, offering stronger theoretical properties and better constraining the probability ratio within the trust region. Empirical results demonstrate that SPO achieves state-of-the-art performance, with a simple implementation and improves sample efficiency, particularly for training large, complex network architectures end-to-end.

289Stabilize continual learning with hyperspherical replay

[openreview] [pdf]

Abstract Neural networks face catastrophic forgetting of previously learned knowledge when training on new task data. While the field of continual learning has made promising progress in reducing this forgetting, recent work has uncovered an interesting phenomenon: existing techniques often exhibit a sharp performance drop on prior tasks during the initial stages of new task training, a phenomenon known as the ”stability gap.” This phenomenon not only raises safety concerns but also challenges the current understanding of neural network behavior in continual learning scenarios. Inspired by this discovery, we revisit two fundamental questions in continual learning: 1) Is the past learned knowledge within deep networks lost abruptly or gradually? and 2) Is past learned knowledge ever completely erased? Our analysis reveals that abrupt forgetting occurs not only in the final fully connected layer but also permeates the feature space and most layers, sparing only the earliest layers. Alarmingly, a single gradient update can severely disrupt the learned class structure. We identify degenerate solutions in the softmax cross-entropy loss as a major contributing factor, with memory samples exhibiting higher feature norms compared to new samples. To address these issues, we pro- pose Adaptive Angular Replay (AAR), a simple yet effective approach that learns features in hyperspherical space using feature and weight normalization. Angular ER demonstrates a strong ability to preserve class structure during task transitions. Additionally, we introduce an adaptive scaling strategy to further mitigate the stability gap and improve overall accuracy.

290Distributed Constrained Optimal Consensus Under a Directed Graph

[openreview] [pdf]

Abstract In this paper, the distributed constrained optimal consensus problem of multi-agent systems under a directed graph is investigated. We propose two projection-based distributed constrained optimal consensus algorithms: one addressing set constraints and the other tailored for general constraints. Only the relative state is exchanged among agents in these two algorithms. In the stability analysis of case with set constraints, we transform the distributed optimization problem into a constrained leaderless consensus problem by adopting a sliding mode approach. Building on this foundational transformation, we further develop a projection-based distributed constrained optimal consensus algorithm to address general constraints. It is shown that the proposed algorithm achieves an ergodic convergence rate of O(1k)O(\frac{1}{k}) with respect to the first-order optimality residuals. Numerical simulations are conducted to validate the effectiveness of our theoretical results.

291How new data pollutes LLM knowledge and how to dilute it

[openreview] [pdf]

Abstract Understanding how the learning of new texts alter the existing knowledge in a large language model is of great importance, because it is through these accumulated changes that the LLM was initially pre-trained, and is also through such changes that continual, new learning in LLMs can proceed. As a result, both desirable alterations (i.e. generalization) and undesirable alterations (i.e. hallucination) can occur. Here, we study the learning of new texts, one at a time, and ask: how does it impact the underlying LLM knowledge? We show that learning new texts induce ‘priming’, an undesirable effect that pollutes existing knowledge where it should not. Centrally, we demonstrate that we can predict how much priming will happen after learning, using token probability before learning. This was empirically robust across models (PALM-2-xs/s, Gemma-2b, Llama-2-7b), of various sizes, and training stages. To show this, we created a new dataset, called “Outlandish” consisting of 1320 different samples with diverse textual characteristics. Finally, we propose two strategies to mitigate the spread of priming: first, a simple text augmentation technique which we call the "stepping-stone’', and second, a novel update pruning technique (“ignore-k”). These decrease priming by a median of 50%-75% and 50%-95% respectively depending on the model architecture, and enhance the specificity of new learning in language models. The dataset and reproducible findings can be found [LINK omitted for double blind review].

292Magnetic Mirror Descent Self-play Preference Optimization

[openreview] [pdf]

Abstract Standard Reinforcement Learning from Human Feedback (RLHF) methods mainly optimize preferences through the Bradley-Terry (BT) reward model, which may misalign with natural human preferences due to the strong transitivity assumption. Recent work has reframed the preference learning problem as a two-player constant-sum game, aiming to learn policies that better reflect human preferences by finding the Nash equilibrium (NE) of this game. However, existing methods under this framework either guarantee only average-iterate convergence or rely on strong first-order approximation assumptions. In this paper, we propose Mirror Descent Self-play Preference Optimization (MDSPO), a novel approach based on Magnetic Mirror Descent (MMD). By introducing an additional magnetic term, MDSPO achieves linear convergence rate to the NE of the regularized game. Furthermore, we establish theoretical guarantees for the convergence of our algorithm to the NE of the original game by periodically updating the reference policy. This approach effectively guarantees that the final policy accurately reflects the true human preferences. To ensure our algorithm is both theoretically sound and practically viable, we provide a simple yet effective implementation that adapts the theoretical insights to the RLHF setting. We demonstrate its effectiveness on a variety of benchmarks.

293Towards Robust Concept Erasure in Diffusion Models: Unlearning Identity, Nudity and Artistic Styles

[openreview] [pdf]

Abstract Diffusion models have achieved remarkable success in generative tasks across various domains. However, the increasing demand for content moderation and the removal of specific concepts from these models has introduced the challenge of \textit{unlearning}. In this work, we present a suite of robust methodologies that significantly enhance the unlearning process by employing advanced loss functions within knowledge distillation frameworks. Specifically, we utilize the Cramer-Wold distance and Jensen-Shannon (JS) divergence to facilitate more efficient and versatile concept removal. Although current non-learning techniques are effective in certain scenarios, they are typically limited to specific categories such as identity, nudity, or artistic style. In contrast, our proposed methods demonstrate robust versatility, seamlessly adapting to and performing effectively across a wide range of concept erasure categories. Our approach outperforms existing techniques, achieving consistent results across different unlearning categories and showcasing its broad applicability. Through extensive experiments, we show that our method not only surpasses previous benchmarks but also addresses key limitations of current unlearning techniques, paving the way for more responsible use of text-to-image diffusion models.

294Continuous Diffusion for Mixed-Type Tabular Data

[openreview] [pdf]

Abstract Score-based generative models (or diffusion models for short) have proven successful for generating text and image data. However, the adaption of this model family to tabular data of mixed-type has fallen short so far. In this paper, we propose CDTD, a Continuous Diffusion model for mixed-type Tabular Data. Specifically, we combine score matching and score interpolation to ensure a common continuous noise distribution for both continuous and categorical features alike. We counteract the high heterogeneity inherent to data of mixed-type with distinct, adaptive noise schedules per feature or per data type. The learnable noise schedules ensure optimally allocated model capacity and balanced generative capability. We homogenize the data types further with model-specific loss calibration and initialization schemes tailored to mixed-type tabular data. Our experimental results show that CDTD consistently outperforms state-of-the-art benchmark models, captures feature correlations exceptionally well, and that heterogeneity in the noise schedule design boosts the sample quality.

295High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

[openreview] [pdf]

Abstract A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings:(i)model shift, where the surrogate model is arbitrary, and(ii)distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that(i)W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but(ii)it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

296TELEPORTATION WITH NULL SPACE GRADIENT PROJECTION FOR OPTIMIZATION ACCELERATION

[openreview] [pdf]

Abstract Optimization techniques have become increasingly critical due to the ever-growing model complexity and data scale. In particular, teleportation has emerged as a promising approach, which accelerates convergence of gradient descent-based methods by navigating within the loss invariant level set to identify parameters with advantageous geometric properties. Existing teleportation algorithms have primarily demonstrated their effectiveness in optimizing Multi-Layer Perceptrons (MLPs), but their extension to more advanced architectures, such as Convolutional Neural Networks (CNNs) and Transformers, remains challenging. Moreover, they often impose significant computational demands, limiting their applicability to complex architectures. To this end, we introduce an algorithm that projects the gradient of the teleportation objective function onto the input null space, effectively preserving the teleportation within the loss invariant level set and reducing computational cost. Our approach is readily generalizable from MLPs to CNNs, transformers, and potentially other advanced architectures. We validate the effectiveness of our algorithm across various benchmark datasets and optimizers, demonstrating its broad applicability.

297softmax is not enough (for sharp out-of-distribution)

[openreview] [pdf]

Abstract A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier of sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from “circuits” which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.

298Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

[openreview] [pdf]

Abstract For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we proposeLoss-Free Balancing, a new load balancing strategy that operates without auxiliary losses. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.

299Data-Centric Graph Condensation via Diffusion Trajectory Matching

[openreview] [pdf]

Abstract This paper introduces Data Centric Graph Condensation (named DCGC), a data-centric and model-agnostic method for condensing a large graph into a smaller one by matching the distribution between two graphs. DCGC defines the distribution of a graph as the trajectories of its node signals (such as node features and node labels) induced by a diffusion process over the geometric structure, which accommodates multi-order structural information. Built upon this, DCGC compresses the topological knowledge of the original graph into the orders-of-magnitude smaller synthetic one by aligning their distributions in input space. Compared with existing methods that stick to particular GNN architectures and require solving complicated optimization, DCGC can be flexibly applied for arbitrary off-the-shelf GNNs and achieve graph condensation with a much faster speed. Apart from the cross-architecture generalization ability and training efficiency, experiments demonstrate that DCGC yields consistently superior performance than existing methods on datasets with varying scales and condensation ratios.

300LATABLE: TOWARDS LARGE TABULAR MODELS

[openreview] [pdf]

Abstract Tabular data is one of the most ubiquitous data modalities, yet the literature on tabular generative foundation models is lagging behind its text and vision counterparts. Large Tabular Models (LTMs) could revolutionize the way tabular data is used: not as any single dataset analyzed in a vacuum, but contextualized using their metadata and with respect to related datasets. Creating an LTM is difficult, due to the heterogeneous feature spaces of different tabular datasets, metadata, and prior knowledge. In this work, we propose LaTable: a novel tabular diffusion model that addresses these challenges. We show LaTable can be trained across tabular datasets. Through extensive experiments, we find that LaTable displays early signs of scaling laws previously encountered in foundation model regimes. Moreover, LaTable outperform baselines in out-of-distribution few-shot data generation.

301Risk-Sensitive Diffusion: Robustly Optimizing Diffusion Models with Noisy Samples

[openreview] [pdf]

Abstract Diffusion models are mainly studied on image data. However, non-image data (e.g., tabular data) are also prevalent in real applications and tend to be noisy due to some inevitable factors in the stage of data collection, degrading the generation quality of diffusion models. In this paper, we consider a novel problem setting where every collected sample is paired with a vector indicating the data quality: risk vector. This setting applies to many scenarios involving noisy data and we propose risk-sensitive SDE, a type of stochastic differential equation (SDE) parameterized by the risk vector, to address it. With some proper coefficients, risk-sensitive SDE can minimize the negative effect of noisy samples on the optimization of diffusion models. We conduct systematic studies for both Gaussian and non-Gaussian noise distributions, providing analytical forms of risk-sensitive SDE. To verify the effectiveness of our method, we have conducted extensive experiments on multiple tabular and time-series datasets, showing that risk-sensitive SDE permits a robust optimization of diffusion models with noisy samples and significantly outperforms previous baselines.

302CityNav: Language-Goal Aerial Navigation Dataset Using Geographic Information

[openreview] [pdf]

Abstract Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. Despite notable advancements in ground-level navigation, the exploration of aerial navigation using these modalities remains limited. This gap primarily arises from a lack of suitable resources for real-world, city-scale aerial navigation studies. To remedy this gap, we introduce CityNav, a novel dataset explicitly designed for language-guided aerial navigation in photorealistic 3D environments of real cities. CityNav comprises 32k natural language descriptions paired with human demonstration trajectories, collected via a newly developed web-based 3D simulator. Each description identifies a navigation goal, utilizing the names and locations of landmarks within actual cities. As an initial step toward addressing this challenge, we provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We have benchmarked the latest aerial navigation methods alongside our proposed baseline model on the CityNav dataset. The findings are revealing: (i) our aerial agent model trained on human demonstration trajectories, outperform those trained on shortest path trajectories by a large margin; (ii) incorporating 2D spatial map information markedly and robustly enhances navigation performance at a city scale; (iii) despite the use of map information, our challenging CityNav dataset reveals a persistent performance gap between our baseline models and human performance. To foster further research in aerial VLN, we have made the dataset and code available athttps://anonymous.4open.science/w/city-nav-77E3/.

303Discrete Inversion: A Controllable Latent Space for Multinomial Diffusion and Masked Generative Models

[openreview] [pdf]

Abstract Discrete diffusion models have achieved notable success in tasks like image generation and masked language modeling, yet they face limitations in controlled content editing. This paper introduces {\bf Discrete Inversion}, the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the forward diffusion process, Discrete Inversion facilitates accurate reconstruction and controlled edits without the need for predefined masks or attention map manipulation. We demonstrate the effectiveness of our method across both image and text domains, evaluating it on models like VQ-Diffusion, Paella, and RoBERTa. Our results show that Discrete Inversion not only preserves high fidelity in the original data but also enables flexible and user-friendly editing in discrete spaces, significantly advancing the capabilities of discrete generative models.

304Fast Direct: Query-Efficient Online Black-box Guidance for Diffusion-model Target Generation

[openreview] [pdf]

Abstract Guided diffusion-model generation is a promising direction for customizing the generation process of a pre-trained diffusion-model to address the specific downstream tasks. Existing guided diffusion models either rely on training of the guidance model with pre-collected datasets or require the objective functions to be differentiable. However, for most real-world tasks, the offline datasets are often unavailable, and their objective functions are often not differentiable, such as image generation with human preferences, molecular generation for drug discovery, and material design. Thus, we need anonlinealgorithm capable of collecting data during runtime and supporting ablack-boxobjective function. Moreover, thequery efficiencyof the algorithm is also critical because the objective evaluation of the query is often expensive in the real-world scenarios. In this work, we propose a novel and simple algorithm,Fast Direct, for query-efficient online black-box target generation. Our Fast Direct builds a pseudo-target on the data manifold to update the noise sequence of the diffusion model with a universal direction, which is promising to perform query-efficient guided generation. Extensive experiments on twelve high-resolution (1024×1024\small {1024 \times 1024}) image target generation tasks and six 3D-molecule target generation tasks show 6×\textbf{6}\times up to 10×\textbf{10}\times query efficiency improvement and 11×\textbf{11}\times up to 44×\textbf{44}\times query efficiency improvement, respectively.

305Hierarchical Multiscale Diffuser for Extendable Long-Horizon Planning

[openreview] [pdf]

Abstract This paper introduces the Hierarchical Multiscale Diffuser (HM-Diffuser), a novel approach for efficient long-horizon planning. Building on recent advances in diffusion-based planning, our method addresses the challenge of planning over horizons significantly longer than those available in the training data. We decompose the problem into two key subproblems. The first phase, Progressive Trajectory Extension (PTE), involves stitching short trajectories together to create datasets with progressively longer trajectories. In the second phase, we train the HM-Diffuser on these extended datasets, preserving computational efficiency while enhancing long-horizon planning capabilities. The hierarchical structure of the HM-Diffuser allows for subgoal generation at multiple temporal resolutions, enabling a top-down planning approach that aligns high-level, long-term goals with low-level, short-term actions. Experimental results demonstrate that the combined PTE and HM-Diffuser approach effectively generates long-horizon plans, extending far beyond the originally provided trajectories.

306Multi-expert collaboration: Enhancing heterogeneous knowledge independence and alignment in knowledge distillation

[openreview] [pdf]

Abstract Heterogeneous multi-teacher Knowledge distillation attempt to learn a versatile student neural network from multiple pre-trained heterogeneous teachers. But current methods face issues with a lack of independence and alignment in heterogeneous knowledge. To address this issue, we propose a novel method called Multi-Expert Collaboration (MEC). Our approach aggregates multiple expert classifiers within the student model, replacing the conventional single-head architecture. By ensuring that each expert’s independent classifier operates without interfering with others, we enhance the independence of heterogeneous knowledge. Inspired by Helmholtz Free Energy (HFE) theory, we introduce an anchor-based HFE self-normalization strategy to align the heterogeneous knowledge effectively. This method ensures consistent energy levels across all classifiers, allowing the appropriate classifier to achieve the highest confidence for in-distribution data. Extensive experiments on CIFAR-100 and ImageNet-100 datasets demonstrate that MEC significantly outperforms existing heterogeneous multi-teacher knowledge distillation methods, achieving an average accuracy improvement of over 10%.

307Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction

[openreview] [pdf]

Abstract Generative modeling of discrete data underlies important applications spanning text-based agents like ChatGPT to the design of the very building blocks of life in protein sequences. However, application domains need to exert control over the generated data by steering the generative process—typically via RLHF—to satisfy a specified property, reward, or affinity metric. In this paper, we study the problem of steering Masked Diffusion Models (MDMs), a recent class of discrete diffusion models that offer a compelling alternative to traditional autoregressive models. We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pretrained MDMs as a problem of probabilistic inference by learning to sample from a target Bayesian posterior. Our DDPP framework leads to a family of three novel objectives that are all simulation-free, and thus scalable while applying to general non-differentiable reward functions. Empirically, we instantiate DDPP by steering MDMs to perform class-conditional pixel-level image modeling, RLHF-based alignment of MDMs using text based rewards, and finetuning protein language models to generate more diverse secondary structures and shorter proteins. We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.

308Risk Informed Policy Learning for Safer Exploration

[openreview] [pdf]

Abstract Reinforcement learning algorithms typically necessitate extensive exploration of the state space to find optimal policies. However, in safety-critical applications, the risks associated with such exploration can lead to catastrophic consequences. Existing safe exploration methods mitigate this by imposing constraints, but these often result in overly conservative behaviours and inefficient learning. Overfitting on negative experiences hampers the agent’s ability to learn accurate risk representations, limiting its exploration of risky yet potentially high-reward regions of the state space. To address this, we introduce a method that explicitly learns state-conditioned risk representations by incorporating an inductive bias. By augmenting state features with these risk representations, our approach naturally encourages safer exploration without being excessively cautious, resulting in more efficient and safer policy learning. Empirical evaluations across diverse environments show that our method significantly improves task performance while reducing constraint violations during training, underscoring its effectiveness in balancing exploration with safety.

309OccProphet: Pushing the Efficiency Frontier of Camera-Only 4D Occupancy Forecasting with an Observer-Forecaster-Refiner Framework

[openreview] [pdf]

Abstract Predicting variations in complex traffic environments is crucial for the safety of autonomous driving. Recent advancements in occupancy forecasting have enabled forecasting future 3D occupied status in driving environments by observing historical 2D images. However, high computational demands make occupancy forecasting less efficient during training and inference stages, hindering its feasibility for deployment on edge agents. In this paper, we propose a novel framework, \textit{i.e.}, OccProphet, to efficiently and effectively learn occupancy forecasting with significantly lower computational requirements while maintaining forecasting accuracy. OccProphet comprises three lightweight components: Observer, Forecaster, and Refiner. The Observer extracts spatio-temporal features from 3D using the proposed Efficient 4D Aggregation with Tripling-Attention Fusion, while the Forecaster and Refiner conditionally predict and refine future occupancy inferences. Experimental results on nuScenes, Lyft-Level5, and nuScenes-Occupancy datasets demonstrate that OccProphet is both training- and inference-friendly. OccProphet reduces 58%\sim78% of the computational cost with a 2.6×\times speedup compared with the state-of-the-art Cam4DOcc. Moreover, it achieves 4%\sim18% relatively higher forecasting accuracy. The code will be publicly available.

310Enhancing Training Robustness through Influence Measure

[openreview] [pdf]

Abstract In the field of machine learning, the pursuit of robust and accurate models is ongoing. A key aspect of achieving robustness lies in identifying which data points in the training set should be excluded and which high-quality, potentially unlabeled data points outside the training set should be incorporated to improve the model’s performance on unseen data. To accomplish this, an effective metric is needed to evaluate the contribution of each data point toward enhancing overall model performance. This paper proposes the use of an influence measure as a metric to assess the impact of training data on test set performance. Additionally, we introduce a data selection method to optimize the training set as well as a dynamic active learning algorithm driven by the influence measure. The effectiveness of these methods is demonstrated through extensive simulations and real-world datasets.

311PHI-S: Distribution Balancing for Agglomerative Models

[openreview] [pdf]

Abstract Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed “agglomerative models.” We build upon this body of work by studying the effect of the teachers’ activation statistics, particularly the impact of the loss function on the resulting student model quality. We explore a standard toolkit of statistical normalization techniques to better align the different distributions and assess their effects. Further, we examine the impact on downstream teacher-matching metrics, which motivates the use of Hadamard matrices. With these matrices, we demonstrate useful properties, showing how they can be used for isotropic standardization, where each dimension of a multivariate distribution is standardized using the same scale. We call this technique “PHI Standardization” (PHI-S) and empirically demonstrate that it produces the best student model across the suite of methods studied.

312Generative bandit optimization via diffusion posterior sampling

[openreview] [pdf]

Abstract Many real-world discovery problems, including drug and material design, can be modeled within the bandit optimization framework, where an agent selects a sequence of experiments to efficiently optimize an unknown reward function. However, classic bandit algorithms operate on fixed finite or continuous action sets, making discovering novel designs impossible in the former case, and often leading to the curse of dimensionality in the latter, thus rendering these methods impractical. In this work, we first formalize thegenerative banditsetting, where an agent wishes to maximize an unknown reward function over the support of a data distribution, often calleddata manifold, which implicitly encodes complex constraints (e.g., the geometry of valid molecules), and from which (unlabeled) sample data is available (e.g., a dataset of valid molecules). We then propose Diffusion Posterior Sampling (DiffPS), an algorithm that tackles the exploration-exploitation problem directly on the learned data manifold by leveraging a conditional diffusion model. We formally show that the statistical complexity of DiffPS adapts to theintrinsic dimensionalityof the data, overcoming the curse of dimensionality in high-dimensional settings. Our experimental evaluation supports the theoretical claims and demonstrates promising performance in practice.

313CoDiCast: Conditional Diffusion Model for Weather Prediction with Uncertainty Quantification

[openreview] [pdf]

Abstract Accurate weather forecasting is critical for science and society. Yet, existing methods have not demonstrated high accuracy, low uncertainty, and high computational efficiency simultaneously. On one hand, to quantify the uncertainty in weather predictions, the strategy of ensemble forecast (i.e., generating a set of diverse predictions) is often employed. However, traditional ensemble numerical weather prediction (NWP) is computationally intensive. On the other hand, even though most existing machine learning-based weather prediction (MLWP) approaches are efficient and accurate, they are deterministic and cannot capture the uncertainty of weather forecasting. To tackle these challenges, we propose CoDiCast\texttt{CoDiCast}, a conditional diffusion model to generate accurate global weather prediction, while achieving uncertainty quantification and modest computational cost. The key idea behind the prediction task is to generate realistic weather scenarios at a future\textit{future} time point, conditioned on observations from the recent past\textit{recent past}. Due to the probabilistic nature of diffusion models, they can be properly applied to capture the uncertainty of weather predictions. Therefore, we accomplish uncertainty quantifications by repeatedly sampling from stochastic Gaussian noise for each initial weather state and running the denoising process multiple times. Experimental results demonstrate that CoDiCast\texttt{CoDiCast} outperforms several existing MLWP methods in accuracy, and is faster than NWP models in the inference speed. CoDiCast\texttt{CoDiCast} can generate 3-day global weather forecasts, at 6-hour steps and 5.6255.625^\circ latitude-longitude resolutions, for over 5 variables, in about 12 minutes on a commodity A100 GPU machine with 80GB memory. The anonymous code is provided at \url{https://anonymous.4open.science/r/CoDiCast/}.

314Boosting Offline Multi-Objective Reinforcement Learning via Preference Conditioned Diffusion Models

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) addresses sequential decision-making problems with multiple objectives by learning policies optimized for diverse preferences. While traditional methods necessitate costly online interaction with the environment, recent approaches leverage static datasets containing pre-collected trajectories, making offline MORL the preferred choice for real-world applications. However, existing offline MORL techniques suffer from limited expressiveness and poor generalization on out-of-distribution (OOD) preferences. To overcome these limitations, we propose Diffusion-based Multi-Objective Reinforcement Learning (DiffMORL), a generalizable diffusion-based planning framework for MORL. Leveraging the strong expressiveness and generation capability of diffusion models, DiffMORL further boosts its generalization through offline data mixup, which mitigates the memorization phenomenon and facilitates feature learning by data augmentation. By training on the augmented data, DiffMORL is able to condition on a given preference, whether in-distribution or OOD, to plan the desired trajectory and extract the corresponding action. Experiments conducted on the D4MORL benchmark demonstrate that DiffMORL achieves state-of-the-art results across nearly all tasks. Notably, it surpasses the best baseline on most tasks, underscoring its remarkable generalization ability in offline MORL scenarios.

315A Trajectory Probability Network for City-Scale Road Volume Prediction

[openreview] [pdf]

Abstract City-scale road volume prediction is a fundamental task in traffic management. However, the observation data are often incomplete and biased, posting a challenge for accurate prediction. Existing methods address this issue through interpolation techniques or manual priors, but they typically provide only a deterministic restoration, overlooking the influence of other potential scenarios. To overcome these limitations, we propose a novel neural network-based probabilistic model, the Trajectory Probability Network (TraPNet), which predicts traffic volume through the aggregation of the joint distribution of potential trajectories. TraPNet makes full use of current observations, historical data, and road network information to offer a comprehensive inference of road volumes. Unlike autoregressive methods, TraPNet makes predictions in a single step, substantially reducing computational time while maintaining high predictive accuracy. Experiments on real-world road networks demonstrate that TraPNet outperforms state-of-the-art methods, and can keep the advantage with only 20% observation ratio. The code will be made publicly available.

316Disentangling data distribution for Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) facilitates collaborative training of a global model whose performance is boosted by private data owned by distributed clients, without compromising data privacy. Yet the wide applicability of FL is hindered by entanglement of data distributions across different clients. This paper demonstrates for the first time that by disentangling data distributions FL can in principle achieve efficiencies comparable to those of distributed systems, requiring only one round of communication. To this end, we propose a novel FedDistr algorithm, which employs stable diffusion models to decouple and recover data distributions. Empirical results on the CIFAR100 and DomainNet datasets show that FedDistr significantly enhances model utility and efficiency in both disentangled and near-disentangled scenarios while ensuring privacy, outperforming traditional federated learning methods.

317Cross-Domain Offline Policy Adaptation with Optimal Transport and Dataset Constraint

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) often struggles with limited data. This work explores cross-domain offline RL where offline datasets (with possibly sufficient data) from another domain can be accessed to facilitate policy learning. However, the underlying environments of the two datasets may have dynamics mismatches, incurring inferior performance when simply merging the data of two domains. Existing methods mitigate this issue by training domain classifiers, using contrastive learning methods, etc. Nevertheless, they still rely on a large amount of target domain data to function well. Instead, we address this problem by establishing a concrete performance bound of a policy given datasets from two domains. Motivated by the theoretical insights, we propose to align transitions in the two datasets using optimal transport and selectively share source domain samples, without training any neural networks. This enables reliable data filtering even given a few target domain data. Additionally, we introduce a dataset regularization term that ensures the learned policy remains within the scope of the target domain dataset, preventing it from being biased towards the source domain data. Consequently, we propose the Optimal Transport Data Filtering (dubbed OTDF) method and examine its effectiveness by conducting extensive experiments across various dynamics shift conditions (e.g., gravity shift, morphology shift), given limited target domain data. It turns out that OTDF exhibits superior performance on many tasks and dataset qualities, often surpassing prior strong baselines by a large margin.

318Federated Adapter on Foundation Models: An Out-Of-Distribution Approach

[openreview] [pdf]

Abstract As foundation models gain increasing attention from both academic and industrial communities, Federated Foundation Models (FedFM) have emerged as a privacy-preserving approach for collaboratively fine-tuning models in federated learning (FL) frameworks using distributed datasets across multiple clients. A key challenge for FedFM, given the versatile nature of foundation models, is addressing out-of-distribution (OOD) generalization, where unseen tasks or clients may exhibit distribution shifts leading to suboptimal performance. Although numerous studies have explored OOD generalization in conventional FL, these methods are inadequate for FedFM due to the challenges posed by large parameter scales and increased data heterogeneity, where large parameter scales would result in high computational and communication costs while increased data heterogeneity like cross-domain would lead to suboptimal performance of the aggregated global model on individual client distributions. To bridge this gap, we propose a new method, called FedOA, to enhance the OOD generalization of FedFM under these conditions. Specifically, our method employs adapter-based parameter-efficient fine-tuning methods for efficient learning, and introduces an additional personalized model with a feature distance-based regularization to ensure distribution alignment and provide OOD generalization guarantees for each client. Theoretically, we demonstrate that the conventional aggregated global model in FedFM inherently retains OOD generalization capabilities, and our proposed method enhances the personalized model’s OOD generalization through regularization informed by the global model, with proven convergence under general non-convex settings. Empirically, the effectiveness of the proposed method is validated on benchmark datasets across various NLP tasks.

319Fine-Tuning of Continuous-Time Diffusion Models as Entropy-Regularized Control

[openreview] [pdf]

Abstract Diffusion models excel at capturing complex data distributions, such as those of natural images and proteins. While diffusion models are trained to represent the distribution in the training dataset, we often are more concerned with other properties, such as the aesthetic quality of the generated images or the functional properties of generated proteins. Diffusion models can be finetuned in a goal-directed way by maximizing the value of some reward function (e.g., the aesthetic quality of an image). However, these approaches may lead to reduced sample diversity, significant deviations from the training data distribution, and even poor sample quality due to the exploitation of an imperfect reward function. The last issue often occurs when the reward function is a learned model meant to approximate a ground-truth “genuine” reward, as is the case in many practical applications. These challenges, collectively termed “reward collapse,” pose a substantial obstacle. To address this reward collapse, we frame the finetuning problem as entropy-regularized control against the pretrained diffusion model, i.e., directly optimizing entropy-enhanced rewards with neural SDEs. We present theoretical and empirical evidence that demonstrates our framework is capable of efficiently generating diverse samples with high genuine rewards, mitigating the overoptimization of imperfect reward models.

320Understanding Impact of Human Feedback via Influence Functions

[openreview] [pdf]

Abstract In Reinforcement Learning from Human Feedback (RLHF), it is crucial to learn suitable reward models from human feedback to align large language models (LLMs) with human intentions. However, human feedback can often be noisy, inconsistent, or biased, especially when evaluating complex responses. Such feedback can lead to misaligned reward signals, potentially causing unintended side effects during the RLHF process. To address these challenges, we explore the use of influence functions to measure the impact of human feedback on the performance of reward models. We propose a compute-efficient approximation method that enables the application of influence functions to LLM-based reward models and large-scale preference datasets. In our experiments, we demonstrate two key applications of influence functions: (1) detecting common forms of labeler bias in human feedback datasets and (2) guiding labelers to refine their strategies to align more closely with expert feedback. By quantifying the impact of human feedback on reward models, we believe that influence functions can enhance feedback interpretability and contribute to scalable oversight in RLHF, helping labelers provide more accurate and consistent feedback.

321Diffusion Models Are Real-Time Game Engines

[openreview] [pdf]

Abstract We present GameNGen, the first game engine powered entirely by a neural model that also enables real-time interaction with a complex environment over long trajectories at high quality. When trained on the classic game DOOM, GameNGen extracts gameplay and uses it to generate a playable environment that can interactively simulate new trajectories. GameNGen runs at 20 frames per second on a single TPU and remains stable over extended multi-minute play sessions. Next frame prediction achieves a PSNR of 29.4, comparable to lossy JPEG compression. Human raters are only slightly better than random chance at distinguishing short clips of the game from clips of the simulation, even after 5 minutes of auto-regressive generation. GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions. Conditioning augmentations help ensure stable auto-regressive generation over long trajectories, and decoder fine-tuning improves the fidelity of visual details and text.

322On the Convergence of FedProx with Extrapolation and Inexact Prox

[openreview] [pdf]

Abstract Enhancing the FedProx federated learning algorithm (Li et al., 2020) with server-side extrapolation, Li et al. (2024a) recently introduced the FedExProx method. Their theoretical analysis, however, relies on the assumption that each client computes a certain proximal operator exactly, which is impractical since this is virtually never possible to do in real settings. In this paper, we investigate the behavior of FedExProx without this exactness assumption in the smooth and globally strongly convex setting. We establish a general convergence result, showing that inexactness leads to convergence to a neighborhood of the solution. Additionally, we demonstrate that, with careful control, the adverse effects of this inexactness can be mitigated. By linking inexactness to biased compression (Beznosikov et al., 2023), we refine our analysis, highlighting robustness of extrapolation to inexact proximal updates. We also examine the local iteration complexity required by each client to achieved the required level of inexactness using various local optimizers. Our theoretical insights are validated through comprehensive numerical experiments.

323Diffusion Actor-Critic: Formulating Constrained Policy Iteration as Diffusion Noise Regression for Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning, it is necessary to manage out-of-distribution actions to prevent overestimation of value functions. One class of methods, policy-regularized methods, address this problem by constraining the target policy to stay close to the behavior policy. Although several approaches suggest representing the behavior policy as an expressive diffusion model to boost performance, it remains unclear how to regularize the target policy given a diffusion-modeled behavior sampler. In this paper, we propose Diffusion Actor-Critic (DAC) that formulates the Kullback-Leibler (KL) constraint policy iteration as a diffusion noise regression problem, enabling direct representation of target policies as diffusion models. Our approach follows the actor-critic learning paradigm that we alternatively train a diffusion-modeled target policy and a critic network. The actor training loss includes a soft Q-guidance term from the Q-gradient. The soft Q-guidance grounds on the theoretical solution of the KL constraint policy iteration, which prevents the learned policy from taking out-of-distribution actions. We demonstrate that such diffusion-based policy constraint, along with the coupling of the lower confidence bound of the Q-ensemble as value targets, not only preserves the multi-modality of target policies but also contributes to stable convergence and strong performance in DAC. Our approach is evaluated on the D4RL benchmarks and outperforms the state-of-the-art in nearly all environments.

324Denoising Diffusion Causal Discovery

[openreview] [pdf]

Abstract A common theme across multiple disciplines of science is to understand the underlying dependencies between variables from observational data. Such dependencies are often modeled as Bayesian Network (BNs), which by definition are Directed Acyclic Graphs (DAGs). Recent advancements, such as NOTEARS and DAG-GNN, have focused on formulating continuous DAG constraints and learning DAGs via continuous optimization. However, these methods often have scalability issues and face challenges when applied to real world data. In this paper, we propose Denoising Diffusion Causal Discovery (DDCD), a new learning framework that leverages Denoising Diffusion Probabilistic Models (DDPMs) for causal structural learning. Using the denoising objective, our method allows the model to explore a wider range of noise in the data and effectively captures both linear and nonlinear dependencies. It also has reduced complexity and is more suitable for inference of larger networks. To accommodate potential feedback loops in biological networks, we propose a k-hop DAG constraint. Additionally, we suggest using fixed-size bootstrap sampling to ensure similar training performance across varying dataset sizes. Our experiments on synthetic data demonstrate that DDCD achieves consistent competitive performance compared to existing methods while noticeably reducing computation time. We also show that DDCD can generate trustworthy networks from real-world datasets.

325MGD3: Mode-Guided Dataset Distillation using Diffusion Models

[openreview] [pdf]

Abstract Dataset distillation aims to synthesize a smaller training set from a large dataset such that a model trained on this distilled set performs comparably to one trained on the entire dataset. For image classification, earlier methods proposed optimization strategies in the input space to synthesize a distilled dataset, but they are computationally expensive and difficult to scale to higher resolutions. Also, the datasets synthesized by these methods lack intra-class diversity as they ignore the modes of the data distribution. Recent works propose using generative models, among which diffusion models have shown promising results as they are known to capture the data distribution effectively. However, diffusion models tend to over-sample from the prominent modes of the data distribution, resulting in limited diversity in the generated samples. To address these limitations in this work, we propose a mode-guided diffusion model. Unlike existing works that fine-tune the diffusion models for dataset distillation, we propose to use a pre-trained model without the need for fine-tuning. Our novel approach consists of three stages: Mode Discovery, Mode Guidance, and Stop Guidance. In the first stage, we discover distinct modes in the data distribution of a class to build a representative set. In the second stage, we use a pre-trained diffusion model and guide the diffusion process toward the discovered modes to generate distinct samples, ensuring intra-class diversity. However, mode-guided sampling can introduce artifacts in the synthetic sample, which affect the performance. To control the fidelity of the synthetic dataset, we introduce the stop guidance. We evaluate our method on multiple benchmark datasets, including ImageNette, ImageIDC, ImageNet-100, and ImageNet-1K; Our method improved 4.4%, 2.9%, 1.6%, and 1.6% over the current state-of-the-art on the respective datasets. In addition, our method does not require retraining of the diffusion model, which leads to reduced computational requirements. We also demonstrate that our approach is effective with general-purpose diffusion models such as Text-to-Image Stable Diffusion, eliminating the need for a pre-trained model in the target dataset.

326Direct Judgement Preference Optimization

[openreview] [pdf]

Abstract Auto-evaluation is crucial for assessing response quality and offering feedback for model development. Recent studies have explored training large language models (LLMs) as generative judges to both evaluate model responses and generate natural language critiques. However, existing models have been trained almost exclusively with supervised fine-tuning (SFT), often only on a small number of datasets, resulting in poor generalization across different evaluation settings and tasks. In this paper, we investigate how learning from both positive and negative data with direct preference optimization (DPO) enhances the evaluation capabilities of LLM judges across three evaluation tasks: pairwise, single ratings, and binary classification. We achieve this by creating three forms of DPO data from a diverse collection of human and synthetic judgements on contemporary model outputs, with the goal of training our model to generate meaningful critiques, make accurate judgements, and understand what constitutes good and bad responses for a given user input. To demonstrate the effectiveness of our method, we train judge models of three sizes: 8B parameters, 12B, and 70B, and conduct a comprehensive study over 13 benchmarks (7 pairwise, 4 single rating, and 2 classification), measuring agreement with human and GPT-4 annotations. Our models exhibit the best aggregate performance, with even our 8B model outperforming strong baselines like GPT-4o and specialized judge models, such as OffsetBias-8B, Auto-J-13B, Prometheus-2-8x7B, and Skywork-Critic-70B, in pairwise benchmarks. Further analysis shows that our judge model robustly counters biases such as position and length bias, flexibly adapts to practitioner-specified evaluation protocols, and provides helpful language feedback for improving downstream generator models.

327FedSUV: Validity and Utility-guided Client Selection for Federated Learning

[openreview] [pdf]

Abstract Federated Learning faces significant challenges arising from two critical uncertainties: the validity of a client’s participation, which can be compromised by network and system heterogeneity, and the utility of the data contributed by each client, which varies due to heterogeneous statistical data. Traditional client selection methods often treat these uncertainties as a whole, leading to suboptimal performance. To address this issue, we propose FedSUV, an innovative client selection framework that decouples validity and utility uncertainties. FedSUV approaches client selection from a multi-objective optimization perspective, employing advanced bandit algorithms: a confidence bound-based linear contextual bandit for assessing validity and a Gaussian Process bandit for evaluating utility. We validate the effectiveness of FedSUV through both theoretical analysis and large-scale experiments conducted within our physical cluster.

328Fast constrained sampling in pre-trained diffusion models

[openreview] [pdf]

Abstract Diffusion models have dominated the field of large, generative image models, with the prime examples of Stable Diffusion and DALL-E 3 being widely adopted. These models have been trained to perform text-conditioned generation on vast numbers of image-caption pairs and as a byproduct, have acquired general knowledge about natural image statistics. However, when confronted with the task of constrained sampling, e.g. generating the right half of an image conditioned on the known left half, applying these models is a delicate and slow process, with previously proposed algorithms relying on expensive iterative operations that are usually orders of magnitude slower than text-based inference. This is counter-intuitive, as image-conditioned generation should rely less on the difficult-to-learn semantic knowledge that links captions and imagery, and should instead be achievable by lower-level correlations among image pixels. In practice, inverse models are trained or tuned separately for each inverse problem, e.g. by providing parts of images during training as an additional condition, to allow their application in realistic settings. However, we argue that this is not necessary and propose an algorithm for fast-constrained sampling in large pre-trained diffusion models (Stable Diffusion) that requires no expensive backpropagation operations through the model and produces results comparable even to the state-of-the-art \emph{tuned} models. Our method is based on a novel optimization perspective to sampling under constraints and employs a numerical approximation to the expensive gradients, previously computed using backpropagation, incurring significant speed-ups.

329Can We Ignore Labels in Out of Distribution Detection?

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection methods have recently become more prominent, serving as a core element in safety-critical autonomous systems. One major purpose of OOD detection is to reject invalid inputs that could lead to unpredictable errors and compromise safety. Due to the cost of labeled data, recent works have investigated the feasibility of self-supervised learning (SSL) OOD detection, unlabled OOD detection, and zero shot OOD detection. In this work, we identify a set of conditions for a theoretical guarantee of failure in unlabeled OOD detection algorithms from an information-theoretic perspective. These conditions are present in all OOD tasks dealing with real world data: I) we provide theoretical proof of unlabeled OOD detection failure when there exists zero mutual information between the learning objective and the in-distribution labels, a.k.a. ‘label blindness’, II) we define a new OOD task – Adjacent OOD detection – that tests for label blindness and accounts for a previously ignored safety gap in all OOD detection benchmarks, and III) we perform experiments demonstrating that existing unlabeled OOD methods fail under conditions suggested by our label blindness theory and analyze the implications for future research in unlabeled OOD methods.

330Rectified Diffusion Guidance for Conditional Generation

[openreview] [pdf]

Abstract Classifier-Free Guidance (CFG), which combines the conditional and unconditional score functions with two coefficients summing to one, serves as a practical technique for diffusion model sampling. Theoretically, however, denoising with CFGcannotbe expressed as a reciprocal diffusion process, which may consequently leave some hidden risks during use. In this work, we revisit the theory behind CFG and rigorously confirm that the improper configuration of the combination coefficients (i.e., the widely used summing-to-one version) brings about expectation shift of the generative distribution. To rectify this issue, we propose ReCFG with a relaxation on the guidance coefficients such that denoising with ReCFG strictly aligns with the diffusion theory. We further show that our approach enjoys aclosed-formsolution given the guidance strength. That way, the rectified coefficients can be readily pre-computed via traversing the observed data, leaving the sampling speed barely affected. Empirical evidence on real-world data demonstrate the compatibility of our post-hoc design with existing state-of-the-art diffusion models, including both class-conditioned ones (e.g., EDM2 on ImageNet) and text-conditioned ones (e.g., SD3 on CC12M), without any retraining. We will open-source the code to facilitate further research.

331Quality Diversity Imitation Learning

[openreview] [pdf]

Abstract Imitation learning (IL) has shown great potential in various applications, such as robot control. However, traditional IL methods are usually designed to learn only one specific type of behavior since demonstrations typically correspond to a single expert. In this work, we introduce the first generic framework for Quality Diversity Imitation Learning (QD-IL), which enables the agent to learn a broad range of skills from limited demonstrations. Our framework integrates the principles of quality diversity with adversarial imitation learning (AIL) methods, and can potentially improve any inverse reinforcement learning (IRL) method. Empirically, our framework significantly improves the QD performance of GAIL and VAIL on the challenging continuous control tasks derived from Mujoco environments. Moreover, our method even achieves 2x expert performance in the most challenging Humanoid environment.

332C2INet: Realizing Incremental Trajectory Prediction with Prior-Aware Continual Causal Intervention

[openreview] [pdf]

Abstract Trajectory prediction for multi-agents in complex scenarios is crucial for applications like autonomous driving. However, existing methods often overlook environmental biases, which leads to poor generalization. Additionally, hardware constraints limit the use of large-scale data across environments, and continual learning settings exacerbate the challenge of catastrophic forgetting. To address these issues, we propose the Continual Causal Intervention (C2^{2}INet) method for generalizable multi-agent trajectory prediction within a continual learning framework. Using variational inference, we align environment-related prior with posterior estimator of confounding factors in the latent space, thereby intervening in causal correlations that affect trajectory representation. Furthermore, we store optimal variational priors across various scenarios using a memory queue, ensuring continuous debiasing during incremental task training. The proposed C2^{2}INet enhances adaptability to diverse tasks while preserving previous task information to prevent catastrophic forgetting. It also incorporates pruning strategies to mitigate overfitting. Comparative evaluations on three real and synthetic complex datasets against state-of-the-art methods demonstrate that our proposed method consistently achieves reliable prediction performance, effectively mitigating confounding factors unique to different scenarios. This highlights the practical value of our method for real-world applications.

333Neural Approximate Mirror Maps for Constrained Diffusion Models

[openreview] [pdf]

Abstract Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a diffusion model makes it more reliable for generating valid synthetic data and solving constrained inverse problems. However, existing methods for constrained diffusion models are restricted in the constraints they can handle. For instance, recent work proposed to learn mirror diffusion models (MDMs), but analytical mirror maps only exist for convex constraints and can be challenging to derive. We proposeneural approximate mirror maps(NAMMs) for general, possibly non-convex constraints. Our approach only requires a differentiable distance function from the constraint set. We learn an approximate mirror map that transforms data into an unconstrained space and a corresponding approximate inverse that maps data back to the constraint set. A generative model, such as an MDM, can then be trained in the learned mirror space and its samples restored to the constraint set by the inverse map. We validate our approach on a variety of constraints, showing that compared to an unconstrained diffusion model, a NAMM-based MDM substantially improves constraint satisfaction. We also demonstrate how existing diffusion-based inverse-problem solvers can be easily applied in the learned mirror space to solve constrained inverse problems.

334Uncertainty-Regularized Diffusional Subgoals for Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract Hierarchical reinforcement learning (HRL) aims to solve complex tasks by making decisions across multiple levels of temporal abstraction. However, off-policy training of hierarchical policies faces non-stationarity issues because the low-level policy is constantly changing, which makes it difficult for the high-level policy that generates subgoals to adapt. In this paper, we propose a conditional diffusion model-based approach for subgoal generation to mitigate these non-stationarity challenges. Specifically, we employ a Gaussian Process (GP) prior on subgoal generation as a surrogate distribution to regularize the diffusion policy and inform the diffusion process about uncertain areas in the action space. We introduce adaptive inducing states to facilitate sparse GP-based subgoal generation, enhancing sample efficiency and promoting better exploration in critical regions of the state space. Building on this framework, we develop an exploration strategy that identifies promising subgoals based on the learned predictive distribution of the diffusional subgoals. Experimental results demonstrate significant improvements in both sample efficiency and performance on challenging continuous control benchmarks compared to prior HRL methods.

335Scaling Optimal LR Across Token Horizons

[openreview] [pdf]

Abstract State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size, and cluster size. It is economically infeasible to extensively tune hyperparameters for the largest runs. Instead, approximately optimal hyperparameters must be inferred or transferred from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et. al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large-scale empirical study on how optimal learning rate (LR) depends on the token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly, we demonstrate that the optimal LR follows a scaling law and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly, we provide evidence that LLama-1 used too high LR, and argue that hyperparameter transfer across data size is an overlooked component of LLM training.

336Can foundation models actively gather information in interactive environments to test hypotheses?

[openreview] [pdf]

Abstract While problem solving is a standard evaluation task for foundation models, a crucial component of problem solving---actively and strategically gathering information to test hypotheses---has not been closely investigated. To assess the information gathering abilities of foundation models in interactive environments, we introduce a framework in which a model must determine the factors influencing a hidden reward function by iteratively reasoning about its previously gathered information and proposing its next exploratory action to maximize information gain at each step. We implement this framework in both a text-based environment, which offers a tightly controlled setting and enables high-throughput parameter sweeps, and in an embodied 3D environment, which requires addressing complexities of multi-modal interaction more relevant to real-world applications. We further investigate whether approaches such as self-correction and increased inference time improve information gathering efficiency. In a relatively simple task that requires identifying a single rewarding feature, we find that Gemini’s information gathering capability is close to optimal. However, when the model must identify a conjunction of rewarding features, performance is suboptimal. The hit in performance is due partly to the model translating task description to a policy and partly to the model’s effectiveness in using its in-context memory. Performance is comparable in both text and 3D embodied environments, although imperfect visual object recognition reduces its accuracy in drawing conclusions from gathered information in the 3D embodied case. For single-feature-based rewards, we find that smaller models curiously perform better; for conjunction-based rewards, incorporating self correction into the model improves performance.

337Bootstrapped Model Predictive Control

[openreview] [pdf]

Abstract Model Predictive Control (MPC) has been demonstrated to be effective in continuous control tasks. When a world model and a value function are available, planning a sequence of actions ahead of time leads to a better policy. Existing methods typically obtain the value function and the corresponding policy in a model-free manner. However, we find that such an approach struggles with complex tasks, resulting in poor policy learning and inaccurate value estimation. To address this problem, we leverage the strengths of MPC itself. In this work, we introduce Bootstrapped Model Predictive Control (BMPC), a novel algorithm that performs policy learning in a bootstrapped manner. BMPC learns a network policy by imitating an MPC expert, and in turn, uses this policy to guide the MPC process. Combined with model-based TD-learning, our policy learning yields better value estimation and further boosts the efficiency of MPC. We also introduce a lazy reanalyze mechanism, which enables computationally efficient imitation learning. Our method achieves superior performance over prior works on diverse continuous control tasks. In particular, on challenging high-dimensional locomotion tasks, BMPC significantly improves data efficiency while also enhancing asymptotic performance and training stability, with comparable training time and smaller network sizes. Code is available athttps://github.com/bmpc-anonymous/bmpc.

338Exploring Diffusion Models’ Corruption Stage in Few-Shot Fine-tuning and Mitigating with Bayesian Neural Networks

[openreview] [pdf]

Abstract Few-shot fine-tuning of Diffusion Models (DMs) is a key advancement, significantly reducing training costs and enabling personalized AI applications. However, we explore the training dynamics of DMs and observe an unanticipated phenomenon: during the training process, image fidelity initially improves, then unexpectedly deteriorates with the emergence of noisy patterns, only to recover later with severe overfitting. We term the stage with generated noisy patterns as corruption stage. To understand this corruption stage, we begin by heuristically modeling the one-shot fine-tuning scenario, and then extend this modeling to more general cases. Through this modeling, we identify the primary cause of this corruption stage: a narrowed learning distribution inherent in the nature of few-shot fine-tuning. To tackle this, we apply Bayesian Neural Networks (BNNs) on DMs with variational inference to implicitly broaden the learned distribution, and present that the learning target of the BNNs can be naturally regarded as an expectation of the diffusion loss and a further regularization with the pretrained DMs. This approach is highly compatible with current few-shot fine-tuning methods in DMs and does not introduce any extra inference costs. Experimental results demonstrate that our method significantly mitigates corruption, and improves the fidelity, quality and diversity of the generated images in both object-driven and subject-driven generation tasks. The code is available at an anonymous link.

339FairCoT: Enhancing Fairness in Diffusion Models via Chain of Thought Reasoning of Multimodal Language Models

[openreview] [pdf]

Abstract In the domain of text-to-image generative models, biases inherent in training datasets often propagate into generated content, posing significant ethical challenges, particularly in socially sensitive contexts. We introduce FairCoT, a novel framework that enhances fairness in diffusion models through Chain-of-Thought (CoT) reasoning within multimodal generative large language models (LLMs). FairCoT employs iterative CoT refinement and attire-based attribute prediction to systematically mitigate biases, ensuring diverse and equitable representation in generated images. By integrating iterative reasoning processes, FairCoT addresses the limitations of zero-shot CoT in sensitive scenarios, balancing creativity with ethical responsibility. Experimental evaluations across multiple models, including DALL-E and various Stable Diffusion variants, demonstrate that FairCoT significantly improves fairness and diversity metrics without compromising image quality or relevance. Our approach advances ethical AI practices in generative modeling, promoting socially responsible content generation and setting new standards for fairness in AI-generated imagery.

340Win Rate is All that Can Matter from Preference Data Alone

[openreview] [pdf]

Abstract The surging interest in learning from preference data has resulted in an elaborate landscape of methods and evaluations. This work offers a framework to simplify this landscape. We start with the insight that the only fixed information represented in preference data is the preference classifier, and thus the only evaluation of a model grounded in the data is win rate under this classifier. In other words, win rate is all that can matter from preference data alone. This insight allows us to unlock many follow-up insights. First, we introduce a family of objectives to directly optimize for win rate, called Direct Win Rate Optimization (DWRO) objectives. We show that Reinforcement Learning From Human Feedback (RLHF) is a KL-regularized DWRO objective while SFT on preferred samples is not. We then compare the target distributions of various preference learning objectives and explain how different design choices affect sharpness of the resulting distribution. Furthermore, we provide close-formed solutions for the expected win rate improvement of common preference learning algorithms and explain the intuitions they provide. Our analysis and accompanying experiments not only elucidate the design space of preference learning algorithms but also offer guidance on future directions to advance preference learning.

341Jump Your Steps: Optimizing Sampling Schedule of Discrete Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have seen notable success in continuous domains, leading to the development of discrete diffusion models (DDMs) for discrete variables. Despite recent advances, DDMs face the challenge of slow sampling speeds. While parallel sampling methods like τ\tau-leaping accelerate this process, they introduceCompounding Decoding Error(CDE), where discrepancies arise between the true distribution and the approximation from parallel token generation, leading to degraded sample quality. In this work, we presentJump Your Steps(JYS), a novel approach that optimizes the allocation of discrete sampling timesteps by minimizing CDE without extra computational cost. More precisely, we derive a practical upper bound on CDE and propose an efficient algorithm for searching for the optimal sampling schedule. Extensive experiments across image, music, and text generation show that JYS significantly improves sampling quality, establishing it as a versatile framework for enhancing DDM performance for fast sampling.

342Diffusion-based Prompt Generation for Lifelong Continual Adaptation

[openreview] [pdf]

Abstract Continual Test-time Adaptation (TTA) addresses sequential out-of-distribution scenarios with unlabeled data but overlooks long-term and recurring in-distribution aspects of the real world. Therefore, we introduce Lifelong Continual Adaptation, which enables models to efficiently retrieve domain-specific knowledge when encountering in-distribution data streams with sequential and recurring domains. We found that optimization-based Continual TTA methods underperform on the proposed problem due to two major pitfalls: updating the model’s parameters is expensive and impractical for resource-constrained devices, and these methods exhibit instability when adapting to long-term recurring domains. To address these challenges, we propose a diffusion-based prompt generation method (DiffPrompt). Specifically, instead of continually optimizing the foundation model, we generate domain-specific prompts for it to adapt. We use a conditional diffusion model to learn a prompt-space distribution for various domains. During testing, the diffusion model generates prompts for the current domain based on the incoming batch of data, facilitating the continual adaptation of the foundation model. Our experiments demonstrate that DiffPrompt enables stable and efficient deployment in practical scenarios involving sequential and recurring domains.

343Knowledge Localization: Mission Not Accomplished? Enter Query Localization!

[openreview] [pdf]

Abstract Large language models (LLMs) store extensive factual knowledge, but the mechanisms behind how they store and express this knowledge remain unclear. The Knowledge Neuron (KN) thesis is a prominent theory for explaining these mechanisms. This theory is based on theKnowledge Localization (KL)assumption, which suggests that a fact can be localized to a few knowledge storage units, namely knowledge neurons. However, this assumption has two limitations: first, it may be too rigid regarding knowledge storage, and second, it neglects the role of the attention module in knowledge expression.In this paper, we first re-examine the KL assumption and demonstrate that its limitations do indeed exist. To address these, we then present two new findings, each targeting one of the limitations: one focusing on knowledge storage and the other on knowledge expression. We summarize these findings asQuery Localizationassumption and argue that the KL assumption can be viewed as a simplification of the QL assumption. Based on QL assumption, we further propose the Consistency-Aware KN modification method, which improves the performance of knowledge modification, further validating our new assumption. We conduct 39 sets of experiments, along with additional visualization experiments, to rigorously confirm our conclusions. Code will be made public soon.

344Boosting Latent Diffusion with Perceptual Objectives

[openreview] [pdf]

Abstract Latent diffusion models (LDMs) power state-of-the-art high-resolution generative image models. LDMs learn the data distribution in the latent space of an autoencoder (AE) and produce images by mapping the generated latents into RGB image space using the AE decoder. While this approach allows for efficient model training and sampling, it induces a disconnect between the training of the diffusion model and the decoder, resulting in a loss of detail in the generated images. To remediate this disconnect, we propose to leverage the internal features of the decoder to define a latent perceptual loss (LPL). This loss encourages the models to create sharper and more realistic images. Our loss can be seamlessly integrated with common autoencoders used in latent diffusion models, and can be applied to different generative modeling paradigms such as DDPM with epsilon and velocity prediction, as well as flow matching. Extensive experiments with models trained on three datasets at 256 and 512 resolution show improved quantitative -- with boosts between 6% and 20% in FID -- and qualitative results when using our perceptual loss.

345Alternating Projections With Volume Sampling

[openreview] [pdf]

Abstract The method of Alternating Projections (AP) is a fundamental iterative technique with applications to problems in machine learning, optimization and signal processing. Examples include the Gauss-Seidel algorithm which is used to solve large-scale regression problems and the Kaczmarz and projections onto convex sets (POCS) algorithms that are fundamental to iterative reconstruction. Progress has been made with regards to the questions of efficiency and rate of convergence in the randomized setting of the AP method. Here, we extend these results with volume sampling to block (batch) sizes greater than 1 and provide explicit formulas that relate the convergence rate bounds to the spectrum of the underlying system. These results, together with a trace formula and associated volume sampling, prove that convergence rates monotonically improve with larger block sizes, a feature that can not be guaranteed in general with uniform sampling (e.g., in SGD).

346Event-Driven Online Vertical Federated Learning

[openreview] [pdf]

Abstract Online learning is more adaptable to real-world scenarios in Vertical Federated Learning (VFL) compared to offline learning. However, integrating online learning into VFL presents challenges due to the unique nature of VFL, where clients possess non-intersecting feature sets for the same sample. In real-world scenarios, the clients may not receive data streaming for the disjoint features for the same entity synchronously. Instead, the data are typically generated by aneventrelevant to only a subset of clients. We are the first to identify these challenges in online VFL, which have been overlooked by previous research. To address these challenges, we proposed an event-driven online VFL framework. In this framework, only a subset of clients were activated during each event, while the remaining clients passively collaborated in the learning process. Furthermore, we incorporateddynamic local regret (DLR)into VFL to address the challenges posed by online learning problems with non-convex models within a non-stationary environment. We conducted a comprehensive regret analysis of our proposed framework, specifically examining the DLR under non-convex conditions with event-driven online VFL. Extensive experiments demonstrated that our proposed framework was more stable than the existing online VFL framework under non-stationary data conditions while also significantly reducing communication and computation costs.

347Improving Generalization of Meta Reinforcement Learning via Explanation

[openreview] [pdf]

Abstract Meta reinforcement learning learns a meta-prior (e.g., meta-policy) from a set of training tasks, such that the learned meta-prior can efficiently adapt to all the tasks in a task distribution. However, it has been observed in literature that the learned meta-prior usually has imbalanced generalization, i.e., it adapts well to some tasks but adapts poorly to some other tasks. This paper aims to explain why certain tasks are poorly adapted and, more importantly, use this explanation to improve generalization. Our methodology has two parts. The first part identifies ``critical" training tasks that are most important to achieve good performance on those poorly-adapted tasks. An explanation of the poor generalization is that the meta-prior does not pay enough attention to the critical training tasks. To improve generalization, the second part formulates a bi-level optimization problem where the upper level learns how to augment the critical training tasks such that the meta-prior can pay more attention to the critical tasks, and the lower level computes the meta-prior distribution corresponding to the current augmentation. We propose an algorithm to solve the bi-level optimization problem and theoretically guarantee that (1) the algorithm converges at the rate of O(1/K)O(1/\sqrt{K}), (2) the learned augmentation makes the meta-prior focus more on the critical training tasks, and (3) the generalization improves after the task augmentation. We use two real-world experiments and three MuJoCo experiments to show that our algorithm improves the generalization and outperforms state-of-the-art baselines.

348DiffPath: Generating Road Network based Path with Latent Diffusion Model

[openreview] [pdf]

Abstract With the increasing use of GPS technology, path has become essential for applications such as navigation, urban planning, and traffic optimization. However, obtaining real-world path presents challenges due to privacy concerns and the difficulty of collecting large datasets. Existing methods, including count-based and deep learning approaches, struggle with two main challenges: handling complex distributions of path segments and ensuring global coherence in generated paths. To address these, we introduce DiffPath, a path generation model based on Latent Diffusion Models (LDMs). By embedding path into a continuous latent space and leveraging a transformer architecture, DiffPath captures both local transitions and global dependencies, ensuring the generation of realistic paths. Experimental results demonstrate that our model outperforms existing approaches in generating paths that adhere to real-world road network structures while maintaining privacy.

349Average Certified Radius is a Poor Metric for Randomized Smoothing

[openreview] [pdf]

Abstract Randomized smoothing is a popular approach for providing certified robustness guarantees against adversarial attacks, and has become a very active area of research. Over the past years, the average certified radius (ACR) has emerged as the single most important metric for comparing methods and tracking progress in the field. However, in this work, we show that ACR is an exceptionally poor metric for evaluating robustness guarantees provided by randomized smoothing. We theoretically show not only that a trivial classifier can have arbitrarily large ACR, but also that ACR is much more sensitive to improvements on easy samples than on hard ones. Empirically, we confirm that existing training strategies that improve ACR reduce the model’s robustness on hard samples. Further, we show that by focusing on easy samples, we can effectively replicate the increase in ACR. We develop strategies, including explicitly discarding hard samples, reweighing the dataset with certified radius, and extreme optimization for easy samples, to achieve state-of-the-art ACR, although these strategies ignore robustness for the general data distribution. Overall, our results suggest that ACR has introduced a strong undesired bias to the field, and better metrics are required to holistically evaluate randomized smoothing.

350Ensemble Kalman Diffusion Guidance: A Derivative-free Method for Inverse Problems

[openreview] [pdf]

Abstract When solving inverse problems, it is increasingly popular to use pre-trained diffusion models as plug-and-play priors. This framework can accommodate different forward models without re-training while preserving the generative capability of diffusion models. Despite their success in many imaging inverse problems, most existing methods rely on privileged information such as derivative, pseudo-inverse, or full knowledge about the forward model. This reliance poses a substantial limitation that restricts their use in a wide range of problems where such information is unavailable, such as many scientific applications. To address this, we propose Ensemble Kalman Diffusion Guidance (EnKG) for diffusion models, a derivative-free approach that can solve inverse problems by only accessing forward model evaluations and a pre-trained diffusion model. We study the empirical effectiveness of our method across various inverse problems, including scientific settings such as inferring fluid flows and astronomical objects, which are highly non-linear inverse problems that often only permit black-box access to the forward model.

351Improved Sampling Algorithms for Lévy-Itô Diffusion Models

[openreview] [pdf]

Abstract Lévy-Itô denoising diffusion models relying on isotropic α-stable noise instead of Gaussian distribution have recently been shown to improve performance of conventional diffusion models in image generation on imbalanced datasets while performing comparably in the standard settings. However, the stochastic algorithm of sampling from such models consists in solving the stochastic differential equation describing only an approximate inverse of the process of adding α-stable noise to data which may lead to suboptimal performance. In this paper, we derive a parametric family of stochastic differential equations whose solutions have the same marginal densities as those of the forward diffusion and show that the appropriate choice of the parameter values can improve quality of the generated images when the number of reverse diffusion steps is small. Also, we demonstrate that Lévy-Itô diffusion models are applicable to diverse domains and show that a well-trained text-to-speech Lévy-Itô model may have advantages over standard diffusion models on highly imbalanced datasets.

352Fast and Slow Streams for Online Time Series Forecasting Without Information Leakage

[openreview] [pdf]

Abstract Current research in online time series forecasting suffers from information leakage: models predict and then evaluate on historical time steps that have been backpropagated for parameter updates. This setting also misaligns with the real-world conception of forecasting, which typically emphasizes looking ahead and anticipating future uncertainties. This paper redefines online time series forecasting to focus on predicting unknown future steps and evaluates performance solely based on these predictions. Following this new setting, challenges arise in leveraging incomplete pairs of ground truth and prediction for backpropagation, as well as generalizing accurate information without overfitting to noises from recent data streams. To address these challenges, we propose a novel dual-stream framework for online forecasting (DSOF): a slow stream that updates with complete data using experience replay, and a fast stream that adapts to recent data through temporal difference learning. This dual-stream approach updates a teacher-student model learned through a residual learning strategy, generating predictions in a coarse-to-fine manner. Extensive experiments demonstrate its improvement in forecasting performance in changing environments.

353Flexible Fairness-Aware Learning via Inverse Conditional Permutation

[openreview] [pdf]

Abstract Equalized odds, as a popular notion of algorithmic fairness, aims to ensure that sensitive variables, such as race and gender, do not unfairly influence the algorithm’s prediction when conditioning on the true outcome. Despite rapid advancements, current research primarily focuses on equalized odds violations caused by a single sensitive attribute, leaving the challenge of simultaneously accounting for multiple attributes largely unaddressed. We bridge this gap by introducing an in-processing fairness-aware learning approach, FairICP, which integrates adversarial learning with a novel inverse conditional permutation scheme. FairICP offers a theoretically justified, flexible, and efficient scheme to promote equalized odds under fairness conditions described by complex and multi-dimensional sensitive attributes. The efficacy and adaptability of our method are demonstrated through both simulation studies and empirical analyses of real-world datasets.

354ASOR: Anchor State Oriented Regularization for Policy Optimization under Dynamics Shift

[openreview] [pdf]

Abstract To train neural policies in environments with diverse dynamics, Imitation from Observation (IfO) approaches aim at recovering expert state trajectories. Their success is built upon the assumption that the stationary state distributions induced by optimal policies remain similar despite dynamics shift. However, such an assumption does not hold in many real world scenarios, especially when certain states become inaccessible during environment dynamics change. In this paper, we propose the concept of anchor states which appear in all optimal trajectories under dynamics shift, thereby maintaining consistent state accessibility. Instead of direct imitation, we incorporate anchor state distributions into policy regularization to mitigate the issue of inaccessible states, leading to the ASOR algorithm. By formally characterizing the difference of state accessibility under dynamics shift, we show that the anchor state-based regularization approach provides strong lower- bound performance guarantees for efficient policy optimization. We perform extensive experiments across various online and offline RL benchmarks, including Gridworld, MuJoCo, MetaDrive, D4RL, and a fall-guys like game environment, featuring multiple sources of dynamics shift. Experimental results indicate ASOR can be effectively integrated with several state-of-the-art cross-domain policy transfer algorithms, substantially enhancing their performance.

355Imputation for prediction: beware of diminishing returns.

[openreview] [pdf]

Abstract Missing values are prevalent across various fields, posing challenges for training and deploying predictive models. In this context, imputation is a common practice, driven by the hope that accurate imputations will enhance predictions. However, recent theoretical and empirical studies indicate that simple constant imputation can be consistent and competitive. This empirical study aims at clarifyingifandwheninvesting in advanced imputation methods yields significantly better predictions. Relating imputation and predictive accuracies across combinations of imputation and predictive models on 19 datasets, we show that imputation accuracy matters less i) when using expressive models, ii) when incorporating missingness indicators as complementary inputs, iii) matters much more for generated linear outcomes than for real-data outcomes. Interestingly, we also show that the use of the missingness indicator is beneficial to the prediction performance, even in MCAR scenarios. Overall, on real-data with powerful models, imputation quality has only a minor effect on prediction performance. Thus, investing in better imputations for improved predictions often offers limited benefits.

356DSPO: Direct Score Preference Optimization for Diffusion Model Alignment

[openreview] [pdf]

Abstract Diffusion-based Text-to-Image (T2I) models have achieved impressive success in generating high-quality images from textual prompts. While large language models (LLMs) effectively leverage Direct Preference Optimization (DPO) for fine-tuning on human preference data without the need for reward models, diffusion models have not been extensively explored in this area. Current preference learning methods applied to T2I diffusion models immediately adapt existing techniques from LLMs. However, this adaptation introduces a mismatch between the pretraining and the fine-tuning objectives specific to T2I diffusion models. This inconsistency can potentially lead to suboptimal performance. In this work, we propose Direct Score Preference Optimization (DSPO), a novel algorithm that aligns the pretraining and fine-tuning objectives of diffusion models by leveraging score matching, the same objective used during pretraining. It introduces a new perspective on preference learning for diffusion models. Specifically, DSPO distills the score function of human-preferred image distributions into pretrained diffusion models, fine-tuning the model to generate outputs that align with human preferences. We theoretically show that DSPO shares the same optimization direction as reinforcement learning algorithms in diffusion models under certain conditions. Our experimental results demonstrate that DSPO outperforms preference learning baselines for T2I diffusion models in human preference evaluation tasks and enhances both visual appeal and prompt alignment of generated images.

357Exploring Local Memorization in Diffusion Models via Bright Ending Attention

[openreview] [pdf]

Abstract In this paper, we identify and leverage a novel `bright ending’ (BE) anomaly in diffusion models prone to memorizing training images to address a new task: locating localized memorization regions within these models. BE refers to a distinct cross-attention pattern observed in text-to-image generations using diffusion models. Specifically, memorized image patches exhibit significantly greater attention to the end token during the final inference step compared to non-memorized patches. This attention map effectively highlights regions where the generated image replicates training data. Furthermore, driven by our observation that local memorization significantly underperforms in existing tasks of measuring, detecting, and mitigating memorization in diffusion models compared to global memorization, we propose a simple yet effective method to integrate BE and the results of the new localization task into these existing frameworks. This integration effectively improves their performances by narrowing the performance gap caused by local memorization. Our results not only demonstrate the successful execution of the new localization task but also establish new state-of-the-art performance across all existing tasks, underscoring the significance of the BE phenomenon.

358Backdoor Attacks for LLMs with Weak-To-Strong Knowledge Distillation

[openreview] [pdf]

Abstract Despite being widely applied due to their exceptional capabilities, Large Language Models (LLMs) have been proven to be vulnerable to backdoor attacks. These attacks introduce targeted vulnerabilities into LLMs by poisoning training samples and full-parameter fine-tuning. However, this kind of backdoor attack is limited since they require significant computational resources, especially as the size of LLMs increases. Besides, parameter-efficient fine-tuning (PEFT) offers an alternative but the restricted parameter updating may impede the alignment of triggers with target labels. In this study, we first verify that backdoor attacks with PEFT may encounter challenges in achieving feasible performance. To address these issues and improve the effectiveness of backdoor attacks with PEFT, we propose a novel backdoor attack algorithm from weak to strong based on feature alignment-enhanced knowledge distillation (W2SAttack). Specifically, we poison small-scale language models through full-parameter fine-tuning to serve as the teacher model. The teacher model then covertly transfers the backdoor to the large-scale student model through feature alignment-enhanced knowledge distillation, which employs PEFT. Theoretical analysis reveals that W2SAttack has the potential to augment the effectiveness of backdoor attacks. We demonstrate the superior performance of W2SAttack on classification tasks across four language models, four backdoor attack algorithms, and two different architectures of teacher models. Experimental results indicate success rates close to 100% for backdoor attacks targeting PEFT.

359Stochastic Diffusion: A Diffusion Based Model for Stochastic Time Series Forecasting

[openreview] [pdf]

Abstract Recent successes in diffusion probabilistic models have demonstrated their strength in modelling and generating different types of data, paving the way for their application in generative time series forecasting. However, most existing diffusion based approaches rely on sequential models and unimodal latent variables to capture global dependencies and model entire observable data, resulting in difficulties when it comes to highly stochastic time series data. In this paper, we propose a novelStochasticDiffusion (StochDiff) model that integrates the diffusion process into time series modelling stage and utilizes the representational power of the stochastic latent spaces to capture the variability of the stochastic time series data. Specifically, the model applies diffusion module at each time step within the sequential framework and learns a step-wise, data-driven prior for generative diffusion process. These features enable the model to effectively capture complex temporal dynamics and the multi-modal nature of the highly stochastic time series data. Through extensive experiments on real-world datasets, we demonstrate the effectiveness of our proposed model for probabilistic time series forecasting, particularly in scenarios with high stochasticity. Additionally, with a real-world surgical use case, we highlight the model’s potential in medical application.

360A Modified Proximal-Perturbed Lagrangian for Non-Convex Non-Smooth Representatives of Fairness Constraints

[openreview] [pdf]

Abstract We study classification problems under fairness constraints and introduce an algorithmic framework designed to prevent discrimination against different groups. These problems are often reformulated as continuous constrained optimization problems and are typically solved using continuous relaxations (surrogates) of the fairness constraints. However, many current algorithms do not provide theoretical guarantees, which possibly is due to the resulting fairness constraints being both non-convex and non-smooth. We propose a novel primal-dual algorithm, based on a newly developed Lagrangian, that converges to a stationary solution of the reformulated problem. Our algorithm is not only efficient and robust, but it also enjoys strong performance guarantees on the fairness of its solutions. Furthermore, experimental results demonstrate that our algorithm is highly effective in terms of computational cost and fairness guarantees, outperforming related algorithms that use regularization (penalization) techniques and/or standard Lagrangian relaxation.

361Adaptive Source Localization on Complex Networks via Conditional Diffusion Model

[openreview] [pdf]

Abstract Network propagation issues like the spread of misinformation, cyber threats, or infrastructure breakdowns are prevalent and have significant societal impacts. Identifying the source of such propagation by analyzing snapshots of affected networks is crucial for managing crises like disease outbreaks and enhancing network security. Traditional methods rely on metrics derived from network topology and are limited to specific propagation models, while deep learning models face the challenge of data scarcity. We propose \textbf{ASLDiff}~(\textbf{A}daptive \textbf{S}ource \textbf{L}ocalization \textbf{Diff}sion Model), a novel adaptive source localization diffusion model to achieve accurate and robust source localization across different network topologies and propagation modes by fusing the principles of information propagation and restructuring the label propagation process within the conditioning module. Our approach not only adapts to real-world patterns easily without abundant fine-tuning data but can also generalize to different network topologies easily. Evaluations of various datasets demonstrate ASLDiff’s superior effectiveness, accuracy, and adaptability in real-world applications, showcasing its robust performance across different localization scenarios. The code can be found athttps://anonymous.4open.science/r/ASLDiff-4FE0.

362UTSD: Unified Time Series Diffusion Model

[openreview] [pdf]

Abstract Transformer-based architectures have achieved unprecedented success in time series analysis. However, facing the challenge of across-domain modeling, existing studies utilize statistical prior as prompt engineering fails under the huge distribution shift among various domains. In this paper, a Unified Time Series Diffusion (UTSD) model is established for the first time to model the multi-domain probability distribution, utilizing the powerful probability distribution modeling ability of Diffusion. Unlike the autoregressive models that capture the conditional probabilities of the prediction horizon to the historical sequence, we use a diffusion denoising process to model the mixture distribution of the cross-domain data and generate the prediction sequence for the target domain directly utilizing conditional sampling. The proposed UTSD contains three pivotal designs: (1) The condition network captures the multi-scale fluctuation patterns from the observation sequence, which are utilized as context representations to guide the denoising network to generate the prediction sequence; (2) Adaptor-based fine-tuning strategy, the multi-domain universal representation learned in the pretraining stage is utilized for downstream tasks in target domains; (3) The diffusion and denoising process on the actual sequence space, combined with the improved classifier free guidance as the conditional generation strategy, greatly improves the stability and accuracy of the downstream task. We conduct extensive experiments on mainstream benchmarks, and the pre-trained UTSD outperforms existing foundation models on all data domains, exhibiting superior zero-shot generalization ability. After training from scratch, UTSD achieves comparable performance against domain-specific proprietary models. In particular, UTSD shows stable and reliable time series generation, and the empirical results validate the potential of UTSD as a time series foundational model. The source codes of UTSD are publicly available onhttps://anonymous.4open.science/r/UTSD-1BFF.

363ContraDiff: Planning Towards High Return States via Contrastive Learning

[openreview] [pdf]

Abstract The performance of offline reinforcement learning (RL) is sensitive to the proportion of high-return trajectories in the offline dataset. However, in many simulation environments and real-world scenarios, there are large ratios of low-return trajectories rather than high-return trajectories, which makes learning an efficient policy challenging. In this paper, we propose a method called Contrastive Diffuser (ContraDiff) to make full use of low-return trajectories and improve the performance of offline RL algorithms. Specifically, ContraDiff groups the states of trajectories in the offline dataset into high-return states and low-return states and treats them as positive and negative samples correspondingly. Then, it designs a contrastive mechanism to pull the planned trajectory of an agent toward high-return states and push them away from low-return states. Through the contrast mechanism, trajectories with low returns can serve as negative examples for policy learning, guiding the agent to avoid areas associated with low returns and achieve better performance. Through the contrast mechanism, trajectories with low returns provide a ``counteracting force’’ guides the agent to avoid areas associated with low returns and achieve better performance. Experiments on 27 sub-optimal datasets demonstrate the effectiveness of our proposed method. Our code is publicly available at \url{https://anonymous.4open.science/r/ContraDiff}.

364DiffuSolve: Diffusion-Based Solver for Non-Convex Trajectory Optimization

[openreview] [pdf]

Abstract Optimal trajectory design is computationally expensive for nonlinear and high-dimensional dynamical systems. The challenge arises from the non-convex nature of the optimization problem with multiple local optima, which usually requires a global search. Traditional numerical solvers struggle to find diverse solutions efficiently without appropriate initial guesses. In this paper, we introduce DiffuSolve, a general diffusion model-based solver for non-convex trajectory optimization. An expressive diffusion model is trained on pre-collected locally optimal solutions and efficiently samples initial guesses, which then warm-starts numerical solvers to fine-tune the feasibility and optimality. We also present DiffuSolve+, a novel constrained diffusion model with an additional loss in training that further reduces the problem constraint violations of diffusion samples. Experimental evaluations on three tasks verify the improved robustness, diversity, and a 2×\times to 11×\times increase in computational efficiency with our proposed method, which generalizes well to trajectory optimization problems of varying challenges.

365f-Divergence Policy Optimization in Fully Decentralized Cooperative MARL

[openreview] [pdf]

Abstract Independent learning is a straightforward solution for fully decentralized learning in cooperative multi-agent reinforcement learning (MARL). The study of independent learning has a history of decades, and the representatives, such as independent Q-learning and independent PPO, can obtain good performance in some benchmarks. However, most independent learning algorithms lack convergence guarantees or theoretical support. In this paper, we propose a general formulation of independent policy optimization, ff-divergence policy optimization. We show the generality of such a formulation and analyze its limitation. Based on this formulation, we further propose a novel independent learning algorithm, TVPO, that theoretically guarantees convergence. Empirically, we show that TVPO outperforms state-of-the-art fully decentralized learning methods in three popular cooperative MARL benchmarks, which verifies the efficacy of TVPO.

366Generalization in VAE and Diffusion Models: A Unified Information-Theoretic Analysis

[openreview] [pdf]

Abstract Despite the empirical success of Diffusion Models (DMs) and Variational Autoencoders (VAEs), their generalization performance remains theoretically underexplored, particularly lacking a full consideration of the shared encoder-generator structure. Leveraging recent information-theoretic tools, we propose a unified theoretical framework that guarantees the generalization of both the encoder and generator by treating them as randomized mappings. This framework further enables (1) a refined analysis for VAEs, accounting for the generator’s generalization, which was previously overlooked; (2) illustrating an explicit trade-off in generalization terms for DMs that depends on the diffusion time TT; and (3) providing estimable bounds for DMs based solely on the training data, allowing the selection of the optimal TT and the integration of such bounds into the optimization process to improve model performance. Empirical results on both synthetic and real datasets illustrate the validity of the proposed theory.

367Right Time to Learn: Promoting Generalization via Bio-inspired Spacing Effect in Knowledge Distillation

[openreview] [pdf]

Abstract Knowledge distillation (KD) is a powerful strategy for training deep neural networks (DNNs). While it was originally proposed to train a more compact “student” model from a large “teacher” model, many recent efforts have focused on adapting it as an effective way to promote generalization of the model itself, such as online KD and self KD. Here, we propose an easy-to-use and compatible strategy named Spaced KD to improve the effectiveness of both online KD and self KD, in which the student model distills knowledge from a teacher model trained with a space interval ahead. This strategy is inspired by a prominent theory named spacing effect in the field of biological learning and memory, positing that appropriate intervals between learning trials can significantly enhance learning performance. We provide an in-depth theoretical and empirical analysis showing that the benefits of the proposed spacing effect in KD stem from seeking a flat minima during stochastic gradient descent (SGD). We perform extensive experiments to demonstrate the effectiveness of our Spaced KD in improving the learning performance of DNNs (e.g., the additional performance gain is up to 2.31% and 3.34% on Tiny-ImageNet over online KD and self KD, respectively).

368Combating inherent noise for direct preference optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has recently gained traction as a promising approach to align large models with human feedback. It is notable for its effectiveness and ease of application across various models, including Large Language Models (LLMs) and Diffusion Models (DMs). However, the quality of preference data used in DPO training has been largely overlooked. Current datasets, whether annotated by deep learning metrics or crowd-sourced human judgments, often contain noisy labels. This noise can adversely affect the performance of DPO. To address this issue, we propose a novel approach that incorporates a noise-aware metric into the DPO objective. This metric, which includes intra-annotator confidence and inter-annotator stability, helps identify and mitigate the impact of noisy data. We introduce an Adaptive-DPO loss function which improves the DPO loss in two ways: one aims to reduce the influence of noisy samples, while the other is to amplify the impact of clean samples. Our experiments demonstrate that this method effectively handles both synthetic and natural noisy data, leading to improved performance in visual and textual generation tasks. This underscores the practical value of our approach in enhancing model robustness amidst noisy preference data.

369Attaining Human’s Desirable Outcomes in Indirect Human-AI Interaction via Multi-Agent Influence Diagrams

[openreview] [pdf]

Abstract In human-AI interaction, one of the cutting-edge research questions is how AI agents can assist a human to attain their desirable outcomes. Most related work investigated the paradigm where a human is required to physically interact with AI agents, which we call direct human-AI interaction. However, this paradigm would be inapplicable when the scenarios are hazardous to humans, such as mine rescue and recovery. To alleviate this shortcoming, we consider indirect human-AI interaction in this paper. More detailed, a human would rely on additional AI agents which we call AI proxies to interact with other AI agents, to attain the human’s desirable outcomes. We model this interactive process as multi-agent influence diagrams (MAIDs), an augmentation of Bayesian networks to describe games, with Nash equilibrium (NE) as a solution. Nonetheless, in a MAID there may exist multiple NEs, and only one NE is associated with a human’s desirable outcomes. To reach this optimal NE, we propose pre-strategy intervention which is an action to provide AI proxies with more information to make decision towards a human’s desirable outcomes. Furthermore, we demonstrate that a team reward Markov game can be rendered as a MAID. This connection not only interprets the successes and failures of prevailing multi-agent reinforcement learning (MARL) paradigms, but also underpins the implementation of pre-strategy intervention in MARL. In practice, we incorporate pre-strategy intervention into MARL for the team reward Markov game to model the scenarios where all agents are required to achieve a common goal, with partial agents working as AI proxies to attain a human’s desirable outcomes. During training, these AI proxies receive an additional reward encoding the human’s desirable outcomes, and its feasibility is justified in theory. We evaluate the resulting algorithm ProxyAgent in benchmark MARL environments for teamwork, with additional goals as a human’s desirable outcomes.

370ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY

[openreview] [pdf]

Abstract Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks. However, extensive demonstrations are required for policy robustness and generalization. To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient trajectory-level SE(3) equivariant diffusion model for generating action sequences in complex robot manipulation tasks. Further, previous equivariant diffusion models require the per-step equivariance in the Markov process, making it difficult to learn policy under such strong constraints. We theoretically extend equivariant Markov kernels and simplify the condition of equivariant diffusion process, thereby significantly improving training efficiency for trajectory-level SE(3) equivariant diffusion policy in an end-to-end manner. We evaluate ET-SEED on representative robotic manipulation tasks, involving rigid body, articulated and deformable object. Experiments demonstrate superior data efficiency and manipulation proficiency of our proposed method, as well as its ability to generalize to unseen configurations with only a few demonstrations. Website:https://et-seed.github.io/

371Expand and Compress: Exploring Tuning Principles for Continual Spatio-Temporal Graph Forecasting

[openreview] [pdf]

Abstract The widespread deployment of sensing devices leads to a surge in data for spatio-temporal forecasting applications such as traffic flow, air quality, and wind energy. Although spatio-temporal graph neural networks (STGNNs) have achieved success in modeling various static spatio-temporal forecasting scenarios, real-world spatio-temporal data are typically received in a streaming manner, and the network continuously expands with the installation of new sensors. Thus, spatio-temporal forecasting in streaming scenarios faces dual challenges: the inefficiency of retraining models over newly-arrived data and the detrimental effects of catastrophic forgetting over long-term history. To address these challenges, we propose a novel prompt tuning-based continuous forecasting method,EAC, following two fundamental tuning principles guided by empirical and theoretical analysis:expandandcompress, which effectively resolve the aforementioned problems with lightweight tuning parameters. Specifically, we integrate the base STGNN with a continuous prompt pool, utilizing stored prompts (\ie, few learnable parameters) in memory, and jointly optimize them with the base STGNN. This method ensures that the model sequentially learns from the spatio-temporal data stream to accomplish tasks for corresponding periods. Extensive experimental results on multiple real-world datasets demonstrate the multi-faceted superiority ofEACover the state-of-the-art baselines, including effectiveness, efficiency, universality, etc.

372Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds

[openreview] [pdf]

Abstract This paper investigateslist replicability[Dixon et al., 2023] in the context of multi-armed (also linear) bandits (MAB). We define an algorithm AA for MAB to be (,δ)(\ell,\delta)-list replicable if with probability at least 1δ1-\delta, AA has at most \ell traces in independent executions even with different random bits, where a trace means sequence of arms played during an execution. For kk-armed bandits, although the total number of traces can be Ω(kT)\Omega(k^T) for a time horizon TT, we present several surprising upper bounds that either independent of or logarithmic of TT: (1) a (2k,δ)(2^{k},\delta)-list replicable algorithm with near-optimal regret, O~kT\widetilde{O}{\sqrt{kT}}, (2) a (O(k/δ),δ)(O(k/\delta),\delta)-list replicable algorithm with regret O~(kδkT)\widetilde{O}\left(\frac{k}{\delta}\sqrt{kT}\right), (3) a ((k+1)B1,δ)((k+1)^{B-1}, \delta)-list replicable algorithm with regret O~(k32T12+2Ω(B))\widetilde{O}(k^{\frac{3}{2}}T^{{\frac{1}{2}}+2^{-\Omega(B)}}) for any integer B>1B>1. We show that result (3) is nearly tight by establishing there are no (k1,δ)(k-1,\delta)-list replicable algorithm with o(T)o(T)-regret, almost exactly matching kk-list replicable upper bound for B=2B=2. We further show that for linear bandits with dd-dimensional features, there is a O~(d2T1/2+2Ω(B))\widetilde{O}(d^2T^{1/2+2^{-\Omega(B)}})-regret algorithm with ((2d+1)B1,δ)((2d+1)^{B-1},\delta)-list replicability, for B>1B>1, even when the number of possible arms can be infinite.

373Diffusion Transformer Captures Spatial-Temporal Dependencies: A Theory for Gaussian Process Data

[openreview] [pdf]

Abstract Diffusion Transformer, the backbone of Sora for video generation, successfully scales the capacity of diffusion models, pioneering new avenues for high-fidelity sequential data generation. Unlike static data such as images, sequential data consists of consecutive data frames indexed by time, exhibiting rich spatial and temporal dependencies. These dependencies represent the underlying dynamic model and are critical to validate the generated data. In this paper, we make the first theoretical step towards bridging diffusion transformers for capturing spatial-temporal dependencies. Specifically, we establish score approximation and distribution estimation guarantees of diffusion transformers for learning Gaussian process data with covariance functions of various decay patterns. We highlight how the spatial-temporal dependencies are captured and affect learning efficiency. Our study proposes a novel transformer approximation theory, where the transformer acts to unroll an algorithm. We support our theoretical results by numerical experiments, providing strong evidence that spatial-temporal dependencies are captured within attention layers, aligning with our approximation theory.

374Inv-PnCO: Invariant Predict-and-Combinatorial Optimization under Distribution Shifts

[openreview] [pdf]

Abstract Machine learning has been well introduced to solve combinatorial optimization (CO) problems over the decade, while most works only consider the deterministic setting. Yet in real-world applications, decisions have often to be made in uncertain environments, which is typically reflected by the stochasticity of the coefficients of the problem at hand, considered as a special case of the more general and emerging “predict-and-optimize” (PnO) paradigm in the sense that the prediction and optimization are jointly learned and performed. In this paper, we consider the problem of learning to solve CO under the above uncertain setting and formulate it as “predict-and-combinatorial optimization” (PnCO), particularly in a challenging yet practical out-of-distribution (OOD) setting, where there is a distribution shift between training and testing CO instances. We propose the Invariant Predict-and-Combinatorial Optimization (Inv-PnCO) framework to alleviate this challenge. Inv-PnCO derives a learning objective that reduces the distance of distribution of solutions with the true distribution and uses a regularization term to learn invariant decision-oriented factors that are stable under various environments, thereby enhancing the generalizability of predictions and subsequent optimizations. We also provide a theoretical analysis of how the proposed loss reduces OOD error. The empirical evaluation across three distinct tasks on knapsack, visual shortest path planning, and traveling salesman problem covering array, image, and graph inputs underscores the efficacy of Inv-PnCO to enhance the generalizability, both for predict-then-optimize and predict-and-optimize approaches.

375A Causal Lens for Learning Long-term Fair Policies

[openreview] [pdf]

Abstract Fairness-aware learning studies the development of algorithms that avoid discriminatory decision outcomes despite biased training data. While most studies have concentrated on immediate bias in static contexts, this paper highlights the importance of investigating long-term fairness in dynamic decision-making systems while simultaneously considering instantaneous fairness requirements. In the context of reinforcement learning, we propose a general framework where long-term fairness is measured by the difference in the average expected qualification gain that individuals from different groups could obtain. Then, through a causal lens, we decompose this metric into three components that represent the direct impact, the delayed impact, as well as the spurious effect the policy has on the qualification gain. We analyze the intrinsic connection between these components and an emerging fairness notion called benefit fairness that aims to control the equity of outcomes in decision-making. Finally, we develop a simple yet effective approach for balancing various fairness notions.

376VideoGuide: Improving Video Diffusion Models without Training Through a Teacher’s Guide

[openreview] [pdf]

Abstract Text-to-image (T2I) diffusion models have revolutionized visual content creation, but extending these capabilities to text-to-video (T2V) generation remains a challenge, particularly in preserving temporal consistency. Existing methods that aim to improve consistency often cause trade-offs such as reduced imaging quality and impractical computational time. To address these issues we introduce VideoGuide, a novel framework that enhances the temporal consistency of pretrained T2V models without the need for additional training or fine-tuning. Instead, VideoGuide leverages any pretrained video diffusion model (VDM) or itself as a guide during the early stages of inference, improving temporal quality by interpolating the guiding model’s denoised samples into the sampling model’s denoising process. The proposed method brings about significant improvement in temporal consistency and image fidelity, providing a cost-effective and practical solution that synergizes the strengths of various video diffusion models. Furthermore, we demonstrate prior distillation, revealing that base models can achieve enhanced text coherence by utilizing the superior data prior of the guiding model through the proposed method. Project Page:https://videoguide2025.github.io/

377A Defense of One-Step Learning: Examining Single-Batch Distillations

[openreview] [pdf]

Abstract Dataset distillation produces a compressed synthetic dataset that approximates a large dataset or other learning task. A model can be trained on a distillation in a single gradient descent step. Conventional wisdom suggests that single-step learning is not generalizable and should yield poor performance; yet, distillation defies these expectations with good approximations of full direct-task training for a large distribution of models. In order to understand how distilled datasets can perform one-shot learning, we examine the distilled data instances and the cost surfaces produced by the distilled datasets. We demonstrate that the distilled dataset not only mimics features of the true dataset but also produces cost surfaces such that one-step training leads models from the initialization space into local minima of the true task’s cost surface. This shows how one-step learning’s counter-intuitive success is not only reasonable but also the expected outcome of dataset distillation.

378Learning by Causality to Improve Channel Dependency Modeling in Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Beyond the conventional long-term temporal dependency modeling, multivariate time series (MTS) forecasting has rapidly shifted toward channel dependency (CD) modeling. This shift significantly improves modeling quality by fully leveraging both multivariate relationships and temporal dependencies. Recent methods primarily model channel dependency through correlation learning (e.g., crossattention) or non-trainable statistical techniques (e.g., cross-correlation). However, these approaches struggle to fully capture the intrinsic relationships within MTS, particularly those stemming from directed cause-effect (i.e., causality) and nonstationary variates originating from diverse sources. In addition, causality may arise from the signals with different temporal behaviors, such as varying periodicity or discrete event sequences, which is not sufficiently discussed before. In this paper, we propose CALAS (Causality-enhanced Attention with Learnable and Adaptive Spacing), the first end-to-end learning method for MTS forecasting that uncover causality among variates without relying on statistical measures or prior knowledge. To model underlying causality, which consists of causal strength and propagation delay, we newly design a hypernetworks-based 1D convolutions mechanism. Inspired by dilated convolution with learnable spacings (DCLS) and spiking neural networks (SNNs), we extend discrete time delay into a continuous Gaussian kernel. Combining the hypernetworks-generated Gaussian kernel and convolutional weights (i.e., attention or causal strength), we achieve the end-to-end dynamic causality modeling mechanism. This mechanism enhances the model’s ability to capture time-varying causality across multi-source variates, ultimately improving the prediction accuracy, quality, and interpretability. For evaluation, we conduct extensive experiments with six real-world datasets and qualitative analysis to demonstrate CALAS’s superiority in capturing varying causality in a data-agnostic manner. The experiment results indicate that CALAS has significantly improved MTS forecasting accuracy compared to state-of-the-art methods by dynamically modeling causality among variates.

379EraseDiff: Erasing Data Influence in Diffusion Models

[openreview] [pdf]

Abstract We introduce EraseDiff, an unlearning algorithm designed for diffusion models to address concerns related to data memorization. Our approach formulates the unlearning task as a constrained optimization problem, aiming to preserve the utility of the diffusion model on retained data while removing the information associated with the data to be forgotten. This is achieved by altering the generative process to deviate away from the ground-truth denoising procedure. To manage the computational complexity inherent in the diffusion process, we develop a first-order method for solving the optimization problem, which has shown empirical benefits. Extensive experiments and thorough comparisons with state-of-the-art algorithms demonstrate that EraseDiff effectively preserves the model’s utility, efficacy, and efficiency.

380Process-Driven Autoformalization in Lean 4

[openreview] [pdf]

Abstract Autoformalization, the conversion of natural language mathematics into formal languages, offers significant potential for advancing mathematical reasoning. However, existing efforts are limited to formal languages with substantial online corpora and struggle to keep pace with rapidly evolving languages like Lean 4. To bridge this gap, we propose a large-scale dataset \textbf{Form}alization for \textbf{L}ean~\textbf{4} (\textbf{\dataset}) designed to comprehensively evaluate the autoformalization capabilities of large language models (LLMs), encompassing both statements and proofs in natural and formal languages. Additionally, we introduce the \textbf{P}rocess-\textbf{D}riven \textbf{A}utoformalization (\textbf{\method}) framework that leverages the precise feedback from Lean 4 compilers to enhance autoformalization. Extensive experiments demonstrate that \method improves autoformalization, enabling higher compiler accuracy and human-evaluation scores using less filtered training data. Moreover, when fine-tuned with data containing detailed process information, \method exhibits enhanced data utilization, resulting in more substantial improvements in autoformalization for Lean 4.

381Bayesian Active Learning By Distribution Disagreement

[openreview] [pdf]

Abstract Active Learning (AL) for regression has been systematically under-researched due to the increased difficulty of measuring uncertainty in regression models. Since normalizing flows offer a full predictive distribution instead of a point forecast, they facilitate direct usage of known heuristics for AL like Entropy or Least-Confident sampling. However, we show that most of these heuristics do not work well for normalizing flows in pool-based AL and we need more sophisticated algorithms to distinguish between aleatoric and epistemic uncertainty. In this work we propose BALSA, an adaptation of the BALD algorithm, tailored for regression with normalizing flows. With this work we extend current research on uncertainty quantification with normalizing flows to real world data and pool-based AL with multiple acquisition functions and query sizes. We report SOTA results for BALSA across 4 different datasets and 2 different architectures.

382LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

[openreview] [pdf]

Abstract Customization generation techniques have significantly advanced the synthesis of specific concepts across varied contexts. Multi-concept customization emerges as the challenging task within this domain. Existing approaches often rely on training a fusion matrix of multiple Low-Rank Adaptations (LoRAs) to merge various concepts into a single image. However, we identify this straightforward method faces two major challenges: 1) concept confusion, where the model struggles to preserve distinct individual characteristics, and 2) concept vanishing, where the model fails to generate the intended subjects. To address these issues, we introduce LoRA-Composer, a training-free framework designed for seamlessly integrating multiple LoRAs, thereby enhancing the harmony among different concepts within generated images. LoRA-Composer addresses concept vanishing through concept injection constraints, enhancing concept visibility via an expanded cross-attention mechanism. To combat concept confusion, concept isolation constraints are introduced, refining the self-attention computation. Furthermore, latent re-initialization is proposed to effectively stimulate concept-specific latent within designated regions. Our extensive testing showcases a notable enhancement in LoRA-Composer’s performance compared to standard baselines, especially when eliminating the image-based conditions like canny edge or pose estimations.

383ϕ-Update: A Class of Policy Update Methods with Policy Convergence Guarantee

[openreview] [pdf]

Abstract Inspired by the similar update pattern of softmax natural policy gradient and Hadamard policy gradient, we propose to study a general policy update rule called ϕ\phi-update, where ϕ\phi refers to a scaling function on advantage functions. Under very mild conditions on ϕ\phi, the global asymptotic state value convergence of ϕ\phi-update is firstly established. Then we show that the policy produced by ϕ\phi-update indeed converges, even when there are multiple optimal policies. This is in stark contrast to existing results where explicit regularizations are required to guarantee the convergence of the policy. Since softmax natural policy gradient is an instance of ϕ\phi-update, it provides an affirmative answer to the question whether the policy produced by softmax natural policy gradient converges. The exact asymptotic convergence rate of state values is further established based on the policy convergence. Lastly, we establish the global linear convergence of ϕ\phi-update.

384Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better

[openreview] [pdf]

Abstract Diffusion Models (DM) and Consistency Models (CM) are two types of popular generative models with good generation quality on various tasks. When training DM and CM, intermediate weight checkpoints are not fully utilized and only the last converged checkpoint is used. In this work, we find proper checkpoint merging can significantly improve the training convergence and final performance. Specifically, we propose LCSC, a simple but effective and efficient method to enhance the performance of DM and CM, by combining checkpoints along the training trajectory with coefficients deduced from evolutionary search. We demonstrate the value of LCSC through two use cases: (a) Reducing training cost. With LCSC, we only need to train DM/CM with fewer number of iterations and/or lower batch sizes to obtain comparable sample quality with the fully trained model. For example, LCSC achieves considerable training speedups for CM (23×\times on CIFAR-10 and 15×\times on ImageNet-64). (b) Enhancing pre-trained models. When full training is already done, LCSC can further improve the generation quality or efficiency of the final converged models. For example, LCSC achieves better FID using 1 number of function evaluation (NFE) than the base model with 2 NFE on consistency distillation, and decreases the NFE of DM from 15 to 9 while maintaining the generation quality. Applying LCSC to large text-to-image models, we also observe clearly enhanced generation quality.

385Long-tailed Adversarial Training with Self-Distillation

[openreview] [pdf]

Abstract Adversarial training significantly enhances adversarial robustness, yet superior performance is predominantly achieved on balanced datasets. Addressing adversarial robustness in the context of unbalanced or long-tailed distributions is considerably more challenging, mainly due to the scarcity of tail data instances. Previous research on adversarial robustness within long-tailed distributions has primarily focused on combining traditional long-tailed natural training with existing adversarial robustness methods. In this study, we provide an in-depth analysis for the challenge that adversarial training struggles to achieve high performance on tail classes in long-tailed distributions. Furthermore, we propose a simple yet effective solution to advance adversarial robustness on long-tailed distributions through a novel self-distillation technique. Specifically, this approach leverages a balanced self-teacher model, which is trained using a balanced dataset sampled from the original long-tailed dataset. Our extensive experiments demonstrate state-of-the-art performance in both clean and robust accuracy for long-tailed adversarial robustness, with significant improvements in tail class performance on various datasets. We improve the accuracy against PGD attacks for tail classes by 20.3, 7.1, and 3.8 percentage points on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, while achieving the highest robust accuracy.

386Guided-BFNs: Towards Visualizing and Understanding Bayesian Flow Networks in the Context of Trajectory Planning

[openreview] [pdf]

Abstract Bayesian Flow Networks (BFNs) represent an emerging class of generative models that exhibit promising capabilities in modeling continuous, discretized, and discrete data. In this paper, we develop Guided-BFNs to integrate BFNs with conditional guidance and gradient guidance to facilitate the effective application of such models in trajectory planning tasks. Based on our developments, we can better comprehend BFNs by inspecting the generation dynamics of the planning trajectories. Through extensive parameter tuning and rigorous ablation experiments, we systematically delineate the functional roles of various parameters and elucidate the pivotal components within the structure of BFNs. Furthermore, we conduct a comparative analysis of the planning results between diffusion models and BFNs, to discern their similarities and differences. Additionally, we undertake efforts to augment the performance of BFNs, including developing a faster and training-free sampling algorithm for sample generation. Our objectives encompass not only a comprehensive exploration of BFNs’ structural insights but also the enhancement of their practical utility.

387DET: Learn to Solve the Tunnel Traveling Salesmen Problem using Double-Encoder Transformer

[openreview] [pdf]

Abstract We delve into a challenging variant of the Traveling Salesman Problem (TSP), namely tunnel TSP, which incorporates a new important constraint requiring the traversal of a prescribed set of tunnels. While traditional deep reinforcement learning (DRL) based neural TSP algorithms excel in optimizing routes without tunnel restrictions, they often struggle to achieve optimal performance in tunnel TSP due to the neglect of the crucial role of tunnel attributes during solution generation. To address this challenge, we propose a simple but effective and flexible technique, called Double-Encoder Transformer (DET), which can be seamlessly integrated into various existing autoregressive neural TSP solvers. DET processes node and tunnel location information separately and encodes them in two distinct feature spaces. Following an efficient fusion strategy, DET then integrates the encoded information from nodes and tunnels, harnessing their intricate interactions. Experimental validation demonstrates that integrating DET into existing autoregressive neural solvers significantly improves performance, enabling us to reduce the average optimality gap for tunnel TSP from 12.58% (of the previous Single-Encoder model) to 7.35%.

388Generating Model Parameters for Controlling: Parameter Diffusion for Controllable Multi-Task Recommendation

[openreview] [pdf]

Abstract Commercial recommender systems face the challenge that task requirements from platforms or users often change dynamically (e.g., varying preferences for accuracy or diversity). Ideally, the model should be re-trained after resetting a new objective function, adapting to these changes in task requirements. However, in practice, the high computational costs associated with retraining make this process impractical for models already deployed to online environments. This raises a new challenging problem: how to efficiently adapt the learning model to different task requirements by controlling model parameters after deployment, without the need for retraining. To address this issue, we propose a novel controllable learning approach via Parameter Diffusion for controllable multi-task Recommendation (PaDiRec), which allows the customization and adaptation of recommendation model parameters to new task requirements without retraining. Specifically, we first obtain the optimized model parameters through adapter tunning based on the feasible task requirements. Then, we utilize the diffusion model as a parameter generator, employing classifier-free guidance in conditional training to learn the distribution of optimized model parameters under various task requirements. Finally, the diffusion model is applied to effectively generate model parameters in a test-time adaptation manner given task requirements. As a model-agnostic approach, PaDiRec can leverage existing recommendation models as backbones to enhance their controllability. Extensive experiments on public datasets and a dataset from a commercial app, indicate that PaDiRec can effectively enhance controllability through efficient model parameter generation. The code is released athttps://anonymous.4open.science/r/PaDiRec-DD13e.

389Characterizing Context Influence and Hallucination in Summarization

[openreview] [pdf]

Abstract Although Large Language Models (LLMs) have achieved remarkable performance in numerous downstream tasks, their ubiquity has raised two significant concerns. One is that LLMs can hallucinate by generating content that contradicts relevant contextual information; the other is that LLMs can inadvertently leak private information due to input regurgitation. Many prior works have extensively studied each concern independently, but none have investigated them simultaneously. Furthermore, auditing the influence of provided context during open-ended generation with a privacy emphasis is understudied. To this end, we comprehensively characterize the influence and hallucination of contextual information during summarization. We introduce a definition for context influence and Context-Influence Decoding (CID), and then we show that amplifying the context (by factoring out prior knowledge) and the context being out of distribution with respect to prior knowledge increases the context’s influence on an LLM. Moreover, we show that context influence gives a lower bound of the private information leakage of CID. We corroborate our analytical findings with experimental evaluations that show improving the F1 ROGUE-L score on CNN-DM for LLaMA 3 by 10\textbf{10}% over regular decoding also leads to 1.5x\textbf{1.5x} more influence by the context. Moreover, we empirically evaluate how context influence and hallucination are affected by (1) model capacity, (2) context size, (3) the length of the current response, and (4) different token nn-grams of the context.

390Which Algorithms Have Tight Generalization Bounds?

[openreview] [pdf]

Abstract We study which machine learning algorithms have tight generalization bounds. First, we present conditions that preclude the existence of tight generalization bounds. Specifically, we show that algorithms that have certain inductive biases that cause them to be unstable do not admit tight generalization bounds. Next, we show that algorithms that are sufficiently stable do have tight generalization bounds. We conclude with a simple characterization that relates the existence of tight generalization bounds to the conditional variance of the algorithm’s loss.

391Frequency-Decoupled Cross-Modal Knowledge Distillation

[openreview] [pdf]

Abstract Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observe that low-frequency features tend to capture modality-agnostic, generalizable information, while high-frequency features are more modality-specific. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. Additionally, we propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show that our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches.

392Learning Transferable Sub-goals by Hypothesizing Generalizing Features

[openreview] [pdf]

Abstract Although transfer is a key promise of hierarchical reinforcement learning, current methods discover nontransferable skills. Typically, skills are defined over all state features simultaneously, preventing generalization as some state features reliably support generalization while others do not. For an agent to effectively transfer a skill it must identify features that generalize and define the skill over this subset. However, this task is under-specified as the agent has no prior knowledge of what future tasks may be introduced. Since successful transfer requires a skill to reliably achieve a sub-goal from different states, we focus our attention on ensuring sub-goals are represented in a transferable way. For each sub-goal, we train an ensemble of classifiers while explicitly incentivizing them to use minimally overlapping features. Each ensemble member represents a unique hypothesis about the transferable features of a sub-goal that the agent can use to learn a skill in previously unseen portions of the environment. Environment reward then determines which hypothesis is most transferable for the given task, based on the intuition that transferable sub-goals lead to better reward maximization. We apply these reusable sub-goals to MiniGrid and Montezuma’s Revenge, allowing us to relearn previously defined skills in unseen parts of the state-space.

393A Super-Aligned Driving Generalist Is Your Cockpit

[openreview] [pdf]

Abstract The intelligent driving cockpit, an important part of intelligent driving, needs to match different users’ comfort, interaction, and safety needs. This paper aims to build a \textbf{s}uper-\textbf{a}ligned and \textbf{ge}neralist \textbf{dr}iving agent, \textbf{sage deer}. Sage Deer achieves two highlights: (1) Super alignment: It achieves different reactions according to different people’s preferences and biases. (2) Generalist: It can understand the user’s physiological indicators, facial emotions, hand movements, body movements, driving scenarios, and behavioral decisions. (3) Multimodal: He can understand RGB, NIR, and depth video to build more robust perception, understanding, and reasoning. To achieve the above requirements, we design retrieval-enhanced multimodal frameworks. We collected multiple data sets and built a large-scale benchmark. This benchmark measures the sage deer’s perceptual decision-making ability and the super alignment’s accuracy.

394SATCH: Specialized Assistant Teacher Distillation to Reduce Catastrophic Forgetting

[openreview] [pdf]

Abstract Continual learning enables models to learn new tasks sequentially without forgetting previously learned knowledge. Knowledge distillation reduces forgetting by using a single teacher model to transfer previous knowledge to the student model. However, existing methods face challenges, specifically loss of task-specific knowledge, limited diversity in the transferred knowledge, and delays in teacher availability. These issues stem from self-distillation, where the teacher is a mere snapshot of the student after learning a new task, inheriting the student’s biases and becoming available only after learning a task. We propose Specialized Assistant TeaCHer distillation (SATCH), a novel method that uses a smaller assistant teacher trained exclusively on the current task. By incorporating the assistant teacher early in the learning process, SATCH provides task-specific guidance, improves the diversity of transferred knowledge, and preserves critical task-specific insights. Our method integrates seamlessly with existing knowledge distillation techniques, and experiments on three standard continual learning benchmarks show that SATCH improves accuracy by up to 12% when combined with four state-of-the-art methods. Code is available in supplementary materials.

395Scaling Diffusion Language Models via Adaptation from Autoregressive Models

[openreview] [pdf]

Abstract Diffusion Language Models (DLMs) have emerged as a promising new paradigm for text generative modeling, potentially addressing limitations of autoregressive (AR) models. However, current DLMs have been studied at a smaller scale compared to their AR counterparts and lack fair comparison on language modeling benchmarks. Additionally, training diffusion models from scratch at scale remains challenging. Given the prevalence of open-source AR language models, we propose adapting these models to build text diffusion models. We demonstrate connections between AR and diffusion modeling objectives and introduce a simple continual pre-training approach for training diffusion models. Through systematic evaluation on language modeling, reasoning, and commonsense benchmarks, we show that we can convert AR models ranging from 127M to 7B parameters (GPT2 and LLaMA) into diffusion models DiffuGPT and DiffuLLaMA, using less than 200B tokens for training. Our experimental results reveal that these models outperform earlier DLMs and are competitive with their AR counterparts. We release a suite of DLMs (with 127M, 355M, and 7B parameters) capable of generating fluent text, performing in-context learning, filling in the middle without prompt re-ordering, and following instructions.

396Hindsight Preference Learning for Offline Preference-based Reinforcement Learning

[openreview] [pdf]

Abstract Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains.

397Latent Diffusion with LLMs for Reasoning

[openreview] [pdf]

Abstract Despite the widespread adoption of large language models with hundreds of billions of parameters, these models still struggle on complex reasoning benchmarks. In this paper, we argue that the autoregressive nature of current language models are not suited for reasoning due to fundamental limitations, and that reasoning requires slow accumulation of knowledge through time. We show that combining latent diffusion models with an encoder-decoder transformer architecture provides a scalable way to address some of the fundamental shortcomings posed by autoregressive models. Diffusion models can arrive at predictions through many forward passes in latent space, and their reasoning is not handicapped by the order of the tokens in the dataset. Through our experiments, we show that latent diffusion language models is a feasible approach towards scalable language models that have general complex reasoning abilities.

398Adversarial Generative Flow Network for Solving Vehicle Routing Problems

[openreview] [pdf]

Abstract Recent research into solving vehicle routing problems (VRPs) has gained significant traction, particularly through the application of deep (reinforcement) learning for end-to-end solution construction. However, many current construction-based neural solvers predominantly utilize Transformer architectures, which can face scalability challenges and struggle to produce diverse solutions. To address these limitations, we introduce a novel framework beyond Transformer-based approaches, i.e., Adversarial Generative Flow Networks (AGFN). This framework integrates the generative flow network (GFlowNet)—a probabilistic model inherently adept at generating diverse solutions (routes)—with a complementary model for discriminating (or evaluating) the solutions. These models are trained alternately in an adversarial manner to improve the overall solution quality, followed by a proposed hybrid decoding method to construct the solution. We apply the AGFN framework to solve the capacitated vehicle routing problem (CVRP) and travelling salesman problem (TSP), and our experimental results demonstrate that AGFN surpasses the popular construction-based neural solvers, showcasing strong generalization capabilities on synthetic and real-world benchmark instances.

399Diffusion Trajectory-guided Policy: A Novel Framework for Long-Horizon Robot Manipulation

[openreview] [pdf]

Abstract Recently, Vision-Language Models (VLMs) have made substantial progress in robot imitation learning, benefiting from increased amounts of demonstration data. However, the high cost of data collection remains a significant bottleneck, and the scarcity of demonstrations often result in poor generalization of the imitation policy, especially in long-horizon robotic manipulation tasks. To address these challenges, we propose the Diffusion Trajectory-guided Policy (DTP) framework, which generates task-relevant trajectories through a diffusion model to guide policy learning for long-horizon tasks. Furthermore, we demonstrate that our DTP method offers a useful interface for prompt engineering, providing a novel way to connect robot manipulation skills with interactions involving LLMs or humans. Our approach employs a two-stage training process: initially, we train a generative vision-language model to create diffusion task-relevant trajectories, then refine the imitation policy using these trajectories. We validate that the DTP method achieves substantial performance improvements in extensive experiments on the CALVIN simulation benchmark, starting from scratch without any external pretraining. Our approach outperforms state-of-the-art baselines by an average of 25% in success rate across various settings.

400Length Desensitization in Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) is widely utilized in the Reinforcement Learning from Human Feedback (RLHF) phase to align Large Language Models (LLMs) with human preferences, thereby enhancing both their harmlessness and efficacy. However, it has been observed that DPO tends to over-optimize for verbosity, which can detrimentally affect both performance and user experience. In this paper, we conduct an in-depth theoretical analysis of DPO’s optimization objective and reveal a strong correlation between its implicit reward and data length. This correlation misguides the optimization direction, resulting in length sensitivity during the DPO training and leading to verbosity. To address this issue, we propose a length-desensitization improvement method for DPO, termed LD-DPO. The proposed method aims to desensitize DPO to data length by decoupling explicit length preference, which is relatively insignificant, from the other implicit preferences, thereby enabling more effective learning of the intrinsic preferences. We utilized two settings (Base and Instruct) of Llama2-13B, Llama3-8B, and Qwen2-7B for experimental validation on various benchmarks including MT-Bench and AlpacaEval 2. The experimental results indicate that LD-DPO consistently outperforms DPO and other baseline methods, achieving more concise responses with a 10-40% reduction in length compared to DPO. We conducted in-depth experimental analyses to demonstrate that LD-DPO can indeed achieve length desensitization and align the model more closely with human-like preferences. ”Brevity is the Soul of Wit.‘’—William Shakespeare

401Understanding Generalization of Preference Optimization Under Noisy Feedback

[openreview] [pdf]

Abstract As large language models (LLMs) advance their capabilities, aligning these models with human preferences has become crucial. Preference optimization, which trains models to distinguish between preferred and non-preferred responses based on human feedback, has become a crucial component for aligning LLMs. However, most existing works assume noise-free feedback, which is unrealistic given the inherent errors and inconsistencies in human judgments. This paper addresses the impact of noisy feedback on preference optimization, providing generalization guarantees under these conditions. Unlike traditional analyses that assume convergence, our work focuses on finite-step preference optimization, offering new insights that are more aligned with practical LLM training. We establish generalization guarantees for noisy preference learning under a broad family of preference optimization losses such as DPO, IPO, SLiC, etc. Our analysis provides the basis for a general model that closely describes how the generalization decays with the noise rate. Empirical validation on contemporary LLMs confirms the practical relevance of our findings, offering valuable insights for developing AI systems that align with human preferences.

402FDN: Interpretable Spatiotemporal Forecasting with Future Decomposition Networks

[openreview] [pdf]

Abstract Spatiotemporal systems comprise a collection of spatially distributed yet interdependent entities each generating unique dynamic signals. Highly sophisticated methods have been proposed in recent years delivering state-of-the-art (SOTA) forecasts but few have focused on interpretability. To address this, we propose the Future Decomposition Network (FDN), a novel forecast model capable of (a) providing interpretable predictions through classification (b) revealing latent activity patterns in the target time-series and (c) delivering forecasts competitive with SOTA methods at a fraction of their memory and runtime cost. We conduct comprehensive analyses on FDN for multiple datasets from hydrologic, traffic, and energy systems demonstrating its improved accuracy and interpretability.

403Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) is essential when online exploration is costly or unsafe, but it often struggles with high epistemic uncertainty due to limited data. Existing methods learn fixed conservative policies, which limit adaptivity and generalization. To tackle these challenges, we proposeReflect-then-Plan (RefPlan), a noveldoubly Bayesianapproach for offline model-based (MB) planning that enhances offline-learned policies for improved adaptivity and generalization. RefPlan integrates uncertainty modeling and MB planning in a unified probabilistic framework, recasting planning as Bayesian posterior estimation. During deployment, it updates a belief distribution over environment dynamics based on real-time observations. By incorporating this uncertainty into MB planning via marginalization, RefPlan derives plans that account for unknowns beyond the agent’s limited knowledge. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

404Anomaly Detection through Conditional Diffusion Probability Modeling on Graphs

[openreview] [pdf]

Abstract Existing Graph Neural Network-based anomaly detection methods suffer from over-smoothing issues during feature aggregation. Moreover, most existing methods are discriminative models that learn the boundaries between anomalous and normal data points, allowing malicious nodes in a dynamic adversarial environment to bypass detection boundaries. To address these issues, existing methods primarily focus on enhancing the discriminative boundary for each individual node, rather than considering the interdependencies of node anomalies from a holistic graph perspective. We propose an advanced Conditional Graph Anomaly Diffusion Model (CGADM) to model and capture the joint distribution of anomalies on the whole graph, thereby enabling generative graph anomaly detection. To avoid starting the diffusion process from a random state, CGADM introduces a prior-guided denoising diffusion probability model. To circumvent the need for iterative denoising samplings for each node on large-scale graphs, we adopt a prior confidence-aware mechanism to dynamically adjust the reverse sampling steps for each node, significantly reducing the computational burden on large-scale graphs. We conducted experiments on CGADM using standard benchmarks, and the results demonstrated excellent performance in graph anomaly detection tasks. Additional ablation studies confirmed our framework’s computational advantages.

405Targeted Attack Improves Protection against Unauthorized Diffusion Customization

[openreview] [pdf]

Abstract Diffusion models build a new milestone for image generation yet raising public concerns, for they can be fine-tuned on unauthorized images for customization. Protection based on adversarial attacks rises to encounter this unauthorized diffusion customization, by adding protective watermarks to images and poisoning diffusion models. However, current protection, leveraging untargeted attacks, does not appear to be effective enough. In this paper, we propose a simple yet effective improvement for the protection against unauthorized diffusion customization by introducing targeted attacks. We show that by carefully selecting the target, targeted attacks significantly outperform untargeted attacks in poisoning diffusion models and degrading the customization image quality. Extensive experiments validate the superiority of our method on two mainstream customization methods of diffusion models, compared to existing protections. To explain the surprising success of targeted attacks, we delve into the mechanism of attack-based protections and propose a hypothesis based on our observation, which enhances the comprehension of attack-based protections. To the best of our knowledge, we are the first to both reveal the vulnerability of diffusion models to targeted attacks and leverage targeted attacks to enhance protection against unauthorized diffusion customization.

406Diffusion Minimization and Sheaf Neural Networks for Recommender Systems

[openreview] [pdf]

Abstract Graph Neural Networks (GNN) are well-known for successful applications in recommender systems. Despite recent advances in GNN development, various authors report that in certain cases GNN suffer from so-called oversmoothing problems. Sheaf Neural Networks (SNN) is one of the ways to address the issue of oversmoothing. In the present work we propose a novel approach for training SNN together with user and item embeddings. In that approach parameters of the sheaf are inferred via minimization of the classical BPR loss and sheaf diffusion on graphs subjected to orthogonality and consistency constraints. Performance of the novel technique is evaluated on synthetic test cases and standard benchmarks for recommendations.

407Learning Augmentation Policies from A Model Zoo for Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting models typically rely on a fixed-size training set and treat all data uniformly, which may not effectively capture the specific patterns present in more challenging training samples. To address this issue, we introduce AutoTSAug, a learnable data augmentation method based on reinforcement learning. Our approach begins with an empirical analysis to determine which parts of the training data should be augmented. Specifically, we identify the so-called marginal samples by considering the prediction diversity across a set of pretrained forecasting models. Next, we propose using variational masked autoencoders as the augmentation model and applying the REINFORCE algorithm to transform the marginal samples into new data. The goal of this generative model is not only to mimic the distribution of real data but also to reduce the variance of prediction errors across the model zoo. By augmenting the marginal samples with a learnable policy, AutoTSAug substantially improves forecasting performance, advancing the prior art in this field with minimal additional computational cost.

408One Training Fits All: Generalized Data Condensation via Mixture-of-Information Bottleneck Guidance

[openreview] [pdf]

Abstract Data condensation (DC) technologies are widely used in buffer-constrained scenarios to reduce the memory demand of training samples and maintain DNN training performance. However, due to the storage constraint of deployment devices and the high energy costs of condensation procedure, synthetic datasets generated by DC often have inferior performance in terms of training efficiency and scalability, which greatly limits its practical application on various edge devices. This dilemma arises due to two reasons: i) existing state-of-the-art (SoTA) data condensation approaches that update synthetic datasets by intuitively matching intermediate training outputs (e.g., gradients, features and distributions) between real datasets and synthetic datasets without improving their representational information capabilities from the perspective of the useful information contained. ii) DC lacks sufficient consideration for the heterogeneity of storage constraints among various edge devices, which will result in large training overheads (i.e., consumption or storage). To tackle the above issue, We propose a novel method named Mixture-of-Information Bottleneck Dataset Condensation (MIBDC), which employs information bottlenecks from synthetic datasets with various Image Per Class (IPC) numbers to improve the overall DC generalization and scalability. Specifically, in this paper, the following two phenomena are found: i) The quality of synthetic datasets improves with increased synthetic dataset quantity. ii) The smaller the number of synthetic datasets, the earlier they can reach the convergence peak. Based on the above two findings, this paper proposes that i) large synthetic datasets can guide the better convergence of smaller ones. ii) information contained in synthetic datasets with different IPC numbers can play a collaborative role in the guidance of dataset condensation generalization. Comprehensive experimental results on three well-known datasets show that, compared with state-of-the-art dataset condensation methods, MIBDC can not only enhance the generalization performance of trained models but also achieve superior scalability.

409HDDI: A Historical Data-Based Diffusion Imputation Method for High-Accuracy Recovery in Multivariate Time Series with High Missing Rate and Long-Term Gap

[openreview] [pdf]

Abstract Multivariate time series data often face the challenge of missing values, which can impact the performance of subsequent tasks. Although some deep learning-based imputation methods perform well, they still struggle with insufficient training data due to high missing rate and long-term missing data. To address these challenges, we propose a Historical Data-based Multivariate Time Series Diffusion Imputation (HDDI) method. Unlike existing deep learning-based imputation methods, we design a historical data supplement module to match and fuse historical data to supplement the training data. Additionally, we propose a diffusion imputation module that utilizes the supplement training data to achieve high-accuracy imputation even under high missing rate and long-term missing scenario. We conduct extensive experiments on five public multivariate time series datasets, the results show that our HDDI outperforms baseline methods across five datasets. Particularly, when the data missing rate is 90%, HDDI improves accuracy by 25.15% compared to the best baseline method in the random missing scenario, and by 13.64% in the long-term missing scenario. The code is available athttps://github.com/liuyu3880/HDDIproject.

410Scenario-Wise Rec: A Multi-Scenario Recommendation Benchmark

[openreview] [pdf]

Abstract Multi Scenario Recommendation (MSR) tasks, referring to building a unified model to enhance performance across all recommendation scenarios, have recently gained much attention. However, current research in MSR faces two significant challenges that hinder the field’s development: the absence of uniform procedures for multi-scenario dataset processing, thus hindering fair comparisons, and most models being closed-sourced, which complicates comparisons with current SOTA models. Consequently, we introduce our benchmark, Scenario-Wise Rec, which comprises six public datasets and twelve benchmark models, along with a training and evaluation pipeline. We have also validated our benchmark using an industrial advertising dataset, further enhancing its real-world reliability. We aim for this benchmark to provide researchers with valuable insights from prior works, enabling the development of novel models based on our benchmark and thereby fostering a collaborative research ecosystem in MSR. Our source code is also available.

411RAPID: Retrieval Augmented Training of Differentially Private Diffusion Models

[openreview] [pdf]

Abstract Differentially private diffusion models (DPDMs) harness the remarkable generative capabilities of diffusion models while enforcing differential privacy (DP) for sensitive data. However, existing DPDM training approaches often suffer from significant utility loss, large memory footprint, and expensive inference cost, impeding their practical uses.To overcome such limitations, we present RAPID: Retrieval Augmented PrIvate Diffusion model, a novel approach that integrates retrieval augmented generation (RAG) into DPDM training. Specifically, RAPID leverages available public data to build a knowledge base of sample trajectories; when training the diffusion model on private data, RAPID computes the early sampling steps as queries, retrieves similar trajectories from the knowledge base as surrogates, and focuses on training the later sampling steps in a differentially private manner. Extensive evaluation using benchmark datasets and models demonstrates that, with the same privacy guarantee, RAPID significantly outperforms state-of-the-art approaches by large margins in generative quality, memory footprint, and inference cost, suggesting that retrieval-augmented DP training represents a promising direction for developing future privacy-preserving generative models (code and data are available in the submitted supplemental materials).

412Prototype-based Optimal Transport for Out-of-Distribution Detection

[openreview] [pdf]

Abstract Detecting Out-of-Distribution (OOD) inputs is crucial for improving the reliability of deep neural networks in the real-world deployment. In this paper, inspired by the inherent distribution shift between ID and OOD data, we propose a novel method that leverages optimal transport to measure the distribution discrepancy between test inputs and ID prototypes. The resulting transport costs are used to quantify the individual contribution of each test input to the overall discrepancy, serving as a desirable measure for OOD detection. To address the issue that solely relying on the transport costs to ID prototypes is inadequate for identifying OOD inputs closer to ID data, we generate virtual outliers to approximate the OOD region via linear extrapolation. By combining the transport costs to ID prototypes with the costs to virtual outliers, the detection of OOD data near ID data is emphasized, thereby enhancing the distinction between ID and OOD inputs. Experiments demonstrate the superiority of our method over state-of-the-art methods.

413Pullback Flow Matching on Data Manifolds

[openreview] [pdf]

Abstract We propose Pullback Flow Matching (PFM), a novel framework for generative modeling on data manifolds. Unlike existing methods that assume or learn restrictive closed-form manifold mappings for training Riemannian Flow Matching (RFM) models, PFM leverages pullback geometry and isometric learning to preserve the underlying manifold’s geometry while enabling efficient generation and precise interpolation in latent space. This approach not only facilitates closed-form mappings on the data manifold but also allows for designable latent spaces, using assumed metrics on both data and latent manifolds. By enhancing isometric learning through Neural ODEs and proposing a scalable training objective, we achieve a latent space more suitable for interpolation, leading to improved manifold learning and generative performance. We demonstrate PFM’s effectiveness through applications in synthetic data, protein dynamics and protein sequence data, generating novel proteins with specific properties. This method shows strong potential for drug discovery and materials science, where generating novel samples with specific properties is of great interest.

414Alignment without Over-optimization: Training-Free Solution for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models excel in generative tasks, but aligning them with specific objec- tives while maintaining their versatility remains challenging. Existing fine-tuning methods often suffer from reward over-optimization, while approximate guidance approaches fail to effectively optimize target rewards. Addressing these limita- tions, we propose a training-free sampling method based on Sequential Monte Carlo (SMC) to sample from the reward-aligned target distribution. Our approach, tailored for diffusion sampling and incorporating tempering techniques, achieves comparable or superior target rewards to fine-tuning methods while preserving diversity and cross-reward generalization. We demonstrate its effectiveness in single-reward optimization, multi-objective scenarios, and online black-box opti- mization. This work offers a robust solution for aligning diffusion models with diverse downstream objectives without compromising their general capabilities.

415AN INFORMATION THEORETIC EVALUATION METRIC FOR STRONG UNLEARNING

[openreview] [pdf]

Abstract Machine unlearning (MU) aims to remove the influence of specific data from trained models, addressing privacy concerns and ensuring compliance with regulations such as the “right to be forgotten.” Evaluating strong unlearning, where the unlearned model is indistinguishable from one retrained without the forgetting data, remains a significant challenge in deep neural networks (DNNs). Common black-box metrics, such as variants of membership inference attacks and accuracy comparisons, primarily assess model outputs but often fail to capture residual information in intermediate layers. To bridge this gap, we introduce the Information Difference Index (IDI), a novel white-box metric inspired by information theory. IDI quantifies retained information in intermediate features by measuring mutual information between those features and the labels to be forgotten, offering a more comprehensive assessment of unlearning efficacy. Our experiments demonstrate that IDI effectively measures the degree of unlearning across various datasets and architectures, providing a reliable tool for evaluating strong unlearning in DNNs.

416A Contrastive Teacher-Student Framework for Novelty Detection under Style Shifts

[openreview] [pdf]

Abstract There have been several efforts to improve Novelty Detection (ND) performance. However, ND methods often suffer significant performance drops under minor distribution shifts caused by changes in the environment, known as style shifts. This challenge arises from the ND setup, where the absence of out-of-distribution (OOD) samples during training causes the detector to be biased toward the dominant style features in the in-distribution (ID) data. As a result, the model mistakenly learns to correlate style with core features, using this shortcut for detection. Robust ND is crucial for real-world applications like autonomous driving and medical imaging, where test samples may have different styles than the training data. Motivated by this, we propose a robust ND method that crafts an auxiliary OOD set with style features similar to the ID set but with different core features. Then, a task-based knowledge distillation strategy is utilized to distinguish core features from style features and help our model rely on core features for discriminating crafted OOD and ID sets. We verified the effectiveness of our method through extensive experimental evaluations on several datasets, including synthetic and real-world benchmarks, against nine different ND methods.

417T-Graphormer: Using Transformers for Spatiotemporal Forecasting

[openreview] [pdf]

Abstract Time series data is ubiquitous and appears in all fields of study. In multivariate time series, observations are interconnected both temporally and across components. For instance, in traffic flow analysis, traffic speeds at different intersections exhibit complex spatiotemporal correlations. This dual structure presents unique challenges for modelling. Most existing forecasting methods address this by learning the spatial and temporal dependencies separately. Here, we propose Temporal Graphormer (T-Graphormer), a transformer-based method that models spatiotemporal correlations directly. By extending the Graphormer architecture over time, each node is updated based on all other nodes within the historical context window, allowing the model to learn powerful representations. We demonstrate the efficacy of T-Graphormer by evaluating it on two real-world traffic prediction benchmarking datasets. Compared to state-of-the-art methods, our method shows a reduction in root mean squared error (RMSE) by up to 10% and mean absolute percentage error (MAPE) by up to 10%.

418Hydra-MDP++: Advancing End-to-End Driving via Hydra-Distillation with Expert-Guided Decision Analysis

[openreview] [pdf]

Abstract We introduce HydraMDP++, a novel end-to-end autonomous driving framework that integrates rule-based and neural planners by learning from human demonstrations and distilling knowledge from rule-based experts. We propose a teacher-student knowledge distillation framework with a multi-head student decoder that integrates feedback from rule-based expert teachers. The student model achieves state-of-the-art performance on the NAVSIM benchmark with a tiny image encoder. Moreover, to address limitations in existing evaluation metrics, we expand the teacher model to include traffic light compliance, lane-keeping ability, and extended comfort. This is intended to ensure a more robust decision synthesis in driving. HydraMDP++ demonstrates robust and efficient performance across diverse driving scenarios, achieving a 91.0% drive score on NAVSIM by simply scaling the image encoder. Our work contributes to developing more reliable and adaptable autonomous driving systems that combine the strengths of rule-based and neural planning approaches.

419GRADIENT-OPTIMIZED CONTRASTIVE LEARNING

[openreview] [pdf]

Abstract Contrastive learning is a crucial technique in representation learning, producing robust embeddings by distinguishing between similar and dissimilar pairs. In this paper, we introduce a novel framework, Gradient-Optimized Contrastive Learning (GOAL), which enhances network training by optimizing gradient updates during backpropagation as a bilevel optimization problem. Our approach offers three key insights that set it apart from existing methods: (1) Contrastive learning can be seen as an approximation of a one-class support vector machine (OC-SVM) using multiple neural tangent kernels (NTKs) in the network’s parameter space; (2) Hard triplet samples are vital for defining support vectors and outliers in OC-SVMs within NTK spaces, with their difficulty measured using Lagrangian multipliers; (3) Contrastive losses like InfoNCE provide efficient yet dense approximations of sparse Lagrangian multipliers by implicitly leveraging gradients. To address the computational complexity of GOAL, we propose a novel contrastive loss function, Sparse InfoNCE (SINCE), which improves the Lagrangian multiplier approximation by incorporating hard triplet sampling into InfoNCE. Our experimental results demonstrate the effectiveness and efficiency of SINCE in tasks such as image classification and point cloud completion. Demo code is attached in the supplementary file.

420G-Transformer for Conditional Average Potential Outcome Estimation over Time

[openreview] [pdf]

Abstract Estimating potential outcomes for treatments over time based on observational data is important for personalized decision-making in medicine. Yet, existing neural methods for this task either (1) do not perform proper adjustments for time-varying confounders, or (2) suffer from large estimation variance. In order to address both limitations, we introduce the G-transformer (GT). Our GT is a novel, neural end-to-end model which adjusts for time-varying confounders, and provides low-variance estimation of conditional average potential outcomes (CAPOs) over time. Specifically, our GT is the first neural model to perform regression-based iterative G-computation for CAPOs in the time-varying setting. We evaluate the effectiveness of our GT across various experiments. In sum, this work represents a significant step towards personalized decision-making from electronic health records.

421Looking Beyond the Top-1: Transformers Determine Top Tokens in Order

[openreview] [pdf]

Abstract Understanding the inner workings of Transformers is crucial for achieving more accurate and efficient predictions. In this work, we analyze the computation performed by Transformers in the layers after the top-1 prediction has become fixed, which has been previously referred to as the “saturation event”. We expand the concept of saturation events for top-k tokens, demonstrating that similar saturation events occur across language, vision, and speech models. We find that these saturation events happen in order of the corresponding tokens’ ranking, i.e., the model first decides on the top ranking token, then the second highest ranking token, and so on. This phenomenon seems intrinsic to the Transformer architecture, occurring across different architectural variants (decoder-only, encoder-only, and to a lesser extent full-Transformer), and even in untrained Transformers. We propose an underlying mechanism of task transition for this sequential saturation, where task k corresponds to predicting the k-th most probable token, and the saturation events are in fact discrete transitions between the tasks. In support of this we show that it is possible to predict the current task from hidden layer embedding. Furthermore, using an intervention method we demonstrate that we can cause the model to switch from one task to the next. Finally, leveraging our findings, we introduce a novel token-level early-exit strategy, which surpasses existing methods in balancing performance and efficiency.

422Repulsive Latent Score Distillation for Solving Inverse Problems

[openreview] [pdf]

Abstract Score Distillation Sampling (SDS) has been pivotal for leveraging pre-trained diffusion models in downstream tasks such as inverse problems, but it faces two major challenges: (i)(i) mode collapse and (ii)(ii) latent space inversion, which become more pronounced in high-dimensional data. To address mode collapse, we introduce a novel variational framework for posterior sampling. Utilizing the Wasserstein gradient flow interpretation of SDS, we propose a multimodal variational approximation with a \emph{repulsion} mechanism that promotes diversity among particles by penalizing pairwise kernel-based similarity. This repulsion acts as a simple regularizer, encouraging a more diverse set of solutions. To mitigate latent space ambiguity, we extend this framework with an \emph{augmented} variational distribution that disentangles the latent and data. This repulsive augmented formulation balances computational efficiency, quality, and diversity. Extensive experiments on linear and nonlinear inverse tasks with high-resolution images (512×512512 \times 512) using pre-trained Stable Diffusion models demonstrate the effectiveness of our approach.

423Overcoming Lookback Window Limitations: Exploring Longer Windows in Long-Term Time Series Forecasting

[openreview] [pdf]

Abstract Long-term time series forecasting (LTSF) aims to predict future trends based on historical data. While longer lookback windows theoretically provide more comprehensive insights, current Transformer-based models face the Lookback Window Limitation (LWL). On one hand, longer windows introduce redundant information, which can hinder model learning. On the other hand, Transformers tend to overfit temporal noise rather than extract meaningful temporal information when dealing with longer sequences, compounded by their quadratic complexity. In this paper, we aim to overcome LWL, enabling models to leverage more historical information for improved performance. Specifically, to mitigate information redundancy, we introduce the Information Bottleneck Filter (IBF), which applies information bottleneck theory to extract essential subsequences from the input. Additionally, to address the limitations of the Transformer architecture in handling long sequences, we propose the Hybrid-Transformer-Mamba (HTM), which combines the linear complexity and long-range modeling capabilities of Mamba with the Transformer’s strength in modeling short sequences. We integrate these two model-agnostic modules into various existing methods and conduct experiments on seven datasets. The results demonstrate that incorporating these modules effectively overcomes the lookback window limitations. Notably, by combining them with the Patch strategy, we design the PIH (\textbf{P}atch-\textbf{I}BF-\textbf{H}TM), successfully extending the window length to 1024—a significantly larger window than previously achieved—and achieving state-of-the-art results, highlighting the potential of exploring even longer windows.

424Does learning the right latent variables necessarily improve in-context learning?

[openreview] [pdf]

Abstract Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights, suggesting avenues for efficiently solving new tasks. For many tasks, e.g., linear regression, the data factorizes: examples are independent given a task latent that generates the data, e.g., linear coefficients. While an optimal predictor leverages this factorization by inferring task latents, it is unclear if Transformers implicitly do so or if they instead exploit heuristics and statistical shortcuts enabled by attention layers. Both scenarios have inspired active ongoing work. In this paper, we systematically investigate the effect of explicitly inferring task latents. We minimally modify the Transformer architecture with a bottleneck designed to prevent shortcuts in favor of more structured solutions, and then compare performance against standard Transformers across various ICL tasks. Contrary to intuition and some recent works, we find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance, in general. Curiously, we find that while the bottleneck effectively learns to extract latent task variables from context, downstream processing struggles to utilize them for robust prediction. Our study highlights the intrinsic limitations of Transformers in achieving structured ICL solutions that generalize, and shows that while inferring the right latents aids interpretability, it is not sufficient to alleviate this problem.

425Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model

[openreview] [pdf]

Abstract ControlNets are widely used for adding spatial control to text-to-image diffusion models. However, when it comes to controllable video generation, ControlNets cannot be directly integrated into new backbones due to feature space mismatches, and training ControlNets for new backbones can be a significant burden for many users. Furthermore, applying ControlNets independently to different frames can not effectively maintain object temporal consistency. To address these challenges, we introduce Ctrl-Adapter, an efficient and versatile framework that adds diverse controls to any image/video diffusion models through the adaptation of pretrained ControlNets. Ctrl-Adapter offers strong and diverse capabilities, including image and video control, sparse-frame video control, fine-grained patch-level multi-condition control, zero-shot adaptation to unseen conditions, and supports a variety of downstream tasks beyond spatial control, including video editing, video style transfer, and text-guided motion control. With six diverse U-Net/DiT-based image/video diffusion models (SDXL, PixArt-α, I2VGen-XL, SVD, Latte, Hotshot-XL), Ctrl-Adapter matches the performance of pretrained ControlNets on COCO and achieves the state-of-the-art on DAVIS 2017 with significantly lower computation (< 10 GPU hours). We provide video examples inhttps://ctrladapterexamples.github.ioand code in the supplementary material.

[openreview] [pdf]

Abstract State-of-the-art link prediction (LP) models demonstrate impressive benchmark results. However, popular benchmark datasets often assume that training, validation, and testing samples are representative of the overall dataset distribution. In real-world situations, this assumption is often incorrect; since uncontrolled factors lead to the problem where new dataset samples come from different distributions than training samples. The vast majority of recent work focuses on dataset shift affecting node- and graph-level tasks, largely ignoring link-level tasks. To bridge this gap, we introduce a novel splitting strategy, known as LPShift, which utilizes structural properties to induce a controlled distribution shift. We verify the effect of LPShift through empirical evaluation of SOTA LP methods on 16 LPShift generated splits of Open Graph Benchmark (OGB) datasets. When benchmarked with LPShift datasets, GNN4LP methods frequently generalize worse than heuristics or basic GNNs. Furthermore, LP-specific generalization techniques do little to improve performance under LPShift. Finally, further analysis provides insight on why LP models lose much of their architectural advantages under LPShift.

[openreview] [pdf]

Abstract Enhancing the capability of large language models (LLMs) in reasoning has gained significant attention in recent years. Previous studies have demonstrated the effectiveness of various prompting strategies in aiding LLMs in reasoning (called “reasoning actions”), such as step-by-step thinking, reflecting before answering, solving with programs, and their combinations. However, these approaches often applied static, predefined reasoning actions uniformly to all questions, without considering the specific characteristics of each question or the capability of the task-solving LLM. In this paper, we propose DOTS, an approach enabling LLMs to reason Dynamically via Optimal reasoning Trajectories Search, tailored to the specific characteristics of each question and the inherent capability of the task-solving LLM. Our approach involves three key steps: i) defining atomic reasoning action modules that can be composed into various reasoning action trajectories; ii) searching for the optimal action trajectory for each training question through iterative exploration and evaluation for the specific task-solving LLM; and iii) using the collected optimal trajectories to train an LLM to plan for the reasoning trajectories of unseen questions. In particular, we propose two learning paradigms, i.e., fine-tuning an external LLM as a planner to guide the task-solving LLM, or directly fine-tuning the task-solving LLM with an internalized capability for reasoning actions planning. Our experiments across eight reasoning tasks show that our method consistently outperforms static reasoning techniques and the vanilla instruction tuning approach. Further analysis reveals that our method enables LLMs to adjust their computation based on problem complexity, allocating deeper thinking and reasoning to harder problems.

428Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning

[openreview] [pdf]

Abstract Autoregressive language models, despite their impressive capabilities, struggle with complex reasoning and long-term planning tasks. We introduce discrete diffusion models as a novel solution to these challenges. Through the lens of subgoal imbalance, we demonstrate how diffusion models effectively learn difficult subgoals that elude autoregressive approaches. We propose Multi-granularity Diffusion Modeling (MDM), which prioritizes subgoals based on difficulty during learning. On complex tasks like Countdown, Sudoku, and Boolean Satisfiability Problems, MDM significantly outperforms autoregressive models without using search techniques. For instance, MDM achieves 91.5% and 100% accuracy on Countdown and Sudoku, respectively, compared to 45.8% and 20.7% for autoregressive models. Our work highlights the potential of diffusion-based approaches in advancing AI capabilities for sophisticated language understanding and problem-solving tasks.

429Training on more Reachable Tasks for Generalisation in Reinforcement Learning

[openreview] [pdf]

Abstract In multi-task reinforcement learning, agents train on a fixed set of tasks and have to generalise to new ones. Recent work has shown that increased exploration improves this generalisation, but it remains unclear why exactly that is. In this paper, we introduce the concept of reachability in multi-task reinforcement learning and show that an initial exploration phase increases the number of reachable tasks the agent is trained on. This, and not the increased exploration, is responsible for the improved generalisation, even to unreachable tasks. Inspired by this, we propose a novel method Explore-Go that implements such an exploration phase at the beginning of each episode. Explore-Go only modifies the way experience is collected and can be used with most existing on-policy or off-policy reinforcement learning algorithms. We demonstrate the effectiveness of our method when combined with some popular algorithms and show an increase in generalisation performance across several environments.

430An Online Learning Theory of Trading-Volume Maximization

[openreview] [pdf]

Abstract We explore brokerage between traders in an online learning framework. At any round tt, two traders meet to exchange an asset, provided the exchange is mutually beneficial. The broker proposes a trading price, and each trader tries to sell their asset or buy the asset from the other party, depending on whether the price is higher or lower than their private valuations. A trade happens if one trader is willing to sell and the other is willing to buy at the proposed price. Previous work provided guidance to a broker aiming at enhancing traders’ total earnings by maximizing thegain from trade, defined as the sum of the traders’ net utilities after each interaction. This classical notion of reward can be highly unfair to traders with small profit margins, and far from the real-life utility of the broker. For these reasons, we investigate how the broker should behave to maximize the trading volume, i.e., thetotal number of trades. We model the traders’ valuations as an i.i.d. process with an unknown distribution. If the traders’ valuations are revealed after each interaction (full-feedback), and the traders’ valuations cumulative distribution function (cdf) is continuous, we provide an algorithm achieving logarithmic regret and show its optimality up to constants. If only their willingness to sell or buy at the proposed price is revealed after each interaction (2-bit feedback), we provide an algorithm achieving poly-logarithmic regret when the traders’ valuations cdf is Lipschitz and show its near-optimality. We complement our results by analyzing the implications of dropping the regularity assumptions on the unknown traders’ valuations cdf. If we drop the continuous cdf assumption, the regret rate degrades to Θ(T)\Theta(\sqrt{T}) in the full-feedback case, where TT is the time horizon. If we drop the Lipschitz cdf assumption, learning becomes impossible in the 2-bit feedback case.

431An Online Learning Theory of Trading-Volume Maximization

[openreview] [pdf]

Abstract No absctract

432Accelerate High-Quality Diffusion Models with Inner Loop Feedback

[openreview] [pdf]

Abstract We propose Inner Loop Feedback (ILF), a novel approach to accelerate diffusion models’ inference. ILF trains a lightweight module to predict future features in the denoising process by leveraging the outputs from a chosen diffusion backbone block at a given time step. This approach exploits two key intuitions; (1) the outputs of a given block at adjacent time steps are similar, and (2) performing partial computations for a step imposes a lower burden on the model than skipping the step entirely. Our method is highly flexible, since we find that the feedback module itself can simply be a block from the diffusion backbone, with all settings copied. Its influence on the diffusion forward can be tempered with a learnable scaling factor from zero initialization. We train this module using distillation losses; however, unlike some prior work where a full diffusion backbone serves as the student, our model freezes the backbone, training only the feedback module. While many efforts to optimize diffusion models focus on achieving acceptable image quality in extremely few steps (1-4 steps), our emphasis is on matching best case results (typically achieved in 20 steps) while significantly reducing runtime. ILF achieves this balance effectively, demonstrating strong performance for both class-to-image generation with diffusion transformer (DiT) and text-to-image generation with DiT-based PixArt-alpha and PixArt-sigma. The quality of ILF’s 1.7x-1.8x speedups are confirmed by FID, CLIP score, CLIP Image Quality Assessment, ImageReward, and qualitative comparisons.

433Entropy-Based Uncertainty Modeling for Trajectory Prediction in Autonomous Driving

[openreview] [pdf]

Abstract In autonomous driving, accurate motion prediction is essential for safe and efficient motion planning. To ensure safety, planners must rely on reliable uncertainties in the future behavior of surrounding agents, yet this aspect has received limited attention. This paper addresses the problem of uncertainty modeling in trajectory prediction. We adopt a holistic approach that focuses on uncertainty quantification, decomposition, and the influence of model composition. Our method is based on a theoretically-grounded information-theoretic approach to measure uncertainty, allowing us to decompose total uncertainty into its aleatoric and epistemic components. We conduct extensive experiments on the nuScenes dataset to assess how different model architectures and configurations affect uncertainty quantification and model robustness. Our analysis thoroughly explores the uncertainty quantification capabilities of several state-of-the-art prediction models, examining the relationship between uncertainty and prediction error in both in- and out-of-distribution scenarios, as well as robustness in out-of-distribution.

434Ensembling Diffusion Models via Adaptive Feature Aggregation

[openreview] [pdf]

Abstract The success of the text-guided diffusion model has inspired the development and release of numerous powerful diffusion models within the open-source community. These models are typically fine-tuned on various expert datasets, showcasing diverse denoising capabilities. Leveraging multiple high-quality models to produce stronger generation ability is valuable, but has not been extensively studied. Existing methods primarily adopt parameter merging strategies to produce a new static model. However, they overlook the fact that the divergent denoising capabilities of the models may dynamically change across different states, such as when experiencing different prompts, initial noises, denoising steps, and spatial locations. In this paper, we propose a novel ensembling method, Adaptive Feature Aggregation (AFA), which dynamically adjusts the contributions of multiple models at the feature level according to various states (i.e., prompts, initial noises, denoising steps, and spatial locations), thereby keeping the advantages of multiple diffusion models, while suppressing their disadvantages. Specifically, we design a lightweight Spatial-Aware Block-Wise (SABW) feature aggregator that adaptive aggregates the block-wise intermediate features from multiple U-Net denoisers into a unified one. The core idea lies in dynamically producing an individual attention map for each model’s features by comprehensively considering various states. It is worth noting that only SABW is trainable with about 50 million parameters, while other models are frozen. Both the quantitative and qualitative experiments demonstrate the effectiveness of our proposed method.

435Prevalence of Negative Transfer in Continual Reinforcement Learning: Analyses and a Simple Baseline

[openreview] [pdf]

Abstract We argue that the negative transfer problem occurring when the new task to learn arrives is an important problem that needs not be overlooked when developing effective Continual Reinforcement Learning (CRL) algorithms. Through comprehensive experimental validation, we demonstrate that such issue frequently exists in CRL and cannot be effectively addressed by several recent work on either mitigating plasticity loss of RL agents or enhancing the positive transfer in CRL scenario. To that end, we develop Reset & Distill (R&D), a simple yet highly effective baseline method, to overcome the negative transfer problem in CRL. R&D combines a strategy of resetting the agent’s online actor and critic networks to learn a new task and an offline learning step for distilling the knowledge from the online actor and previous expert’s action probabilities. We carried out extensive experiments on long sequence of Meta World tasks and show that our simple baseline method consistently outperforms recent approaches, achieving significantly higher success rates across a range of tasks. Our findings highlight the importance of considering negative transfer in CRL and emphasize the need for robust strategies like R&D to mitigate its detrimental effects.

436Inverse Flow and Consistency Models

[openreview] [pdf]

Abstract Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF’s utility in scientific problems. This work opens up the use of powerful generative models for denoising.

437Understanding the Stability-based Generalization of Personalized Federated Learning

[openreview] [pdf]

Abstract Despite great achievements in algorithm design for Personalized Federated Learning (PFL), research on the theoretical analysis of generalization is still in its early stages. Some recent theoretical results have investigated the generalization performance of personalized models under the problem setting and hypothesis in the convex condition, which do not consider the real iteration performance during the non-convex training. To further understand the testing performance from the theoretical perspective, we propose the first algorithm-matter generalization analysis with uniform stability for the typical PFL method Partial Model Personalization on smooth and non-convex objectives. In an attempt to distinguish the shared and personalized errors, we decouple the shared aggregation and the local fine-tuning progress and illustrate the interaction mechanism between the shared and personalized variables. The algorithm-matter generalization bounds analyze the impact of the trivial hyperparameters like learning steps and stepsizes as well as the communication modes in both Centralized and Decentralized PFL (C-PFL and D-PFL), which also concludes that C-PFL generalizes better than D-PFL. Combined with the convergence errors, we then obtain the excess risk analysis and establish the better early stopping point for the optimal population risk of PFL. Promising experiments on CIFAR dataset also corroborate our theoretical results.

438Dual-Model Defense: Safeguarding Diffusion Models from Membership Inference Attacks through Disjoint Data Splitting

[openreview] [pdf]

Abstract Diffusion models have demonstrated remarkable capabilities in image synthesis, but their recently proven vulnerability to Membership Inference Attacks (MIAs) poses a critical privacy concern. This paper introduces two novel and efficient approaches (DualMD and DistillMD) to protect diffusion models against MIAs while maintaining high utility. Both methods are based on training two separate diffusion models on disjoint subsets of the original dataset. DualMD then employs a private inference pipeline that utilizes both models. This strategy significantly reduces the risk of black-box MIAs by limiting the information any single model contains about individual training samples. The dual models can also generate “soft targets” to train a private student model in DistillMD, enhancing privacy guarantees against all types of MIAs. Extensive evaluations of DualMD and DistillMD against state-of-the-art MIAs across various datasets in white-box and black-box settings demonstrate their effectiveness in substantially reducing MIA success rates while preserving competitive image generation performance. Notably, our experiments reveal that DistillMD not only defends against MIAs but also mitigates model memorization, indicating that both vulnerabilities stem from overfitting and can be addressed simultaneously with our unified approach.

439Constrained Exploitability Descent: Finding Mixed-Strategy Nash Equilibrium by Offline Reinforcement Learning

[openreview] [pdf]

Abstract This paper presents Constrained Exploitability Descent (CED), a novel model-free offline reinforcement learning algorithm for solving adversarial Markov games. CED is a game-theoretic approach combined with policy constraint methods from offline RL. While policy constraints can perturb the optimal pure-strategy solutions in single-agent scenarios, we find this side effect can be mitigated when it comes to solving adversarial games, where the optimal policy can be a mixed-strategy Nash equilibrium. We theoretically prove that, under the uniform coverage assumption on the dataset, CED converges to a stationary point in deterministic two-player zero-sum Markov games. The min-player policy at the stationary point satisfies the necessary condition for making up an exact mixed-strategy Nash equilibrium, even when the offline dataset is fixed and finite. Compared to the model-based method of Exploitability Descent that optimizes the max-player policy, our convergence result no longer relies on the generalized gradient. Experiments in matrix games, a tree-form game, and an infinite-horizon soccer game verify that a single run of CED leads to an optimal min-player policy when the practical offline data guarantees uniform coverage. Besides, CED achieves significantly lower NashConv compared to an existing pessimism-based method and can gradually improve the behavior policy even under non-uniform coverage.

440Attention Is All You Need For Mixture-of-Depths Routing

[openreview] [pdf]

Abstract Advancements in deep learning are driven by training models with increasingly larger numbers of parameters, which in turn heightens the computational demands. To address this issue, Mixture-of-Depths (MoD) models have been proposed to dynamically assign computations only to the most relevant parts of the inputs, thereby enabling the deployment of large-parameter models with high efficiency during inference and training. These MoD models utilize a routing mechanism to determine which tokens should be processed by a layer, or skipped. However, conventional MoD models employ additional network layers specifically for the routing which are difficult to train, and add complexity and deployment overhead to the model. In this paper, we introduce a novel attention-based routing mechanismA-MoDthat leverages the existing attention map of the preceding layer for routing decisions within the current layer. Compared to standard routing,A-MoDallows for more efficient training as it introduces no additional trainable parameters and can be easily adapted from pretrained transformer models. Furthermore, it can increase the performance of the MoD model. For instance, we observe up to 2% higher accuracy on ImageNet and up to 2×2\times faster transfer learning, for the first time showing the benefits of MoD on various computer vision tasks.

441Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

[openreview] [pdf]

Abstract Autonomous agents have demonstrated significant potential in automating complex multistep decision-making tasks. However, even state-of-the-art vision-language models (VLMs), such as GPT-4o, still fall short of human-level performance, particularly in intricate web environments and long-horizon planning tasks. To address these limitations, we introduce Reflective Monte Carlo Tree Search (R-MCTS), a novel test-time algorithm designed to enhance the ability of AI agents, e.g., powered by GPT-4o, to explore decision space on the fly. R-MCTS extends traditional MCTS by 1) incorporating contrastive reflection, allowing agents to learn from past interactions and dynamically improve their search efficiency; and 2) using multi-agent debate to provide reliable state evaluation. Moreover, we improve the agent’s performance by fine-tuning GPT-4o through self-learning, using R-MCTS generated tree traversals without any human-provided labels. On the challenging VisualWebArena benchmark, our GPT-4o-based R-MCTS agent achieves a 6% to 30% relative improvement across various tasks compared to the previous state-of-the-art. Additionally, we show that the knowledge gained from test-time search can be effectively transferred back to GPT-4o via fine-tuning. The fine-tuned GPT-4o matches 97% of R-MCTS’s performance while reducing compute usage by a factor of four at test time. Furthermore, qualitative results reveal that the fine-tuned GPT-4o model demonstrates the ability to explore the environment, evaluate a state, and backtrack to viable ones when it detects that the current state cannot lead to success. Moreover, our work demonstrates the compute scaling properties in both training - data collection with R-MCTS - and testing time. These results suggest a promising research direction to enhance VLMs’ reasoning and planning capabilities for agentic applications via test-time search and self-learning.

442RouteLLM: Learning to Route LLMs from Preference Data

[openreview] [pdf]

Abstract Large language models (LLMs) excel at a wide range of tasks, but choosing the right model often involves balancing performance and cost. Powerful models offer better results but are expensive, while smaller models are more cost-effective but less capable. To address this trade-off, we introduce a training framework for learning efficient router models that dynamically select between a stronger and weaker LLM during inference. Our framework leverages human preference data and employs data augmentation techniques to enhance performance. Evaluations on public benchmarks show that our approach can reduce costs by over 2 times without sacrificing response quality. Moreover, our routers exhibit strong generalization capabilities, maintaining performance even when routing between LLMs not included in training. This highlights the potential of our framework to deliver cost-effective, high-performance LLM solutions.

443B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

[openreview] [pdf]

Abstract In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. Recently, the approach to self-improvement has shifted toward a more dynamic, online fashion through iterative training processes. However, the critical factors underlying the mechanism of these self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to explore and generate high-quality responses among multiple candidates (exploration); and (2) the reliability of external rewards in selecting the best responses from the generated outputs (exploitation). These factors are inherently moving targets throughout the self-improvement cycles, yet their dynamics are rarely discussed in prior research -- It remains unclear what impedes continual model enhancement after only a few iterations. Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model’s exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well due to shifts in distribution from the original policy. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-teaching effectiveness based on the current policy model and available rewards. Our experiments in mathematical reasoning demonstrate that B-STaR not only enhances the model’s exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance. Crucially, this work deconstructs the opaque nature of self-training algorithms, elucidating the interpretable dynamics throughout the process and highlighting current limitations for future research to address.

444Counterfactual Realizability

[openreview] [pdf]

Abstract It is commonly believed that, in a real-world environment, samples can only be drawn from observational and interventional distributions, corresponding to Layers 1 and 2 of the Pearl Causal Hierarchy. Layer 3, representing counterfactual distributions, is believed to be inaccessible almost by definition. However, Bareinboim, Forney, and Pearl (2015) introduced a procedure that allows an agent to sample directly from a counterfactual distribution, leaving open the question of what other counterfactual quantities can be estimated directly via physical experimentation. We resolve this by introducing a formal definition ofrealizability, the ability to draw samples from a distribution, and then developing a complete algorithm to determine whether an arbitrary counterfactual distribution is realizable given fundamental physical constraints, such as the inability to go back in time and subject the same unit to a different experimental condition. We illustrate the implications of this new framework for counterfactual data collection using motivating examples from causal fairness and causal reinforcement learning. While the baseline approach in these motivating settings typically follows an interventional or observational strategy, we show that a counterfactual strategy provably dominates both.

445How to Evaluate Reward Models for RLHF

[openreview] [pdf]

Abstract Reward models are critical to the LLM fine-tuning pipeline, serving as the proxy reference signal during Reinforcement Learning from Human Feedback (RLHF). As a result, the RLHF-ed model’s success strongly depends on the reward model’s ability to reproduce human preferences with high fidelity. However, this exact dependence is unknown, making it difficult to know which reward model is best. Undergoing a full RLHF training pipeline to directly probe downstream LLM performance, while the gold standard, is completely impractical given the resource-intensive nature of RLHF. To address this, we study downstream RLHF outcomes to create a predictive reward model evaluation. We ground our evaluations with our large-scale human preference and verifiable correctness preference datasets, compiling 12 metrics across 12 domains. To investigate which reward model metrics are most correlated to RLHF outcomes, we launch a full end-to-end RLHF experiment on a large-scale crowdsourced human preference platform to view real reward model downstream performance as ground truth. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance which we will open-source for public use and further development.

446Structured Diffusion Models with Mixture of Gaussians as Prior Distribution

[openreview] [pdf]

Abstract We propose a class of structured diffusion models, in which the prior distribution is chosen as a mixture of Gaussians, rather than a standard Gaussian distribution. The specific mixed Gaussian distribution, as prior, can be chosen to incorporate certain structured information of the data. We develop a simple-to-implement training procedure that smoothly accommodates the use of mixed Gaussian as prior. Theory is provided to quantify the benefits of our proposed models, compared to the classical diffusion models. Numerical experiments with synthetic, image and operational data are conducted to show comparative advantages of our model. Our method is shown to be robust to mis-specifications and in particular suits situations where training resources are limited or faster training in real time is desired.

447DP-SGD for non-decomposable objective functions

[openreview] [pdf]

Abstract Unsupervised pre-training is a common step in developing computer vision models and large language models. In this setting, the absence of labels requires the use of similarity-based loss functions, such as the contrastive loss, that favor minimizing the distance between similar inputs and maximizing the distance between distinct inputs. As privacy concerns mount, training these models using differential privacy has become more important. However, due to how inputs are generated for these losses, one of their undesirable properties is that their L2L_2 sensitivity grows with the batch size. This property is particularly disadvantageous for differentially private training methods, such as DP-SGD. To overcome this issue, we develop a new DP-SGD variant for similarity based loss functions --- in particular, the commonly-used contrastive loss --- that manipulates gradients of the objective function in a novel way to obtain a sensitivity of the summed gradient that is O(1)O(1) for batch size nn. We test our DP-SGD variant on some CIFAR-10 pre-training and CIFAR-100 finetuning tasks and show that, in both tasks, our method’s performance comes close to that of a non-private model and generally outperforms DP-SGD applied directly to the contrastive loss.

448SATE: A Two-Stage Approach for Performance Prediction in Subpopulation Shift Scenarios

[openreview] [pdf]

Abstract Subpopulation shift refers to the difference in the distribution of subgroups between training and test datasets. When an underrepresented group becomes predominant during testing, it can lead to significant performance degradation, making performance prediction prior to deployment particularly important. Existing performance prediction methods often fail to address this type of shift effectively due to their usage of unreliable model confidence and mis-specified distributional distances. In this paper, we propose a novel performance prediction method specifically designed to tackle subpopulation shifts, called Subpopulation-Aware Two-stage Estimator (SATE). Our approach first estimates the subgroup proportions in the test set by linearly expressing the test embedding with training subgroup embeddings. Then, it predicts the accuracy for each subgroup using the accuracy on augmented training set, aggregating them into an overall performance estimate. We provide theoretical proof of our method’s unbiasedness and consistency, and demonstrate that it outperforms numerous baselines across various datasets, including vision, medical, and language tasks, offering a reliable tool for performance prediction in scenarios involving subpopulation shifts.

449Does Spatial Cognition Emerge in Frontier Models?

[openreview] [pdf]

Abstract Not yet. We present SPACE, a benchmark that systematically evaluates spatial cognition in frontier models. Our benchmark builds on decades of research in cognitive science. It evaluates large-scale mapping abilities that are brought to bear when an organism traverses physical environments, smaller-scale reasoning about object shapes and layouts, and cognitive infrastructure such as spatial attention and memory. For many tasks, we instantiate parallel presentations via text and images, allowing us to benchmark both large language models and large multimodal models. Results suggest that contemporary frontier models fall short of the spatial intelligence of animals, performing near chance level on a number of classic tests of animal cognition.

450Distribution-Dependent Rates for Multi-Distribution Learning

[openreview] [pdf]

Abstract To address the needs of modeling uncertainty in sensitive machine learning applications, the setup of distributionally robust optimization (DRO) seeks good performance uniformly across a variety of tasks. The recent multi-distribution learning (MDL) framework \cite{pmlr-v195-awasthi23a-open-prob} tackles this objective in a dynamic interaction with the environment, where the learner has sampling access to each target distribution. Drawing inspiration from the field of pure-exploration multi-armed bandits, we provide \textit{distribution-dependent} guarantees in the MDL regime, that scale with suboptimality gaps and result in superior dependence on the sample size when compared to the existing distribution-independent analyses. We investigate two non-adaptive strategies, uniform and non-uniform exploration, and present non-asymptotic regret bounds using novel tools from empirical process theory. Furthermore, we devise an adaptive optimistic algorithm, LCB-DR, that showcases enhanced dependence on the gaps, mirroring the contrast between uniform and optimistic allocation in the multi-armed bandit literature.

451Local Patterns Generalize Better for Novel Anomalies

[openreview] [pdf]

Abstract Video anomaly detection (VAD) aims at identifying novel actions or events which are unseen during training. Existing mainstream VAD techniques typically focus on the global patterns of events but struggle to generalize to novel samples. In this paper, we propose a framework that identifies the local patterns which generalize to novel samples and models the dynamics of local patterns. The capability of extracting spatial local patterns is achieved through a two-stage process involving image-text alignment and cross-modality attention. Generalizable representations are built by focusing on text-informative features that filter out unnecessary visual data variances. To enhance spatial local patterns with temporal clues, we introduce a State Machine Module (SMM) that combines tokens from different moments to improve sentence generation within cross-modality attention. Furthermore, temporal motion estimation complements spatial local patterns to detect anomalies characterized by novel spatial distributions or distinctive dynamics. Extensive experiments on popular benchmark datasets demonstrate the achievement of state-of-the-art performance. Code is available athttps://anonymous.4open.science/r/Local-Patterns-Generalize-Better-1E30/.

452State & Image Guidance: Teaching Old Text-to-Video Diffusion Models New Tricks

[openreview] [pdf]

Abstract Current text-to-video (T2V) models have made significant progress in generating high-quality video. However, these models are limited when it comes to generating dynamic video scenes where the description per frame can vary dramatically. Changing the color, shape, position and state of objects in the scene is a challenge that current video models cannot handle. In addition, the lack of a cheap image-based conditioning mechanism limits their creative application. To address these challenges and extend the applicability of T2V models, we propose two innovative approaches:State GuidanceandImage Guidance.State Guidanceuses advanced guidance mechanisms to control motion dynamics and scene transformation smoothness by navigating the diffusion process between a state triplet <initial state, transition state, final state>. This mechanism enables the generation of dynamic video scenes (Dynamic Scene T2V) and allows to control the speed and the expressiveness of the scene transformation by introducing temporal dynamics via a guidance weight schedule across video frames.Image Guidanceenables Zero-Shot Image-to-Video generation (Zero-Shot I2V) by injecting reference image into the initial diffusion steps noise predictions. Furthermore, the combination ofState GuidanceandImage Guidanceallows for zero-shot transitions between two input reference frames of a video (Zero-Shot II2V). Finally, we introduce the novelDynamic Scene Benchmarkto evaluate the ability of the models to generate dynamic video scenes. Extensive experiments show thatState GuidanceandImage Guidancesuccessfully address the aforementioned challenges and significantly improve the generation capabilities of existing T2V architectures.

453UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations

[openreview] [pdf]

Abstract We address the problem of offline learning a policy that avoids undesirable demonstrations. Unlike conventional offline imitation learning approaches that aim to imitate expert or near-optimal demonstrations, our setting involves avoiding undesirable behavior (specified using undesirable demonstrations). To tackle this problem, unlike standard imitation learning where the aim is to minimize the distance between learning policy and expert demonstrations, we formulate the learning task as maximizing a statistical distance, in the space of state-action stationary distributions, between the learning policy and the undesirable policy. This significantly different approach results in a novel training objective that necessitates a new algorithm to address it. Our algorithm, UNIQ, tackles these challenges by building on the inverse Q-learning framework, framing the learning problem as a cooperative (non-adversarial) task. We then demonstrate how to efficiently leverage unlabeled data for practical training. Our method is evaluated on standard benchmark environments, where it consistently outperforms state-of-the-art baselines.

454The Directionality of Optimization Trajectories in Neural Networks

[openreview] [pdf]

Abstract The regularity or implicit bias in neural network optimization has been typically studied via the parameter norms or the landscape curvature, often overlooking the trajectory leading to these parameters. However, properties of the trajectory --- particularly its directionality --- capture critical aspects of how gradient descent navigates the landscape to converge to a solution. In this work, we introduce the notion of a Trajectory Map and derive natural complexity measures that highlight the directional characteristics of optimization trajectories. Our comprehensive analysis across vision and language modeling tasks reveals that (a) the trajectory’s directionality at the macro-level saturates by the initial phase of training, wherein weight decay and momentum play a crucial but understated role; and (b) in subsequent training, trajectory directionality manifests in micro-level behaviors, such as oscillations, for which we also provide a theoretical analysis. This implies that neural optimization trajectories have, overall, a more linear form than zig-zaggy, as evident by high directional similarity, especially towards the end. To further hone this point, we show that when the trajectory direction gathers such an inertia, optimization proceeds largely unaltered even if the network is severely decapacitated (by freezing >99% of the parameters), --- thereby demonstrating the potential for significant computational and resource savings without compromising performance.

455TWO STAGES DOMAIN INVARIANT REPRESENTATION LEARNERS SOLVE THE LARGE CO-VARIATE SHIFT IN UNSUPERVISED DOMAIN ADAPTATION WITH TWO DIMENSIONAL DATA DOMAINS

[openreview] [pdf]

Abstract Recent developments in the unsupervised domain adaptation (UDA) enable the unsupervised machine learning (ML) prediction for target data, thus this will accelerate real world applications with ML models such as image recognition tasks in self-driving. Researchers have reported the UDA techniques are not working well under large co-variate shift problems where e.g. supervised source data consists of handwritten digits data in monotone color and unsupervised target data colored digits data from the street view. Thus there is a need for a method to resolve co-variate shift and transfer source labelling rules under this dynamics. We perform two stages domain invariant representation learning to bridge the gap between source and target with semantic intermediate data (unsupervised). The proposed method can learn domain invariant features simultaneously between source and intermediate also intermediate and target. Finally this achieves good domain invariant representation between source and target plus task discriminability owing to source labels. This induction for the gradient descent search greatly eases learning convergence in terms of classification performance for target data even when large co-variate shift. We also derive a theorem for measuring the gap between trained models and unsupervised target labelling rules, which is necessary for the free parameters optimization. Finally we demonstrate that proposing method is superiority to previous UDA methods using 4 representative ML classification datasets including 38 UDA tasks. Our experiment will be a basis for challenging UDA problems with large co-variate shift.

456Dual-Branch HNSW Approach with Skip Bridges and LID-Driven Optimization

[openreview] [pdf]

Abstract The Hierarchical Navigable Small World (HNSW) algorithm is widely used for approximate nearest neighbor (ANN) search, leveraging the principles of navigable small-world graphs. However, it faces some limitations. The first is the local optima problem, which arises from the algorithm’s greedy search strategy, selecting neighbors based solely on proximity at each step. This often leads to cluster disconnections. The second limitation is that HNSW frequently fails to achieve logarithmic complexity, particularly in high-dimensional datasets, due to the exhaustive traversal through each layer. To address these limitations, we propose a novel algorithm that mitigates local optima and cluster disconnections while improving inference speed. The first component is a dual-branch HNSW structure with LID-based insertion mechanisms, enabling traversal from multiple directions. This improves outlier node capture, enhances cluster connectivity, and reduces the risk of local minima. The second component introduces a bridge-building technique that adds shortcuts between layers, enabling direct jumps and speeding up inference. Experiments on various benchmarks and datasets showed that our algorithm outperforms the original HNSW in both accuracy and speed. We evaluated six datasets across Computer Vision (CV), deep learning (DL), and Natural Language Processing (NLP), showing improvements of 2.5% in NLP, 15% in DL, and up to 35% in CV tasks. Inference speed is also improved by 12% across all datasets. Ablation studies revealed that LID-based insertion had the greatest impact on performance, followed by the dual-branch structure and bridge-building components.

457Toward Exploratory Inverse Constraint Inference with Generative Diffusion Verifiers

[openreview] [pdf]

Abstract An important prerequisite for safe control is aligning the policy with the underlying constraints in the environment. In many real-world applications, due to the difficulty of manually specifying these constraints, existing works have proposed recovering constraints from expert demonstrations by solving the Inverse Constraint Learning (ICL) problem. However, ICL is inherently ill-posed, as multiple constraints can equivalently explain the experts’ preferences, making the optimal solutions not uniquely identifiable. In this work, instead of focusing solely on a single constraint, we propose the novel approach of Exploratory ICL (ExICL). The goal of ExICL is to recover a diverse set of feasible constraints, thereby providing practitioners the flexibility to select the most appropriate constraint based on the needs of practical deployment. To achieve this goal, we design a generative diffusion verifier, which guides the trajectory generation process using the probabilistic representation of an optimal constrained policy. By comparing these decisions with those made by expert agents, we can efficiently verify a candidate constraint. Driven by the verification feedback, ExICL implements an exploratory constraint update mechanism that strategically facilitates the diversity within the collection of feasible constraints. Our empirical results demonstrate that ExICL can seamlessly and reliably generalize across different tasks and environments.

458DRIVE: Distributional Model-Based Reinforcement Learning via Variational Inference

[openreview] [pdf]

Abstract Distributional reinforcement learning (RL) provides a natural framework for estimating the distribution of returns rather than a single expected value. However, the control aspect of distributional RL has not been as thoroughly explored as the evaluation part, typically relying on the greedy selection rule with respect to either the expected value, akin to standard approaches, or risk-sensitive measures derived from the return distribution. On the other hand, casting RL as a probabilistic inference problem allows for flexible control solutions utilizing a toolbox of approximate inference techniques; however, its connection to distributional RL remains underexplored. In this paper, we bridge this gap by proposing a variational approach for efficient policy search. Our method leverages the log-likelihood of optimality as a learning proxy, decoupling it from traditional value functions. This learning proxy incorporates aleatoric uncertainty of the return distribution, enabling risk-aware decision-making. We provide a theoretical analysis of our framework, detailing the conditions for convergence. Empirical results on vision-based tasks in DMControl Suite demonstrate the effectiveness of our approach compared to various algorithms, as well as its ability to balance exploration and exploitation at different training stages.

459Learn from the Past: Dynamic Data Pruning with Historically Weighted Bernoulli Sampling

[openreview] [pdf]

Abstract Dynamic data pruning, which also known as data importance sampling, has been proposed to improve training efficiency. For the case of sampling with replacement, the optimal sampling distribution to minimize the variance is to sample proportional to the gradient norm, which can be approximated by the gradient norm of the logits from an extra forward pass. However, this could result in repeated samples, which can be an undesirable property. Noticing that most dynamic data pruning methods that avoids repeated samples can be seen as weighted Bernoulli sampling, in this work we study the optimal distribution to reduce its variance. Furthermore, to avoid an extra forward pass, we study the use of historic statistics. We propose the use of exponential moving average and probability smoothing to improve the performance.

460A Simple Baseline for Predicting Future Events with Auto-Regressive Tabular Transformers

[openreview] [pdf]

Abstract Many real-world applications of tabular data involve using historic events to predict properties of new ones, for example whether a credit card transaction is fraudulent or what rating a customer will assign a product on a retail platform. Existing approaches to event prediction include costly, brittle, and application-dependent techniques such as time-aware positional embeddings, learned row and field encodings, and oversampling methods for addressing class imbalance. Moreover, these approaches often assume specific use-cases, for example that we know the labels of all historic events or that we only predict a pre-specified label and not the data’s features themselves. In this work, we propose a simple but flexible baseline using standard autoregressive LLM-style transformers with elementary positional embeddings and a causal language modeling objective. Our baseline outperforms existing approaches across popular datasets and can be employed for various use-cases. We demonstrate that the same model can predict labels, impute missing values, or model event sequences.

461Learning Policy Committees for Effective Personalization in MDPs with Diverse Tasks

[openreview] [pdf]

Abstract Many dynamic decision problems, such as robotic control, involve a series of tasks, many of which are unknown at training time. Typical approaches for these problems, such as multi-task and meta reinforcement learning, do not generalize well when the tasks are diverse. We propose a general framework to address this issue. In our framework, the goal is to learn a set of policies—a policy committee—such that at least one is near-optimal for most tasks that may be encountered at execution time. While we show that even a special case of this problem is inapproximable, we present two effective algorithmic approaches for it. The first of these yields provably approximation guarantees, albeit in small-dimensional settings (the best we can do due to inapproximability), whereas the second is a general and practical gradient-based approach. In addition, we provide provable sample complexity bounds for few-shot learning settings. Our experiments in personalized and multi-task RL settings using MuJoCo and Meta-World benchmarks show that the proposed approach outperforms state-of-the-art multi-task, meta-, and personalized RL baselines on training and test tasks, as well as in few-shot learning, often by a large margin.

462The Phase Transition Phenomenon of Shuffled Regression

[openreview] [pdf]

Abstract We study the phase transition phenomenon inherent in the shuffled (permuted) regression problem, which has found numerous applications in databases, privacy, data analysis, etc. For the permuted regression task: Y=ΠXB\mathbf{Y} = \mathbf{\Pi}\mathbf{X}\mathbf{B}, the goal is to recover the permutation matrix Π\mathbf{\Pi} as well as the coefficient matrix B\mathbf{B}. It has been empirically observed in prior studies that when recovering Π\mathbf{\Pi}, there exists a phase transition phenomenon: the error rate drops to zero rapidly once the parameters reach certain thresholds. In this study, we aim to precisely identify the locations of the phase transition points by leveraging techniques from {\em message passing} (MP).In our analysis, we first transform the permutation recovery problem into a probabilistic graphical model. Then, we leverage the analytical tools rooted in the message passing (MP) algorithm and derive an equation to track the convergence of the MP algorithm. By linking this equation to the branching random walk process, we are able to characterize the impact of the \emph{signal-to-noise-ratio} (snr\mathsf{snr}) on the permutation recovery. Depending on whether the signal is given or not, we separately investigate the oracle case and the non-oracle case. The bottleneck in identifying the phase transition regimes lies in deriving closed-form formulas for the corresponding critical points, but only in rare scenarios can one obtain such precise expressions. To tackle this challenge, we propose the Gaussian approximation method, which allows us to obtain the closed-form formulas in almost all scenarios. In the oracle case, our method can fairly accurately predict the phase transition snr\mathsf{snr}. In the non-oracle case, our proposed algorithm can predict the maximum allowed number of permuted rows and uncover its dependency on the sample number.

463Group Distributionally Robust Dataset Distillation with Risk Minimization

[openreview] [pdf]

Abstract Dataset distillation (DD) has emerged as a widely adopted technique for crafting a synthetic dataset that captures the essential information of a training dataset, facilitating the training of accurate neural models. Its applications span various domains, including transfer learning, federated learning, and neural architecture search. The most popular methods for constructing the synthetic data rely on matching the convergence properties of training the model with the synthetic dataset and the training dataset. However, targeting the training dataset must be thought of as auxiliary in the same sense that the training set is an approximate substitute for the population distribution, and the latter is the data of interest. Yet despite its popularity, an aspect that remains unexplored is the relationship of DD to its generalization, particularly across uncommon subgroups. That is, how can we ensure that a model trained on the synthetic dataset performs well when faced with samples from regions with low population density? Here, the representativeness and coverage of the dataset become salient over the guaranteed training error at inference. Drawing inspiration from distributionally robust optimization, we introduce an algorithm that combines clustering with the minimization of a risk measure on the loss to conduct DD. We provide a theoretical rationale for our approach and demonstrate its effective generalization and robustness across subgroups through numerical experiments.

464Learning the Partially Dynamic Travelling Salesman Problem

[openreview] [pdf]

Abstract Learning to solve the Travelling Salesman Problem (TSP) using Deep Reinforcement Learning (Deep RL) and Graph Neural Networks (GNNs) has shown promising results for small instances of the problem. We demonstrate that these methods can be extended to solve instances of a partially dynamic variant of the TSP. Solving this partially dynamic variant more effectively exploits the strengths of reinforcement learning and also presents challenges for more established methods of solving the TSP. We show the policies trained using Deep RL outperform modified versions of TSP solvers and heuristics for different distributions of dynamic vertices, including on larger instances than the policies were trained on. This shows the promise of Deep RL for solving this type of dynamic routing problem which is predicted to become of great importance as logistical services become more flexible and responsive to customer demand. Furthermore, our method is a general purpose approach to Deep RL where the problem consists of selecting items from a dynamically-evolving and arbitrarily-sized set.

465Value Residual Learning For Alleviating Attention Concentration In Transformers

[openreview] [pdf]

Abstract Transformers can capture long-range dependencies using self-attention, allowing tokens to attend to all others directly. However, stacking multiple attention layers leads to attention concentration. One natural way to address this issue is to use cross-layer attention, allowing information from earlier layers to be directly accessible to later layers. However, this approach is computationally expensive. To address this problem, we propose Transformer with residual value (ResFormer) which approximates cross-layer attention through adding a residual connection from the values of the the first layer to all subsequent layers. Based on this method, one variant is the Transformer with single layer value (SVFormer), where all layers share the same value embedding from first layer, reducing the KVKV cache by nearly 50%. Comprehensive empirical evidence demonstrates that ResFormer mitigates attention concentration problem in deeper layers and enhances representation across most layers, outperforming the vanilla Transformer, DenseFormer, and NeuTRENO in training error as well as downstream tasks. SVFormer trains significantly faster than the vanilla Transformer and performs better than other methods like GQA and CLA, with performance influenced by sequence length and cumulative learning rate.

466SELF-EVOLVED REWARD LEARNING FOR LLMS

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) is a crucial technique for aligning language models with human preferences and is a key factor in the success of modern conversational models like GPT-4, ChatGPT, and Llama 2. A significant challenge in employing RLHF lies in training a reliable RM, which relies on high-quality labels. Typically, these labels are provided by human experts or a stronger AI, both of which can be costly and introduce bias that may affect the language model’s responses. As models improve, human input may become less effective in enhancing their performance. This paper explores the potential of using the RM itself to generate additional training data for a more robust RM. Our experiments demonstrate that reinforcement learning from self-feedback outperforms baseline approaches. We conducted extensive experiments with our approach on multiple datasets, such as HH-RLHF and UltraFeedback, and models including Mistral and Llama 3, comparing it against various baselines. Our results indicate that, even with a limited amount of human-labeled data, learning from self-feedback can robustly enhance the performance of the RM, thereby improving the capabilities of large language models.

467DOPL: Direct Online Preference Learning for Restless Bandits with Preference Feedback

[openreview] [pdf]

Abstract Restless multi-armed bandits (RMAB) has been widely used to model constrained sequential decision making problems, where the state of each restless arm evolves according to a Markov chain and each state transition generates a scalar reward. However, the success of RMAB crucially relies on the availability and quality of reward signals. Unfortunately, specifying an exact reward function in practice can be challenging and even infeasible. In this paper, we introduce Pref-RMAB, a new RMAB model in the presence of preference signals, where the decision maker only observes pairwise preference feedback rather than scalar reward from the activated arms at each decision epoch. Preference feedback, however, arguably contains less information than the scalar reward, which makes Pref-RMAB seemingly more difficult. To address this challenge, we present a direct online preference learning (DOPL) algorithm for Pref-RMAB to efficiently explore the unknown environments, adaptively collect preference data in an online manner, and directly leverage the preference feedback for decision-makings. We prove that DOPL yields a sublinear regret. To our best knowledge, this is the first algorithm to ensure O~(TlnT)\tilde{\mathcal{O}}(\sqrt{T\ln T}) regret for RMAB with preference feedback. Experimental results further demonstrate the effectiveness of DOPL.

468Off-Policy Maximum Entropy RL with Visitation Measures

[openreview] [pdf]

Abstract We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. More precisely, an intrinsic reward function is added to the reward function of the Markov decision process that shall be controlled. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions (or features from these states and actions) during the next time steps. We prove that this distribution is the fixed point of a contractive operator. Furthermore, the problem of maximizing the expected discounted sum of these intrinsic rewards is proven to be an approximation of the minimization of an upper bound on the suboptimality gap of the state-action value function of the policy. We finally describe how existing algorithms can integrate these intrinsic rewards to enhance exploration and introduce a practical algorithm for learning this fixed point off-policy, using state-action transitions, relying on N-step bootstrapping of the operator. Empirically, this maximum entropy reinforcement learning framework provides exploration policies with good coverage of the state-action space, and high-performing control policies, which both can be computed off-policy.

469How Discrete and Continuous Diffusion Meet: Comprehensive Analysis of Discrete Diffusion Models via a Stochastic Integral Framework

[openreview] [pdf]

Abstract Discrete diffusion models have gained increasing attention for their ability to model complex distributions with tractable sampling and inference. However, the error analysis for discrete diffusion models remains less well-understood. In this work, we propose a comprehensive framework for the error analysis of discrete diffusion models based on Lévy-type stochastic integrals. By generalizing the Poisson random measure to that with a time-independent and state-dependent intensity, we rigorously establish a stochastic integral formulation of discrete diffusion models and provide the corresponding change of measure theorems that are intriguingly analogous to Itô integrals and Girsanov’s theorem for their continuous counterparts. Our framework unifies and strengthens the current theoretical results on discrete diffusion models and obtains the first error bound for the τ\tau-leaping scheme in KL divergence. With error sources clearly identified, our analysis gives new insight into the mathematical properties of discrete diffusion models and offers guidance for the design of efficient and accurate algorithms for real-world discrete diffusion model applications.

470Only-IF: Revealing the Decisive Effect of Instruction Diversity on Generalization

[openreview] [pdf]

Abstract Understanding and accurately following instructions is critical for large language models (LLMs) to be effective across diverse tasks. In this work, we conduct a rigorous investigation into the factors that enable generalization to unseen instructions. Through controlled experiments, inspired by the Turing-complete Markov algorithm, we demonstrate that such generalization only emerges\textbf{only emerges} when training data is diversified enough across semantic domains. Our findings also reveal that merely diversifying within limited domains fails to ensure robust generalization. In contrast, cross-domain data diversification, even under constrained data budgets, significantly enhances a model’s adaptability. We further extend our analysis to real-world scenarios, including fine-tuning of specialist\textit{\textbf{{specialist}}} and generalist\textit{\textbf{{generalist}}} models. Our research provides important insights for dataset collation, particularly when optimizing model performance by expanding training data for both specialist and generalist scenarios. We show that careful consideration of data diversification is key: training specialist models with data extending beyond their core domain leads to significant performance improvements, while generalist models benefit from diverse data mixtures that enhance their overall instruction-following capabilities across a wide range of applications. . Our results highlight the critical role of strategic diversification and offer clear guidelines for improving data quality.

471Learning Interpretable and Influential Directions with Signal Vectors and Uncertainty Region Alignment

[openreview] [pdf]

Abstract Latent space directions have played a key role in understanding, debugging, and fixing deep learning models. Concepts are often encoded in distinct feature space directions, and evaluating impact of these directions on the model’s predictions, highlights their importance in the decision-making process. Additionally, recent studies have shown that penalizing directions associated with spurious artifacts during training can force models to unlearn features irrelevant to their prediction task. Identifying these directions, therefore, provides numerous benefits, including a deeper understanding of the model’s strategy, fostering trust, and enabling model correction and improvement. We introduce a novel unsupervised approach utilizing signal vectors and uncertainty region alignment to discover latent space directions that meet two key debugging criteria: significant influence on model predictions and high level of interpretability. To our knowledge, this method is the first of its kind to uncover such directions, leveraging the inherent structure of the feature space and the knowledge encoded in the deep network. We validate our approach using both synthetic and real-world benchmarks, demonstrating that the discovered directions effectively fulfill the critical debugging criteria.

472Unlocking Guidance for Discrete State-Space Diffusion and Flow Models

[openreview] [pdf]

Abstract Generative models on discrete state-spaces have a wide range of potential applications, particularly in the domain of natural sciences. In continuous state-spaces, controllable and flexible generation of samples with desired properties has been realized using guidance on diffusion and flow models. However, these guidance approaches are not readily amenable to discrete state-space models. Consequently, we introduce a general and principled method for applying guidance on such models. Our method depends on leveraging continuous-time Markov processes on discrete state-spaces, which unlocks computational tractability for sampling from a desired guided distribution. We demonstrate the utility of our approach, Discrete Guidance, on a range of applications including guided generation of small-molecules, DNA sequences and protein sequences.

473Learning and Steering Game Dynamics Towards Desirable Outcomes

[openreview] [pdf]

Abstract Game dynamics, which describe how agents’ strategies evolve over time based on past interactions, can exhibit a variety of undesirable behaviours, including convergence to suboptimal equilibria, cycling, and chaos. While central planners can employ incentives to mitigate such behaviors and steer game dynamics towards desirable outcomes, the effectiveness of such interventions critically relies on accurately predicting agents’ responses to these incentives---a task made particularly challenging when the underlying dynamics are unknown and observations are limited. To address this challenge, this work introduces the Side Information Assisted Regression with Model Predictive Control (SIAR-MPC) framework. We extend the recently introduced SIAR method to incorporate the effect of control, enabling it to utilize side-information constraints inherent to game theoretic applications to model agent responses to incentives from scarce data. MPC then leverages this model to implement adaptive incentive adjustments. Our experiments demonstrate the efficiency of SIAR-MPC in guiding systems towards socially optimal equilibria, stabilizing chaotic and cycling behaviors. Comparative analyses in data-scarce settings show SIAR-MPC’s superior performance compared to pairing MPC with state-of-the-art alternatives like Sparse Identification of Nonlinear Dynamics (SINDy) and Physics Informed Neural Networks (PINNs).

474Nesterov acceleration in benignly non-convex landscapes

[openreview] [pdf]

Abstract While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign’ non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov’s accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

475Toward Efficient Multi-Agent Exploration With Trajectory Entropy Maximization

[openreview] [pdf]

Abstract Recent works have increasingly focused on learning decentralized policies for agents as a solution to the scalability challenges in Multi-Agent Reinforcement Learning (MARL), where agents typically share the parameters of a policy network to make action decisions. However, this parameter sharing can impede efficient exploration, as it may lead to similar behaviors among agents. Different from previous mutual information-based methods that promote multi-agent diversity, we introduce a novel multi-agent exploration method called Trajectory Entropy Exploration (TEE). Our method employs a particle-based entropy estimator to maximize the entropy of different agents’ trajectories in a contrastive trajectory representation space, resulting in diverse trajectories and efficient exploration. This entropy estimator avoids challenging density modeling and scales effectively in high-dimensional multi-agent settings. We integrate our method with MARL algorithms by deploying an intrinsic reward for each agent to encourage entropy maximization. To validate the effectiveness of our method, we test our method in challenging multi-agent tasks from several MARL benchmarks. The results demonstrate that our method consistently outperforms existing state-of-the-art methods.

476Probing the Latent Hierarchical Structure of Data via Diffusion Models

[openreview] [pdf]

Abstract High-dimensional data must be highly structured to be learnable. Although the compositional and hierarchical nature of data is often put forward to explain learnability, quantitative measurements establishing these properties are scarce. Likewise, accessing the latent variables underlying such a data structure remains a challenge. Forward-backward experiments in diffusion-based models, where a datum is noised and then denoised, are a promising tool to achieve these goals. We predict in simple hierarchical models that, in this process, changes in data occur by correlated chunks, with a length scale that diverges at a noise level where a phase transition is known to take place. Remarkably, we confirm this prediction in both text and image datasets using state-of-the-art diffusion models. Our results suggest that forward-backward experiments are informative on the nature of latent variables, and that the effect of changing deeper ones is revealed near the transition.

477TimeDiT: General-purpose Diffusion Transformers for Time Series Foundation Model

[openreview] [pdf]

Abstract With recent advances in building foundation models for text and video data, there is a surge of interest in foundation modeling for time series. Many families of models have been developed utilizing a temporal autoregressive Transformer architecture, whose effectiveness has been proven in Large Language Models (LLMs). However, real-world time series exhibit unique challenges, such as variable channel sizes across domains, missing values, and varying signal sampling intervals due to the multi-resolution nature of real-world data. Additionally, the unidirectional nature of temporally autoregressive decoding typically learns a deterministic mapping relationship and limits the incorporation of domain knowledge, such as physical laws. To address these challenges, we introduce the Time Diffusion Transformer (TimeDiT), a general foundation model for time series that jointly leverages the transformer inductive bias to capture temporal dependencies and the diffusion processes to generate high-quality candidate samples. The proposed mask unit for task-agnostic pretraining and task-specific sampling enables direct processing of multivariate inputs even with missing values or multi-resolution. Furthermore, we introduce a theoretically justified finetuning-free model editing strategy that allows the flexible integration of external knowledge during the sampling process. Extensive experiments conducted on a variety of tasks, such as forecasting, imputation, and anomaly detection highlight TimeDiT’s adaptability as a foundation model, addressing diverse time series challenges and advancing analysis in various fields.

478InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

[openreview] [pdf]

Abstract Data analytics is essential for extracting valuable insights from data that can assist organizations in making effective decisions. We introduce InsightBench, a benchmark dataset with three key features. First, it consists of 100 datasets representing diverse business use cases such as finance and incident management, each accompanied by a carefully curated set of insights planted in the datasets. Second, unlike existing benchmarks focusing on answering single queries, InsightBench evaluates agents based on their ability to perform end-to-end data analytics, including formulating questions, interpreting answers, and generating a summary of insights and actionable steps. Third, we conducted comprehensive quality assurance to ensure that each dataset in the benchmark had clear goals and included relevant and meaningful questions and analysis. Furthermore, we implement a two-way evaluation mechanism using LLaMA-3 as an effective, open-source evaluator to assess agents’ ability to extract insights. We also propose AgentPoirot, our baseline data analysis agent capable of performing end-to-end data analytics. Our evaluation on InsightBench shows that AgentPoirot outperforms existing approaches (such as Pandas Agent) that focus on resolving single queries. We also compare the performance of open- and closed-source LLMs and various evaluation strategies. Overall, this benchmark serves as a testbed to motivate further development in comprehensive automated data analytics

479Noise Prompt Learning: Learning the Winning Tickets for Diffusion Sampling

[openreview] [pdf]

Abstract Text-to-image diffusion model is a popular paradigm that synthesizes personalized images by providing a text prompt and a random Gaussian noise. While people observe that some noises are winning tickets that can achieve better text-image alignment and higher human preference than others, we still lack a machine learning framework to obtain those winning noises. To learn winning noises for diffusion sampling, we mainly make three contributions in this paper. First, we identify a new concept termed the noise prompt\textit{noise prompt}, which aims at turning a random Gaussian noise into a winning noise ticket by adding a small desirable perturbation derived from the text prompt. Following the concept, we first formulate the noise prompt learning\textit{noise prompt learning} framework that systematically learns "prompted’’ winning noise tickets associated with a text prompt for diffusion models. Second, we design a noise prompt data collection pipeline and collect a large-scale noise prompt dataset\textit{noise prompt dataset} (NPD) that contains 100k pairs of random noises and winning noises with the associated text prompts. With the prepared NPD as the training dataset, we trained a small noise prompt network\textit{noise prompt network} (NPNet) that can directly learn to transform a random noise ticket into a winning noise ticket. The learned winning noise perturbation can be considered as a kind of prompt for noise, as it is rich in semantic information and tailored to the given text prompt. Third, our extensive experiments demonstrate the impressive effectiveness and generalization of NPNet on improving the quality of synthesized images across various diffusion models, including SDXL, DreamShaper-xl-v2-turbo, and Hunyuan-DiT. Moreover, NPNet is a small and efficient controller that acts as a plug-and-play module with very limited additional inference and computational costs, as it just provides a winning noise instead of a random noise without accessing the original pipeline.

480WARP: On the Benefits of Weight Averaged Rewarded Policies

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) aligns large language models by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its initialization, though it hinders the reward optimization. To address the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP), merging policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration’s final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with Gemma policies validate that WARP improves their quality and alignment, outperforming open-source models.

481Unifying Back-Propagation and Forward-Forward Algorithms through Model Predictive Control

[openreview] [pdf]

Abstract We introduce a Model Predictive Control (MPC) framework for training deep neural networks, systematically unifying the Back-Propagation (BP) and Forward-Forward (FF) algorithms. At the same time, it gives rise to a range of intermediate training algorithms with varying look-forward horizons, leading to a performance-efficiency trade-off. We perform a precise analysis of this trade-off on a deep linear network, where the qualitative conclusions carry over to general networks. Based on our analysis, we propose a principled method to choose the optimization horizon based on given objectives and model specifications. Numerical results on various models and tasks demonstrate the versatility of our method.

482Minifinetuning: Low-Data Generation Domain Adaptation through Corrective Self-Distillation

[openreview] [pdf]

Abstract Finetuning language models for a new domain inevitably leads to the deterioration of their general performance. This becomes more pronounced the more limited the finetuning data resource.We introduce minifinetuning (MFT), a method for language model domain adaptation that considerably reduces the effects of overfitting-induced degeneralization in low-data settings and which does so in the absence of any pre-training data for replay. MFT demonstrates 2-10x more favourable specialization-to-degeneralization ratios than standard finetuning across a wide range of models and domains and exhibits an intrinsic robustness to overfitting when data in the new domain is scarce and down to as little as 500 samples.Employing corrective self-distillation that is individualized on the sample level, MFT outperforms parameter-efficient finetuning methods, demonstrates replay-like forgetting mitigation properties, and is composable with either for a combined effect.

483Investigating Memorization in Video Diffusion Models

[openreview] [pdf]

Abstract Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference, potentially generating unauthorized copyrighted content. While prior research has focused on image diffusion models (IDMs), video diffusion models (VDMs) remain underexplored. To address this, we introduce new metrics specifically designed to separately assess content and motion memorization in VDMs. By applying these metrics, we systematically analyze memorization in various pretrained VDMs, including text-conditional and unconditional models on various datasets, revealing that memorization is widespread across both video and image datasets. Finally, we propose effective detection strategies for both content and motion memorization, offering a foundational approach for improving privacy in VDMs.

484THE ROBUSTNESS OF DIFFERENTIABLE CAUSAL DISCOVERY IN MISSPECIFIED SCENARIOS

[openreview] [pdf]

Abstract Causal discovery aims to learn causal relationships between variables from targeted data, making it a fundamental task in machine learning. However, causal discovery algorithms often rely on unverifiable causal assumptions, which are usually difficult to satisfy in real-world data, thereby limiting the broad application of causal discovery in practical scenarios. Inspired by these considerations, this work extensively benchmarks the empirical performance of various mainstream causal discovery algorithms, which assume i.i.d. data, under eight model assumption violations. Our experimental results show that differentiable causal discovery methods exhibit counter-intuitive robustness under the metrics of Structural Hamming Distance and Structural Intervention Distance of the inferred graphs in challenging scenarios, except for scale variation. We also provide the theoretical explanations for the performance of differentiable causal discovery methods. Finally, our work aims to comprehensively benchmark the performance of recent differentiable causal discovery methods under model assumption violations, and provide the standard for reasonable evaluation of causal discovery, as well as to further promote its application in real-world scenarios.

485ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

[openreview] [pdf]

Abstract Reward shaping is a critical component in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. While shaping rewards have been introduced to provide additional guidance, selecting effective shaping functions remains challenging and computationally expensive. This paper introduces Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames shaping reward selection as an online model selection problem. ORSO employs principled exploration strategies to automatically identify promising shaping reward functions without human intervention, balancing exploration and exploitation with provable regret guarantees. We demonstrate ORSO’s effectiveness across various continuous control tasks using the Isaac Gym simulator. Compared to traditional methods that fully evaluate each shaping reward function, ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards.

486The Hidden Cost of Waiting for Accurate Predictions

[openreview] [pdf]

Abstract Algorithmic predictions are increasingly informing societal resource allocations by identifying individuals for targeting. Policymakers often build these systems with the assumption that by gathering more observations on individuals, they can improve predictive accuracy and, consequently, allocation efficiency. An overlooked yet consequential aspect of prediction-driven allocations is that of timing. The planner has to trade off relying on earlier and potentially noisier predictions to intervene before individuals experience undesirable outcomes, or they may wait to gather more observations to make more precise allocations. We examine this tension using a simple mathematical model, where the planner collects observations on individuals to improve predictions over time. We analyze both the ranking induced by these predictions and optimal resource allocation. We show that though individual prediction accuracy may improve over time, counter-intuitively, the average ranking loss can worsen. As a result, the planner’s ability to improve social welfare can decline. We identify inequality as a driving factor behind this phenomenon. Our findings provide a nuanced perspective and challenge the conversational wisdom that it is preferable to wait for more accurate predictions to ensure the most efficient allocations.

487Stable batched bandit: Optimal regret with free inference

[openreview] [pdf]

Abstract In this paper, we discuss statistical inference when using a sequential strategy to collect data. While inferential tasks become challenging with sequentially collected data, we argue that this problem can be alleviated when the sequential algorithm satisfies certain stability properties; we call such algorithms stable bandit algorithms. Focusing on batched bandit problems, we first demonstrate that popular algorithms including the greedy-UCB algorithm and ϵ\epsilon-greedy ETC algorithms are not stable, complicating downstream inferential tasks. Our main result shows that a form of elimination algorithm is stable in the batched bandit setup, and we characterize the asymptotic distribution of the sample means. This result allows us to construct asymptotically exact confidence intervals for arm-means which are sharper than existing concentration-based bounds. As a byproduct of our main results, we propose an Explore and Commit (ETC) strategy, which is stable --- thus allowing easy statistical inference--- and also attains optimal regret up to a factor of 4.Our work connects two historically conflicting paradigms in sequential learning environments: regret minimization and statistical inference. Ultimately, we demonstrate that it is possible to minimize regret without sacrificing the ease of performing statistical inference, bridging the gap between these two important aspects of sequential decision-making.

488Secure Diffusion Model Unlocked: Efficient Inference via Score Distillation

[openreview] [pdf]

Abstract As services based on diffusion models expand across various domains, preserving the privacy of client data becomes more critical. Fully homomorphic encryption and secure multi-party computation have been employed for privacy-preserving inference, but these methods are computationally expensive and primarily work for linear computations, making them challenging to apply to large diffusion models. While homomorphic encryption has been recently applied to diffusion models, it falls short of fully safeguarding privacy, as inputs used in the ϵ\epsilon prediction are not encrypted. In this paper, we propose a novel framework for private inference for both inputs and outputs. To ensure robust approximations, we introduce several techniques for handling non-linear operations. Additionally, to reduce latency, we curtail the number of denoising steps while minimizing performance degradation of conditional generation through score distillation from the unconditional generation of the original model with full denoising steps. Experimental results show that our model produces high-quality images comparable to the original, and the proposed score distillation significantly enhances performance, compensating for fewer steps and approximation errors.

489A Simple Approach to Unifying Diffusion-based Conditional Generation

[openreview] [pdf]

Abstract Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation. Instead of proposing another specialized technique, we introduce a simple, unified framework to handle diverse conditional generation tasks involving a specific image-condition correlation. By learning a joint distribution over a correlated image pair (e.g. image and depth) with a diffusion model, our approach enables versatile capabilities via different inference-time sampling schemes, including controllable image generation (e.g. depth to image), estimation (e.g. image to depth), signal guidance, joint generation (image & depth), and coarse control. Previous attempts at unification often introduce complexity through multi-stage training, architectural modification, or increased parameter counts. In contrast, our simplified formulation requires a single, computationally efficient training stage, maintains the standard model input, and adds minimal learned parameters (15% of the base model). Moreover, our model supports additional capabilities like non-spatially aligned and coarse conditioning. Extensive results show that our single model can produce comparable results with specialized methods and better results than prior unified methods. We also demonstrate that multiple models can be effectively combined for multi-signal conditional generation.

490Retrieval Augmented Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting uses historical data to predict future trends, leveraging the relationships between past observations and available features. In this paper, we propose, RAFT, a retrieval-augmented time series forecasting method to provide sufficient inductive biases and complement the model’s learning capacity. When forecasting the subsequent time frames, we directly retrieve historical data candidates from the training dataset with patterns most similar to the input, and utilize the future values of these candidates alongside the inputs to obtain predictions. This simple approach augments the model’s capacity by externally providing information about past patterns via retrieval modules. Our empirical evaluations on eight benchmark datasets show that RAFT consistently outperforms contemporary baselines, an average win ratio of 86% for multivariate forecasting and 80% for univariate forecasting tasks.

491Subject Information Extraction for Novelty Detection with Domain Shifts

[openreview] [pdf]

Abstract Unsupervised novelty detection (UND), aimed at identifying novel samples, is essential in fields like medical diagnosis, cybersecurity, and industrial quality control. Most existing UND methods assume that the training data and testing normal data originate from the same domain and only consider the distribution variation between training data and testing data. However, in real scenarios, it is common for normal testing and training data to originate from different domains, a challenge known as domain shift. The discrepancies between training and testing data often lead to incorrect classification of normal data as novel by existing methods. A typical situation is that testing normal data and training data describe the same subject, yet they differ in the background conditions. To address this problem, we introduce a novel method that separates subject information from background variation encapsulating the domain information to enhance detection performance under domain shifts. The proposed method minimizes the mutual information between the representations of the subject and background while modelling the background variation using a deep Gaussian mixture model, where the novelty detection is conducted on the subject representations solely and hence is not affected by the variation of domains. Extensive experiments demonstrate that our model generalizes effectively to unseen domains and significantly outperforms baseline methods, especially under substantial domain shifts between training and testing data.

492Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation

[openreview] [pdf]

Abstract As Large Language Models (LLMs) scale up and gain powerful Chain-of-Thoughts (CoTs) reasoning abilities, practical resource constraints drive efforts to distill these capabilities into more compact Smaller Language Models (SLMs). We find that CoTs consist mainly of simple reasoning forms, with a small proportion (~4.7%) of key reasoning steps that truly impact conclusions. However, previous distillation methods typically involve supervised fine-tuning student SLMs only on correct CoTs data produced by teacher LLMs, resulting in students struggling to learn the key reasoning steps, instead imitating the teacher’s reasoning forms and making errors or omissions on these steps. To address these issues, drawing an analogy to human learning, where analyzing mistakes according to correct solutions often reveals the crucial steps leading to successes or failures, we propose mistak\textbf{E}-\textbf{D}riven key reason\textbf{I}ng step distilla\textbf{T}ion (\textbf{EDIT}), a novel method that further aids SLMs learning key reasoning steps rather than mere simple fine-tuning. Firstly, to expose these crucial steps in CoTs, we design specific prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions. Then, we apply the minimum edit distance algorithm on the dual CoTs data to locate these key steps and optimize the likelihood of these steps. Extensive experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets. Further analysis shows that EDIT can generate high-quality CoTs with more correct key reasoning steps. Notably, we also explore how different mistake patterns affect performance and find that EDIT benefits more from logical errors than from knowledge or mathematical calculation errors in dual CoTs. Code can be found athttps://anonymous.4open.science/r/eb77sh-F564

493Learning Time-shared Hidden Heterogeneity for Counterfactual Outcome Forecast

[openreview] [pdf]

Abstract Forecasting counterfactual outcome in the longitudinal setting can be critical for many time-related applications. To solve this problem, the previous works propose to apply different sequence models including long short-term memory (LSTM) networks and transformers to model the relationship between the observed histories, treatments and outcomes, and apply various approaches to remove treatment selection bias. However, these methods neglect the hidden heterogeneity of outcome generation among samples induced by hidden factors which can bring hurdles to counterfactual outcome forecast. To alleviate this problem, we capture the hidden heterogeneity by recovering the hidden factors and incorporate it into the outcome prediction process. Specifically, we propose a Time-shared Heterogeneity Learning from Time Series (THLTS) method which infers the shared part of hidden factors characterizing the heterogeneity across time steps with the architecture of variational encoders (VAE). This method can be a flexible component and combined with arbitrary counterfactual outcome forecast method. Experimental results on (semi-)synthetic datasets demonstrate that combined with our method, the mainstream models can improve their performance.

494Understanding Distribution Alignment Through Category Separability In An Infant-Inspired Domain Adaptation Task

[openreview] [pdf]

Abstract We introduce a novel distribution shift considering the tradeoff between object instances and viewpoints occurring in human and embodied visual experience; we study this problem through the lens of domain adaptation. We show that the performance of a well-known domain adaptation method, Joint Adaptation Network (JAN), deteriorates in the absence of ImageNet pretraining. We hypothesize that the separability of source and target category clusters in the feature space plays a crucial role in the effectiveness of JAN. To this end, we propose 3 metrics to measure category separability in the feature space and show that separability in the pretrained network is strongly correlated with downstream JAN accuracy. Further, we propose two novel loss functions increasing target separability by aligning the distribution of within-domain pairwise distances between the source and target cluster. Our experiments show that the application of these loss functions improves downstream performance on the test set.

495DisCoNet: Rethinking Adversarial Networks for Discriminator-Driven Distribution Modeling

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection holds significant importance across various applications. While semantic and domain-shift OOD problems are well-documented, this work focuses on the nuances of covariate shifts, which entail subtle perturbations or variations in the data distribution. These disturbances have proven to negatively impact machine learning performance. We have found that existing OOD detection methods often struggle to effectively distinguish covariate shifts from in-distribution instances, emphasizing the need for specialized solutions. Therefore, we propose DisCoNet, an Adversarial Variational Autoencoder (VAE) that rethinks the Generative Adversarial Networks paradigm. Instead of prioritizing the generator as the network’s core, we focus on the discriminator, using the generator as a supporting training tool. DisCoNet uses the VAE’s suboptimal outputs as negative samples to train the discriminator, thereby improving its ability to delineate the boundary between in-distribution samples and covariate shifts. By tightening this in-distribution boundary, DisCoNet achieves state-of-the-art results in public OOD detection benchmarks. The proposed model not only excels in detecting covariate shifts, achieving 98.9% AUROC on ImageNet-1K(-C), but also outperforms all prior methods on public semantic OOD benchmarks. With a model size of \leq 25MB, it is highly effective on Far-OOD (OpenImage-O (99.4%) and iNaturalist (100.0%)) and Near-OOD (SSB-hard (99.9%) and NINCO (99.7%)) detection. The code will be made publicly available.

496Unveiling the Secret of AdaLN-Zero in Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion transformer (DiT), a rapidly emerging architecture for image generation, has gained much attention. However, despite ongoing efforts to improve its performance, the understanding of DiT remains superficial. In this work, we delve into and investigate a critical conditioning mechanism within DiT, adaLN-Zero, which achieves superior performance compared to adaLN. Our work studies three potential elements driving this performance, including an SE-like structure, zero-initialization, and a “gradual” update order, among which zero-initialization is proved to be the most influential. Building on this insight, we heuristically leverage Gaussian distributions to initialize each condition modulation, termed adaLN-Gaussian, leading to more stable and effective training. Extensive experiments following DiT on ImageNet1K demonstrate the effectiveness and generalization of adaLN-Gaussian, e.g., a notable improvement of 2.16% in FID score over adaLN-Zero.

497Safety Alignment Shouldn’t Be Complicated

[openreview] [pdf]

Abstract As large language models (LLMs) are overwhelmingly more and more integrated into various applications, ensuring they generate safe and aligned responses is a pressing need. Previous research on alignment has largely focused on general instruction-following but has often overlooked the unique properties and challenges of safety alignment, such as the brittleness of safety mechanisms. To bridge the gap, we propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction - interpreted as a specialized binary classification task - and incorporate a refusal mechanism with multiple reserved fallback options. Furthermore, through SSAH, we hypothesize that safety guardrails in LLMs can be established by just a small number of essential components. To verify this, we conduct an ablation study and successfully identify four types of attribute-critical components in safety-aligned LLMs: Exclusive Safety Unit (ESU), Exclusive Utility Unit (EUU), Complex Unit (CU), and Redundant Unit (RU). Our findings show that freezing certain safety-critical components \textbf{(7.5%)} during fine-tuning allows the model to retain its safety attributes while adapting to new tasks. Additionally, we show that leveraging redundant units \textbf{(20%)} in the pre-trained model as an ``alignment budget’’ can effectively minimize the alignment tax while achieving the alignment goal. All considered, this paper concludes that the atomic functional unit for safety in LLMs is at the neuron level and underscores that safety alignment should not be complicated. We believe this work contributes to the foundation of efficient and scalable safety alignment for future LLMs.

498Dialogue Action Tokens: Steering Language Models in Goal-Directed Dialogue with a Multi-Turn Planner

[openreview] [pdf]

Abstract We present an approach called Dialogue Action Tokens (DAT) that adapts language model agents to plan goal-directed dialogues. The core idea is to treat each utterance as an action, thereby converting dialogues into games where existing approaches such as reinforcement learning can be applied. Specifically, we freeze a pretrained language model and train a small planner model that predicts a continuous action vector, used for controlled generation in each round. This design avoids the problem of language degradation under reward optimization. When evaluated on the Sotopia platform for social simulations, the DAT-steered LLaMA model surpasses GPT-4’s performance. We also apply DAT to steer an attacker language model in a novel multi-turn red-teaming setting, revealing a potential new attack surface.

499Never Forget the Basics: In-distribution Knowledge Retention for Continual Test-time Adaptation in Human Motion Prediction

[openreview] [pdf]

Abstract This paper presents a novel approach to addressing the underexplored challenge of human pose prediction in dynamic target domains that simultaneously contain in-distribution (ID) and out-of-distribution (OOD) data. Existing test-time adaptation (TTA) techniques predominantly focus on OOD data, neglecting the fact that ID data, which closely resembles the training distribution, is often encountered during real-world deployment, leading to significant degradation in ID performance. To address this, we introduce In-Distribution Knowledge Retention (IDKR), a continual TTA framework designed to preserve critical knowledge about ID data while adapting to unseen OOD sequences. Our method introduces an ID-informative subgraph learning strategy that leverages the structural characteristics of human skeletal data to compute a structural graph Fisher Information Matrix (SG-FIM). Unlike prior work, IDKR simultaneously considers both node and edge features in the skeletal graph, with edge features, representing the invariant bone lengths between parent-child joint pairs, being essential for maintaining structural consistency across poses. These edge features are key to extracting reliable SG-FIM parameters, enabling the model to retain parameters critical for ID performance while selectively updating those needed for OOD adaptation. Extensive experiments on multiple benchmark datasets demonstrate that IDKR consistently outperforms state-of-the-art methods, particularly in scenarios involving mixed ID and OOD data, setting a new standard for robust human pose prediction in dynamic environments.

500Flexible Active Learning of PDE Trajectories

[openreview] [pdf]

Abstract Accurately solving partial differential equations (PDEs) is critical for understanding complex scientific and engineering phenomena, yet traditional numerical solvers are computationally expensive. Surrogate models offer a more efficient alternative, but their development is hindered by the cost of generating sufficient ground-truth data from numerical simulations. In this paper, we present a novel framework for active learning (AL) in PDE surrogate modeling that reduces the data acquisition cost and improves model accuracy. Unlike the existing AL methods for PDEs that always acquire entire PDE trajectories, our approach strategically queries only a subset of the time steps from a numerical solver along a trajectory, while employing a surrogate model to approximate values for the remaining steps. This dramatically reduces the cost of data acquisition, which is proportional to the number of time steps simulated by the numerical solver, and thus allows the active learning algorithm to try out a more diverse set of trajectories given the same computational budget. To accommodate this novel framework, we develop an acquisition function that estimates the utility of a set of time steps by approximating its resulting variance reduction. We demonstrate the effectiveness of our method on several benchmark PDEs, including the Heat equation, Korteweg–De Vries equation, Kuramoto–Sivashinsky equation, and the incompressible Navier-Stokes equation. Extensive experiments validate that our approach outperforms existing methods, offering a cost-efficient solution to surrogate modeling for PDEs.

501Learning a Fast Mixing Exogenous Block MDP using a Single Trajectory

[openreview] [pdf]

Abstract In order to train agents that can quickly adapt to new objectives or reward functions, efficient unsupervised representation learning in sequential decision-making environments can be important. Frameworks such as the Exogenous Block Markov Decision Process (Ex-BMDP) have been proposed to formalize this representation-learning problem (Efroni et al., 2022b). In the Ex-BMDP framework, the agent’s high-dimensional observations of the environment have two latent factors: a controllable factor, which evolves deterministically within a small state space according to the agent’s actions, and an exogenous factor, which represents time-correlated noise, and can be highly complex. The goal of the representation learning problem is to learn an encoder that maps from observations into the controllable latent space, as well as the dynamics of this space. Efroni et al. (2022b) has shown that this is possible with a sample complexity that depends only on the size of the controllable latent space, and not on the size of the noise factor. However, this prior work has focused on the episodic setting, where the controllable latent state resets to a specific start state after a finite horizon.By contrast, if the agent can only interact with the environment in a single continuous trajectory, prior works have not established sample-complexity bounds. We propose STEEL, the first provably sample-efficient algorithm for learning the controllable dynamics of an Ex-BMDP from a single trajectory, in the function approximation setting. STEEL has a sample complexity that depends only on the sizes of the controllable latent space and the encoder function class, and (at worst linearly) on the mixing time of the exogenous noise factor. We prove that STEEL is correct and sample-efficient, and demonstrate STEEL on two toy problems.

502Diffusion Process with Implicit Latents via Energy Models

[openreview] [pdf]

Abstract We present a generative model based on an ordered sequence of latent variables for intermediate distributions between a given source and a desired target distribution. We construct the probabilistic transitions among the latent variables using energy models that are in the form of classifiers. In our work, the intermediate transitional distributions are implicitly defined by the energy models during training, where the statistical properties of the data distribution are naturally taken into account. This is in contrast to denoising diffusion probabilistic models (DDPMs) where they are explicitly defined by the predefined scheduling of a sequential noise degradation process. Over the course of training, our model is designed to optimally determine the intermediate distributions by Langevin dynamics driven by the energy model. In contrast, energy-based models (EBMs) typically require an additional generator since the intermediate distributions are not explicitly defined in the training procedure. We demonstrate the effectiveness and efficiency of the proposed algorithm in the context of image generation, achieving high fidelity results with less inference steps on a variety of datasets.

503SEMDICE: Off-policy State Entropy Maximization via Stationary Distribution Correction Estimation

[openreview] [pdf]

Abstract In the unsupervised pre-training for reinforcement learning, the agent aims to learn a prior policy for downstream tasks without relying on task-specific reward functions. We focus on state entropy maximization (SEM), where the goal is to learn a policy that maximizes the entropy of the state’s stationary distribution. In this paper, we introduce SEMDICE, a principled off-policy algorithm that computes an SEM policy from an arbitrary off-policy dataset, which optimizes the policy directly within the space of stationary distributions. SEMDICE computes a single, stationary Markov state-entropy-maximizing policy from an arbitrary off-policy dataset. Experimental results demonstrate that SEMDICE outperforms baseline algorithms in maximizing state entropy while achieving the best adaptation efficiency for downstream tasks among SEM-based unsupervised RL pre-training methods.

504The Crucial Role of Samplers in Online Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, theoptimizationproperties, particularly the impact of samplers on its convergence rates, remain underexplored. In this paper, we provide a rigorous analysis of DPO’sconvergence rateswith different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieveslinearconvergence, while our proposed online sampler achievesquadraticconvergence. We further adapt the sampler to practical settings by incorporating posterior distributions andlogit mixing, demonstrating significant improvements over previous approaches. On Safe-RLHF dataset, our method exhibits a 4.5% improvement over vanilla DPO and a 3.0% improvement over on-policy DPO; on Iterative-Prompt, our approach outperforms vanilla DPO, on-policy DPO, and Hybrid GSHF by over 4.2%. Our results not only offer insights into the theoretical standing of DPO but also pave the way for potential algorithm designs in the future.

505From discrete-time policies to continuous-time diffusion samplers: Asymptotic equivalences and faster training

[openreview] [pdf]

Abstract We study the problem of training neural stochastic differential equations, or diffusion models, to sample from a Boltzmann distribution without access to target samples. Existing methods for training such models enforce time-reversal of the generative and noising processes, using either differentiable simulation or off-policy reinforcement learning (RL). We prove equivalences between families of objectives in the limit of infinitesimal discretization steps, linking entropic RL methods (GFlowNets) with continuous-time objects (partial differential equations and path space measures). We further show that an appropriate choice of coarse time discretization during training allows greatly improved sample efficiency and the use of time-local objectives, achieving competitive performance on standard sampling benchmarks with reduced computational cost.

506SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

[openreview] [pdf]

Abstract As large language models (LLMs) continue to advance and find applications across a growing number of fields, ensuring the safety of LLMs has become increasingly critical. To address safety concerns, recent studies have proposed integrating safety constraints into reinforcement learning from human feedback (RLHF). However, these approaches tend to be complex and often unstable, as they encompass complicated procedures in RLHF along with additional procedures required by the safety constraints. Inspired by direct preference optimization (DPO), we introduce a new algorithm called \textit{SafeDPO}, which is designed to implicitly optimize the safety alignment objective within a single stage of policy learning. The resulting algorithm can be implemented by introducing only one additional hyperparameter, which aims to further enhance safety, along with minor modifications to the DPO implementation. Consequently, SafeDPO successfully eliminates the necessity of fitting a reward and a cost model, as well as sampling from the language model during fine-tuning, while still enhancing the safety of LLMs. Finally, we demonstrate that SafeDPO achieves competitive performance compared to the current state-of-the-art safety alignment algorithm, both in terms of aligning with human preferences and improving safety.

507Towards Understanding the Universality of Transformers for Next-Token Prediction

[openreview] [pdf]

Abstract Causal Transformers are trained to predict the next token for a given context. While it is widely accepted that self-attention is crucial for encoding the causal structure of sequences, the precise underlying mechanism behind this in-context autoregressive learning ability remains unclear. In this paper, we take a step towards understanding this phenomenon by studying the approximation ability of Transformers for next-token prediction. Specifically, we explore the capacity of causal Transformers to predict the next token xt+1x_{t+1} given an autoregressive sequence (x1,,xt)(x_1, \dots, x_t) as a prompt, where xt+1=f(xt) x_{t+1} = f(x_t) , and f f is a context-dependent function that varies with each sequence. On the theoretical side, we focus on specific instances, namely when f f is linear or when (xt) (x_t) is periodic. We explicitly construct a Transformer (with linear, exponential, or softmax attention) that learns the mapping ff in-context through a causal kernel descent method. The causal kernel descent method we propose provably estimates xt+1x_{t+1} based solely on past and current observations (x1,,xt) (x_1, \dots, x_t) , with connections to the Kaczmarz algorithm in Hilbert spaces. We present experimental results that validate our theoretical findings and suggest their applicability to more general mappings ff.

508Exploration in the Face of Strategic Responses: Provable Learning of Online Stackelberg Games

[openreview] [pdf]

Abstract We study online leader-follower games where the leader interacts with a myopic follower using a quantal response policy. The leader’s objective is to design an algorithm without prior knowledge of her reward function or the state transition dynamics. Crucially, the leader also lacks insight into the follower’s reward function and realized rewards, posing a significant challenge. To address this, the leader must learn the follower’s quantal response mapping solely through strategic interactions --- announcing policies and observing responses. We introduce a unified algorithm, Planning after Estimation, which updates the leader’s policies in a two-step approach.In particular, we first jointly estimate the leader’s value function and the follower’s response mapping by maximizing a sum of the Bellman error of the value function, the likelihood of the quantal response model, and a regularization term that encourages exploration. The leader’s policy is then updated through a greedy planning step based on these estimates. Our algorithm achieves a T\sqrt{T}-regret in the context of general function approximation. Moroever, this algorithm avoids the intractable optimistic planning and thus enhances implementation simplicity.

509Cross-Domain Off-Policy Evaluation and Learning for Contextual Bandits

[openreview] [pdf]

Abstract Off-Policy Evaluation and Learning (OPE/L) in contextual bandits is rapidly gaining popularity in real systems because new policies can be evaluated and learned securely using only historical logged data. However, existing methods in OPE/L cannot handle many challenging but prevalent scenarios such as few-shot data, deterministic logging policies, and new actions. In many applications, such as personalized medicine, content recommendations, education, and advertising, we need to evaluate and learn new policies in the presence of these challenges. Existing methods cannot evaluate and optimize effectively in these situations due to the notorious variance issue or limited exploration in the logged data. To enable OPE/L even under these unsolved challenges, we propose a new problem setup of Cross-Domain OPE/L, where we have access not only to the logged data from the target domain in which the new policy will be implemented but also to logged datasets collected from other domains. This novel formulation is widely applicable because we can often use historical data not only from the target hospital, country, device, or user segment but also from other hospitals, countries, devices, or segments. We develop a new estimator and policy gradient method to solve OPE/L by leveraging both target and source datasets, resulting in substantially enhanced OPE/L in the previously unsolved situations in our empirical evaluations.

510Graph Concept Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) provide explicit interpretations for deep neural networks through concepts and allow intervention with concepts to adjust final predictions. Existing CBMs assume concepts are conditionally independent given labels and isolated from each other, ignoring the hidden relationships among concepts. However, the set of concepts in CBMs often has an intrinsic structure where concepts are generally correlated: changing one concept will inherently impact its related concepts. To mitigate this limitation, we proposeGraph CBMs: a new variant of CBM that facilitates concept relationships by constructing latent concept graphs, which can be combined with CBMs to enhance model performance while retaining their interpretability. Empirical results on real-world image classification tasks demonstrate Graph CBMs are (1) superior in image classification tasks while providing more concept structure information for interpretability; (2) able to utilize concept graphs for more effective interventions; and (3) robust across different training and architecture settings.

511Series-to-Series Diffusion Bridge Model

[openreview] [pdf]

Abstract Diffusion models have risen to prominence in time series forecasting, showcasing their robust capability to model complex data distributions. However, their effectiveness in deterministic predictions is often constrained by instability arising from their inherent stochasticity. In this paper, we revisit time series diffusion models and present a comprehensive framework that encompasses most existing diffusion-based methods. Building on this theoretical foundation, we propose a novel diffusion-based time series forecasting model, the Series-to-Series Diffusion Bridge Model (S2DBM\mathrm{S^2DBM}), which leverages the Brownian Bridge process to reduce randomness in reverse estimations and improves accuracy by incorporating informative priors and conditions derived from historical time series data. Experimental results demonstrate that S2DBM\mathrm{S^2DBM} delivers superior performance in point-to-point forecasting and competes effectively with other diffusion-based models in probabilistic forecasting.

512Flow Tree: A dynamic model for navigation paths and strategies

[openreview] [pdf]

Abstract Navigation is a dynamic process that involves learning how to represent the environment, along with positions in and trajectories through it. Spatial navigation skills vary significantly among individual humans. But what exactly differentiates a good navigator from a bad one, or an easy-to-navigate path from a hard one, is not well understood. Several studies have analysed exploration and navigation behaviour using static quantitative measures, like counts of positions visited or distance travelled. These static measures, however, are inherently limited in their ability to describe dynamic behaviors, providing a coarse quantification of the navigation process. To fill this gap, we introduce the \emph{Flow Tree}, a novel data structure, which quantifies the dynamics of a group of trajectories through time. This is a discrete adaptation of the Reeb graph, a mathematical structure from topology, computed from multiple trajectories (from different people or the same person over time). Each divergence in trajectory is captured as a node, encoding the variability of the collection of trajectories. A Flow Tree encodes how difficult it will be to navigate a certain path for a group of humans. We apply the Flow Tree to a behavioural dataset of 100 humans exploring and then navigating a small, closed-form maze in virtual reality. In this paper we (1) describe what a Flow Tree is and how to calculate it, (2) show that Flow Trees can be used to predict path difficulty more effectively than static metrics, and (3) demonstrate that a trajectory through the Flow Tree is predictive of that individual’s success. We (4) introduce a hypothesis testing framework over Flow Trees to quantitatively differentiate between the strategies of the best navigators from those of worst. Thus, we show that Flow Trees are a powerful tool to analyse dynamic trajectory data.\footnote{The code will be made publicly available at [anon-github-link].}

513Oracle efficient truncated statistics

[openreview] [pdf]

Abstract We study the problem of learning from truncated samples: instead of observing samples from some underlying population pp^\ast, we observe only the examples that fall in some survival set SRdS \subset \mathbb{R}^d whose probability mass (measured with respect to pp^\ast) is at least α\alpha. Assuming membership oracle access to the truncation set SS, prior works obtained algorithms for the case where pp^\ast is Gaussian or more generally an exponential family with strongly convex likelihood --- albeit with a super-polynomial dependency on the (inverse) survival mass 1/α1/\alpha both in terms of runtime and in number of oracle calls to the set SS. In this work we design a new learning method with runtime and query complexity polynomial in 1/α1/\alpha.Our result significantly improves over the prior works by focusing on efficiently solving the underlying optimization problem using a general purpose optimization algorithm with minimal assumptions.

514Model Collapse in the Chain of Diffusion Finetuning: A Novel Perspective from Quantitative Trait Modeling

[openreview] [pdf]

Abstract The success of generative models has reached a unique threshold where their outputs are indistinguishable from real data, leading to the inevitable contamination of future data collection pipelines with synthetic data. While their potential to generate infinite samples initially offers promise for reducing data collection costs and addressing challenges in data-scarce fields, the severe degradation in performance has been observed when iterative loops of training and generation occur---known as ‘‘model collapse.’’ This paper explores a practical scenario in which a pretrained text-to-image diffusion model is finetuned using synthetic images generated from a previous iteration, a process we refer to as the ‘‘Chain of Diffusion.’’ We first demonstrate the significant degradation in image qualities caused by this iterative process and identify the key factor driving this decline through rigorous empirical investigations. Drawing on an analogy between the Chain of Diffusion and biological evolution, we then introduce a novel theoretical analysis based on quantitative trait modeling. Our theoretical analysis aligns with empirical observations of the generated images in the Chain of Diffusion. Finally, we propose Reusable Diffusion Finetuning (ReDiFine), a simple yet effective strategy inspired by genetic mutations. ReDiFine mitigates model collapse without requiring any hyperparameter tuning, making it a plug-and-play solution for reusable image generation.

515Distilling Reinforcement Learning Algorithms for In-Context Model-Based Planning

[openreview] [pdf]

Abstract Recent studies have demonstrated that Transformers can perform in-context reinforcement learning (RL) by imitating a source RL algorithm. This enables them to adapt to new tasks in a sample-efficient manner without parameter updates. However, since the Transformers are trained to mimic the source algorithm, they also reproduce its suboptimal behaviors. Model-based planning offers a promising solution to this limitation by allowing the agents to simulate potential outcomes before taking action, providing an additional mechanism to deviate from the source algorithm’s behavior. Rather than learning a separate dynamics model, we propose Distillation for In-Context Planning (DICP), an in-context model-based RL framework where the Transformer simultaneously learns environment dynamics and improves policy in-context. With experiments across a diverse set of discrete and continuous environments such as Darkroom variants and Meta-World, we show that this method achieves state-of-the-art performance, requiring significantly fewer environmental interactions than the baselines including both in-context model-free counterparts and existing meta-RL methods.

516Exploring Complex Trade-offs in Information Bottleneck through Multi-Objective Optimization

[openreview] [pdf]

Abstract Information Bottleneck (IB) theory provides a principled approach to analyze and optimize how neural networks extract and learn latent representations from data, aiming to enhance network performance and generalization. The IB framework has been applied and validated across various domains in deep learning. However, most studies employing IB require tuning of Lagrange multipliers to balance compression and prediction during optimization. Finding the optimal Lagrange multiplier β\beta to achieve the best balance between compression and prediction is challenging, relying heavily on empirical tuning and potentially failing to capture the complex trade-offs present within the IB paradigm. In this paper, we redefine the IB problem as a multi-objective optimization problem with respect to compression and prediction objectives. We employ a gradient-based multi-objective optimization algorithm that adaptively determines the weights for this optimization challenge. Our method is demonstrated to automatically find Pareto-optimal solutions, achieving a balance between compression and prediction, and exploring more complex Pareto frontiers than linear weighting. We compare our approach with the Variational Information Bottleneck and its variants across different datasets. Empirical results confirm that our method achieves a more stable and optimal trade-off compared to Information Bottleneck approaches with manually-tuned multipliers. The code is available in \url{https://anonymous.4open.science/r/ASDGASDG}.

517Mostly Exploration-free Algorithms for Multi-Objective Linear Bandits

[openreview] [pdf]

Abstract We address the challenge of solving multi-objective bandit problems, which are increasingly relevant in real-world applications where multiple possibly conflicting objectives must be optimized simultaneously. Existing multi-objective algorithms often rely on complex, computationally intensive methods, making them impractical for real-world use. In this paper, we propose a novel perspective by showing that objective diversity can naturally induce free exploration, allowing for simpler, near-greedy algorithms to achieve state-of-the-art regret bounds. We introduce simple and efficient algorithms for multi-objective linear bandits, which do not require constructing empirical Pareto fronts and achieve a regret bound of O~(dT)\tilde{\mathcal{O}}(\sqrt{dT}) under sufficient objective diversity and suitable regularity. We also introduce the concept of objective fairness, ensuring equal treatment of all objectives, and show that our algorithms satisfy this criterion. Numerical experiments validate our theoretical findings, demonstrating that objective diversity can enhance algorithm performance while simplifying the solution process.

518Learning Generalizable and Well-Shaped Reward Functions from Too Few Demonstrations

[openreview] [pdf]

Abstract Inverse reinforcement learning (IRL) is an important problem that aims to learn a reward function and policy directly from demonstrations, which can often be easier to provide than a well-shaped reward function. However, many real-world tasks include natural variations (i.e., a cleaning robot in a house with different furniture configurations), making it costly to provide demonstrations of every possible scenario. We tackle the problem of few-shot IRL with multi-task data where the goal is for an agent to learn from a few demonstrations, not sufficient to fully specify the task, by utilizing an offline multi-task demonstration dataset. Prior work utilizes meta-learning or imitation learning which additionally requires reward labels, a multi-task training environment, or cannot improve with online interactions. We propose Multitask Discriminator Proximity-guided IRL (MPIRL), an IRL method that learns a generalizable and well-shaped reward function by learning a multi-task generative adversarial discriminator with an auxiliary proximity-to-expert reward. We demonstrate the effectiveness of our method on multiple navigation and manipulation tasks.

519Partially Observed Trajectory Inference using Optimal Transport and a Dynamics Prior

[openreview] [pdf]

Abstract Trajectory inference seeks to recover the temporal dynamics of a population from snapshots of its (uncoupled) temporal marginals, i.e. where observed particles are \emph{not} tracked over time. Lavenant et al. (2023) addressed this challenging problem under a stochastic differential equation (SDE) model with a gradient-driven drift in the observed space, introducing a minimum entropy estimator relative to the Wiener measure. Chizat et al. (2022) then provided a practical grid-free mean-field Langevin (MFL) algorithm using Schrodinger bridges. Motivated by the overwhelming success of observable state space models in the traditional paired trajectory inference problem (e.g. target tracking), we extend the above framework to a class of latent SDEs in the form of \emph{observable state space models}. In this setting, we use partial observations to infer trajectories in the latent space under a specified dynamics model (e.g. the constant velocity/acceleration models from target tracking). We introduce PO-MFL to solve this latent trajectory inference problem and provide theoretical guarantees by extending the results of Lavenant et al. (2023) to the partially observed setting. We leverage the MFL framework of Chizat et al. (2022), yielding an algorithm based on entropic OT between dynamics-adjusted adjacent time marginals. Experiments validate the robustness of our method and the exponential convergence of the MFL dynamics, and demonstrate significant outperformance over the latent-free method of Chizat et al. (2022) in key scenarios.

[openreview] [pdf]

Abstract Maximum Inner Product Search (MIPS) is essential for machine learning and information retrieval, particularly in applications that operate on high-dimensional data, such as recommender systems and retrieval-augmented generation (RAG), using inner product or cosine similarity. While numerous techniques have been developed for efficient MIPS, their performance often suffers due to a limited understanding of the geometric properties of Inner Product (IP) space. Many approaches reduce MIPS to Nearest Neighbor Search (NNS) through nonlinear transformations, which rely on strong assumptions and can hinder performance. To address these limitations, we propose a novel approach that directly leverages the geometry of IP space. We focus on a class of special vectors called dominators and introduce the Monotonic Relative Dominator Graph MRDG, an IP-space-native, sparse, and strongly-connected graph designed for efficient MIPS, offering theoretical solid foundations. To ensure scalability, we further introduce the Approximate Relative Dominator Graph (ARDG), which retains MRDG’s benefits while significantly reducing indexing complexity. Extensive experiments on 8 public datasets demonstrate that ARDG achieves a 30% average speedup in search at high precision and reduces index size by 2x compared to state-of-the-art graph-based methods.

521IO-LVM: Inverse optimization latent variable models with applications to inferring and explaining paths

[openreview] [pdf]

Abstract Learning representations from solutions of constrained optimization problems (COPs) with unknown cost functions is challenging, as models like (Variational) Autoencoders struggle to capture constraints to decode structured outputs. We propose an inverse optimization latent variable model (IO-LVM) that constructs a latent space of COP costs based on observed decisions, enabling the inference of feasible and meaningful solutions by reconstructing them with a COP solver. To achieve this, we leverage estimated gradients of a Fenchel-Young loss through a non-differentiable deterministic solver while shaping the embedding space. In contrast to established Inverse Optimization or Inverse Reinforcement Learning methods, which typically identify a single or context-conditioned cost function, we exploit the learned representation to capture underlying COP cost structures and identify solutions likely originating from different agents, each using distinct or slightly different cost functions when making decisions. Using both synthetic and actual ship routing data, we validate our approach through experiments on path planning problems using the Dijkstra algorithm, demonstrating the interpretability of the latent space and its effectiveness in path inference and path distribution reconstruction.

522Evolving Multi-Scale Normalization for Time Series Forecasting under Distribution Shifts

[openreview] [pdf]

Abstract Complex distribution shifts are the main obstacle to achieving accurate long-term time series forecasting. Several efforts have been conducted to capture the distribution characteristics and propose adaptive normalization techniques to alleviate the influence of distribution shifts. However, these methods neglect intricate distribution dynamics that are observed from various scales and the evolving functions of both distribution dynamics and normalized mapping relationships. To this end, we propose a novel model-agnostic Evolving Multi-Scale Normalization (EvoMSN) framework to tackle the distribution shift problem. Flexible normalization and denormalization are proposed based on the multi-scale statistics prediction module and adaptive ensembling. An evolving optimization strategy is designed to update the forecasting model and statistics prediction module collaboratively to track the shifting distributions. We evaluate the effectiveness of EvoMSN in improving the performance of five mainstream forecasting methods on benchmark datasets and also show its superiority compared to existing advanced normalization and online learning approaches.

523Federated Maximum Likelihood Inverse Reinforcement Learning with Convergence Guarantee

[openreview] [pdf]

Abstract Inverse Reinforcement Learning (IRL) aims to recover the latent reward function and corresponding optimal policy from observed demonstrations. Existing IRL research predominantly focuses on a centralized learning approach, not suitable for real-world problems with distributed data and privacy restrictions. To this end, this paper proposes a novel algorithm for federated maximum-likelihood IRL (F-ML-IRL) and provides a rigorous analysis of its convergence and time-complexity. The proposed F-ML-IRL leverages a dual-aggregation to update the shared global model and performs bi-level local updates -- an upper-level learning task to optimize the parameterized reward function by maximizing the discounted likelihood of observing expert trajectories under the current policy and a low-level learning task to find the optimal policy concerning the entropy-regularized discounted cumulative reward under the current reward function. We analyze the convergence and time-complexity of the proposed F-ML-IRL algorithm and show that the global model in F-ML-IRL converges to a stationary point for both the reward and policy parameters within finite time, i.e., the log-distance between the recovered policy and the optimal policy, as well as the gradient of the likelihood objective, converge to zero. Finally, evaluating our F-ML-IRL algorithm on high-dimensional robotic control tasks in MuJoCo, we show that it ensures convergences of the recovered reward in decentralized learning and even outperforms centralized baselines due to its ability to utilize distributed data.

524Is Large-scale Pretraining the Secret to Good Domain Generalization?

[openreview] [pdf]

Abstract Multi-Source Domain Generalization (DG) is the task of training on multiple source domains and achieving high classification performance on unseen target domains. Recent methods combine robust features from web-scale pretrained backbones with new features learned from source data, and this has dramatically improved benchmark results. However, it remains unclear if DG finetuning methods are becoming better over time, or if improved benchmark performance is simply an artifact of stronger pre-training. Prior studies have shown that perceptual similarity to pre-training data correlates with zero-shot performance, but we find the effect limited in the DG setting. Instead, we posit that having perceptually similar data in pretraining is not enough; and that it is how well these data were learned that determines performance. This leads us to introduce the Alignment Hypothesis, which states that the final DG performance will be high if and only if alignment of image and class label text embeddings is high. Our experiments confirm the Alignment Hypothesis is true, and we use it as an analysis tool of existing DG methods evaluated on DomainBed datasets by splitting evaluation data into In-pretraining (IP) and Out-of-pretraining (OOP). We show that all evaluated DG methods struggle on DomainBed-OOP, while recent methods excel on DomainBed-IP. Put together, our findings highlight the need for DG methods which can generalize beyond pretraining alignment.

525FunBO: Discovering Acquisition Functions forBayesian Optimization with FunSearch

[openreview] [pdf]

Abstract The sample efficiency of Bayesian optimization algorithms depends on carefully crafted acquisition functions (AFs) guiding the sequential collection of function evaluations. The best-performing AF can vary significantly across optimization problems, often requiring ad-hoc and problem-specific choices. This work tackles the challenge of designing novel AFs that perform well across a variety of experimental settings. Based on FunSearch, a recent work using Large Language Models (LLMs) for discovery in mathematical sciences, we propose FunBO, an LLM-based method that can be used to learn new AFs written in computer code by leveraging access to a limited number of evaluations for a set of objective functions. We provide the analytic expression of all discovered AFs and evaluate them on various global optimization benchmarks and hyperparameter optimization tasks. We show how FunBO identifies AFs that generalize well in and out of the training distribution of functions, thus outperforming established general-purpose AFs and achieving competitive performance against AFs that are customized to specific function types and are learned via transfer-learning algorithms.

526Optimizing Knowledge Distillation in Transformers: Enabling Power of Multi-Head Attention without Alignment Barriers

[openreview] [pdf]

Abstract Knowledge distillation has been proven effective for compressing transformer architectures by transferring knowledge from teacher to student models. Logits-based methods of knowledge distillation cannot fully capture the intermediate representations and features within the teacher model, which may result in the student model not fully learning all the knowledge from the teacher model. Thus, previous work focuses on transferring knowledge through intermediate features or attention maps. However, leveraging multi-head attention maps in transformers for knowledge distillation presents challenges due to head misalignment and suboptimal feature alignment, often requiring projectors to align features or special modifications to the model architecture. To address above limitations, we propose the Squeezing-Heads Distillation (SHD) method. This method reduces the number of attention maps to any desired number through linear approximation, without requiring additional projectors or parameters. This facilitates better alignment and knowledge transfer between models with different numbers of heads, enhancing both flexibility and efficiency. Experimental results demonstrate significant improvements in both language and vision generative models, validating the effectiveness of our method.

527Machine Unlearning For Alleviating Negative Transfer In Partial-Set Source-Free Unsupervised Domain Adaptation

[openreview] [pdf]

Abstract Source-free Unsupervised Domain Adaptation (SFUDA) aims to adjust a source model trained on a labeled source domain to a related but unlabeled target domain without accessing the source data. Many SFUDA methods are studied in closed-set scenarios where the target domain and source domain categories are perfectly aligned. However, a more practical scenario is a partial-set scenario where the source label space subsumes the target one. In this paper, we prove that reducing the differences between the source and target domains in the partial-set scenario helps to achieve domain adaptation. And we propose a simple yet effective SFUDA framework called the Machine Unlearning Framework to alleviate the negative transfer problem in the partial-set scenario, thereby allowing the model to focus on the target domain category. Specifically, we first generate noise samples for each category that only exists in the source domain and generate pseudo-labeled samples from the target domain. Then, in the forgetting stage, we use these samples to train the model, making it behave like the model has never seen the class that only exists in the source domain before. Finally, in the adaptation stage, we use only the pseudo-labeled samples to conduct self-supervised training on the model, making it more adaptable to the target domain. Our method is easy to implement and pluggable, suitable for various pre-trained models. Experimental results show that our method can well alleviate the negative transfer problem and improve model performance under various target domain category settings.

528Propensity-driven Uncertainty Learning for Sample Exploration in Source-Free Active Domain Adaptation

[openreview] [pdf]

Abstract Source-free active domain adaptation (SFADA) addresses the challenge of adapting a pre-trained model to new domains without access to source data while minimizing the need for target domain annotations. This scenario is particularly relevant in real-world applications where data privacy, storage limitations, or labeling costs are significant concerns. Key challenges in SFADA include selecting the most informative samples from the target domain for labeling, effectively leveraging both labeled and unlabeled target data, and adapting the model without relying on source domain information. Additionally, existing methods often struggle with noisy or outlier samples and may require impractical progressive labeling during training. To effectively select more informative samples without frequently requesting human annotations, we propose the Propensity-driven Uncertainty Learning (ProULearn) framework. ProULearn utilizes a novel homogeneity propensity estimation mechanism combined with correlation index calculation to evaluate feature-level relationships. This approach enables the identification of representative and challenging samples while avoiding noisy outliers. Additionally, we develop a central correlation loss to refine pseudo-labels and create compact class distributions during adaptation. In this way, ProULearn effectively bridges the domain gap and maximizes adaptation performance. The principles of informative sample selection underlying ProULearn have broad implications beyond SFADA, offering benefits across various deep learning tasks where identifying key data points or features is crucial. Extensive experiments on four benchmark datasets demonstrate that ProULearn consistently outperforms state-of-the-art methods in domain adaptation scenarios.

529Influential Language Data Selection via Gradient Trajectory Pursuit

[openreview] [pdf]

Abstract Curating a desirable dataset for training has been the core of building highly capable large language models (Touvron et al., 2023; Achiam et al., 2023; Team et al., 2024). Gradient influence scores (Pruthi et al., 2020; Xia et al., 2024) have been shown to be correlated with model performance and are commonly used as the criterion for data selection. However, existing methods are built upon either individual sample rankings or inefficient matching process, leading to suboptimal performance or scaling up issues. In this paper, we propose Gradient Trajectory Pursuit (GTP), an algorithm that performs pursuit of gradient trajectories via jointly selecting data points under an L0-norm regularized objective. The proposed algorithm highlights: (1) joint selection instead of independent top-k selection, which automatically de-duplicates samples; (2) higher efficiency with compressive sampling processes, which can be further sped up using a distributed framework. In the experiments, we demonstrate the algorithm in both in-domain and target-domain selection benchmarks and show that it outperforms top-k selection and competitive algorithms consistently, for example, our algorithm chooses as low as 0.5% data to achieve full performance on the targeted instruction tuning tasks.

530Counterfactual Learning under Rank Preservation

[openreview] [pdf]

Abstract Counterfactual inference aims to estimate the counterfactual outcome given knowledge of an observed treatment and the factual outcome, with broad applications in fields such as epidemiology, econometrics, and management science. In this paper, we propose a principled approach for identifying and estimating the counterfactual outcome. Specifically, we introduce a simple and intuitive rank preservation assumption to identify the counterfactual outcome without relying on a known structural causal model. Building on this, we propose a novel ideal loss for theoretically unbiased learning of the counterfactual outcome and further develop a kernel-based estimator for its empirical estimation. Our theoretical analysis shows that the proposed ideal loss is convex, and the proposed estimator is unbiased. Extensive semi-synthetic and real-world experiments are conducted to demonstrate the effectiveness of the proposed method.

531FAST: Federated Average with Snapshot Unleashes Arbitrary Client Participation

[openreview] [pdf]

Abstract Federated Learning (FL) provides a flexible distributed platform where numerous clients with high degrees of heterogeneity in data and system can collaborate to learn a model jointly. Previous research has shown that Federated Learning is effective in handling diverse data, but often assumes idealized conditions. Specifically, client participation is often simplified in these studies, while real-world factors make it difficult to predict or design individual client participation. This complexity often diverges from the ideal client participation assumption, rendering an unknown pattern of client participation, referred to asarbitrary client participation. Hence, it is an important open problem to explore the impact of client participation and find a lightweight mechanism to enable arbitrary client participation in FL. In this paper, we first empirically investigate the influence of client participation on FL, revealing that FL algorithms are significantly impacted by arbitrary client participation. Afterward, to alleviate the influence, we propose a lightweight solution, Federated Average with Snapshot (FAST), to unleash the almost arbitrary client participation for FL. It can seamlessly integrate with other classic FL algorithms. Specifically, FAST enforces the clients to take a snapshot once in a while and facilitates arbitrary client participation for the majority of the training process. We show the convergence rates of FAST in non-convex and strongly-convex cases, which could match the rates with those in ideal client participation. Furthermore, we empirically introduce an adaptive strategy for dynamically configuring the snapshot frequency, tailored to accommodate diverse FL systems. Our extensive numerical results demonstrate that our FAST algorithm attains significant improvements under the conditions of arbitrary client participation and highly heterogeneous data.

532VideoPanda: Video Panoramic Diffusion With Multi-view Attention

[openreview] [pdf]

Abstract High resolution panoramic video content is paramount for immersive experiences in Virtual Reality, but is non-trivial to collect as it requires specialized equipment and intricate camera setups. In this work, we introduce \ourmodel, a novel approach for synthesizing 360360^\circ videos conditioned on text or single-view video data. \ourmodel leverages multi-view attention layers to augment a video diffusion model, enabling it to generate consistent multi-view videos that can be combined into immersive panoramic content. \ourmodel is trained jointly using two conditions: text-only and single-view video, and supports autoregressive generation of long-videos. To overcome the computational burden of multi-view video generation, we randomly subsample the duration and camera views used during training and show that the model is able to gracefully generalize to generating more frames during inference. Extensive evaluations on both real-world and synthetic video datasets demonstrate that \ourmodel generates more realistic and coherent 360360^\circ panoramas across all input conditions compared to existing methods. Visit the project website athttps://mvpanovideo.github.io/VideoPanda/for results.

533Many-Objective Multi-Solution Transport

[openreview] [pdf]

Abstract Optimizing the performance of many objectives (instantiated by tasks or clients) jointly with a few Pareto stationary solutions (models) is critical in machine learning. However, previous multi-objective optimization methods often focus on a few objectives and cannot scale to many objectives that outnumber the solutions, leading to either subpar performance or ignored objectives. We introduce ‘‘Many-objective multi-solution Transport (MosT)’’, a framework that finds multiple diverse solutions in the Pareto front of many objectives. Our insight is to seek multiple solutions, each performing as a domain expert and focusing on a specific subset of objectives while collectively covering all of them. MosT formulates the problem as a bi-level optimization of weighted objectives for each solution, where the weights are defined by an optimal transport between objectives and solutions. Our algorithm ensures convergence to Pareto stationary solutions for complementary subsets of objectives. On a range of applications in federated learning, multi-task learning, and mixture-of-prompt learning for LLMs, MosT distinctly outperforms strong baselines, delivering high-quality, diverse solutions that profile the entire Pareto frontier, thus ensuring balanced trade-offs across many objectives.

534Taming Continuous Spurious Shift in Domain Adaptation

[openreview] [pdf]

Abstract Recent advances in domain adaptation have shown promise in transferring knowledge across domains characterized by a continuous value or vector, such as varying patient ages, where "age’’ serves as a continuous index. However, these approaches often fail when spurious features shift continuously along with the domain index. This paper introduces the first method designed to withstand the continuous shifting of spurious features during domain adaptation. Our method enhances domain adaptation performance by aligning causally transportable encodings across continuously indexed domains. Theoretical analysis demonstrates that our approach more effectively ensures causal transportability across different domains. Empirical results, from both semi-synthetic and real-world medical datasets, indicate that our method outperforms state-of-the-art domain adaptation methods.

535Understanding Constraint Inference in Safety-Critical Inverse Reinforcement Learning

[openreview] [pdf]

Abstract In practical applications, the underlying constraint knowledge is often unknown and difficult to specify. To address this issue, recent advances in Inverse Constrained Reinforcement Learning (ICRL) have focused on inferring these constraints from expert demonstrations. However, the ICRL approach typically characterizes constraint learning as a tri-level optimization problem, which is inherently complex due to its interdependent variables and multiple layers of optimization. Considering these challenges, a critical question arises:Can we implicitly embed constraint signals into reward functions and effectively solve this problem using a classic reward inference algorithm?The resulting method, known as Inverse Reward Correction (IRC), merits investigation. In this work, we conduct a theoretical analysis comparing the sample complexities of both solvers. Our findings confirm that the IRC solver achieves lower sample complexity than its ICRL counterpart. Nevertheless, this reduction in complexity comes at the expense of generalizability. Specifically, in the target environment, the reward correction terms may fail to guarantee the safety of the resulting policy, whereas this issue can be effectively mitigated by transferring the constraints via the ICRL solver. Advancing our inquiry, we investigate conditions under which the ICRL solver ensures ϵ\epsilon-optimality when transferring to new environments. Empirical results across various environments validate our theoretical findings, underscoring the nuanced trade-offs between complexity reduction and generalizability in safety-critical applications.

536Boosting LLM Translation Skills without General Ability Loss via Rationale Distillation

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved impressive results across numerous NLP tasks but still encounter difficulties in machine translation. Traditional methods to improve translation have typically involved fine-tuning LLMs using parallel corpora. However, vanilla fine-tuning often leads to catastrophic forgetting of the instruction-following capabilities and alignment with human preferences, compromising their broad general abilities and introducing potential security risks. These abilities, which are developed using proprietary and unavailable training data, make existing continual instruction tuning methods ineffective. To overcome this issue, we propose a novel approach called RaDis\textbf{RaDis} (Ra\textbf{Ra}tionale Dis\textbf{Dis}tillation). RaDis harnesses the strong generative capabilities of LLMs to create rationales for training data, which are then “replayed” to prevent forgetting. These rationales encapsulate general knowledge and safety principles\textit{encapsulate general knowledge and safety principles} and act as self-distillation targets\textit{self-distillation targets} to regulate the training process. By jointly training on both reference translations and self-generated rationales, the model can learn new translation skills while preserving its overall general abilities. Extensive experiments demonstrate that our method enhances machine translation performance while maintaining the broader capabilities of LLMs across other tasks. This work presents a pathway for creating more versatile LLMs that excel in specialized tasks without compromising generality and safety.

537Early Period of Training Impacts Adaptation for Out-of-Distribution Generalization: An Empirical Study

[openreview] [pdf]

Abstract Prior research shows that differences in the early period of neural network training significantly impact the performance of in-distribution (ID) data of tasks. Yet, the implications of early learning dynamics on out-of-distribution (OOD) generalization remain poorly understood, primarily due to the complexities and limitations of existing analytical techniques. In this work, we investigate the relationship between learning dynamics, OOD generalization under covariate shift and the early period of neural network training. We utilize the trace of Fisher Information and sharpness, focusing on gradual unfreezing (i.e., progressively unfreezing parameters during training) as our methodology for investigation. Through a series of empirical experiments, we show that 1) changing the number of trainable parameters during the early period of training via gradual unfreezing can significantly improve OOD results; 2) the trace of Fisher Information and sharpness can be used as indicators for the removal of interventions during the early period of training for better OOD generalization. Our experiments on both image and text data show that the early period of training is a general phenomenon that can provide Pareto improvements in ID and OOD performance with minimal complexity. Our work represents a first step towards understanding how early learning dynamics affect neural network OOD generalization and suggests a new avenue to improve and study this problem.

538Prompt Diffusion Robustifies Any-Modality Prompt Learning

[openreview] [pdf]

Abstract Foundation models enable prompt-based classifiers for zero-shot and few-shot learning. Nonetheless, the conventional method of employing fixed prompts suffers from distributional shifts that negatively impact generalizability to unseen samples. This paper introduces prompt diffusion, which uses a diffusion model to gradually refine prompts to obtain a customized prompt for each sample. Specifically, we first optimize a collection of prompts to obtain over-fitted prompts per sample. Then, we propose a prompt diffusion model within the prompt space, enabling the training of a generative transition process from a random prompt to its overfitted prompt. As we cannot access the label of a test image during inference, our model gradually generates customized prompts solely from random prompts using our trained, prompt diffusion. Our prompt diffusion is generic, flexible, and modality-agnostic, making it a simple plug-and-play module seamlessly embedded into existing prompt learning methods for textual, visual, or multi-modal prompt learning. Our diffusion model uses a fast ODE-based sampling strategy to optimize test sample prompts in just five steps, offering a good trade-off between performance improvement and computational efficiency. For all prompt learning methods tested, adding prompt diffusion yields more robust results for base-to-new generalization, cross-dataset generalization, and domain generalization in classification tasks tested over 15 diverse datasets.

539Is the Fairness Metric Truly Fair?

[openreview] [pdf]

Abstract Image classification is a fundamental task in computer vision that has been widely adopted in critical applications such as face recognition and medical imaging, drawing considerable attention to its predictive fairness. Some researchers have proposed various fairness metrics and pipelines to enhance the fairness of deep learning models. However, recent studies indicate that existing fairness evaluation specifications and metrics have inherent flaws, as they focus on low-dimensional inputs, such as numerical data, and overlook partial correlations between target and sensitive attributes, leading to some degree of mutual exclusivity. This raises the question: Is the fairness metric truly fair? Through in-depth analysis, experiments conclude that the fairness of deep models is closely related to attribute sampling and the interdependencies among attributes. In this work, we address this challenge by introducing a new specification based on dynamic perturbation for image classification models. Specifically, we introduce an Attribute Projection Perturbation Strategy (APPS) that moves beyond the constraints of directly statistical discrete predictions by mapping sensitive attributes that may influence task attributes onto the same dimension for evaluation. Building on this, a Projection Fairness Metric System is proposed to quantifing the upper and lower bounds of fairness perturbations, examining and evaluating the impact of mapped sensitive attributes on the fairness of task predictions from different perspectives. Additionally, we conducted systematic evaluation experiments and extensive discussions, demonstrating that the proposed evaluation specification offers better objectivity and interpretability compared to existing metrics, in 24 image classification models including CNN and ViT architectures. It is hoped that this work will promote the standardization of fairness evaluation pipeline and metrics.

540Do Influence Functions Work on Large Language Models?

[openreview] [pdf]

Abstract Influence functions aim to quantify the impact of individual training data points on a model’s predictions. While extensive research has been conducted on influence functions in traditional machine learning models, their application to large language models (LLMs) has been limited. In this work, we conduct a systematic study to address a key question: do influence functions work on LLMs? Specifically, we evaluate influence functions across multiple tasks and find that they consistently perform poorly in most settings. Our further investigation reveals that their poor performance can be attributed to: (1) inevitable approximation errors when estimating the iHVP component due to the scale of LLMs, (2) uncertain convergence during fine-tuning, and, more fundamentally, (3) the definition itself, as changes in model parameters do not necessarily correlate with changes in LLM behavior. Our study thus suggests the need for alternative approaches for identifying influential samples. To support future work, our code is made available athttps://github.com/anonymous.

541Unintentional Unalignment: Likelihood Displacement in Direct Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO), and its numerous variants, are increasingly used for aligning language models. Although they are designed to teach a model to generate preferred responses more frequently relative to dispreferred responses, prior work has observed that the likelihood of preferred responses often decreases during training. The current work sheds light on the causes and implications of this counter-intuitive phenomenon, which we termlikelihood displacement. We demonstrate that likelihood displacement can becatastrophic, shifting probability mass from preferred responses to semantically opposite ones. As a simple example, training a model to prefer No\texttt{No} over Never\texttt{Never} can sharply increase the probability of Yes\texttt{Yes}. Moreover, when aligning the model to refuse unsafe prompts, we show that such displacement canunintentionally lead to unalignment, by shifting probability mass from preferred refusal responses to harmful responses (e.g., reducing the refusal rate of Llama-3-8B-Instruct from 74.4% to 33.4%). We theoretically characterize that likelihood displacement is driven by preferences that induce similar embeddings, as measured by acentered hidden embedding similarity (CHES)score. Empirically, the CHES score enables identifying which training samples contribute most to likelihood displacement in a given dataset. Filtering out these samples effectively mitigated unintentional unalignment in our experiments. More broadly, our results highlight the importance of curating data with sufficiently distinct preferences, for which we believe the CHES score may prove valuable.

542Towards counterfactual fairness thorough auxiliary variables

[openreview] [pdf]

Abstract The challenge of balancing fairness and predictive accuracy in machine learning models, especially when sensitive attributes such as race, gender, or age are considered, has motivated substantial research in recent years. Counterfactual fairness ensures that predictions remain consistent across counterfactual variations of sensitive attributes, which is a crucial concept in addressing societal biases. However, existing counterfactual fairness approaches usually overlook intrinsic information about sensitive features, limiting their ability to achieve fairness while simultaneously maintaining performance. To tackle this challenge, we introduce EXOgenous Causal reasoning (EXOC), a novel causal reasoning framework motivated by exogenous variables. It leverages auxiliary variables to uncover intrinsic properties that give rise to sensitive attributes. Our framework explicitly defines an auxiliary node and a control node that contribute to counterfactual fairness and control the information flow within the model. Our evaluation, conducted on synthetic and real-world datasets, validates EXOC’s superiority, showing that it outperforms state-of-the-art approaches in achieving counterfactual fairness without sacrificing accuracy.

543Broaden your SCOPE! Efficient Conversation Planning for LLMs with Semantic Space

[openreview] [pdf]

Abstract Large language models (LLMs) are used in chatbots or AI assistants to hold conversations with a human user. In such applications, the quality (e.g., user engagement, safety) of a conversation is important and can only be exactly known at the end of the conversation. To maximize its expected quality, conversation planning reasons about the stochastic transitions within a conversation to select the optimal LLM response at each turn. Existing simulation-based conversation planning algorithms typically select the optimal response by simulating future conversations with a large number of LLM queries at every turn. However, this process is extremely time-consuming and hence impractical for real-time conversations. This paper presents a novel approach called Semantic space COnversation Planning with improved Efficiency (SCOPE) that exploits the dense semantic representation of conversations to perform conversation planning efficiently. In particular, SCOPE models the stochastic transitions in conversation semantics and their associated rewards to plan entirely within the semantic space. This gives the advantage of allowing the optimal LLM response to be selected at every conversation turn without needing additional LLM queries for simulation. As a result, SCOPE can perform conversation planning 70 times faster than conventional simulation-based planning algorithms when applied to a wide variety of conversation starters and two reward functions seen in the real world, yet achieving a higher reward within a practical planning budget.

544Hybrid Preference Optimization: Augmenting Direct Preference Optimization with Auxiliary Objectives

[openreview] [pdf]

Abstract For aligning large language models (LLMs), prior work has leveraged reinforcement learning via human feedback (RLHF) or variations of direct preference optimization (DPO). While DPO offers a simpler framework based on maximum likelihood estimation, it compromises on the ability to tune language models to easily maximize non-differentiable objectives according to the LLM designer’s preferences (e.g., using simpler language or minimizing specific kinds of harmful content). These may neither align with user preferences nor even be able to be captured tractably by binary preference data. To leverage the simplicity and performance of DPO with the generalizability of RL, we propose a hybrid approach between DPO and RLHF. With a simple augmentation to the implicit reward decomposition of DPO, we allow for tuning LLMs to maximize a set of arbitrary auxiliary rewards using offline RL. The proposed method, Hybrid Preference Optimization (HPO), shows the ability to effectively generalize to both user preferences and auxiliary designer objectives, while preserving alignment performance across a range of challenging benchmarks and model sizes.

545No-Regret is not enough! Bandits with General Constraints through Adaptive Regret Minimization

[openreview] [pdf]

Abstract In the bandits with knapsacks framework (BwK) the learner has mm resource-consumption (i.e., packing) constraints. We focus on the generalization of BwK in which the learner has a set of general long-term constraints. The goal of the learner is to maximize their cumulative reward, while at the same time achieving small cumulative constraints violations. In this scenario, there exist simple instances where conventional methods for BwK fail to yield sublinear violations of constraints. We show that it is possible to circumvent this issue by requiring the primal and dual algorithm to be weakly adaptive. Indeed, even in absence on any information on the Slater’s parameter ρ\rho characterizing the problem, the interplay between weakly adaptive primal and dual regret minimizers yields a ``self-bounding’’ property of dual variables. In particular, their norm remains suitably upper bounded across the entire time horizon even without explicit projection steps. By exploiting this property, we provide best-of-both-worlds guarantees for stochastic and adversarial inputs. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio of ρ/(1+ρ)\rho/(1+\rho). In both settings, constraints violations are guaranteed to be sublinear in time. Finally, this results allow us to obtain new result for the problem of contextual bandits with linear constraints, providing the first no-α\alpha-regret guarantees for adversarial contexts.

546Reconstruct the Understanding of Grokking through Dynamical Systems

[openreview] [pdf]

Abstract \textbf{Grokking}, or the \textbf{delayed generalization phenomenon}, describes the abrupt and rapid improvement in test accuracy that occurs after a model has been overfitted for a prolonged period. This phenomenon was first identified by Power in the context of operations on a prime number field. Over the past two years, a range of mathematical analyses has been conducted to investigate grokking, typically involving the use of the hidden progress measure which mean a function that can anticipate the occurrence of grokking. We believe that a comprehensive and rigorous mathematical modeling approach can invigorate the research on this task and provide a unified perspective for understanding previous research. This paper introduces a novel approach by modeling the task as a unique dynamical system. Using mathematical derivation within this framework, we propose a robust hidden progress measure that effectively captures the grokking phenomenon across all operations on prime number fields. This approach not only provides a more complete understanding but also offers deeper insights into the underlying architecture of the model. Based on this understanding, we also proposed a method to accelerate grokking without involving regularization or altering the model architecture.

547SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

[openreview] [pdf]

Abstract The development of diffusion models has led to significant progress in image and video generation tasks, with pre-trained models like the Stable Diffusion series playing a crucial role. However, a key challenge remains in downstream task applications: how to effectively and efficiently adapt pre-trained diffusion models to new tasks. Inspired by model pruning which lightens large pre-trained models by removing unimportant parameters, we propose a novel model fine-tuning method to make full use of these ineffective parameters and enable the pre-trained model with new task-specified capabilities. In this work, we first investigate the importance of parameters in pre-trained diffusion models and discover that parameters with the smallest absolute values do not contribute to the generation process due to training instabilities. Based on this observation, we propose a fine-tuning method termed SaRA that re-utilizes these temporarily ineffective parameters, equating to optimizing a sparse weight matrix to learn the task-specific knowledge. To mitigate potential overfitting, we propose a nuclear-norm-based low-rank sparse training scheme for efficient fine-tuning. Furthermore, we design a new progressive parameter adjustment strategy to make full use of the finetuned parameters. Finally, we propose a novel unstructural backpropagation strategy, which significantly reduces memory costs during fine-tuning. Our method enhances the generative capabilities of pre-trained models in downstream applications and outperforms existing fine-tuning methods in maintaining model’s generalization ability.

548Enriching Knowledge Distillation with Intra-Class Contrastive Learning

[openreview] [pdf]

Abstract Since the advent of knowledge distillation, much research has focused on how the soft labels generated by the teacher model can be utilized effectively. A study points out that the implicit knowledge within soft labels originates from the multi-view structure present in the data. Feature variations within samples of the same class allow the student model to generalize better by learning diverse representations. However, in existing distillation methods, teacher models predominantly adhere to ground-truth labels as targets, without considering the diverse representations within the same class. Therefore, we propose incorporating an intra-class contrastive loss during teacher training to enrich the intra-class information contained in soft labels. In practice, we find that intra-class loss causes instability in training and slows convergence. To mitigate these issues, margin loss is integrated into intra-class contrastive learning to improve the training stability and convergence speed. Simultaneously, we theoretically analyze the impact of this loss on the intra-class distances and inter-class distances. It has been proved that the intra-class contrastive loss can enrich the intra-class diversity. Experimental results demonstrate the effectiveness of the proposed method.

549Diffusion-Nested Auto-Regressive Synthesis of Heterogeneous Tabular Data

[openreview] [pdf]

Abstract Autoregressive models are predominant in natural language generation, while their application in tabular data remains underexplored. We posit that this can be attributed to two factors: 1) tabular data contains heterogeneous data type, while the autoregressive model is primarily designed to model discrete-valued data; 2) tabular data is column permutation-invariant, requiring a generation model to generate columns in arbitrary order. This paper proposes a Diffusion-nested Autoregressive model (TabDAR) to address these issues. To enable autoregressive methods for continuous columns, TabDAR employs a diffusion model to parameterize the conditional distribution of continuous features. To ensure arbitrary generation order, TabDAR resorts to masked transformers with bi-directional attention, which simulate various permutations of column order, hence enabling it to learn the conditional distribution of a target column given an arbitrary combination of other columns. These designs enable TabDAR to not only freely handle heterogeneous tabular data but also support convenient and flexible unconditional/conditional sampling. We conduct extensive experiments on ten datasets with distinct properties, and the proposed TabDAR outperforms previous state-of-the-art methods by 18% to 45% on eight metrics across three distinct aspects.

550Boundless Socratic Learning

[openreview] [pdf]

Abstract An agent trained within a closed system can master any desired capability, as long as the following three conditions hold: (a) it receives sufficiently informative and aligned feedback, (b) its coverage of experience/data is broad enough, and (c) it has sufficient capacity and resource. In this white paper, we justify these conditions, and consider what limitations arise from (a) and (b) in closed systems, when assuming that (c) is not a bottleneck. Considering the special case of agents with matching input and output spaces (namely, language), we argue that such pure recursive self-improvement, dubbed “Socratic learning”, can boost performance vastly beyond what is present in its initial data or knowledge, and is only limited by time, as well as gradual misalignment concerns. Furthermore, we propose a constructive framework to implement it, based on the notion oflanguage games.

551Turning Challenges into Opportunities: How Distribution Shifts Enhance Identifiability in Causal Representation Learning

[openreview] [pdf]

Abstract Causal representation learning seeks to uncover latent causal variables and their relationships from observed, unstructured data, a task complicated by identifiability challenges. While distribution shifts, viewed as natural interventions on latent causal variables, often present difficulties in traditional machine learning tasks, they also create valuable opportunities for identifiability by introducing variability in latent variables. In this paper, we study a non-parametric condition characterizing the types of distribution shifts that contribute to identifiability within the context of latent additive noise models. We also present partial identifiability results when only a portion of distribution shifts meets the condition. Furthermore, we extend our findings to latent post-nonlinear causal models. Building on our theoretical results, we propose a practical algorithm facilitating the acquisition of reliable latent causal representations. Our algorithm, guided by our underlying theory, has demonstrated outstanding performance across a diverse range of synthetic and real-world datasets. The empirical observations closely align with the theoretical findings, affirming the robustness and effectiveness of our proposed approach.

552GROD: Enhancing Generalization of Transformer with Out-of-Distribution Detection

[openreview] [pdf]

Abstract Transformer networks excel in natural language processing (NLP) and computer vision (CV) tasks. However, they face challenges in generalizing to Out-of-Distribution (OOD) datasets, that is, data whose distribution differs from that seen during training. The OOD detection aims to distinguish data that deviates from the expected distribution, while maintaining optimal performance on in-distribution (ID) data. This paper introduces a novel approach based on OOD detection, termed the Generate Rounded OOD Data (GROD) algorithm, which significantly bolsters the generalization performance of transformer networks across various tasks. GROD is motivated by our new OOD detection Probably Approximately Correct (PAC) Theory for transformer. The transformer has learnability in terms of OOD detection that is, when the data is sufficient the outlier can be well represented. By penalizing the misclassification of OOD data within the loss function and generating synthetic outliers, GROD guarantees learnability and refines the decision boundaries between inlier and outlier. This strategy demonstrates robust adaptability and general applicability across different data types. Evaluated across diverse OOD detection tasks in NLP and CV, GROD achieves SOTA regardless of data format. The code is available athttps://anonymous.4open.science/r/GROD-OOD-Detection-with-transformers-B70F.

553Memory retaining finetuning via distillation

[openreview] [pdf]

Abstract Large language models (LLMs) pretrained on large corpora of internet text possess much of the world knowledge. Following pretraining, one often needs to conduct continued pretraining on certain capabilities such as math and coding, or “posttraining” (a.k.a., alignment) techniques to make the models follow users’ instructions and align them with human preferences. One challenge during these finetuning stages is that the model can lose the pretraining knowledge or forget certain capabilities (e.g., in-context learning ability). Moreover, although there exist strong open-weight LLMs such as Llama 3, both their pretraining and posttraining data are not open to the public, making it difficult to mix the finetuning data with the models’ own pretraining data as a solution for mitigating forgetting. We propose label annealing, a method that mitigates forgetting during finetuning without requiring access to the original pretraining data. Label annealing distills pretraining knowledge during finetuing by adding a KL divergence term in the loss function, regularizing the divergence between the finetuned model’s predictions to those of the initial pretrained model. In mathematics and code finetuning, label annealing improves the model’s performance in target domains without sacrificing other capabilities of the pretrained model. In alignment finetuning, our method introduces a smooth tradeoff between the instruction-following capability and the pretraining knowledge. We complement our empirical investigation with a mathematical model with overparameterized linear regression that provides geometric intuition why label annealing would help.

554Multi-aspect Knowledge Distillation with Large Language Model

[openreview] [pdf]

Abstract Recent advancements in deep learning have significantly improved performance on computer vision tasks. Previous image classification methods primarily modify model architectures or add features, and they optimize models using cross-entropy loss on class logits. Since they focus on classifying images with considering class labels, these methods may struggle to learn various aspects of classes (e.g., natural positions and shape changes). In contrast, humans classify images by naturally referring to multi-aspects such as context, shape, color, and other features. Inspired by this, rethinking the previous approach from a novel view, we propose a multi-aspect knowledge distillation method using Multimodal Large Language Models (MLLMs). Our approach involves: 1) querying Large Language Model with multi-aspect questions relevant to the knowledge we want to transfer to the model, 2) extracting corresponding logits from MLLM, and 3) expanding the model’s output dimensions to distill these multi-aspect logits. We then apply cross-entropy loss to class logits and binary cross-entropy loss to multi-aspect logits. Through our method, the model can learn not only the knowledge about visual aspects but also the abstract and complex aspects that require a deeper understanding. We primarily apply our method to image classification, and to explore the potential for extending our model, we expand it to other tasks, such as object detection. In all experimental results, our method improves the performance of the baselines. Additionally, we analyze the effect of multi-aspect knowledge distillation. These results demonstrate that our method can transfer knowledge about various aspects to the model and the aspect knowledge can enhance model performance in computer vision tasks. This paper demonstrates the great potential of multi-aspect knowledge distillation, and we believe it offers a promising direction for future research in computer vision and beyond.

555Rethinking the Bias of Foundation Model under Long-tailed Distribution

[openreview] [pdf]

Abstract Long-tailed learning has garnered increasing attention due to its practical significance. Among the various approaches, the fine-tuning paradigm has gained considerable interest with the advent of foundation models. However, most existing methods primarily focus on leveraging knowledge from these models, overlooking the inherent biases introduced by the imbalanced training data they rely on. In this paper, we examine how such imbalances affect long-tailed downstream tasks. Specifically, we refer to the biases in foundation models and downstream tasks as parameter imbalance and data imbalance, respectively. Through fine-tuning, we observe that parameter imbalance plays a more critical role, while data imbalance can be mitigated using existing re-balancing strategies. Moreover, we find that parameter imbalance cannot be effectively addressed by current re-balancing techniques, such as adjusting the logits, during training, unlike data imbalance. To tackle both imbalances simultaneously, we constitute a causal structure graph and view the partial semantic factor as the confounder, which brings spurious correlations between input samples and labels. To resolve the negative effects of this, we propose a novel backdoor adjustment method that learns the true causal effect between input samples and labels, rather than merely fitting the correlations in the data. Experimental results validate the effectiveness of our method.

556100 instances is all you need: predicting LLM success by testing on a few instances

[openreview] [pdf]

Abstract Predicting if LLMs will succeed on individual task instances is essential to ensure their reliability in high-stakes applications. To do so, we can evaluate a LLM on a set of instances and train an “assessor” to predict its performance. However, this requires evaluating each new LLM on sufficiently many instances. In this work, we build a “generic assessor” predicting the performance of any LLM on an instance by using the LLM’s performance on a small set of reference instances and the features of the considered instance. In practice, we make use of existing evaluation results to extract the representative instances and train the assessor. Thus, the performance of a new LLM can be predicted by only testing it on the reference instances, leveraging the information contained in other LLMs’ evaluations. We conduct empirical studies on HELM-Lite and KindsOfReasoning, a new collection of existing reasoning datasets that we introduce, where we evaluate all instruction-fine-tuned OpenAI models until gpt4-0125-preview\texttt{gpt4-0125-preview}. We find that a few instances (around 100) are enough to achieve predictive power comparable to the LLM-specific assessors trained on the complete set of several thousand instances. Interestingly, randomly selecting the reference instances performs comparably to the advanced selection methods we tested. Finally, we identify a sharp drop in the predictive power of the generic and specific assessors in out-of-distribution scenarios, suggesting that the inherent predictability of LLMs is low.

557Aggregation of Multi Diffusion Models for Enhancing Learned Representations

[openreview] [pdf]

Abstract Diffusion models have achieved remarkable success in image generation, particularly with the various applications of classifier-free guidance conditional diffusion models. While many diffusion models perform well when controlling for particular aspect among style, character, and interaction, they struggle with fine-grained control due to dataset limitations and intricate model architecture design. This paper introduces a novel algorithm, Aggregation of Multi Diffusion Models (AMDM), which synthesizes features from multiple diffusion models into a specified model, enhancing its learned representations to activate specific features for fine-grained control. AMDM consists of two key components: spherical aggregation and manifold optimization. Spherical aggregation merges intermediate variables from different diffusion models with minimal manifold deviation, while manifold optimization refines these variables to align with the intermediate data manifold, enhancing sampling quality. Experimental results demonstrate that AMDM significantly improves fine-grained control without additional training or inference time, proving its effectiveness. Additionally, it reveals that diffusion models initially focus on features such as position, attributes, and style, with later stages improving generation quality and consistency. AMDM offers a new perspective for tackling the challenges of fine-grained conditional control generation in diffusion models: We can fully utilize existing conditional diffusion models that control specific aspects, or develop new ones, and then aggregate them using the AMDM algorithm. This eliminates the need for constructing complex datasets, designing intricate model architectures, and incurring high training costs. Code is available at:https://github.com/Hammour-steak/AMDM

558OASIS Uncovers: High-Quality T2I Models, Same Old Stereotypes

[openreview] [pdf]

Abstract Images generated by text-to-image (T2I) models often exhibit visual biases and stereotypes of concepts such as culture and profession. Existing quantitative measures of stereotypes are based on statistical parity that does not align with the sociological definition of stereotypes and, therefore, incorrectly categorizes biases as stereotypes. Instead of oversimplifying stereotypes as biases, we propose a quantitative measure of stereotypes that aligns with its sociological definition. We then propose OASIS to measure the stereotypes in a generated dataset and understand their origins within the T2I model. OASIS includes two scores to measure stereotypes from a generated image dataset:(M1)Stereotype Score to measure the distributional violation of stereotypical attributes, and(M2)WALS to measure spectral variance in the images along a stereotypical attribute. OASIS also includes two methods to understand the origins of stereotypes in T2I models:(U1)StOP to discover attributes that the T2I model internally associates with a given concept, and(U2)SPI to quantify the emergence of stereotypical attributes in the latent space of the T2I model during image generation. Despite the considerable progress in image fidelity, using OASIS, we conclude that newer T2I models such as FLUX.1 and SDv3 contain strong stereotypical predispositions about concepts and still generate images with widespread stereotypical attributes. Additionally, the quantity of stereotypes worsens for nationalities with lower Internet footprints.

559Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning

[openreview] [pdf]

Abstract In this work, we address the problem of large language model (LLM) unlearning, aiming to remove unwanted data influences and associated model capabilities e.g., copyrighted data or harmful content generation) while preserving essential model utilities, without the need for retraining from scratch. Despite the growing need for LLM unlearning, a principled optimization framework remains lacking. To this end, we revisit the state-of-the-art approach, negative preference optimization (NPO), and identify the issue of reference model bias, which could undermine NPO’s effectiveness, particularly when unlearning forget data of varying difficulty. Given that, we propose a simple yet effective unlearning optimization framework, called SimNPO, showing that `simplicity’ in removing the reliance on a reference model (through the lens of simple preference optimization) benefits unlearning. We also provide deeper insights into SimNPO’s advantages, supported by analysis using mixtures of Markov chains. Furthermore, we present extensive experiments validating SimNPO’s superiority over existing unlearning baselines in benchmarks like TOFU and MUSE, and robustness against relearning attacks.

560Enforcing Interpretability in Time Series Transformers: A Concept Bottleneck Framework

[openreview] [pdf]

Abstract There has been a recent push of research on Transformer-based models for long-term time series forecasting, even though they are inherently difficult to interpret and explain. While there is a large body of work on interpretability methods for various domains and architectures, the interpretability of Transformer-based forecasting models remains largely unexplored. To address this gap, we develop a framework based on Concept Bottleneck Models to enforce interpretability of time series Transformers. We modify the training objective to encourage a model to develop representations similar to predefined interpretable concepts. In our experiments, we enforce similarity using Centered Kernel Alignment, and the predefined concepts include time features and an interpretable, autoregressive surrogate model (AR). We apply the framework to the Autoformer model, and present an in-depth analysis for a variety of benchmark tasks. We find that the model performance remains mostly unaffected, while the model shows much improved interpretability. Additionally, interpretable concepts become local, which makes the trained model easily intervenable. As a proof of concept, we demonstrate a successful intervention in the scenario of a time shift in the data, which eliminates the need to retrain.

561KnobGen: Controlling the Sophistication of Artwork in Sketch-Based Diffusion Models

[openreview] [pdf]

Abstract Recent advances in diffusion models have significantly improved text-to-image (T2I) generation, but they often struggle to balance fine-grained precision with high-level control. Methods like ControlNet and T2I-Adapter excel at following sketches by seasoned artists but tend to be overly rigid, replicating unintentional flaws in sketches from novice users. Meanwhile, coarse-grained methods, such as sketch-based abstraction frameworks, offer more accessible input handling but lack the precise control needed for detailed, professional use. To address these limitations, we propose KnobGen, a dual-pathway framework that democratizes sketch-based image generation by seamlessly adapting to varying levels of sketch complexity and user skill. KnobGen uses a Coarse-Grained Controller (CGC) module for high-level semantics and a Fine-Grained Controller (FGC) module for detailed refinement. The relative strength of these two modules can be adjusted through our knob inference mechanism to align with the user’s specific needs. These mechanisms ensure that KnobGen can flexibly generate images from both novice sketches and those drawn by seasoned artists. This maintains control over the final output while preserving the natural appearance of the image, as evidenced on the MultiGen-20M dataset and a newly collected sketch dataset.

562Bias Mitigation in Graph Diffusion Models

[openreview] [pdf]

Abstract Most existing graph generative diffusion models suffer from significant exposure bias during graph sampling. We observe that the forward diffusion’s maximum perturbation distribution in most models deviates from the standard normal distribution, while reverse sampling consistently starts from a standard normal distribution. This mismatch results in a reverse starting bias, which, together with the exposure bias, degrades generation quality. The exposure bias typically accumulates and propagates throughout the sampling process. In this paper, we effectively address both biases. To mitigate reverse starting bias, we employ a newly designed Langevin sampling algorithm to align with the forward maximum perturbation distribution, establishing a new reverse starting point. To address the exposure bias, we introduce a fraction correction mechanism based on a newly defined score difference. Our approach, which requires no network modifications, is validated across multiple models, datasets, and tasks, achieving state-of-the-art results.

563Forward Learning with Differential Privacy

[openreview] [pdf]

Abstract Differential privacy (DP) in deep learning is a critical concern as it ensures the confidentiality of training data while maintaining model utility. Existing DP training algorithms provide privacy guarantees by clipping each individual backpropagated gradient and then injecting noise. Different from backpropagation, forward-learning algorithms based on perturbation inherently utilize randomness to estimate the gradient of each sample in parallel. These algorithms offer high parallelizability, suitability for non-differentiable modules, and applicability in black-box settings. Moreover, the introduction of noise during the forward pass indirectly provides randomness protection to the model parameters and their gradients, suggesting its potential for naturally providing differential privacy. In this paper, we propose a forward-learning algorithm, Differential Private Unified Likelihood Ratio method (DP-ULR), and demonstrate its differential privacy guarantees. DP-ULR features a novel batch sampling operation with rejection, which we theoretically analyze in conjunction with classic differential privacy mechanisms. DP-ULR is also underpinned by a theoretically guided privacy controller that dynamically adjusts noise levels to manage privacy costs effectively in each training step. Our experiments indicate that DP-ULR achieves competitive performance compared to traditional differential privacy training algorithms based on backpropagation, maintaining the same privacy loss limits.

564Bridging Jensen Gap for Max-Min Group Fairness Optimization in Recommendation

[openreview] [pdf]

Abstract Group max-min fairness (MMF) is commonly used in fairness-aware recommender systems (RS) as an optimization objective, as it aims to protect marginalized item groups and ensures a fair competition platform. However, our theoretical analysis indicates that integrating MMF constraint violates the assumption of sample independence during optimization, causing the loss function to deviate from linear additivity. Such nonlinearity property introduces the Jensen gap between the model’s convergence point and the optimal point if mini-batch sampling is applied. Both theoretical and empirical studies show that as the mini-batch size decreases and the group size increases, the Jensen gap will widen accordingly. Some methods using heuristic re-weighting or debiasing strategies have the potential to bridge the Jensen gap. However, they either lack theoretical guarantees or suffer from heavy computational costs. To overcome these limitations, we first theoretically demonstrate that the MMF-constrained objective can be essentially reformulated as a group-weighted optimization objective. Then we present an efficient and effective algorithm named FairDual, which utilizes a dual optimization technique to minimize Jensen gap. Our theoretical analysis demonstrates that FairDual can achieve a sub-linear convergence rate to the globally optimal solution and the Jensen gap can be well bounded under a mini-batch sampling strategy with random shuffle. Extensive experiments conducted using three large-scale RS backbone models on two publicly available datasets demonstrate that FairDual outperforms all baselines in terms of both accuracy and fairness.

565Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

[openreview] [pdf]

Abstract Off-policy learning and evaluation scenarios leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator’s bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded (1+ϵ)(1+\epsilon)-th moment of weighted reward. Notably, we achieve a convergence rate of O(nϵ/(1+ϵ))O(n^{-\epsilon/(1+\epsilon)}), where nn is the number of training samples for the regret bounds. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach.

566Longhorn: State Space Models are Amortized Online Learners

[openreview] [pdf]

Abstract The most fundamental capability of modern AI methods such as Large Language Models (LLMs) is the ability to predict the next token in a long sequence of tokens, known as “sequence modeling.” Although the Transformers model is the current dominant approach to sequence modeling, its quadratic computational cost with respect to sequence length is a significant drawback. State-space models (SSMs) offer a promising alternative due to their linear decoding efficiency and high parallelizability during training. However, existing SSMs often rely on seemingly ad hoc linear recurrence designs. In this work, we explore SSM design through the lens of online learning, conceptualizing SSMs as meta-modules for specific online learning problems. This approach links SSM design to formulating precise online learning objectives, with state transition rules derived from optimizing these objectives. Based on this insight, we introduce a novel deep SSM architecture based on the implicit update for optimizing an online regression objective. Our experimental results show that our models outperform state-of-the-art SSMs, including the Mamba model, on standard sequence modeling benchmarks and language modeling tasks.

567Following the Human Thread in Social Navigation

[openreview] [pdf]

Abstract The success of collaboration between humans and robots in shared environments relies on the robot’s real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human trajectories emerge as crucial cues in Social Navigation, but they are partially observable from the robot’s egocentric view and computationally complex to process.We present the first Social Dynamics Adaptation model (SDA) based on the robot’s state-action history to infer the social dynamics. We propose a two-stage Reinforcement Learning framework: the first learns to encode the human trajectories into social dynamics and learns a motion policy conditioned on this encoded information, the current status, and the previous action. Here, the trajectories are fully visible, i.e., assumed as privileged information. In the second stage, the trained policy operates without direct access to trajectories. Instead, the model infers the social dynamics solely from the history of previous actions and statuses in real-time. Tested on the novel Habitat 3.0 platform, SDA sets a novel state-of-the-art (SotA) performance in finding and following humans.The code will be released upon acceptance.

568Open-World Reinforcement Learning over Long Short-Term Imagination

[openreview] [pdf]

Abstract Training visual reinforcement learning agents in a high-dimensional open world presents significant challenges. While various model-based methods have improved sample efficiency by learning interactive world models, these agents tend to be “short-sighted”, as they are typically trained on short snippets of imagined experiences. We argue that the primary obstacle in open-world decision-making is improving the efficiency of off-policy exploration across an extensive state space. In this paper, we present LS-Imagine, which extends the imagination horizon within a limited number of state transition steps, enabling the agent to explore behaviors that potentially lead to promising long-term feedback. The foundation of our approach is to build a long short-term world model\textit{long short-term world model}. To achieve this, we simulate goal-conditioned jumpy state transitions and compute corresponding affordance maps by zooming in on specific areas within single images. This facilitates the integration of direct long-term values into behavior learning. Our method demonstrates significant improvements over state-of-the-art techniques in MineDojo.

569Inverse Constitutional AI: Compressing Preferences into Principles

[openreview] [pdf]

Abstract Feedback data is widely used to align or evaluate state-of-the-art AI models according to human preferences. Pairwise text preferences, where human (or AI) annotators select the “better” of two options, are particularly common. This data is typically used to train reward models or to compute aggregate statistics, asserting one model to be “better” than another. For many applications, however, it is desirable to understand human preferences in addition to modeling them. Neither black-box reward models nor statistics can answer why one model is better than another. Pairwise preference datasets, therefore, pose an interpretability challenge. The raw data consists of numerous (long) response pairs that are often infeasible to interpret manually. Prior work has demonstrated that human-annotated preference data often exhibits unintended biases, underscoring the urgent need for good interpretability tools to detect and alleviate such biases. In this paper, we introduce the Inverse Constitutional AI (ICAI) problem, formulating the interpretation of pairwise text preference data as a compression task. In constitutional AI, a set of principles (a constitution) is used to provide feedback and fine-tune AI models. ICAI inverts this process: given a feedback dataset, we aim to extract a constitution that best enables a large language model (LLM) to reconstruct the original annotations. We propose a corresponding algorithm and validate its generated constitutions quantitatively based on annotation reconstruction accuracy on a variety of datasets: (a) synthetic feedback data with known underlying principles; (b) AlpacaEval data with cross-annotated human feedback; (c) crowdsourced Chatbot Arena data; and (d) PRISM data from diverse demographic groups. As a short and interpretable representation of the original dataset, generated constitutions have many potential use cases — they may help identify undesirable annotator biases, better understand model performance, scale feedback to unseen data, or assist with adapting LLMs to individual user or group preferences. We release the code for our experiments athidden url.

570FullDiffusion: Diffusion Models Without Time Truncation

[openreview] [pdf]

Abstract Diffusion models are predominantly used for generative modeling, which synthesize samples by simulating the reverse process of a stochastic differential equation (SDE) that diffuses data into Gaussian noise. However, when simulating the reverse SDE, the SDE solver suffers from numerical instability near the time boundary; hence, in practice, the simulation is terminated before reaching the boundary point. This heuristic time truncation hinders the rigorous formulation of diffusion models, and requires additional costs of hyperparameter tuning. Moreover, such numerical instability often occurs even in training, especially when using a maximum likelihood loss. Therefore, the current diffusion model heavily relies on the time truncation technique in both training and inference. In this paper, we propose a method that completely eliminates the heuristic of time truncation. Our method eliminates numerical instability during maximum likelihood training by modifying the parameterization of the noise predictor and the noise schedule. We also propose a novel SDE solver that can simulate without time truncation by taking advantage of the semi-linear structure of the reverse SDE. These improvements enable stable training and sampling of diffusion models without relying on time truncation. In our experiments, we tested the effectiveness of our method on the CIFAR-10 and ImageNet-32 datasets by evaluating the test likelihood and the sample quality measured by the Fréchet inception distance (FID). We observe that our method consistently improve performance in both test likelihood and the FID compared to the baseline model of DDPM++.

571Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

[openreview] [pdf]

Abstract Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling, reducing the inference cost significantly. To achieve this, MoDE uses sparse experts combined with a novel routing strategy that conditions the expert selection on the current noise level of the denoising process. This is combined with a noise-conditioned self-attention mechanism for further improvements. MoDE achieves state-of-the-art performance across 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). It surpasses both CNN-based and Transformer Diffusion Policies by an average of 20% in all settings, while using 40% fewer FLOPs and fewer active parameters. Furthermore, we conduct comprehensive ablations on MoDE’s components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies.

572Rainbow Generator: Generating Diverse Data for Name Only Continual Learning

[openreview] [pdf]

Abstract Requiring extensive human supervision is often impractical for continual learning due to its cost, leading to the emergence of ‘name-only continual learning’ that only provides the name of new concepts (e.g., classes) without providing supervised samples. To address the task, recent approach uses web-scraped data but results in issues such as data imbalance, copyright, and privacy concerns. To overcome the limitations of both human supervision and webly supervision, we propose Generative name only Continual Learning (GenCL) using generative models for the name only continual learning. But naïve application of generative models results in limited diversity of generated data. So, we specifically propose a diverse prompt generation method, HIerarchical Recurrent Prompt Generation (HIRPG) as well as COmplexity-NAvigating eNsembler (CONAN) that selects samples with minimal overlap from multiple generative models. We empirically validate that the proposed GenCL outperforms prior arts, even a model trained with fully supervised data, in various tasks including image recognition and multi-modal visual reasoning. Data generated by GenCL is available athttps://anonymous.4open.science/r/name-only-continual-E079.

573Sail into the Headwind: Alignment via Robust Rewards and Dynamic Labels against Reward Hacking

[openreview] [pdf]

Abstract Aligning AI systems with human preferences typically suffers from the infamousreward hackingproblem, where optimization of an imperfect reward model leads to undesired behaviors. In this paper, we investigate reward hacking in offline preference optimization (PO), which aims to improve an initial model using a preference dataset. We identify two types of reward hacking stemming from statistical fluctuations in the dataset: Type I Reward Hacking due to subpar choices appearing more favorable, and Type II Reward Hacking due to decent choices appearing less desirable. We prove that many (mainstream or theoretical) PO methods suffer from both types of reward hacking. To address Type I Reward Hacking, we propose POWER, a new PO method that combines Guiaus’s Weighted Entropy with a Robust Reward maximization objective. POWER enjoys finite-sample guarantees under general function approximation, competing with the best covered policy in the data. To address Type II Reward Hacking, we analyze the learning dynamics of POWER and combine it with a novel technique that dynamically updates preference labels (POWER-DL) toward certain “stationary labels”, resulting in diminishing gradients for untrustworthy samples. Empirically, POWER-DL consistently outperforms state-of-the-art methods on alignment benchmarks, achieving improvements of up to13.0points on AlpacaEval 2 and11.5points on Arena Hard over DPO. Strong theoretical guarantees and empirical performance demonstrate the promise of POWER-DL in mitigating reward hacking.

574Federated Learning Can Find Friends That Are Advantageous

[openreview] [pdf]

Abstract In Federated Learning (FL), the distributed nature and heterogeneity of client data present both opportunities and challenges. While collaboration among clients can significantly enhance the learning process, not all collaborations are beneficial; some may even be detrimental. In this study, we introduce a novel algorithm that assigns adaptive aggregation weights to clients participating in FL training, identifying those with data distributions most conducive to a specific learning objective. We demonstrate that our aggregation method converges no worse than the method that aggregates only the updates received from clients with the same data distribution. Furthermore, empirical evaluations consistently reveal that collaborations guided by our algorithm outperform traditional FL approaches. This underscores the critical role of judicious client selection and lays the foundation for more streamlined and effective FL implementations in the coming years.

575STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning

[openreview] [pdf]

Abstract Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task, generalist policy by training broadly across them. Notably, while these generalist policies can improve the average performance across many tasks, the performance of generalist policies on any one task is often suboptimal due to negative transfer between partitions of the data, compared to task-specific specialist policies. In this work, we argue for the paradigm of training policies during deployment given the scenarios they encounter: rather than deploying pre-trained policies to unseen problems in a zero-shot manner, we non-parametrically retrieve and train models directly on relevant data at test time. Furthermore, we show that many robotics tasks share considerable amounts of low-level behaviors and that retrieval at the “sub”-trajectory granularity enables significantly improved data utilization, generalization, and robustness in adapting policies to novel problems. In contrast, existing full-trajectory retrieval methods tend to underutilize the data and miss out on shared cross-task content. This work proposes STRAP, a technique for leveraging pre-trained vision foundation models and dynamic time warping to retrieve sub-sequences of trajectories from large training corpora in a robust fashion. STRAP outperforms both prior retrieval algorithms and multi-task learning methods in simulated and real experiments, showing the ability to scale to much larger offline datasets in the real world as well as the ability to learn robust control policies with just a handful of real-world demonstrations.

576Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

[openreview] [pdf]

Abstract Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel online algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no- regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 42.6% length-controlled win rate on AlpacaEval 2.0 and a 37.8% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art online RLHF algorithms.

577Towards Machine Theory of Mind with Large Language Model-Augmented Inverse Planning

[openreview] [pdf]

Abstract We propose a hybrid approach to machine Theory of Mind (ToM) that uses large language models (LLMs) as a mechanism for generating hypotheses and likelihood functions with a Bayesian inverse planning model that computes posterior probabilities for an agent’s likely mental states given its actions. Bayesian inverse planning models can accurately predict human reasoning on a variety of ToM tasks, but these models are constrained in their ability to scale these predictions to scenarios with a large number of possible hypotheses and actions. Conversely, LLM-based approaches have recently demonstrated promise in solving ToM benchmarks, but can exhibit brittleness and failures on reasoning tasks even when they pass otherwise structurally identical versions. By combining these two methods, our approach leverages the strengths of each component, closely matching optimal results on a task inspired by prior inverse planning models and improving performance relative to models that utilize LLMs alone or with chain-of-thought prompting. We also exhibit the model’s potential to predict mental states on open-ended tasks, offering a promising direction for future development of ToM models and the creation of socially intelligent generative agent models.

578In Search of Forgotten Domain Generalization

[openreview] [pdf]

Abstract Out-of-Domain (OOD) generalization is the ability of a model trained on one or more domains to generalize to unseen domains. In the ImageNet era of computer vision, evaluation sets for measuring a model’s OOD performance were designed to be strictly OOD with respect to style. However, the emergence of foundation models and expansive web-scale datasets has obfuscated this evaluation process, as datasets cover a broad range of domains and risk test domain contamination. In search of the forgotten domain generalization, we create large-scale datasets subsampled from LAION---LAION-Natural and LAION-Rendition---that are strictly OOD to corresponding ImageNet and DomainNet test sets in terms of style. Training CLIP models on these datasets reveals that a significant portion of their performance is explained by in-domain examples. This indicates that the OOD generalization challenges from the ImageNet era still prevail and that training on web-scale data merely creates the illusion of OOD generalization. Furthermore, through a systematic exploration of combining natural and rendition datasets in varying proportions, we identify optimal mixing ratios for model generalization across these domains. Our datasets and results re-enable meaningful assessment of OOD robustness at scale---a crucial prerequisite for improving model robustness.

579Test-Time Training for Out-of-Distribution Industrial Anomaly Detection via Robust Distribution Alignment

[openreview] [pdf]

Abstract Detecting anomalous patterns is essential for quality control in industrial applications, with state-of-the-art methods relying on large defect-free datasets to model normal distributions. However, robustness under domain shift, such as changes in lighting or sensor drift, remains a critical challenge in real-world deployment. An existing work, Generalized Normality Learning (GNL), addresses domain shifts by enforcing feature consistency through training-time augmentation, but its reliance on prior knowledge of target distributions and access to training data at inference limits flexibility. To overcome these limitations, we propose a memory bank-based anomaly detection method that avoids retraining or access to training data during inference. We improve the robustness to distribution shifts via distribution alignment based test-time training. Our approach leverages a modified Sinkhorn distance to align distributions and handle outliers, offering a more resilient solution for industrial anomaly detection under realistic constraints. Extensive evaluations on out-of-distribution anomaly detection benchmarks demonstrate the effectiveness.

580A Theoretical Perspective: When and How Self-consuming Training Loops Generalize

[openreview] [pdf]

Abstract High-quality data is essential for training large generative models, yet the vast reservoir of real data available online has become nearly depleted. Consequently, models increasingly generate their own data for further training, forming Self-consuming Training Loops (STLs). However, the empirical results have been strikingly inconsistent: some models degrade or even collapse, while others successfully avoid these failures, leaving a significant gap in theoretical understanding to explain this discrepancy. This paper introduces the intriguing notion ofrecursive stabilityand presents the first theoretical generalization analysis, revealing how both model architecture and the proportion between real and synthetic data influence the success of STLs. We further extend this analysis to transformers in in-context learning, showing that even a constant-sized proportion of real data ensures convergence, while also providing insights into optimal synthetic data sizing.

581Rethinking Diffusion Posterior Sampling: From Conditional Score Estimator to Maximizing a Posterior

[openreview] [pdf]

Abstract Recent advancements in diffusion models have been leveraged to address inverse problems without additional training, and Diffusion Posterior Sampling (DPS) (Chung et al., 2022a) is among the most popular approaches. Previous analyses suggest that DPS accomplishes posterior sampling by approximating the conditional score. While in this paper, we demonstrate that the conditional score approximation employed by DPS is not as effective as previously assumed, but rather aligns more closely with the principle of maximizing a posterior (MAP). This assertion is substantiated through an examination of DPS on 512×\times512 ImageNet images, revealing that: 1) DPS’s conditional score estimation significantly diverges from the score of a well-trained conditional diffusion model and is even inferior to the unconditional score; 2) The mean of DPS’s conditional score estimation deviates significantly from zero, rendering it an invalid score estimation; 3) DPS generates high-quality samples with significantly lower diversity. In light of the above findings, we posit that DPS more closely resembles MAP than a conditional score estimator, and accordingly propose the following enhancements to DPS: 1) we explicitly maximize the posterior through multi-step gradient ascent and projection; 2) we utilize a light-weighted conditional score estimator trained with only 100 images and 8 GPU hours. Extensive experimental results indicate that these proposed improvements significantly enhance DPS’s performance. The source code for these improvements is provided in the supplementary material.

582Provably Efficient Multi-Objective Bandit Algorithms under Preference-Centric Customization

[openreview] [pdf]

Abstract Existing multi-objective multi-armed bandit (MO-MAB) approaches mainly focus on achieving Pareto optimality. However, a Pareto optimal arm that receives a high score from one user may lead to a low score from another, since in real-world scenarios, users often have diverse preferences across different objectives. Instead, these preferences should informcustomized learning, a factor usually neglected in prior research. To address this need, we study apreference-awareMO-MAB framework in the presence of explicit user preferences, where each user’s overall-reward is modeled as the inner product of user preference and arm reward. This new framework shifts the focus from merely achieving Pareto optimality to further optimizing within the Pareto front under preference-centric customization. To the best of our knowledge, this is the first theoretical exploration of customized MO-MAB optimization based on explicit user preferences. This framework introduces new and unique challenges for algorithm design for customized optimization. To address these challenges, we incorporatepreference estimationandpreference-aware optimizationas key mechanisms for preference adaptation, and develop new analytical techniques to rigorously account for the impact of preference estimation errors on overall performance. Under this framework, we consider three preference structures inspired by practical applications, with tailored algorithms that are proven to achieve near-optimal regret, and show good numerical performance.

583Statistical Tractability of Off-policy Evaluation of History-dependent Policies in POMDPs

[openreview] [pdf]

Abstract We investigate off-policy evaluation (OPE), a central and fundamental problem in reinforcement learning (RL), in the challenging setting of Partially Observable Markov Decision Processes (POMDPs) with large observation spaces. Recent works of Uehara et al. (2023a); Zhang & Jiang (2024) developed a model-free framework and identified important coverage assumptions (called belief and outcome coverage) that enable accurate OPE of memoryless policies with polynomial sample complexities, but handling more general target policies that depend on the entire observable history remained an open problem. In this work, we prove information-theoretic hardness for model-free OPE of history-dependent policies in several settings, characterized by additional assumptions imposed on the behavior policy (memoryless vs. history-dependent) and/or the state-revealing property of the POMDP (single-step vs. multi-step revealing). We further show that some hardness can be circumvented by a natural model-based algorithm—whose analysis has surprisingly eluded the literature despite the algorithm’s simplicity—demonstrating provable separation between model-free and model-based OPE in POMDPs.

584Diffusion Attribution Score: Which Training Sample Determines Your Generation?

[openreview] [pdf]

Abstract As diffusion models advance, the scientific community is actively developing methods to curb the misuse of generative models, which aims to prevent the reproduction of copyrighted, explicitly violent, or personally sensitive information in generated images. One strategy is to identify the contribution of training samples in generative models by evaluating their influence to the generated images, a task known as data attribution. Existing data attribution approaches on diffusion models suggest representing the contribution of a specific training sample by evaluating the change in the diffusion loss when the sample is included versus excluded from the training process. However, we argue that the direct usage of diffusion loss cannot represent such a contribution accurately due to the diffusion loss calculation. Specifically, these approaches measure the divergence between predicted and ground truth distributions, which leads to an indirect comparison between the predicted distributions and cannot represent the variances between model behaviors. To address these issues, we aim to measure the direct comparison between predicted distributions with an attribution score to analyse the training sample importance, which is achieved by Diffusion Attribution Score (DAS). Underpinned by rigorous theoretical analysis, we elucidate the effectiveness of DAS. Additionally, we explore strategies to accelerate DAS calculations, facilitating its application to large-scale diffusion models. Our extensive experiments across various datasets and diffusion models demonstrate that DAS significantly surpasses previous benchmarks in terms of the linear data-modelling score, establishing new state-of-the-art performance. Code is available athttps://anonymous.4open.science/r/Diffusion-Attribution-Score-411F.

585Comparing Targeting Strategies for Maximizing Social Welfare with Limited Resources

[openreview] [pdf]

Abstract Machine learning is increasingly used to select which individuals receive limited resource interventions in domains such as human services, education, development, and more. However, it is often not apparent what the right quantity is for models to predict. In particular, policymakers rarely have access to data from a randomized controlled trial (RCT) that would enable accurate estimates of treatment effects – which individuals would benefit more from the intervention. Observational data is more likely to be available, creating a substantial risk of bias in treatment effect estimates. Practitioners instead commonly use a technique termed “risk-based targeting” where the model is just used to predict each individual’s status quo outcome (an easier, non-causal task). Those with higher predicted risk are offered treatment. There is currently almost no empirical evidence to inform which choices lead to the most effect machine learning-informed targeting strategies in social domains. In this work, we use data from 5 real-world RCTs in a variety of domains to empirically assess such choices. We find that risk-based targeting is almost always inferior to targeting based on even biased estimates of treatment effects. Moreover, these results hold even when the policymaker has strong normative preferences for assisting higher-risk individuals. Our results imply that, despite the widespread use of risk prediction models in applied settings, practitioners may be better off incorporating even weak evidence about heterogeneous causal effects to inform targeting.

586Efficient and Accurate Explanation Estimation with Distribution Compression

[openreview] [pdf]

Abstract Exact computation of various machine learning explanations requires numerous model evaluations and in extreme cases becomes impractical. The computational cost of approximation increases with an ever-increasing size of data and model parameters. Many heuristics have been proposed to approximate post-hoc explanations efficiently. This paper shows that the standard i.i.d. sampling used in a broad spectrum of algorithms for explanation estimation leads to an approximation error worthy of improvement. To this end, we introduce compress then explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. We show that CTE improves the estimation of removal-based local and global explanations with negligible computational overhead. It often achieves an on-par explanation approximation error using 2-3x fewer samples, i.e. requiring 2-3x fewer model evaluations. CTE is a simple yet powerful plug-in for any explanation method that now relies on i.i.d. sampling.

587Adaptive Priors from Learning Trajectories for Function-Space Bayesian Neural Networks

[openreview] [pdf]

Abstract Tractable Function-space Variational Inference (T-FVI) provides a way to estimate the function-space Kullback-Leibler (KL) divergence between a random prior function and its posterior. This allows the optimization of the function-space KL divergence via Stochastic Gradient Descent (SGD) and thus simplifies the training of function-space Bayesian Neural Networks (BNNs). However, function-space BNNs on high-dimensional datasets typically require deep neural networks (DNN) with numerous parameters, and thus defining suitable function-space priors remains challenging. For instance, the Gaussian Process (GP) prior suffers from scalability issues, and DNNs do not provide a clear way to set appropriate weight parameters to achieve meaningful function-space priors. To address this issue, we propose an explicit form of function-space priors that can be easily integrated into widely-used DNN architectures, while adaptively incorporating different levels of uncertainty based on the function’s inputs. To achieve this, we consider DNNs as Bayesian last-layer models to obtain the explicit mean and variance functions of our prior. The parameters of these explicit functions are determined using the weight statistics over the learning trajectory. Our empirical experiments show improved uncertainty estimation in image classification, transfer learning, and UCI regression tasks.

588Generalization Performance Gap Analysis between Centralized and Federated Learning: How to Bridge this Gap?

[openreview] [pdf]

Abstract The rising interest in decentralized data and privacy protection has led to the emergence of Federated Learning. Many studies have compared federated training with classical training approaches using centralized data and found from experiments that models trained in a federated setup with equal resources perform poorly on tasks. However, these studies have generally been empirical and have not explored the performance gap further from a theoretical perspective. The lack of theoretical understanding prevents figuring out whether federated algorithms are necessarily inferior to centralized algorithms in performance and how large this gap is according to the training settings. Also, it hinders identifying valid ways to close this performance distance. This paper fills this theoretical gap by formulating federated training as an SGD (Stochastic Gradient Descent) optimization problem over decentralized data and defining the performance gap within the PAC-Bayes (Probably Approximately Correct Bayesian) framework. Through theoretical analysis, we derive non-vacuous bounds on this performance gap, revealing that the difference in generalization performance necessarily exists when training resources are equal for both training setups and that variations in the training parameters affect the gap. Moreover, we also prove that the complete elimination of the performance gap is only possible by introducing new clients or adding new data to existing clients. Advantages in other training resources are not feasible for closing the gap, such as giving larger models or more communication rounds to federated scenarios. Our theoretical findings are validated by extensive experimental results from different model architectures and datasets.

589FrugalNeRF: Fast Convergence for Few-shot Novel View Synthesis without Learned Priors

[openreview] [pdf]

Abstract Neural Radiance Fields (NeRF) face significant challenges in few-shot scenarios, particularly due to overfitting and long training times for high-fidelity rendering. While current approaches like FreeNeRF and SparseNeRF use frequency regularization or pre-trained priors, they can be limited by complex scheduling or potential biases. We introduce FrugalNeRF, a novel few-shot NeRF framework that leverages weight-sharing voxels across multiple scales to efficiently represent scene details. Our key contribution is a cross-scale geometric adaptation training scheme that selects pseudo ground truth depth based on reprojection error from both training and novel views across scales. This guides training without relying on externally learned priors, allowing FrugalNeRF to fully utilize available data. While not dependent on pre-trained priors, FrugalNeRF can optionally integrate them for enhanced quality without affecting convergence speed. Our method generalizes effectively across diverse scenes and converges more rapidly than state-of-the-art approaches. Our experiments on standard LLFF, DTU, and RealEstate-10K datasets demonstrate that FrugalNeRF outperforms existing few-shot NeRF models, including those using pre-trained priors, while significantly reducing training time, making it a practical solution for efficient and accurate 3D scene reconstruction.

590Sampling Process Brings Additional Bias for Debiased Recommendation

[openreview] [pdf]

Abstract In recommender systems, selection bias arises from the users’ selective interactions with items, which poses a widely-recognized challenge for unbiased evaluation and learning for recommendation models. Recently, doubly robust and its variants have been widely studied to achieve debiased learning of prediction models. However, if the users and items in the training set are not exactly the same as those in the test set, even if the imputed errors and learned propensities are accurate, all previous doubly robust based debiasing methods are biased. To tackle this problem, in this paper, we first derive the bias of doubly robust learning methods and provide alternative unbiasedness conditions when users and items are sampled from a superpopulation. Then we propose a novel superpopulation doubly robust target learning approach (SuperDR), which is unbiased when either the imputation model or propensity model is correctly specified. We further derive the generalization error bound of the proposed method under superpopulation, and show that it can be effectively controlled by the proposed target learning approach. We conduct extensive experiments on three real-world datasets, including a large-scale industrial dataset, to demonstrate the effectiveness of our method.

591Gap Preserving Distillation by Building Bidirectional Mappings with A Dynamic Teacher

[openreview] [pdf]

Abstract Knowledge distillation aims to transfer knowledge from a large teacher model to a compact student counterpart, often coming with a significant performance gap between them. We find that a too-large performance gap can hamper the training process, which is also verified in recent studies. To address this, we propose a Gap Preserving Distillation (GPD) method that trains an additional dynamic teacher model from scratch along with training the student to bridge this gap. In this way, it becomes possible to maintain a reasonable performance gap between teacher and student during the whole distillation process. To further strengthen distillation from the dynamic teacher to the student, we develop a hard strategy by enforcing them to share parameters and encouraging parameter inheritance. Besides hard strategy, we also build the soft bidirectional mappings between them which are built on an Inverse Reparameterization (IR) method and a Channel-Branch Reparameterization (CBR) strategy. We highlight that our IR is able to initialize a larger dynamic teacher with an arbitrary expansion ratio, while preserving exactly the same accuracy as the given student model. In this way, it guarantees that the dynamic teacher and student start from the same point and avoid a too large gap in early stage of training. As for our CBR, with parameter-sharing, it directly extracts an effective student model from the well-learned dynamic teacher without any post-training, making our method highly flexible for model deployment. In the experiments, GPD significantly outperforms existing distillation methods on top of both CNNs and transformers architectures, achieving up to 1.58% accuracy improvement. Interestingly, GPD also generalizes well to the scenarios without a pretrained teacher, including training from scratch and fine-tuning, yielding a large improvement of 1.80% and 0.89% on ResNet18, respectively.

592High dimensional Bayesian Optimization via Condensing-Expansion Projection

[openreview] [pdf]

Abstract In high-dimensional settings, Bayesian optimization (BO) can be expensive and infeasible. The random embedding Bayesian optimization algorithm is commonly used to address high-dimensional BO challenges. However, this method relies on the effective subspace assumption on the optimization problem’s objective function, which limits its applicability. In this paper, we introduce Condensing-Expansion Projection Bayesian optimization (CEPBO), a novel random projection-based approach for high-dimensional BO that does not reply on the effective subspace assumption. The approach is both simple to implement and highly practical. We present two algorithms based on different random projection matrices: the Gaussian projection matrix and the hashing projection matrix. Experimental results demonstrate that both algorithms outperform existing random embedding-based algorithms in most cases, achieving superior performance on high-dimensional BO problems. The code is available in \url{https://anonymous.4open.science/r/CEPBO-14429}.

593Learning system dynamics without forgetting

[openreview] [pdf]

Abstract Observation-based trajectory prediction for systems with unknown dynamics is essential in fields such as physics and biology. Most existing approaches are limited to learning within a single system with fixed dynamics patterns. However, many real-world applications require learning across systems with evolving dynamics patterns, a challenge that has been largely overlooked. To address this, we systematically investigate the problem of Continual Dynamics Learning (CDL), examining task configurations and evaluating the applicability of existing techniques, while identifying key challenges. In response, we propose the Mode-switching Graph ODE (MS-GODE) model, which integrates the strengths LG-ODE and sub-network learning with a mode-switching module, enabling efficient learning over varying dynamics. Moreover, we construct a novel benchmark of biological dynamic systems for CDL, Bio-CDL, featuring diverse systems with disparate dynamics and significantly enriching the research field of machine learning for dynamic systems. Our code and benchmark datasets will be publicly available.

594Expected Sliced Transport Plans

[openreview] [pdf]

Abstract The optimal transport (OT) problem has gained significant traction in modern machine learning for its ability to: (1) provide versatile metrics, such as Wasserstein distances and their variants, and (2) determine optimal couplings between probability measures. To reduce the computational complexity of OT solvers, methods like entropic regularization and sliced optimal transport have been proposed. The sliced OT framework improves efficiency by comparing one-dimensional projections (slices) of high-dimensional distributions. However, despite their computational efficiency, sliced-Wasserstein approaches lack a transportation plan between the input measures, limiting their use in scenarios requiring explicit coupling. In this paper, we address two key questions: Can a transportation plan be constructed between two probability measures using the sliced transport framework? If so, can this plan be used to define a metric between the measures? We propose a ‘lifting’ operation to extend one-dimensional optimal transport plans back to the original space of the measures. By computing the expectation of these lifted plans, we derive a new transportation plan, termed expected sliced transport (EST) plans. We further prove that using the EST plan to weight the sum of the individual Euclidean costs xyp|x - y|^p for moving from xx to yy results in a valid metric between the input discrete probability measures. Finally, we demonstrate the connection between our approach and the recently proposed min-SWGG, along with illustrative numerical examples that support our theoretical findings.

595LEARN TO LEARN CONSISTENTLY

[openreview] [pdf]

Abstract In the few-shot learning problem, a model trained on a disjoint meta-train dataset is required to address novel tasks with limited novel examples. A key challenge in few-shot learning is the model’s propensity to learn biased shortcut features(e.g., background, noise, shape, color), which are sufficient to distinguish the few ex- amples during fast adaptation but lead to poor generalization. In our work, we observed when the model learns with higher consistency, the model tends to be less influenced by shortcut features, resulting in better generalization. Based on the observation, we propose a simple yet effective meta-learning method named Meta Self-Distillation. By maximizing the consistency of the learned knowledge during the meta-train phase, the model initialized by our method shows better generalization in the meta-test phase. Extensive experiments demonstrate that our method improves the model’s generalization across various few-shot classification scenarios and enhances the model’s ability to learn consistently.

596RecFlow: An Industrial Full Flow Recommendation Dataset

[openreview] [pdf]

Abstract Industrial recommendation systems (RS) rely on the multi-stage pipeline to balance effectiveness and efficiency when delivering items from a vast corpus to users. Existing RS benchmark datasets primarily focus on the exposure space, where novel RS algorithms are trained and evaluated. However, when these algorithms transition to real-world industrial RS, they face a critical challenge: handling unexposed items—a significantly larger space than the exposed one. This discrepancy profoundly impacts their practical performance. Additionally, these algorithms often overlook the intricate interplay between multiple RS stages, resulting in suboptimal overall system performance. To address this issue, we introduce RecFlow—an industrial full-flow recommendation dataset designed to bridge the gap between offline RS benchmarks and the real online environment. Unlike existing datasets, RecFlow includes samples not only from the exposure space but also unexposed items filtered at each stage of the RS funnel. Our dataset comprises 38M interactions from 42K users across nearly 9M items with additional 1.9B stage samples collected from 9.3M online requests over 37 days and spanning 6 stages. Leveraging the RecFlow dataset, we conduct courageous exploration experiments, showcasing its potential in designing new algorithms to enhance effectiveness by incorporating stage-specific samples. Some of these algorithms have already been deployed online, consistently yielding significant gains. We propose RecFlow as the first comprehensive benchmark dataset for the RS community, supporting research on designing algorithms at any stage, study of selection bias, debiased algorithms, multi-stage consistency and optimality, multi-task recommendation, and user behavior modeling. The RecFlow dataset, along with the corresponding source code, is publicly available at \textcolor{red}{\url{https://github.com/RecFlow-ICLR/RecFlow}}. The dataset is licensed under CC-BY-NC-SA-4.0 International License.

597OOD-Chameleon: Is Algorithm Selection for OOD Generalization Learnable?

[openreview] [pdf]

Abstract Out-of-distribution (OOD) generalization is challenging because distribution shifts come in many forms. A multitude of learning algorithms exist and each can improve performance inspecificOOD situations. We posit that much of the challenge of OOD generalization lies inchoosing the right algorithm for the right dataset. However, such algorithm selection is often elusive under complex real-world shifts. In this work, we formalize the task ofalgorithm selection for OOD generalizationand investigate whether it could be approached by learning.We propose a solution, dubbed OOD-Chameleon that treats the task as a supervised classification over candidate algorithms. We construct adataset of datasetsto learn from, which represents diverse types, magnitudes and combinations of shifts (covariate shift, label shift, spurious correlations). We train the model to predict the relative performance of algorithms given a dataset’s characteristics. This enablesa prioriselection of the best learning strategy, i.e. without training various models as needed with traditional model selection.Our experiments show that the adaptive selection outperforms any individual algorithm and simple selection heuristics, on unseen datasets of controllable and realistic image data. Inspecting the model shows that it learns non-trivial data/algorithms interactions, and reveals the conditions for any one algorithm to surpass another. This opens new avenues for (1) enhancing OOD generalization with existing algorithms instead of designing new ones, and (2) gaining insights into the applicability of existing algorithms with respect to datasets’ properties.

598Breaking Free: Hacking Diffusion Models for Generating Adversarial Examples and Bypassing Safety Guardrails

[openreview] [pdf]

Abstract Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on synthetically altering the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorithmic framework that uses auxiliary Conditional Diffusion and Classifier models to generate photo-realistic natural adversarial samples. We employ CMA-ES to optimize the initial seed vector search, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Classifier Model. Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers. We also show that beyond generating adversarial images, EvoSeed can also be used as a red-teaming tool to understand classification systems’ misclassification. Our research opens new avenues for understanding the limitations of current safety mechanisms and the risk of plausible attacks against classifier systems using image generation.

599Imagine to Ensure Safety in Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract This work investigates the safety exploration problem, where an agent must maximize performance while satisfying safety constraints. To address this problem, we propose a method that includes a learnable world model and two policies, a high-level policy and a low-level policy, that ensure safety at both levels. The high-level policy generates safe subgoals for the low-level policy, which progressively guide the agent towards the final goal. Through trajectory imagination, the low-level policy learns to safely reach these subgoals. The proposed method was evaluated on the standard benchmark, SafetyGym, and demonstrated superior performance quality while maintaining comparable safety violations compared to state-of-the-art approaches. In addition, we investigated an alternative implementation of safety in hierarchical reinforcement learning (HRL) algorithms using Lagrange multipliers, and demonstrated in the custom long-horizon environment SafeAntMaze that our approach achieves comparable performance while more effectively satisfying safety constraints, while the flat safe policy fails to accomplish this task.

600Beyond-Expert Performance with Limited Demonstrations: Efficient Imitation Learning with Double Exploration

[openreview] [pdf]

Abstract learning where the goal is to learn a policy that mimics the expert’s behavior. In practice, it is often challenging to learn the expert policy from a limited number of demonstrations accurately due to the complexity of the state space. Moreover, it is essential to explore the environment and collect data to achieve beyond-expert performance. To overcome these challenges, we propose a novel imitation learning algorithm namely Imitation Learning with Double Exploration (ILDE), which implements exploration in two aspects: (1) optimistic policy optimization via an exploration bonus that rewards state-action pairs with high uncertainty to potentially improve the convergence to the expert policy, and (2) curiosity-driven exploration of the states that deviate from the demonstration trajectories to potentially yield beyond-expert performance. Empirically, we demonstrate that ILDE outperforms the state-of-the-art imitation learning algorithms in terms of sample efficiency and achieves beyond-expert performance on Atari and MuJoCo tasks with fewer demonstrations than those in previous work. We also provide theoretical justification of ILDE as an uncertainty-regularized policy optimization method with optimistic exploration, leading to a regret growing sublinearly in the number of episodes.

601Inertial Confinement Fusion Forecasting via Large Language Models

[openreview] [pdf]

Abstract Controlled fusion energy is deemed pivotal for the advancement of human civilization. In this study, we introduce LPI-LLM\textbf{LPI-LLM}, a novel integration of Large Language Models (LLMs) with classical reservoir computing paradigms tailored to address a critical challenge, Laser-Plasma Instabilities (LPI\texttt{LPI}), in Inertial Confinement Fusion (ICF\texttt{ICF}). Our approach offers several key contributions: Firstly, we propose the LLM-anchored Reservoir\textit{LLM-anchored Reservoir}, augmented with a Fusion-specific Prompt\textit{Fusion-specific Prompt}, enabling accurate forecasting of LPI\texttt{LPI}-generated-hot electron dynamics during implosion. Secondly, we develop Signal-Digesting Channels\textit{Signal-Digesting Channels} to temporally and spatially describe the driver laser intensity across time, capturing the unique characteristics of ICF\texttt{ICF} inputs. Lastly, we design the Confidence Scanner\textit{Confidence Scanner} to quantify the confidence level in forecasting, providing valuable insights for domain experts to design the ICF\texttt{ICF} process. Extensive experiments demonstrate the superior performance of our method, achieving 1.90 CAE, 0.14 top-1\texttt{top-1} MAE, and 0.11 top-5\texttt{top-5} MAE in predicting Hard X-ray (HXR\texttt{HXR}) energies emitted by the hot electrons in ICF\texttt{ICF} implosions, which presents state-of-the-art comparisons against concurrent best systems. Additionally, we present LPI4AI\textbf{LPI4AI}, the first LPI\texttt{LPI} benchmark based on physical experiments, aimed at fostering novel ideas in LPI\texttt{LPI} research and enhancing the utility of LLMs in scientific exploration. Overall, our work strives to forge an innovative synergy between AI and ICF\texttt{ICF} for advancing fusion energy.

602Can a Bayesian oracle prevent harm from an agent?

[openreview] [pdf]

Abstract Is there a way to design powerful AI systems based on machine learning methods that would satisfy probabilistic safety guarantees? With the long-term goal of obtaining a probabilistic guarantee that would apply in every context, we consider estimating a context-dependent bound on the probability of violating a given safety specification. Such a risk evaluation would need to be performed at run-time to provide a guardrail against dangerous actions of an AI. Noting that different plausible hypotheses about the world could produce very different outcomes, and because we do not know which one is right, we derive bounds on the safety violation probability predicted under the true but unknown hypothesis. Such bounds could be used to reject potentially dangerous actions. Our main results involve searching for cautious but plausible hypotheses, obtained by a maximization that involves Bayesian posteriors over hypotheses. We consider two forms of this result, in the i.i.d. case and in the non-i.i.d. case, and conclude with open problems towards turning such theoretical results into practical AI guardrails.

603Pan for gold

[openreview] [pdf]

Abstract Training a deep model is fundamentally about reducing loss, and we often believe that a ‘‘good model’’ is one that trained with a ‘‘good loss.’’ This paper investigates that belief. We show that even when learning with unstructured, randomized labels, models can still discover generalized features. We propose that generalization in deep learning is not about learning the structure of data through a well-structured loss, but rather a process akin to ‘‘pan for gold,’’ where gradient descent shakes through the function space, naturally stabilizing useful features. To support this, we present quantitative and qualitative experimental evidence, and introduce the Panning through Unstructured Label (PUL) algorithm. We demonstrate its effectiveness across various fields, showing improvements in unsupervised domain adaptation, state-of-the-art performance in object discovery, and its ability to mitigate massive attention issues. Finally, we offer a new interpretation of existing deep learning assumptions, challenging the conventional beliefs in the field.

604Synthetic Theorem Generation in Lean

[openreview] [pdf]

Abstract The application of large language models (LLMs) to theorem proving presents a promising avenue for advancing formal mathematics. Interactive theorem provers, such as Lean, offer a rigorous framework within which these models can assist in or automate proof discovery, grounding their reasoning capabilities in a sound, verifiable formal system. However, the potential of LLMs in this domain is constrained by the limited availability of formal proof corpora for training. To address this limitation, we introduce a synthetic theorem generator capable of producing novel Lean theorems and their corresponding proofs. Our approach employs forward reasoning to synthesize new propositions from premises drawn from existing Lean libraries. We explore candidate reasoning steps using a search strategy that optimizes for diversity of output, apply them in a linear fashion that avoids irrelevant proof steps, and assess their effect by meta-programmatically executing corresponding Lean tactics. These methods enable the generation of an arbitrary number of new theorems and proofs across various mathematical domains, using common Lean proof tactics while ensuring the correctness of generated theorems by construction. We demonstrate the efficacy of the generated theorems and training data by fine-tuning models on synthetic theorems and evaluating them on the miniF2F-test benchmark. Our results show improvements in theorem-proving capabilities, with accuracy increasing from 37.3% to 38.5% for the Falcon2-11B model trained solely on Mathlib, and from 38.1% to 39.3% for the same model trained on a mix of rich datasets. These improvements highlight the value of our diverse synthetic data in augmenting limited existing corpora of formal proofs, providing complementary information that enhances LLMs’ performance on theorem-proving tasks even when combined with other datasets.

605Do LLM Agents Have Regret? A Case Study in Online Learning and Games

[openreview] [pdf]

Abstract Large language models (LLMs) have been increasingly employed for (interactive) decision-making, via the development of LLM-based autonomous agents. Despite their emerging successes, the performance of LLM agents in decision-making has not been fully investigated through quantitative metrics, especially in the multi-agent setting when they interact with each other, a typical scenario in real-world LLM-agent applications. To better understand the limits of LLM agents in these interactive environments, we propose to study their interactions in benchmark decision-making settings in online learning and game theory, through the performance metric of regret. We first empirically study the no-regret behaviors of LLMs in canonical non-stochastic online learning problems, as well as the emergence of equilibria when LLM agents interact through playing repeated games. We then provide some theoretical insights into the no-regret behaviors of LLM agents, under certain assumptions on the supervised pre-training and the rationality model of human decision-makers who generate the data. Notably, we also identify (simple) cases where advanced LLMs such as GPT-4 fail to be no-regret. To further promote the no-regret behaviors, we propose a novel unsupervised training loss of regret-loss, which, in contrast to the supervised pre-training loss, does not require the labels of (optimal) actions. Finally, we establish the statistical guarantee of generalization bound for regret-loss minimization, and more importantly, the optimization guarantee that minimizing such a loss may automatically lead to known no-regret learning algorithms, when single-layer self-attention models are used. Our further experiments demonstrate the effectiveness of our regret-loss, especially in addressing the above “regrettable” cases.

606MissDiff: Training Diffusion Models on Tabular Data with Missing Values

[openreview] [pdf]

Abstract The diffusion model has shown remarkable performance in modeling data distributions and synthesizing data. However, the vanilla diffusion model requires complete or fully observed training data. Incomplete data is a common issue in various real-world applications, including healthcare and finance, particularly when dealing with tabular datasets. This work considers learning from data with missing values for missing value imputations and generating synthetic complete data in a unified framework. With minimal assumptions on the missing mechanisms, our method models the score of complete data distribution by denoising score matching on data with missing values. We prove that the proposed method can recover the score of the complete data distribution, and the proposed training objective serves as an upper bound for the negative likelihood of observed data. Extensive experiments on imputation tasks together with generation tasks demonstrate that our proposed framework outperforms existing state-of-the-art approaches on multiple tabular datasets.

607Unified Perspectives on Signal-to-Noise Diffusion Models

[openreview] [pdf]

Abstract Diffusion models (DM) have become essential components of generative modeling, demonstrating exceptional performance in domains like image synthesis, audio generation, and complex data interpolation. Signal-to-Noise diffusion models represent a broad family encompassing many state-of-the-art models. Although several efforts have been made to explore Signal-to-Noise (S2N) diffusion models from different angles, a comprehensive study that connects these viewpoints and introduces new insights is still needed. In this work, we provide an in-depth perspective on noise schedulers, analyzing their role through the lens of the signal-to-noise ratio (SNR) and its relationship to information theory. Based on this framework, we introduce a generalized backward equation to improve the efficiency of the inference process.

608Causally Motivated Diffusion Sampling Frameworks for Harnessing Contextual Bias

[openreview] [pdf]

Abstract Diffusion models have shown remarkable performance in text-guided image generation when trained on large-scale datasets, usually collected from the Internet. These large-scale datasets have contextual biases (e.g., co-occurrence of objects) which will naturally cascade into the diffusion model. For example, given a text prompt of ``a photo of the living room’', diffusion models frequently generate a couch, a rug, and a lamp together while rarely generating objects that do not commonly occur in a living room. Intuitively, contextual bias can be helpful because it naturally draws the scene even without detailed information (i.e., visual autofill). On the other hand, contextual bias can limit the diversity of generated images (e.g., diverse object combinations) to focus on common image compositions. To have the best of both worlds, we argue that contextual bias needs to be strengthened or weakened depending on the situation. Previous causally-motivated studies have tried to deal with such issues by analyzing confounders (i.e., contextual bias) and augmenting training data or designing their models to directly learn the interventional distribution. However, due to the large-scale nature of these models, obtaining and analyzing the data or training the huge model from scratch is beyond reach in practice. To tackle this problem, we propose two novel frameworks for strengthening or weakening the contextual bias of pretrained diffusion models without training any parameters or accessing training data. Briefly, we first propose causal graphs to explicitly model contextual bias in the generation process. We then sample the hidden confounder due to contextual bias by sampling from a chain of pretrained large-scale models. Finally, we use samples from the confounder to strengthen or weaken the contextual bias based on methods from causal inference. Experiment results show that our proposed methods are effective in generating more realistic and diverse images than the regular sampling method.

609Critique-out-Loud Reward Models

[openreview] [pdf]

Abstract Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the capabilities of reward models as they must reason implicitly about the quality of a response, i.e., preference modeling must be performed in a single forward pass through the model. To enable reward models to reason explicitly about the quality of a response, we introduce Critique-out-Loud (CLoud) reward models. CLoud reward models operate by first generating a natural language critique of the assistant’s response that is then used to predict a scalar reward for the quality of the response. We demonstrate the success of CLoud reward models for both Llama-3-8B and 70B base models: compared to classic reward models CLoud reward models improve pairwise preference classification accuracy on RewardBench by 4.65 and 5.84 percentage points for the 8B and 70B base models respectively. Furthermore, CLoud reward models lead to a Pareto improvement for win rate on ArenaHard when used as the scoring model for Best-of-N. Finally, we explore how to exploit the dynamic inference compute capabilities of CLoud reward models by performing self-consistency decoding for reward prediction.

610Federated Learning in Streaming Subspace

[openreview] [pdf]

Abstract Federated learning (FL) has received widespread attention due to its distributed training and privacy protection. However, existing federated learning methods encounter significant challenges, such as increased communication costs and degraded model performance, when processing non-independently and identically distributed (non-IID) data. This paper jointly alleviates these problems by analyzing and exploiting the low-rank properties of global model trajectories.Primarily, we introduce a streaming subspace update strategy and then propose a general federated learning framework, F\textbf{F}erated L\textbf{L}earning in S\textbf{S}treaming S\textbf{S}ubspace (FLSS\texttt{FLSS}). In FLSS\texttt{FLSS}, local model updates are restricted to the global streaming subspace, resulting in low-dimensional trajectories. The server then aggregates these trajectories to update the global model. Comprehensive experiments verify the effectiveness of our framework. In Cifar100, the FLSS\texttt{FLSS}-equipped FL method outperforms the baseline by 2.14%\% and reduces the communication cost by 80%\%. FLSS\texttt{FLSS} utilizes the early training information of the global model to simultaneously improve the performance and communication efficiency of federated learning.

611Taming Transformer Without Using Learning Rate Warmup

[openreview] [pdf]

Abstract Scaling Transformer to a large scale without using some technical tricks such as learning rate warump and an obviously lower learning rate, is an extremely challenging task, and is increasingly gaining more attention. In this paper, we provide a theoretical analysis for training Transformer and reveal a key problem behind the model crash phenomenon in the training, \ie the \textit{spectral energy concentration} of WqWk{W_q}^{\top} W_k, which is the reason for a malignant entropy collapse. To remedy this problem, motivated by \textit{Weyl’s Inequality}, we present a novel optimization strategy---making weight updating in successive steps smooth, that is, if the ratio σ1(Wt)σ1(Wt1)\frac{\sigma_{1}(\nabla W_t)}{\sigma_{1}(W_{t-1})} is larger than a threshold, where \nabla \bW_t is the updating quantity in step tt, we will automatically bound the learning rate to a weighted multiply of σ1(Wt1)σ1(Wt)\frac{\sigma_{1}(W_{t-1})}{\sigma_{1}(\nabla W_t)}. Our optimization strategy is able to prevent the rapid spectral energy concentration to only a few directions, and thus is able to avoid the malignant entropy collapse that will trigger the model crash. We conduct extensive experiments using ViT, Swin-Transformer and GPT, showing that our optimization strategy can effectively and stably train these (Transformer) models without using learning rate warmup.

612Reward Learning From Preference With Ties

[openreview] [pdf]

Abstract Reward learning plays a pivotal role in Reinforcement Learning from Human Feedback (RLHF), ensuring the alignment of language models. The Bradley-Terry (BT) model stands as the prevalent choice for capturing human preferences from datasets containing pairs of chosen and rejected responses. In preference modeling, the focus is not on absolute values but rather on the reward difference between chosen and rejected responses, referred to as preference strength. Thus, precise evaluation of preference strength holds paramount importance in preference modeling. However, an easily overlooked factor significantly affecting preference strength measurement is that human attitudes towards two responses may not solely indicate a preference for one over the other and ties are also a common occurrence. To address this, we propose the adoption of the generalized Bradley-Terry model -- the Bradley-Terry model with ties (BTT) -- to accommodate tied preferences, thus leveraging additional information. We prove that even with the access to the true distributions of prompt and response, disregarding ties can lead to a notable bias in preference strength measurement. Comprehensive experiments further validate the advantages of incorporating ties in preference modeling. Notably, fine-tuning with BTT significantly outperforms fine-tuning with BT on synthetic preference datasets with ties, labeled by state-of-the-art open-source LLMs.

613Enhancing Group Fairness in Federated Learning through Personalization

[openreview] [pdf]

Abstract Personalized Federated Learning (FL) algorithms collaboratively train customized models for each client, enhancing the accuracy of the learned models on the client’s local data (e.g., by clustering similar clients, by fine-tuning models locally, or by imposing regularization terms). In this paper, we investigate the impact of such personalization techniques on the group fairness of the learned models, and show that personalization can also lead to improved (local) fairness as an unintended benefit. We begin by illustrating these benefits of personalization through numerical experiments comparing several classes of personalized FL algorithms against a baseline FedAvg algorithm, elaborating on the reasons behind improved fairness using personalized FL, and then providing analytical support. Motivated by these, we then show how to build on this (unintended) fairness benefit, by further integrating a fairness metric into the cluster-selection procedure of clustering-based personalized FL algorithms, and improve the fairness-accuracy trade-off attainable through them. Specifically, we propose two new fairness-aware federated clustering algorithms, Fair-FCA and Fair-FL+HC, extending the existing IFCA and FL+HC algorithms, and demonstrate their ability to strike a (tuneable) balance between accuracy and fairness at the client level.

614Towards Understanding Text Hallucination of Diffusion Models via Local Generation Bias

[openreview] [pdf]

Abstract Score-based diffusion models have achieved incredible performance in generating realistic images, audio, and video data. While these models produce high-quality samples with impressive details, they often introduce unrealistic artifacts, such as distorted fingers or hallucinated texts with no meaning. This paper focuses on textual hallucinations, where diffusion models correctly generate individual symbols but assemble them in a nonsensical manner. Through experimental probing, we consistently observe that such phenomenon is attributed it to the network’s local generation bias. Denoising networks tend to produce outputs that rely heavily on highly correlated local regions, particularly when different dimensions of the data distribution are nearly pairwise independent. This behavior leads to a generation process that decomposes the global distribution into separate, independent distributions for each symbol, ultimately failing to capture the global structure, including underlying grammar. Intriguingly, this bias persists across various denoising network architectures including MLP and transformers which have the structure to model global dependency. These findings also provide insights into understanding other types of hallucinations, extending beyond text, as a result of implicit biases in the denoising models. Additionally, we theoretically analyze the training dynamics for a specific case involving a two-layer MLP learning parity points on a hypercube, offering an explanation of its underlying mechanism.

615Accelerated Diffusion using Closed-form Discriminator Guidance

[openreview] [pdf]

Abstract Diffusion models are a state-of-the-art generative modeling framework that transform noise to images via Langevin sampling, guided by the score, which is the gradient of the logarithm of the data distribution. Recent works have shown empirically that the generation quality can be improved when guided by classifier network, which is typically the discriminator trained in a generative adversarial network (GAN) setting. In this paper, we propose a theoretical framework to analyze the effect of the GAN discriminator on Langevin-based sampling, and show that in IPM GANs, the optimal generator matches {\it score-like} functions, involving the flow-field of the kernel associated with a chosen IPM constraint space. Further, we show that IPM-GAN optimization can be seen as one of smoothed score-matching, where the scores of the data and the generator distributions are convolved with the kernel associated with the constraint. The proposed approach serves to unify score-based training and optimization of IPM-GANs. Based on these insights, we demonstrate that closed-form discriminator guidance, using a kernel-based implementation, results in improvements (in terms of CLIP-FID and KID metrics) when applied atop baseline diffusion models. We demonstrate these results by applying closed-form discriminator guidance to denoising diffusion implicit model (DDIM) and latent diffusion model (LDM) settings on the FFHQ and CelebA-HQ datasets. We also demonstrate improvements to accelerated time-step-shifted diffusion, when coupled with a wavelet-based noise estimator for latent-space image generation.

616FreqPrior: Improving Diffusion Models with Frequency Filtering Gaussian Noise as Prior

[openreview] [pdf]

Abstract Text-driven video generation has advanced significantly due to developments in diffusion models. Beyond the training and sampling phases, recent studies have investigated noise priors of diffusion models, as improved noise priors yield better generation results. One recent approach employs Fourier transform to manipulate noise, marking the initial exploration of frequency operations in this context. However, it often generates videos that lack motion dynamics and imaging details. In this work, we provide a comprehensive theoretical analysis of the variance decay issue present in existing methods, contributing to the loss of details and motion dynamics. Recognizing the critical impact of noise distribution on generation quality, we introduce FreqPrior, a novel noise initialization strategy that refines noise in the frequency domain. Our method features a novel filtering technique designed to address different frequency signals while maintaining the noise prior distribution that closely approximates a standard Gaussian distribution. Additionally, we propose a partial sampling process by perturbing the latent at an intermediate timestep during finding the noise prior, significantly reducing inference time without compromising quality. Extensive experiments on VBench demonstrate that our method achieves the highest scores in both quality and semantic assessments, resulting in the best overall total score. These results highlight the superiority of our proposed noise prior.

617Going Beyond Static: Understanding Shifts with Time-Series Attribution

[openreview] [pdf]

Abstract Distribution shifts in time-series data are complex due to temporal dependencies, multivariable interactions, and trend changes. However, robust methods often rely on structural assumptions that lack thorough empirical validation, limiting their practical applicability. In order to support an empirically grounded inductive approach to research, we introduce our Time-Series Shift Attribution (TSSA) framework, which analyzes application-specific patterns of distribution shifts. Our framework attributes performance degradation from various types of shifts to eachtemporal data propertyin a detailed manner, supported by theoretical analysis of unbiasedness and asymptotic properties. Empirical studies in real-world healthcare applications highlight how the TSSA framework enhances the understanding of time-series shifts, facilitating reliable model deployment and driving targeted improvements from both algorithmic and data-centric perspectives.

618SelKD: Selective Knowledge Distillation via Optimal Transport Perspective

[openreview] [pdf]

Abstract Knowledge Distillation (KD) has been a popular paradigm for training a (smaller) student model from its teacher model. However, little research has been done on the practical scenario where only a subset of the teacher’s knowledge needs to be distilled, which we term selective KD (SelKD). This demand is especially pronounced in the era of foundation models, where the teacher model can be significantly larger than the student model. To address this issue, we propose to rethink the knowledge distillation problem from the perspective of Inverse Optimal Transport (IOT). Previous Bayesian frameworks mapped each sample to the probabilities of corresponding labels in an end-to-end manner, which fixed the number of classification categories and hindered effective local knowledge transfer. In contrast, IOT calculates from the standpoint of transportation or matching, allowing for the flexible selection of samples and their quantities for matching. Traditional logit-based KD can be viewed as a special case within the IOT framework. Building on this IOT foundation, we formalize this setting in the context of classification, where only selected categories from the teacher’s category space are required to be recognized by the student in the context of closed-set recognition, which we call closed-set SelKD, enhancing the student’s performance on specific subtasks. Furthermore, we extend the closed-set SelKD, introducing an open-set version of SelKD, where the student model is required to provide a ``not selected" response for categories outside its assigned task. Experimental results on standard benchmarks demonstrate the superiority of our approach.

619Subsampled Ensemble Can Improve Generalization Tail Exponentially

[openreview] [pdf]

Abstract Ensemble learning is a popular technique to improve the accuracy of machine learning models. It hinges on the rationale that aggregating multiple weak models can lead to better models with lower variance and hence higher stability, especially for discontinuous base learners. In this paper, we provide a new perspective on ensembling. By selecting the best model trained on subsamples via majority voting, we can attain exponentially decaying tails for the excess risk, even if the base learner suffers from slow (i.e., polynomial) decay rates. This tail enhancement power of ensembling is agnostic to the underlying base learner and is stronger than variance reduction in the sense of exhibiting rate improvement. We demonstrate how our ensemble methods can substantially improve out-of-sample performances in a range of examples involving heavy-tailed data or intrinsically slow rates.

620Glauber Generative Model: Discrete Diffusion Models via Binary Classification

[openreview] [pdf]

Abstract We introduce the Glauber Generative Model (GGM), a new class of discrete diffusion models, to obtain new samples from a distribution given samples from a discrete space. GGM deploys a discrete Markov chain called the heat bath dynamics (or the Glauber dynamics) to denoise a sequence of noisy tokens to a sample from a joint distribution of discrete tokens. Our novel conceptual framework provides an exact reduction of the task of learning the denoising Markov chain to solving a class of binary classification tasks. More specifically, the model learns to classify a given token in a noisy sequence as signal or noise. In contrast, prior works on discrete diffusion models either solve regression problems to learn importance ratios, or minimize loss functions given by variational approximations. We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling.

621Training on the Test Task Confounds Evaluation and Emergence

[openreview] [pdf]

Abstract We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not a malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for the effect of training on the test task on benchmark evaluations. Put simply, to fine-tune each model under comparison on the same task-relevant data before evaluation. Lastly, we show that instances of emergent behavior disappear gradually as models train on the test task. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.

622Denoising Task Difficulty-based Curriculum for Training Diffusion Models

[openreview] [pdf]

Abstract Diffusion-based generative models have emerged as powerful tools in the realm of generative modeling. Despite extensive research on denoising across various timesteps and noise levels, a conflict persists regarding the relative difficulties of the denoising tasks. While various studies argue that lower timesteps present more challenging tasks, others contend that higher timesteps are more difficult. To address this conflict, our study undertakes a comprehensive examination of task difficulties, focusing on convergence behavior and changes in relative entropy between consecutive probability distributions across timesteps. Our observational study reveals that denoising at earlier timesteps poses challenges characterized by slower convergence and higher relative entropy, indicating increased task difficulty at these lower timesteps. Building on these observations, we introduce an easy-to-hard learning scheme, drawing from curriculum learning, to enhance the training process of diffusion models. By organizing timesteps or noise levels into clusters and training models with ascending orders of difficulty, we facilitate an order-aware training regime, progressing from easier to harder denoising tasks, thereby deviating from the conventional approach of training diffusion models simultaneously across all timesteps. Our approach leads to improved performance and faster convergence by leveraging benefits of curriculum learning, while maintaining orthogonality with existing improvements in diffusion training techniques. We validate these advantages through comprehensive experiments in image generation tasks, including unconditional, class-conditional, and text-to-image generation.

623DICE: Data Influence Cascade in Decentralized Learning

[openreview] [pdf]

Abstract Decentralized learning offers a promising approach to crowdsource computational workloads across geographically distributed compute interconnected through peer-to-peer networks, accommodating the exponentially increasing compute demands in the era of large models. However, the absence of proper incentives in locally connected decentralized networks poses significant risks of free riding and malicious behaviors. Data influence, which ensures fair attribution of data source contributions, holds great potential for establishing effective incentive mechanisms. Despite the importance, little effort has been made to analyze data influence in decentralized scenarios, due to non-trivial challenges arising from the distributed nature and the localized connections inherent in decentralized networks. To overcome this fundamental incentive problem, we propose DICE, the first comprehensive framework for analyzing Data Influence CascadEs in decentralized environments. Our framework characterizes how data influence cascades across the communication network and highlights the interplay between original data and network structure in shaping data influence in decentralized learning. We anticipate that DICE can open new avenues for incentive mechanism design and enable impactful applications of influence in decentralized learning, including anomaly detection, collaborator selection and machine unlearning.

624Breaking the Detection-Generalization Paradox on Out-Of-Distribution Data

[openreview] [pdf]

Abstract This work studies the trade-off between out-of-distribution (OOD) detection and generalization. We identify the Detection-Generalization Paradox in OOD data, where optimizing one objective can degrade the other. We investigate this paradox by analyzing the behaviors of models trained under different paradigms, focusing on representation, logits, and loss across in-distribution, covariate-shift, and semantic-shift data. Based on our findings, we propose Distribution-Robust Sharpness-Aware Minimization (DR-SAM), an optimization framework that balances OOD detection and generalization. Extensive experiments demonstrate the method’s effectiveness, offering a clear, empirically validated approach for improving detection and generalizationability in different benchmarks.

625Combining Analytical Smoothing with Surrogate Losses for Improved Decision-Focused Learning

[openreview] [pdf]

Abstract Many combinatorial optimization problems in routing, scheduling, and assignment involve parameters such as price or travel time that must be predicted from data; so-called predict-then-optimize (PtO) problems. Decision-focused learning (DFL) is a family of successful end-to-end techniques for PtO that trains machine learning models to minimize the error of the downstream optimization problems. For each instance, this requires computing the derivative of the optimization problem’s solution with respect to the predicted input parameters. Previous works in DFL employ two main approaches when the parameters appear linearly in the objective: (a) using a differentiable surrogate loss instead of regret; or (b) turning the combinatorial optimization problem into a differentiable mapping by smoothing the optimization to a quadratic program or other smooth convex optimization problem and minimizing the regret of that. We argue that while smoothing makes the optimization differentiable, for a large part, the derivative remains approximately zero almost everywhere, with highly non-zero values near the transition points. To address this plateau effect, we propose minimizing a surrogate loss even after smoothing. We experimentally demonstrate the advantage of minimizing surrogate losses instead of the regret after smoothing across a series of problems. Furthermore, we show that by minimizing a surrogate loss, a recently developed fast, fully neural optimization layer matches state-of-the-art performance while dramatically reducing training time up to five-fold. Thus, our paper opens new avenues for efficient and scalable DFL techniques.

626Outcome-based Semifactual Explanation For Reinforcement Learning

[openreview] [pdf]

Abstract Counterfactual explanations in reinforcement learning (RL) aim to answer what-if questions by showing sparse and minimal changes to states, which results in the probability mass moving from one action to another. Although these explanations are effective in classification tasks that look for the presence of concepts, RL brings new challenges that current counterfactual methods for RL still need to solve. These challenges include defining similarity in RL, out-of-distribution states, and lack of discriminative power. Given a state of interest called the query state, we solve these problems by asking how long the agent can execute the query state action without incurring a negative outcome regarding the expected return. We coin this outcome-based semifactual (OSF) explanation and find the OSF state by simulating trajectories from the query state. The last state in a subtrajectory where we can take the same action as in the query state without incurring a negative outcome is the OSF state. This state is discriminative, plausible, and similar to the query state. It abstracts away unimportant action switching with little explanatory value and shows the boundary between positive and negative outcomes. Qualitatively, we show that our method explains when it is necessary to switch actions. As a result, it is easier to understand the agent’s behavior. Quantitatively, we demonstrate that our method can increase policy performance and, at the same time, reduce how often the agent switches its action across six environments. The code and trained models are available athttps://anonymous.4open.science/r/osf-explanation-for-rl-E312/.

627Efficient Online Reinforcement Learning Fine-Tuning Should Not Retain Offline Data

[openreview] [pdf]

Abstract The modern paradigm in machine learning involves pre-training models on diverse data, followed by task-specific fine-tuning. In reinforcement learning (RL), this translates to learning via offline RL on a static dataset, followed by rapid online RL fine-tuning using autonomous interaction data. Most RL fine-tuning methods require continued training on offline data for stability and performance. This is undesirable because retaining offline data is both slow and expensive for large datasets, but has been inevitable so far. In this paper, we show that retaining offline data is completely unnecessary as long as we use a correctly-designed online RL approach for fine-tuning offline RL initializations. We start by analyzing the role of retaining offline data in online fine-tuning. We find that continued training on offline data is mostly useful for preventing a sudden unlearning of the offline RL value function at the onset of fine-tuning, caused by a distribution mismatch between the offline data and online rollouts. As a result, this unlearning erases the benefits of offline pre-training. Our approach, WSRL, mitigates this sudden unlearning by using a warmup phase that seeds the online RL run with a very small number of rollouts from the pre-trained policy. The data collected during warmup helps ``recalibrate’’ the offline Q-function to the online data better, allowing us to completely discard offline data without risking of destabilizing the online RL training. We show that WSRL is able to fine-tune without retaining any offline data, and is able to learn faster and attains higher performance than existing algorithms irrespective of whether they do or do not retain offline data.

628Replay concurrently or sequentially? A theoretical perspective on replay in continual learning

[openreview] [pdf]

Abstract Replay-based methods have shown superior performance to address catastrophic forgetting in continual learning (CL), where a subset of past data is stored and generally replayed together with new data in current task learning. While seemingly natural, it is questionable, though rarely questioned, if such a concurrent replay strategy is always the right way for replay in CL. Inspired by the fact in human learning that revisiting very different courses sequentially before final exams is more effective for students, an interesting open question to ask is whether a sequential replay can benefit CL more compared to a standard concurrent replay. However, answering this question is highly nontrivial considering a major lack of theoretical understanding in replay-based CL methods. To this end, we investigate CL in overparameterized linear models and provide a comprehensive theoretical analysis to compare two replay schemes: 1) Concurrent Replay, where the model is trained on replay data and new data concurrently; 2) Sequential Replay, where the model is trained first on new data and then sequentially on replay data for each old task. By characterizing the explicit form of forgetting and generalization error, we show in theory that sequential replay tends to outperform concurrent replay when tasks are less similar, which is corroborated by our simulations in linear models. More importantly, our results inspire a novel design of a hybrid replay method, where only replay data of similar tasks are used concurrently with the current data and dissimilar tasks are sequentially revisited using their replay data. As depicted in our experiments on real datasets using deep neural networks, such a hybrid replay method improves the performance of standard concurrent replay by leveraging sequential replay for dissimilar tasks. By providing the first comprehensive theoretical analysis on replay, our work has great potentials to open up more principled designs for replay-based CL.

629Invariance to Planning in Goal-Conditioned RL

[openreview] [pdf]

Abstract We study goal-conditioned RL through the lens of generalization, but not in the traditional sense of random augmentations and domain randomization. Rather, we aim to learn goal-directed policies that generalize with respect to the horizon: after training to reach nearby goals (which are easy to learn), these policies should succeed in reaching distant goals (which are quite challenging to learn). In the same way that invariance is closely linked with generalization is other areas of machine learning (e.g., normalization layers make a network invariant to scale, and therefore generalize to inputs of varying scales), we show that this notion of horizon generalization is closely linked with invariance to planning: a policy navigating towards a goal will select the same actions as if it were navigating to a waypoint en route to that goal. Horizon generalization and invariance to planning are appealing because of their potential reach: they imply that a policy trained to reach nearby goals would succeed at reaching goals that are arbitrarily more distant.Our theoretical analysis proves that both horizon generalization and planning invariance are possible, under some assumptions. We present new experimental results, as well as recalling results from prior work, in support of our theoretical results. Taken together, our results open the door to studying how techniques for invariance and generalization developed in other areas of machine learning might be adapted to achieve this alluring property.

630FairDropout: Using Example-Tied Dropout to Enhance Generalization for Minority Groups

[openreview] [pdf]

Abstract Deep learning models frequently exploit spurious features in training data to achieve low training error, often resulting in poor generalization when faced with shifted testing distributions. To address this issue, various methods from imbalanced learning, representation learning, and classifier recalibration have been proposed to enhance the robustness of deep neural networks against spurious correlations. In this paper, we observe that models trained with empirical risk minimization tend to generalize well for examples from the majority groups while memorizing instances from minority groups. Building on recent findings that show memorization can be localized to a limited number of neurons, we apply example-tied dropout as a method we term \textit{FairDropout}, aimed at redirecting this memorization to specific neurons that we subsequently drop out during inference. We empirically evaluate FairDropout using the subpopulation benchmark suite encompassing vision, language, and healthcare tasks, demonstrating that it significantly reduces reliance on spurious correlations.

631Offline-to-Online Reinforcement Learning with Prioritized Experience Selection

[openreview] [pdf]

Abstract Offline-to-online reinforcement learning (O2O RL) offers a promising paradigm that first pre-trains an offline policy and fine-tunes it with further online interactions. Nevertheless, the distribution shift between the offline and online phase often hinders the fine-tuning performance, sometimes even incurring performance collapse. Existing methods mitigate this by enhancing training robustness with Q-ensemble, training a density ratio estimator to balance offline and online data, etc. But they often rely on components like ensemble and have higher training costs. In this paper, we address this issue by establishing a concrete performance bound for the optimal policies between two consecutive online steps. Motivated by the theoretical insight, we propose a simple yet effective fine-tuning method, \textbf{P}rioritized \textbf{E}xperience \textbf{S}election (PES). During the online stage, PES maintains a dynamically updated priority queue containing a portion of high-return trajectories, and only selects online samples that are close to the samples in the queue for fine-tuning. In this way, the distribution shift issue can be mitigated and the fine-tuning performance can be boosted. PES is computationally efficient and compatible with numerous approaches. Experimental results on a variety of D4RL datasets show that PES can benefit different offline and O2O RL algorithms and enhance Q-value estimate. Our code is available and will be open-source.

632Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling

[openreview] [pdf]

Abstract Recent advances in knowledge distillation (KD) have enabled smaller student models to approach the performance of larger teacher models. However, popular methods such as supervised KD and on-policy KD, are adversely impacted by the knowledge gaps between teacher-student in practical scenarios. Supervised KD suffers from a distribution mismatch between training with a static dataset and inference over final student-generated outputs. Conversely, on-policy KD, which uses student-generated samples for training, can suffer from low-quality training examples with which teacher models are not familiar, resulting in inaccurate teacher feedback. To address these limitations, we introduce Speculative Knowledge Distillation (SKD), a novel approach that leverages cooperation between student and teacher models to generate high-quality training data on-the-fly while aligning with the student’s inference-time distribution. In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution, transferring high-quality knowledge adaptively. We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following, and show that SKD consistently outperforms existing KD methods across different domains, data sizes, and model initialization strategies.

633PREDICT: Preference Reasoning by Evaluating Decomposed preferences Inferred from Candidate Trajectories

[openreview] [pdf]

Abstract Accommodating human preferences is essential for creating AI agents that deliver personalized and effective interactions. Recent work has shown the potential for LLMs to infer preferences from user interactions, but they often produce broad and generic preferences, failing to capture the unique and individualized nature of human preferences. This paper introduces PREDICT, a method designed to enhance the precision and adaptability of inferring preferences. PREDICT incorporates three key elements: (1) iterative refinement of inferred preferences, (2) decomposition of preferences into constituent components, and (3) validation of preferences across multiple trajectories. We evaluate PREDICT on two distinct environments: a gridworld setting and a new text-domain environment (PLUME). PREDICT more accurately infers nuanced human preferences improving over existing baselines by 66.2% (gridworld environment) and 41.0% (PLUME).

634Online Bandit Nonlinear Control with Dynamic Batch Length and Adaptive Learning Rate

[openreview] [pdf]

Abstract This paper is concerned with the online bandit nonlinear control, which aims to learn the best stabilizing controller from a pool of stabilizing and destabilizing controllers of unknown types for a given nonlinear dynamical system. We develop an algorithm, named Dynamic Batch length and Adaptive learning Rate (DBAR), and study its stability and regret. Unlike the existing Exp3 algorithm requiring an exponentially stabilizing controller, DBAR only needs a significantly weaker notion of controller stability, in which case substantial time may be required to certify the system stability. Dynamic batch length in DBAR effectively addresses this issue and enables the system to attain asymptotic stability, where the algorithm behaves as if there were no destabilizing controllers. Moreover, adaptive learning rate in DBAR only uses the state norm information to achieve a tight regret bound even when none of the stabilizing controllers in the pool are exponentially stabilizing.

635AlphaQCM: Alpha Discovery with Distributional Reinforcement Learning

[openreview] [pdf]

Abstract Finding synergistic formulaic alphas is very important but challenging for researchers and practitioners in finance. In this paper, we reconsider the discovery of formulaic alphas from the viewpoint of sequential decision-making, and conceptualize the entire alpha-mining process as a non-stationary and reward-sparse Markov decision process. To overcome the challenges of non-stationarity and reward-sparsity, we propose the AlphaQCM method, a novel distributional reinforcement learning method designed to search for synergistic formulaic alphas efficiently. The AlphaQCM method first learns the Q function and quantiles via a Q network and a quantile network, respectively. Then, the AlphaQCM method applies the quantiled conditional moment method to learn unbiased variance from the potentially biased quantiles. Guided by the learned Q function and variance, the AlphaQCM method navigates the non-stationarity and reward-sparsity to explore the vast search space of formulaic alphas with high efficacy. Empirical applications to real-world datasets demonstrate that our AlphaQCM method significantly outperforms its competitors, particularly when dealing with large datasets comprising numerous stocks.

636Federated Learning with Dynamic Client Arrival and Departure: Convergence and Rapid Adaptation via Initial Model Construction

[openreview] [pdf]

Abstract While most existing federated learning (FL) approaches assume a fixed set of clients in the system, in practice, clients can dynamically leave or join the system depending on their needs or interest in the specific task. This dynamic FL setting introduces several key challenges: (1) the objective function dynamically changes depending on the current set of clients, unlike traditional FL approaches that maintain a static optimization goal; (2) the current global model may not serve as the best initial point for the next FL rounds and could potentially lead to slow adaptation, given the possibility of clients leaving or joining the system. In this paper, we consider a dynamic optimization objective in FL that seeks the optimal model tailored to the currently active set of clients. Building on our probabilistic framework that provides direct insights into how the arrival and departure of different types of clients influence the shifts in optimal points, we establish an upper bound on the optimality gap, accounting for factors such as stochastic gradient noise, local training iterations, non-IIDness of data distribution, and deviations between optimal points caused by dynamic client pattern. We also propose an adaptive initial model construction strategy that employs weighted averaging guided by gradient similarity, prioritizing models trained on clients whose data characteristics align closely with the current one, thereby enhancing adaptability to the current clients. The proposed approach is validated on various datasets and FL algorithms, demonstrating robust performance across diverse client arrival and departure patterns, underscoring its effectiveness in dynamic FL environments.

637Beyond the Boundaries of Proximal Policy Optimization

[openreview] [pdf]

Abstract Proximal policy optimization (PPO) is a widely-used algorithm for on-policy reinforcement learning. This work offers an alternative perspective of PPO, in which it is decomposed into the inner-loop estimation of update vectors, and the outer-loop application of updates using gradient ascent with unity learning rate. Using this insight we propose outer proximal policy optimization (outer-PPO); a framework wherein these update vectors are applied using an arbitrary gradient-based optimizer. The decoupling of update estimation and update application enabled by outer-PPO highlights several implicit design choices in PPO that we challenge through empirical investigation. In particular we consider non-unity learning rates and momentum applied to the outer loop, and a momentum-bias applied to the inner estimation loop. Methods are evaluated against an aggressively tuned PPO baseline on Brax, Jumaji and MinAtar environments; non-unity learning rates and momentum both achieve statistically significant improvement on Brax and Jumaji, given the same hyperparameter tuning budget.

638Foundation Models for Enhanced Exploration in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning agents often struggle with sample inefficiency, requiring extensive interactions with the environment to develop effective policies. This inefficiency is partly due to the challenge of balancing exploration and exploitation without the abstract reasoning and prior knowledge that humans use to quickly identify rewarding actions. Recent advancements in foundation models, such as large language models (LLMs) and vision-language models (VLMs), have shown human-level reasoning capabilities in some domains but have been underutilized in directly selecting low-level actions for exploration in reinforcement learning. In this paper, we investigate the potential of foundation models to enhance exploration in reinforcement learning tasks. We conduct an in-depth analysis of their exploration behaviour in multi-armed bandit problems and Gridworld environments, comparing their performance against traditional exploration strategies and reinforcement learning agents. Our empirical results suggest foundation models can significantly improve exploration efficiency by leveraging their reasoning abilities to infer optimal actions. Building on these findings, we introduce Foundation Model Exploration (FME), a novel exploration scheme that integrates foundation models into the reinforcement learning framework for intelligent exploration behaviour. We use VLMs and demonstrate that they can infer environment dynamics and objectives from raw image observations. This means FME only requires the action space as environment-specific manual text input. We find that agents equipped with FME achieve superior performance in sparse reward Gridworld environments and scale to more complex tasks like Atari games. Moreover, the effectiveness of FME increases with the capacity of the VLM used, indicating that future advancements in foundation models will further enhance such exploration strategies.

639Is multitask learning all you need in continual learning?

[openreview] [pdf]

Abstract Continual Learning solutions often treat multitask learning as an upper-bound of what the learning process can achieve.This is a natural assumption, given that this objective directly addresses the catastrophic forgetting problem, which has been a central focus in early works. However, depending on the nature of the distributional shift in the data, the multi-task solution is not always optimal for the broader continual learning problem. In this work, we draw on principles from online learning to formalize the limitations of multitask objectives, especially when viewed through the lens of cumulative loss, which also serves as an indicator of forward transfer. We provide empirical evidence on when multi-task solutions are suboptimal, and argue that continual learning solutions should not and do not have to adhere to this assumption. Moreover, we argue for the utility of estimating the distributional drift as the data is being received and show preliminary results of how this could be exploited by a simple replay based method to move beyond the multitask solution.

640Contextual Bandits with Entropy-based Human Feedback

[openreview] [pdf]

Abstract In recent years, preference-based human feedback mechanisms have become integral to improving model performance across a range of applications, including conversational AI systems like ChatGPT. However, existing methodologies often overlook critical factors such as model uncertainty and variability in feedback quality. To address these limitations, we propose an innovative entropy-based human feedback framework designed for contextual bandits, which balances exploration and exploitation by soliciting expert feedback when model entropy surpasses a predefined threshold. Our method is model-agnostic and adaptable to any contextual bandit agent employing stochastic policies. Through rigorous experimentation, we demonstrate that our approach requires minimal human feedback to achieve significant performance gains, even with suboptimal feedback quality. Our work not only introduces a novel feedback solicitation strategy but also underscores the robustness of integrating human guidance into machine learning systems. Our code is publicly available: \url{https://anonymous.4open.science/r/CBHF-33C5}

641Preference Optimization for Reasoning with Pseudo Feedback

[openreview] [pdf]

Abstract Preference optimization techniques, such as Direct Preference Optimization (DPO), are frequently employed to enhance the reasoning capabilities of large language models (LLMs) in domains like mathematical reasoning and coding, typically following supervised fine-tuning. These methods rely on high-quality labels for reasoning tasks to generate preference pairs; however, the availability of reasoning datasets with human-verified labels is limited. In this study, we introduce a novel approach to generate pseudo feedback for reasoning tasks by framing the labeling of solutions to reason problems as an evaluation against associated \emph{test cases}. We explore two forms of pseudo feedback based on test cases: one generated by frontier LLMs and the other by extending self-consistency to multi-test-case. We conduct experiments on both mathematical reasoning and coding tasks using pseudo feedback for preference optimization, and observe improvements across both tasks. Specifically, using Mathstral-7B as our base model, we improve MATH results from 58.3 to 68.6, surpassing both NuminaMath-72B and GPT-4-Turbo-1106-preview. In GSM8K and College Math, our scores increase from 85.6 to 90.3 and from 34.3 to 42.3, respectively. Building on Deepseek-coder-7B-v1.5, we achieve a score of 24.3 on LiveCodeBench (from 21.1), surpassing Claude-3-Haiku.

642Differential Transformer

[openreview] [pdf]

Abstract Transformer tends to overallocate attention to irrelevant context. In this work, we introduce Diff Transformer, which amplifies attention to the relevant context while canceling noise. Specifically, the differential attention mechanism calculates attention scores as the difference between two separate softmax attention maps. The subtraction cancels noise, promoting the emergence of sparse attention patterns. Experimental results on language modeling show that Diff Transformer outperforms Transformer in various settings of scaling up model size and training tokens. More intriguingly, it offers notable advantages in practical applications, such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers. By being less distracted by irrelevant context, Diff Transformer can mitigate hallucination in question answering and text summarization. For in-context learning, Diff Transformer not only enhances accuracy but is also more robust to order permutation, which was considered as a chronic robustness issue. The results position Diff Transformer as a highly effective and promising architecture for large language models.

643Mobility Networked Time-Series Forecasting Benchmark Datasets

[openreview] [pdf]

Abstract Human mobility is crucial for urban planning (e.g., public transportation) and epidemic response strategies. However, existing research often neglects integrating comprehensive perspectives on spatial dynamics, temporal trends, and other contextual views due to the limitations of existing mobility datasets. To bridge this gap, we introduceMOBINS(MOBIlityNetworked timeSeries), a novel dataset collection designed for networked time-series forecasting of dynamic human movements.MOBINSfeatures diverse and explainable datasets that capture various mobility patterns across different transportation modes in four cities and two countries and cover both transportation and epidemic domains at the administrative area level. Our experiments with nine baseline methods reveal the significant impact of different model backbones on the proposed six datasets. We provide a valuable resource for advancing urban mobility research, and our dataset collection is available athttps://anonymous.4open.science/r/MOBINS.

644Avoiding Catastrophe in Online Learning by Asking for Help

[openreview] [pdf]

Abstract Most learning algorithms with formal regret guarantees assume that no mistake is irreparable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes arecatastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We first show that in general, any algorithm either constantly queries the mentor or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.

645Convergence of Distributed Adaptive Optimization with Local Updates

[openreview] [pdf]

Abstract We study distributed adaptive algorithms with local updates (intermittent communication). Despite the great empirical success of adaptive methods in distributed training of modern machine learning models, the theoretical benefits of local updates within adaptive methods, particularly in terms of reducing communication complexity, have not been fully understood yet. In this paper, we prove that \em Local SGD \em with momentum (\em Local \em SGDM) and \em Local \em Adam can outperform their minibatch counterparts in convex and weakly convex settings, respectively. Our analysis relies on a novel technique to prove contraction during local iterations, which is a crucial yet challenging step to show the advantages of local updates, under generalized smoothness assumption and gradient clipping strategy.

646Linear Multistep Solver Distillation for Fast Sampling of Diffusion Models

[openreview] [pdf]

Abstract Sampling from diffusion models can be seen as solving the corresponding probability flow ordinary differential equation (ODE). The solving process requires a significant number of function evaluations (NFE), making it time-consuming. Recently, several solver search frameworks have attempted to find better-performing model-specific solvers. However, predicting the impact of intermediate solving strategies on final sample quality remains challenging, rendering the search process inefficient. In this paper, we propose a novel method for designing solving strategies. We first introduce a unified prediction formula for linear multistep solvers. Subsequently, we present a solver distillation framework, which enables a student solver to mimic the sampling trajectory generated by a teacher solver with more steps. We utilize the mean Euclidean distance between the student and teacher sampling trajectories as a metric, facilitating rapid adjustment and optimization of intermediate solving strategies. The design space of our framework encompasses multiple aspects, including prediction coefficients, time step schedules, and time scaling factors. Our framework has the ability to complete a solver search for Stable-Diffusion in less than 10 total GPU hours. Compared to previous reinforcement learning-based search frameworks, our approach achieves over a 10×\times increase in search efficiency. With just 5 NFE, we achieve FID scores of 3.23 on CIFAR10, 7.16 on ImageNet-64, 5.44 on LSUN-Bedroom, and 15.69 on MS-COCO, resulting in a 2×\times sampling acceleration ratio compared to handcrafted solvers.

647Distributed In-Context Learning under Non-IID Among Clients

[openreview] [pdf]

Abstract Advancements in large language models (LLMs) have shown their effectiveness in multiple compli- cated natural language reasoning tasks. A key challenge remains in adapting these models efficiently to new or unfamiliar tasks. In-context learning (ICL) provides a promising solution for few-shot adaptation by retrieving a set of data points relevant to a query, called in-context examples (ICE), from a training dataset and providing them during the inference as context. Most existing studies utilize a centralized training dataset, yet many real-world datasets may be distributed among multiple clients, and remote data retrieval can be associated with costs. Especially when the client data are non-identical independent distributions (non-IID), retrieving from clients a proper set of ICEs needed for a test query presents critical challenges. In this paper, we first show that in this challenging setting, test queries will have different preferences among clients because of non-IIDness, and equal contribution often leads to suboptimal performance. We then introduce a novel approach to tackle the distributed non-IID ICL problem when a data usage budget is present. The principle is that each client’s proper contribution (budget) should be designed according to the preference of each query for that client. Our approach uses a data-driven manner to allocate a budget for each client, tailored to each test query. Through extensive empirical studies on diverse datasets, our framework demonstrates superior performance relative to competing baselines.

648Accelerated Online Reinforcement Learning using Auxiliary Start State Distributions

[openreview] [pdf]

Abstract Learning a robust policy that is performant across the state space, in a sample efficient manner, is a long-standing problem in online reinforcement learning (RL). This challenge arises from the inability of algorithms to explore the environment efficiently. Most attempts at efficient exploration tackle this problem in a setting where learning begins from scratch, without prior information available to bootstrap learning. However, such approaches often fail to fully leverage expert demonstrations and simulators that can reset to arbitrary states. These affordances are valuable resources that offer enormous potential to guide exploration and speed up learning. In this paper, we explore how a small number of expert demonstrations and a simulator allowing arbitrary resets can accelerate learning during online RL. We show that by leveraging expert state information to form an auxiliary start state distribution, we significantly improve sample efficiency. Specifically, we show that using a notion of safety to inform the choice of auxiliary distribution significantly accelerates learning. We highlight the effectiveness of our approach by matching or exceeding state-of-the-art performance in sparse reward and dense reward setups, even when competing with algorithms with access to expert actions and rewards. Moreover, we find that the improved exploration ability facilitates learning more robust policies in spare reward, hard exploration environments.

649Towards Marginal Fairness Sliced Wasserstein Barycenter

[openreview] [pdf]

Abstract The Sliced Wasserstein barycenter (SWB) is a widely acknowledged method for efficiently generalizing the averaging operation within probability measure spaces. However, achieving marginal fairness SWB, ensuring approximately equal distances from the barycenter to marginals, remains unexplored. The uniform weighted SWB is not necessarily the optimal choice to obtain the desired marginal fairness barycenter due to the heterogeneous structure of marginals and the non-optimality of the optimization. As the first attempt to tackle the problem, we define the marginal fairness sliced Wasserstein barycenter (MFSWB) as a constrained SWB problem. Due to the computational disadvantages of the formal definition, we propose two hyperparameter-free and computationally tractable surrogate MFSWB problems that implicitly minimize the distances to marginals and encourage marginal fairness at the same time. To further improve the efficiency, we perform slicing distribution selection and obtain the third surrogate definition by introducing a new slicing distribution that focuses more on marginally unfair projecting directions. We discuss the relationship of the three proposed problems and their relationship to sliced multi-marginal Wasserstein distance. Finally, we conduct experiments on finding 3D point-clouds averaging, color harmonization, and training of sliced Wasserstein autoencoder with class-fairness representation to show the favorable performance of the proposed surrogate MFSWB problems.

650On Generalization Within Multi-Objective Reinforcement Learning Algorithms

[openreview] [pdf]

Abstract Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel testbed featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this testbed reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments.

651Tuning Timestep-Distilled Diffusion Model Using Pairwise Sample Optimization

[openreview] [pdf]

Abstract Recent advancements in timestep-distilled diffusion models have enabled high-quality image generation that rivals non-distilled multi-step models, but with significantly fewer inference steps. While such models are attractive for applications due to the low inference cost and latency, fine-tuning them with a naive diffusion objective would result in degraded and blurry outputs. An intuitive alternative is to repeat the diffusion distillation process with a fine-tuned teacher model, which produces good results but is cumbersome and computationally intensive: the distillation training usually requires magnitude higher of training compute compared to fine-tuning for specific image styles. In this paper, we present an algorithm named pairwise sample optimization (PSO), which enables the direct fine-tuning of an arbitrary timestep-distilled diffusion model. PSO introduces additional reference images sampled from the current time-step distilled model, and increases the relative likelihood margin between the training images and reference images. This enables the model to retain its few-step generation ability, while allowing for fine-tuning of its output distribution. We also demonstrate that PSO is a generalized formulation which be flexible extended to both offline-sampled and online-sampled pairwise data, covering various popular objectives for diffusion model preference optimization. We evaluate PSO in both preference optimization and other fine-tuning tasks, including style transfer and concept customization. We show that PSO can directly adapt distilled models to human-preferred generation with both offline and online-generated pairwise preference image data. PSO also demonstrates effectiveness in style transfer and concept customization by directly tuning timestep-distilled diffusion models.

652Learn Your Reference Model for Real Good Alignment

[openreview] [pdf]

Abstract Despite the fact that offline methods for Large Language Models (LLMs) alignment do not require a direct reward model, they remain susceptible to overoptimization. This issue arises when the trained model deviates excessively from the reference policy, leading to a decrease in sample quality. We propose a new paradigm of offline alignment methods, called Trust Region (including variants TR-DPO, TR-IPO, TR-KTO), which dynamically updates the reference policy throughout the training process. Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy. We demonstrate the efficacy of these approaches not only through toy examples that exhibit reduced overoptimization, but also through direct, side-by-side comparisons in specific tasks such as helpful and harmless dialogue, as well as summarization, where they surpass conventional methods. Additionally, we report significant improvements in general-purpose assistant setups with the Llama3 model on the AlpacaEval 2 and Arena-Hard benchmarks, highlighting the advantages of Trust Region methods over classical approaches.

653Reward Dimension Reduction for Scalable Multi-Objective Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we introduce a simple yet effective reward dimension reduction method to tackle the scalability challenges of multi-objective reinforcement learning algorithms. While most existing approaches focus on optimizing two to four objectives, their abilities to scale to environments with more objectives remain uncertain. Our method uses a dimension reduction approach to enhance learning efficiency and policy performance in multi-objective settings. While most traditional dimension reduction methods are designed for static datasets, our approach is tailored for online learning and preserves Pareto-optimality after transformation. We propose a new training and evaluation framework for reward dimension reduction in multi-objective reinforcement learning and demonstrate the superiority of our method in an environment with sixteen objectives, significantly outperforming existing online dimension reduction methods.

654Generalized Anomaly Detection with Knowledge Exposure:The Dual Effects of Augmentation

[openreview] [pdf]

Abstract Anomaly detection involves identifying samples that deviate from the training data. While previous methods have demonstrated significant performance, our experiments reveal that their generalization ability declines substantially when faced with slight shifts in the test data. This limitation stems from an underlying assumption: these methods generally expect the distribution of normal test samples to closely resemble that of the training set, while anomalies are presumed to be far from this distribution. However, in real-world scenarios, test samples often experience varying degrees of distributional shift while retaining their semantic consistency. The ability to generalize successfully to semantically preserved transformations while accurately detecting normal samples whose semantic meaning has changed as anomalies is critical for a model’s trustworthiness and reliability. For instance, while a rotation may alter the semantic meaning of a car in the context of anomaly detection, it typically preserves the meaning of an apple. Yet, current methods, particularly those based on contrastive learning, are likely to detect both as anomalies. This complexity underscores the need for dynamic learning procedures grounded in a deeper understanding of outliers. To address this, we propose a novel approach called Knowledge Exposure (KE), which incorporates external knowledge to interpret concept dynamics and distinguish between transformations that induce semantic shifts. Our approach improves generalization by leveraging insights from a pre-trained CLIP model to assess the significance of anomalies for each concept. Evaluations on datasets such as CIFAR-10, CIFAR-100, SVHN demonstrate superior performance compared to previous methods, validating the effectiveness of our approach.

655Strategic Classification With Externalities

[openreview] [pdf]

Abstract We propose a new variant of the strategic classification problem: a principal reveals a classifier, and nn agents report their (possibly manipulated) features to be classified. Motivated by real-world applications, our model crucially allows the manipulation of one agent to affect another; that is, it explicitly captures inter-agent externalities. The principal-agent interactions are formally modeled as a Stackelberg game, with the resulting agent manipulation dynamics captured as a simultaneous game. We show that under certain assumptions, the pure Nash Equilibrium of this agent manipulation game is unique and can be efficiently computed. Leveraging this result, PAC learning guarantees are established for the learner: informally, we show that it is possible to learn classifiers that minimize loss on the distribution, even when a random number of agents are manipulating their way to a pure Nash Equilibrium. We also comment on the optimization of such classifiers through gradient-based approaches. This work sets the theoretical foundations for a more realistic analysis of classifiers that are robust against multiple strategic actors interacting in a common environment.

656Mitigating Dialogue Hallucination for Large Vision Language Models via Adversarial Instruction Tuning

[openreview] [pdf]

Abstract Mitigating hallucinations of Large Vision Language Models (LVLMs) is crucial to enhance their reliability for general-purpose assistants. This paper shows that such hallucinations of LVLMs can be significantly exacerbated by preceding user-system dialogues. To precisely measure this, we first present an evaluation benchmark by extending popular multi-modal benchmark datasets with prepended hallucinatory dialogues powered by our novel Adversarial Question Generator (AQG), which can automatically generate image-related yet adversarial dialogues by adopting adversarial attacks on LVLMs. On our benchmark, the zero-shot performance of state-of-the-art LVLMs drops significantly for both the VQA and Captioning tasks. Next, we further reveal this hallucination is mainly due to the prediction bias toward preceding dialogues rather than visual content. To reduce this bias, we propose Adversarial Instruction Tuning (AIT) that robustly fine-tunes LVLMs against hallucinatory dialogues. Extensive experiments show our proposed approach successfully reduces dialogue hallucination while maintaining performance.

657Concept-driven Off Policy Evaluation

[openreview] [pdf]

Abstract Evaluating a set of decisions based on batch data as in off-policy evaluation is challenging as high variance and limited sample sizes can severely hinder reliable evaluation. Identifying and addressing the sources of this variance is essential for improving OPE performance. Recent work on Concept Bottleneck Models (CBMs) shows how a set of human-explainable concepts can be used for predictions, enabling clearer understanding and inspection of these models. Our work proposes incorporating concepts into OPE to identify and reduce variance through targeted interventions. For example, concepts such as shared disease characteristics could help predict better treatments, despite differing vital signs among two patients. We introduce a family of concept-based OPE estimators, and provide theoretical guarantees that when given a set of known concepts, these estimators are unbiased and reduce variance compared to traditional methods. However, in many real-world applications, these concepts are often unknown and need to be estimated. We develop an end-to-end algorithm for learning parameterized concepts that are interpretable, concise, diverse, and optimized for variance reduction in OPE. Through extensive experiments on synthetic and real-world datasets, we demonstrate that both known and learned concept-based estimators significantly improve OPE performance. Crucially, we show that unlike other methods for OPE, concept-based estimators can easily be interpreted and offer opportunities for targeted interventions on specific concepts of interest to further improve the quality of these estimators.

658FreeVS: Generative View Synthesis on Free Driving Trajectory

[openreview] [pdf]

Abstract Existing reconstruction-based novel view synthesis methods for driving scenes focus on synthesizing camera views along the recorded trajectory of the ego vehicle. Their image rendering performance will severely degrade on viewpoints falling out of the recorded trajectory, where camera rays are untrained. We propose FreeVS, a novel fully generative approach that can synthesize camera views on free new trajectories in real driving scenes. To control the generation results to be 3D consistent with the real scenes and accurate in viewpoint pose, we propose the pseudo-image representation of view priors to control the generation process. Viewpoint translation simulation is applied on pseudo-images to simulate camera movement in each direction. Once trained, FreeVS can be applied to any validation sequences without reconstruction process and synthesis views on novel trajectories. Moreover, we propose two new challenging benchmarks tailored to driving scenes, which are novel camera synthesis and novel trajectory synthesis, emphasizing the freedom of viewpoints. Given that no ground truth images are available on novel trajectories, we also propose to evaluate the consistency of images synthesized on novel trajectories with 3D perception models. Experiments on the Waymo Open Dataset show that FreeVS has a strong image synthesis performance on both the recorded trajectories and novel trajectories. The code will be released.

659A Single Goal is All You Need: Skills and Exploration Emerge from Contrastive RL without Rewards, Demonstrations, or Subgoals

[openreview] [pdf]

Abstract In this paper, we present empirical evidence of skills and directed exploration emerging from a simple RL algorithm long before any successful trials are observed. For example, in a manipulation task, the agent is given a single observation of the goal state (see Fig. 1) and learns skills, first for moving its end-effector, then for pushing the block, and finally for picking up and placing the block. These skills emerge before the agent has ever successfully placed the block at the goal location and without the aid of any reward functions, demonstrations, or manually-specified distance metrics. Once the agent has learned to reach the goal state reliably, exploration is reduced. Implementing our method involves a simple modification of prior work and does not require density estimates, ensembles, or any additional hyperparameters. Intuitively, the proposed method seems like it should be terrible at exploration, and we lack a clear theoretical understanding of why it works so effectively, though our experiments provide some hints.

660DIMS: Channel-Dependent and Seasonal-Trend Independent Transformer Using Multi-Stage Training for Time Series Forecasting

[openreview] [pdf]

Abstract Due to the limited size of real-world time series data, current transformer-based time series forecasting algorithms often struggle with overfitting. Common techniques used to mitigate overfitting include channel-independence and seasonal-trend decomposition. However, channel-independent inevitably results in the loss of inter-channel dependencies, and existing seasonal-trend decomposition methods are insufficient in effectively mitigating overfitting. In this study, we propose DIMS, a time series forecasting model that uses multi-stage training to capture inter-channel dependencies while ensuring the independence of seasonal and trend components. The computation of channel dependency is postponed to the later stage, following the channel-independent training, while the seasonal and trend components remain fully independent during the early training phases. This approach enables the model to effectively capture inter-channel dependencies while minimizing overfitting. Experiments show that our model outperforms the state-of-the-art transformer-based models on several datasets.

661Safety Alignment Should be Made More Than Just a Few Tokens Deep

[openreview] [pdf]

Abstract The safety alignment of current Large Language Models (LLMs) is vulnerable. Simple attacks, or even benign fine-tuning, can jailbreak aligned models. We note that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model’s generative distribution primarily over only its very first few output tokens. We unifiedly refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and show how this issue universally contributes to multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. The key contribution of this work is that we demonstrate how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. We show that deepening the safety alignment beyond the first few tokens can meaningfully improve robustness against some common exploits. We also design a regularized fine-tuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.

662Low Variance: A Bottleneck in Diffusion-Based Graph Imputation

[openreview] [pdf]

Abstract In this paper, we tackle learning tasks on graphs with missing features, improving the applicability of graph neural networks to real-world graph-structured data. Existing imputation methods based upon graph diffusion produce channels that have nearly identical values within each channel, and these low-variance channels contribute very little to performance in graph learning tasks. To prevent diffusion-based imputation from producing low-variance channels, we introduce synthetic features that address the cause of the production, thereby increasing variance in low-variance channels. Since the synthetic features prevent diffusion-based imputation models from generating meaningless feature values shared across all nodes, our synthetic feature propagation design prevents significant performance degradation, even under extreme missing rates. Extensive experiments demonstrate the effectiveness of our scheme across various graph learning tasks with missing features, ranging from low to extremely high missing rates. Moreover, we provide empirical evidence and theoretical proof that validate the low-variance problem.

663A Primal-Dual Approach for Dynamic Pricing of Sequentially Displayed Complementary Items under Sale Constraints

[openreview] [pdf]

Abstract We address the challenging problem of dynamically pricing complementary items that are sequentially displayed to customers. An illustrative example is the online sale of flight tickets, where customers navigate through multiple web pages. Initially, they view the ticket cost, followed by ancillary expenses such as insurance and additional luggage fees. Coherent pricing policies for complementary items are essential because optimizing the pricing of each item individually is ineffective. Our scenario also involves a sales constraint, which specifies a minimum number of items to sell, and uncertainty regarding customer demand curves. To tackle this problem, we originally formulate it as a Markov decision process with constraints. Leveraging online learning tools, we design a primal-dual online optimization algorithm. We empirically evaluate our approach using synthetic settings randomly generated from real-world data, covering various configurations from stationary to non-stationary, and compare its performance in terms of constraints violation and regret against well-known baselines optimizing each state singularly.

664Are LLMs Prescient? A Continuous Evaluation using Daily News as the Oracle

[openreview] [pdf]

Abstract Existing evaluation benchmarks of Large Language Models (LLMs) can become outdated due to continuous model updates and the evolving information landscape. This presents a significant challenge: How can we effectively evaluate LLMs in a way that remains relevant over time? To address this, we explore the potential of future event prediction as a continuous evaluation for LLMs, assessing their ability to make predictions about real-world events and exhibit temporal generalization. Towards this goal, we propose a continuous LLM evaluation using daily news. We automatically generate question-answer (QA) pairs from daily news, constructing our Daily Oracle dataset, which challenges LLMs to predict “future” events based on its pre-training data. Our findings show that as pre-training data becomes outdated, LLMs exhibit performance degradation over time. While the Retrieval Augmented Generation (RAG) technique can enhance prediction accuracy, the performance degradation pattern still exists, underscoring the necessity for ongoing model updates.

665Emerging Tracking from Video Diffusion

[openreview] [pdf]

Abstract We find video diffusion models, renowned for their generative capabilities, surprisingly excel at pixel-level object tracking without any explicit training for this task. We introduce a simple and effective method to extract motion representations from video diffusion models, achieving state-of-the-art tracking results. Our approach enables the tracking of identical objects, overcoming limitations of previous methods reliant on intra-frame appearance correspondence. Visualizations and empirical results show that our approach outperforms recent supervised and self-supervised tracking methods, including the state-of-the-art, by up to 6 points. Our work demonstrates video generative models can learn intrinsic temporal dynamics of video, and excel in tracking tasks beyond original video synthesis.

666Accelerating Diffusion Transformers with Token-wise Feature Caching

[openreview] [pdf]

Abstract Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10X more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-alpha, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36X and 1.93X acceleration are achieved on OpenSora and PixArt-alpha with almost no drop in generation quality. Codes have been released in the supplementary material and will be released in Github.

667On the Convergence of No-Regret Dynamics in Information Retrieval Games with Proportional Ranking Functions

[openreview] [pdf]

Abstract Publishers who publish their content on the web act strategically, in a behavior that can be modeled within the online learning framework. Regret, a central concept in machine learning, serves as a canonical measure for assessing the performance of learning agents within this framework. We prove that any proportional content ranking function with a concave activation function induces games in which no-regret learning dynamics converge. Moreover, for proportional ranking functions, we prove the equivalence of the concavity of the activation function, the social concavity of the induced games and the concavity of the induced games. We also study the empirical trade-offs between publishers’ and users’ welfare, under different choices of the activation function, using a state-of-the-art no-regret dynamics algorithm. Furthermore, we demonstrate how the choice of the ranking function and changes in the ecosystem structure affect these welfare measures, as well as the dynamics’ convergence rate.

668High Probability Bounds for Cross-Learning Contextual Bandits with Unknown Context Distributions

[openreview] [pdf]

Abstract Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider & Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability\textit{high probability}. There are steps in the original analysis by Schneider & Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

669Bayesian Policy Distillation via Offline RL for Lightweight and Fast Inference

[openreview] [pdf]

Abstract High-performance deep reinforcement learning faces tremendous challenges when implemented on cost-effective low-end embedded systems due to its heavy computational burden. To address this issue, we propose a policy distillation method called Bayesian Policy Distillation (BPD), which effectively retrains small-sized neural networks through an offline reinforcement learning approach. BPD exploits Bayesian neural networks to distill already designed high-performance policy networks by adopting value optimizing, behavior cloning, and sparsity-inducing strategies. Simulation results reveal that the proposed BPD successfully compresses the policy networks, making them lighter and achieving faster inference time. Furthermore, the proposed approach is demonstrated with a real inverted pendulum system and reduced the inference time and memory size by 78 % and 98 %, respectively.

670Revisiting Source-Free Domain Adaptation: a New Perspective via Uncertainty Control

[openreview] [pdf]

Abstract Source-Free Domain Adaptation (SFDA) seeks to adapt a pre-trained source model to the target domain using only unlabeled target data, without access to the original source data. While current state-of-the-art (SOTA) methods rely on leveraging weak supervision from the source model to extract reliable information for self-supervised adaptation, they often overlook the uncertainty that arises during the transfer process. In this paper, we conduct a systematic and theoretical analysis of the uncertainty inherent in existing SFDA methods and demonstrate its impact on transfer performance through the lens of Distributionally Robust Optimization (DRO). Building upon the theoretical results, we propose a novel instance-dependent uncertainty control algorithm for SFDA. Our method is designed to quantify and exploit the uncertainty during the adaptation process, significantly improving the model performance. Extensive experiments on benchmark datasets and empirical analyses confirm the validity of our theoretical findings and the effectiveness of the proposed method. This work offers new insights into understanding and advancing SFDA performance.

671Positive Mining in Graph Contrastive Learning

[openreview] [pdf]

Abstract Graph Contrastive Learning (GCL), which aims to capture representations from unlabeled graphs, has made significant progress in recent years. In GCL, InfoNCE-based loss functions play a crucial role by ensuring that positive node pairs—those that are similar—are drawn closer together in the representational space, while negative pairs, which are dissimilar, are pushed apart. The primary focus of recent research has been on refining the contrastive loss function, particularly by adjusting the weighting of negative nodes. This is achieved by changing the weight between negative node pairs, or by using node similarity to select the positive node associated with the anchor node. Despite the substantial success of these GCL techniques, there remains a belief that the nodes identified as positive or negative may not accurately reflect the true positives and negatives. To tackle this challenge, we introduce an innovative method known as Positive Mining Graph Contrastive Learning (PMGCL). This method consists in calculating the probability of positive samples between the anchor node and other nodes using a mixture model, thereby identifying nodes that have a higher likelihood of being true positives in relation to the anchor node. We have conducted a comprehensive evaluation of PMGCL on a range of real-world graph datasets. The experimental findings indicate that PMGCL significantly outperforms traditional GCL methods. Our method not only achieves state-of-the-art results in unsupervised learning benchmarks but also exceeds the performance of supervised learning benchmarks in certain scenarios.

672DPaI: Differentiable Pruning at Initialization with Node-Path Balance Principle

[openreview] [pdf]

Abstract Pruning at Initialization (PaI) is a technique in neural network optimization characterized by the proactive elimination of weights before the network’s training on designated tasks. This innovative strategy potentially reduces the costs for training and inference, significantly advancing computational efficiency. A key element of PaI’s effectiveness is that it considers the significance of weights in an untrained network. It prioritizes the trainability and optimization potential of the pruned subnetworks. Recent methods can effectively prevent the formation of hard-to-optimize networks, e.g. through iterative adjustments at each network layer. However, this way often results inlarge-scale discrete optimization problems, which could make PaI further challenging. This paper introduces a novel method, calledDPaI, that involves a differentiable optimization of the pruning mask. DPaI adopts a dynamic and adaptable pruning process, allowing easier optimisation processes and better solutions. More importantly, our differentiable formulation enables readily use of the existing rich body of efficient gradient-based methods for PaI. Our empirical results demonstrate that DPaI significantly outperforms current state-of-the-art PaI methods on various architectures, such as Convolutional Neural Networks and Vision-Transformers.

673POTEC: Off-Policy Contextual Bandits for Large Action Spaces via Policy Decomposition

[openreview] [pdf]

Abstract We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.

674Long-Term Fairness in Reinforcement Learning with Bisimulation Metrics

[openreview] [pdf]

Abstract Ensuring long-term fairness is crucial when developing automated decision making systems, specifically in dynamic and sequential environments. By maximizing their reward without consideration of fairness, AI agents can introduce disparities in their treatment of groups or individuals. In this paper, we establish the connection between bisimulation metrics and group fairness in reinforcement learning. We propose a novel approach that leverages bisimulation metrics to learn reward functions and observation dynamics, ensuring that learners treat groups fairly while reflecting the original problem. We demonstrate the effectiveness of our method in addressing disparities in sequential decision making problems through empirical evaluation on a standard fairness benchmark consisting of lending and college admission scenarios.

675Non-Adversarial Inverse Reinforcement Learning via Successor Feature Matching

[openreview] [pdf]

Abstract In inverse reinforcement learning (IRL), an agent seeks to replicate expert demonstrations through interactions with the environment. Traditionally, IRL is treated as an adversarial game, where an adversary searches over reward models, and a learner optimizes the reward through repeated RL procedures. This game-solving approach is both computationally expensive and difficult to stabilize. In this work, we propose a novel approach to IRL by direct policy optimization: exploiting a linear factorization of the return as the inner product of successor features and a reward vector, we design an IRL algorithm by policy gradient descent on the gap between the learner and expert features. Our non-adversarial method does not require learning a reward function and can be solved seamlessly with existing actor-critic RL algorithms. Remarkably, our approach works in state-only settings without expert action labels, a setting which behavior cloning (BC) cannot solve. Empirical results demonstrate that our method learns from as few as a single expert demonstration and achieves improved performance on various control tasks.

676You Can Train from Scratch: Further Discussion on the Long Range Arena

[openreview] [pdf]

Abstract Despite their success, Transformers suffer from quadratic complexity in the sequence length, limiting their applicability to long-range dependency problems and making them expensive to train and run. After many proposals to address this issue, the Long Range Arena (LRA) was suggested as a benchmark to evaluate the performance of new models in long-range dependency modeling tasks. The Transformer and its variants performed poorly on this benchmark, and a new series of architectures such as State Space Models (SSMs) gained some traction, greatly outperforming Transformers in the LRA. Recent work has shown that with a denoising pretraining phase, Transformers can achieve competitive results in the LRA with these new architectures. In this work, we show that one can achieve the same result without a separate pretraining phase, using other training techniques. This reduces the computational burden of training and eliminates the risk of representation collapse during fine-tuning. We argue that LRA tasks are very positional and provide evidence that short-range dependencies account for a significant portion of the performance. This explains prior differences in LRA accuracy between the Transformer and new architectures, which have better positional and local biases. Our training techniques alleviate these differences up to a point, and rotary embeddings add further improvements by including these positional biases. Given these insights, LRA results should be interpreted with caution, and should be analyzed given the model’s inductive biases and the nature of the tasks.

677ContextGNN: Beyond Two-Tower Recommendation Systems

[openreview] [pdf]

Abstract Recommendation systems predominantly utilize two-tower architectures, which evaluate user-item rankings through the inner product of their respective embeddings. However, one key limitation of two-tower models is that they learn a pair-agnostic representation of users and items. In contrast, pair-wise representations either scale poorly due to their quadratic complexity or are too restrictive on the candidate pairs to rank. To address these issues, we introduce Context-based Graph Neural Networks (ContextGNNs), a novel deep learning architecture for link prediction in recommendation systems. The method employs a pair-wise representation technique for familiar items situated within a user’s local subgraph, while leveraging two-tower representations to facilitate the recommendation of exploratory items. A final network then predicts how to fuse both pair-wise and two-tower recommendations into a single ranking of items. We demonstrate that ContextGNN is able to adapt to different data characteristics and outperforms existing methods, both traditional and GNN-based, on a diverse set of practical recommendation tasks, improving performance by 20% on average.

678Everyone Deserves Recourse: Feasible Recourse Paths Using Data Augmentation

[openreview] [pdf]

Abstract Decisions made using machine learning models can negatively impact individuals in critical applications such as healthcare and finance by denying essential services or access to opportunity. Algorithmic recourse supplements a negative AI decision by providing rejected individuals with advice on the changes they can make to their profiles, so that they may eventually achieve the desired outcome. Most existing recourse methods provide single-step changes by using counterfactual explanations. These counterfactual explanations are computed assuming a fixed (not learned) distance function. Further, few works consider providing more realistic multi-step changes in the form of recourse paths. However, such methods may fail to provide any recourse path for some individuals or provide paths that might not be feasible, since intermediate steps needed to reach the counterfactual explanation may not be realizable. We introduce a framework for learning an optimal distance function and threshold to compute multi-step recourse paths for all. First, we formalize the problem of finding multi-step recourse paths. Given a set of feasible transitions, we propose a data-driven framework for learning the optimal distance and threshold for each step with PAC (Probably Approximately Correct) guarantees. Finally, we provide a data augmentation algorithm to ensure that a solution exists for all individuals. Experiments on several datasets show that the proposed method learns feasible recourse paths for all individuals.

679Map to Optimal: Adapting Graph Out-of-Distribution in Test Time

[openreview] [pdf]

Abstract Based on topological proximity message passing, graph neural networks (GNNs) can quickly model data patterns on graphs. However, at test time, when the node feature and topological structure of the graph data are out-of-distribution (OOD), the performance of pre-trained GNNs will be hindered. Existing test-time methods either fine-tune the pre-trained model or overlook the discrepancy between the prior knowledge in pre-trained models and the test graph. We propose a novel self-supervised test-time adaptation paradigm GOAT (https://anonymous.4open.science/r/GOAT-5C0E), through graph augmentation-to-augmentation strategy, that enables a simple adapter can mitigate the distribution gap of training data and test-time data. GOAT reduces generalization error for node classification in various pre-trained settings through experiments on six benchmark datasets spanning three distinct real-world OOD scenarios. Remarkably, GOAT outperforms state-of-the-art test-time methods, and our empirical study further demonstrates the interpretability of the OOD representation generated from our method.

680Markovian Compression: Looking to the Past Helps Accelerate the Future

[openreview] [pdf]

Abstract This paper deals with distributed optimization problems that use compressed communication to achieve efficient performance and mitigate the communication bottleneck. We propose a family of compression schemes in which operators transform vectors fed to their input according to a Markov chain, i.e., the stochasticity of the compressors depends on previous iterations. Intuitively, this should accelerate the convergence of optimization methods, as considering previous iterations seems more natural and robust. The compressors are implemented in the vanilla Quantized Stochastic Gradient Descent (QSGD) algorithm. To further improve efficiency and convergence rate, we apply the momentum acceleration method. We prove convergence results for our algorithms with Markovian compressors and show theoretically that the accelerated method converges faster than the basic version. The analysis covers non-convex, Polyak-Lojasiewicz (PL), and strongly convex cases. Experiments are conducted to demonstrate the applicability of the results to distributed data-parallel optimization problems. Practical results demonstrate the superiority of methods utilizing our compressors design over several existing optimization algorithms.

681Improved Risk Bounds with Unbounded Losses for Transductive Learning

[openreview] [pdf]

Abstract In the transductive learning setting, we are provided with a labeled training set and an unlabeled test set, with the objective of predicting the labels of the test points. This framework differs from the standard problem of fitting an unknown distribution with a training set drawn independently from this distribution. In this paper, we primarily improve the generalization bounds in transductive learning. Specifically, we develop two novel concentration inequalities for the suprema of empirical processes sampled without replacement for unbounded functions, marking the first discussion of the generalization performance of unbounded functions in the context of sampling without replacement. We further provide two valuable applications of our new inequalities: on one hand, we firstly derive fast excess risk bounds for empirical risk minimization in transductive learning under unbounded losses. On the other hand, we establish high-probability bounds on the generalization error for graph neural networks when using stochastic gradient descent which improve the current state-of-the-art results.

682Trajectory-level Data Generation with Better Alignment for Offline Imitation Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) relies heavily on densely precise reward signals, which are labor-intensive and challenging to obtain in many real-world scenarios. To tackle this challenge, offline imitation learning (IL) extracts optimal policies from expert demonstrations and datasets without reward labels. However, the scarcity of expert data and the abundance of suboptimal trajectories within the dataset impede the application of supervised learning methods like behavior cloning (BC). While previous research has focused on learning importance weights for BC or reward functions to integrate with offline RL algorithms, these approaches often result in suboptimal policy performance due to training instabilities and inaccuracies in learned weights or rewards. To address this problem, we introduce Trajectory-level Data Generation with Better Alignment (TDGBA), an algorithm that leverages alignment measures between unlabeled trajectories and expert demonstrations to guide a diffusion model in generating highly aligned trajectories. With these trajectories, BC can be directly applied to extract optimal polices without the need for weight or reward learning. Moreover, to ensure high fidelity and diversity in the generated trajectories and to make the learning more stable, the implicit expert preference that can fully exploit the unlabeled data is employed in the training of the diffusion model. Experimental results on the D4RL benchmarks demonstrate that TDGBA significantly outperforms state-of-the-art offline IL methods. Additionally, the analysis of the generated trajectories shows the effectiveness of incorporating the diffusion model and implicit expert preference for trajectory-level data generation.

683GuideCO: Training Objective-Guided Diffusion Solver with Imperfect Data for Combinatorial Optimization

[openreview] [pdf]

Abstract Combinatorial optimization (CO) problems have widespread applications in science and engineering but they present significant computational challenges. Recent advancements in generative models, particularly diffusion models, have shown promise in bypassing traditional optimization solvers by directly generating near-optimal solutions. However, we observe an exponential scaling law between the optimality gap and the amount of training data needed for training diffusion-based solvers. Notably, the performance of existing diffusion solvers relies on both quantity and quality of training data: they perform well with abundant high quality training data labeled by exact or near-optimal solvers, while suffering when high-quality labels are scarce or unavailable. To address the challenge, we propose GuideCO, an objective-guided diffusion solver for combinatorial optimization, which can be trained on imperfectly labelled datasets. GuideCO is a two-stage generate-then-decode framework, featuring an objective-guided diffusion model that is further reinforced by classifier-free guidance for generating high-quality solutions on any given problem instance. Experiments demonstrate the improvements of GuideCO against baselines when trained on imperfect data, in a range of combinatorial optimization benchmark tasks such as TSP (Traveling Salesman Problem) and MIS (Maximum Independent Set).

684POMDIFFUSER: LONG-MEMORY MEETS LONG- PLANNING FOR POMDPS

[openreview] [pdf]

Abstract Effective long-term planning in complex environments benefits from not only leveraging immediate information but also utilizing past experiences. Drawing inspiration from how humans use long-term memory in decision-making, we propose the POMDiffuser framework, an approach to planning in partially observable environments. While conventional Diffuser models often memorize specific environments, POMDiffuser explores the potential of learning to plan from memory, with the aim of generalizing to new scenarios. By incorporating a memory mechanism in POMDP scenarios, our model extends diffusion-based planning models into the realm of meta-learning with carefully designed tasks that require the diffusion planner to demonstrate both long-term planning and memory utilization. We investigated existing diffusion-based models, focusing on their applicability, computational efficiency, and performance trade-offs.

685Adversarial Attack Robust dataset pruning

[openreview] [pdf]

Abstract Dataset pruning, while effective for reducing training data size, often leads to models vulnerable to adversarial attacks. This paper introduces a novel approach to create adversarially robust coresets. We first theoretically analyze how existing pruning methods result in non-smooth loss surfaces, increasing susceptibility to attacks. To address this, we propose two key innovations: (1) a Frequency-Selective Excitation Network (FSE-Net) that dynamically selects important frequency components, smoothing the loss surface while reducing storage requirements, and (2) a “Jointentropy” score for selecting stable and informative samples. Our method significantly outperforms state-of-the-art pruning algorithms across various adversarial attacks and pruning ratios. On CIFAR-10, our approach achieves up to 58.19% accuracy under AutoAttack with an 80% pruning ratio, compared to 42.98% for previous methods. Moreover, our frequency pruning technique improves robustness even on full datasets, demonstrating its potential for enhancing model security while reducing computational costs.

686Natural Policy Gradient for Average Reward Non-Stationary RL

[openreview] [pdf]

Abstract We consider the problem of non-stationary reinforcement learning (RL) in the infinite-horizon average-reward setting. We model it by a Markov Decision Process with time-varying rewards and transition probabilities, with a variation budget of ΔT\Delta_T. Existing non-stationary RL algorithms focus on model-based and model-free value-based methods. Policy-based methods, however, despite their flexibility in practice, are not theoretically well understood in non-stationary RL. We propose the first model-free policy-based algorithm, Non-Stationary Natural Actor-Critic (NS-NAC), a policy gradient method with a novel interpretation of learning rates as adapting factors. We present a dynamic regret of O~(S12A12ΔT19T89)\mathcal{\tilde{O}} (|\mathcal{S}|^{\frac{1}{2}}|\mathcal{A}|^{\frac{1}{2}}\Delta_T^{\frac{1}{9}}T^{\frac{8}{9}} ), where TT is the time horizon, and S|\mathcal{S}|, A|\mathcal{A}| are, respectively, the size of the state and action space. The regret analysis relies on adapting the Lyapunov function based analysis to dynamic environments and characterizing the effects of simultaneous changes in policy and the environment on estimates of the value function and average reward.

687Lasso Bandit with Compatibility Condition on Optimal Arm

[openreview] [pdf]

Abstract We consider a stochastic sparse linear bandit problem where only a sparse subset of context features affects the expected reward function, i.e., the unknown reward parameter has sparse structure. In the existing Lasso bandit literature, the compatibility conditions together with additional diversity conditions on the context features are imposed to achieve regret bounds that only depend logarithmically on the ambient dimension dd. In this paper, we demonstrate that even without the additional diversity assumptions, thecompatibility condition on the optimal armis sufficient to derive a regret bound that depends logarithmically on dd, and our assumption is strictly weaker than those used in the lasso bandit literature under the single-parameter setting. We propose an algorithm that adapts the forced-sampling technique and prove that the proposed algorithm achieves O(polylogdT)\mathcal{O}(\text{poly}\log dT) regret under the margin condition. To our knowledge, the proposed algorithm requires the weakest assumptions among Lasso bandit algorithms under the single-parameter setting that achieve O(polylogdT)\mathcal{O}(\text{poly}\log dT) regret. Through numerical experiments, we confirm the superior performance of our proposed algorithm.

688Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

[openreview] [pdf]

Abstract In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

689TabWak: A Watermark for Tabular Diffusion Models

[openreview] [pdf]

Abstract Synthetic data offers alternatives for data augmentation and sharing. Till date, it remains unknown how to use watermarking techniques to trace and audit synthetic tables generated by tabular diffusion models to mitigate potential misuses. In this paper, we design TabWak, the first watermarking method to embed invisible signatures that control the sampling of Gaussian latent codes used to synthesize table rows via the diffusion backbone. TabWak has two key features. Different from existing image watermarking techniques, TabWak uses self-cloning and shuffling to embed the secret key in positional information of random seeds that control the Gaussian latents, allowing to use different seeds at each row for high inter-row diversity and enabling row-wise detectability. To further boost the robustness of watermark detection against post-editing attacks, TabWak uses a valid-bit mechanism that focuses on the tail of the latent code distribution for superior noise resilience. We provide theoretical guarantees on the row diversity and effectiveness of detectability. We evaluate TabWak on five datasets against baselines to show that the quality of watermarked tables remains nearly indistinguishable from non-watermarked tables while achieving high detectability in the presence of strong post-editing attacks, with a 100% true positive rate at a 0.1% false positive rate on synthetic tables with fewer than 300 rows. Our code is available at the following anonymized repositoryhttps://anonymous.4open.science/r/TabWak-4E65/.

690Influence-based Attributions can be Manipulated

[openreview] [pdf]

Abstract Influence Functions are a standard tool for attributing predictions to training data in a principled manner and are widely used in applications such as data valuation and fairness. In this work, we present realistic incentives to manipulate influence-based attributions and investigate whether these attributions can be \textit{systematically} tampered by an adversary. We show that this is indeed possible for logistic regression models trained on ResNet feature embeddings and standard tabular fairness datasets and provide efficient attacks with backward-friendly implementations. Our work raises questions on the reliability of influence-based attributions in adversarial circumstances.

691Safe Meta-Reinforcement Learning via Dual-Method-Based Policy Adaptation: Near-Optimality and Anytime Safety Guarantee

[openreview] [pdf]

Abstract This paper studies the safe meta-reinforcement learning (safe meta-RL) problem where anytime safety is ensured during the meta-test. We develop a safe meta-RL framework that consists of two modules, safe policy adaptation and safe meta-policy training, and propose efficient algorithms for the two modules. Beyond existing safe meta-RL analyses, we prove the anytime safety guarantee of policy adaptation and provide a lower bound of the expected total reward of the adapted policies compared with the optimal policies, which shows that the adapted policies are nearly optimal. Our experiments demonstrate three key advantages over existing safe meta-RL methods: (i) superior optimality, (ii) anytime safety guarantee, and (iii) high computational efficiency.

692Eligibility Traces for Confounding Robust Off-Policy Evaluation: A Causal Approach

[openreview] [pdf]

Abstract A unifying theme in Artificial Intelligence is learning an effective policy to control an agent in an unknown environment in order to optimize a certain performance measure. Off-policy methods can significantly improve the sample efficiency during training since they allow an agent to learn from observed trajectories generated by different behavior policies, without directly deploying the target policies in the underlying environment. This paper studies off-policy evaluation from biased offline data where (1) unobserved confounding bias cannot be ruled out a priori; or (2) the observed trajectories do not overlap with intended behaviors of the learner, i.e., the target and behavior policies do not share a common support. Specifically, we first extend the Bellman’s equation to derive effective closed-form bounds over value functions from the observational distribution contaminated with unobserved confounding and no-overlap. Second, we propose two novel algorithms that use eligibility traces to estimate these bounds from finite observational data. Compared to other partial identification methods for off-policy evaluation in sequential environments, these methods are model-free and do not rely on additional parametric knowledge about the system dynamics in the underlying environment.

693Numerical Pitfalls in Policy Gradient Updates

[openreview] [pdf]

Abstract Numerical instability, such as gradient explosion, is a fundamental problem in practical deep reinforcement learning (DRL) algorithms. Beyond anecdotal debugging heuristics, there is a lack of systematic understanding of the causes for numerical sensitivity that leads to exploding gradient failures in practice. In this work, we demonstrate that the issue arises from the ill-conditioned density ratio in the surrogate objective that comes from importance sampling, which can take excessively large values during training. Perhaps surprisingly, while various policy optimization methods such as TRPO and PPO prevent excessively large policy updates, their optimization constraints on KL divergence and probability ratio cannot guarantee numerical stability. This also explains why gradient explosion often occurs during DRL training, even with code-level optimizations. We also discuss several potential approaches to ensure numerical stability and the challenges associated with them.

694Retrospective Learning from Interactions

[openreview] [pdf]

Abstract Multi-turn language interactions naturally include implicit feedback signals. For example, if a listener responds in an unexpected way to an instruction, the instructor may rephrase it, express frustration, or pivot to an alternative task. These signals are task-independent and occupy a relatively constrained subspace of language, allowing a language model to identify them even if it fails on the actual task. This holds the promise of continually learning and improving from interactions without additional annotations. We introduceReSpect, a method to learn from signals in past interactions via retrospection. We deployReSpectin a new multimodal interaction scenario, where humans instruct an LLM to solve an abstract reasoning task with a combinatorial solution space. Through thousands of interactions with humans, we show howReSpectgradually improves task completion rate from 31% to 82%, all without any external annotation.

695Policy Design in Long-run Welfare Dynamics

[openreview] [pdf]

Abstract We study a stochastic dynamic model of long-term welfare in a population. Individuals in our model have welfare that improves with intervention and deteriorates in the absence of treatment. The planner can treat one individual at each time step. We contrast two fundamental policies in our model. The utilitarian policy greedily maximizes welfare improvement at each step. The Rawlsian policy intervenes on the individual of lowest welfare. Although hugely influential as a normative proposal, Rawlsian policies have been criticized for failing to optimize social welfare. We prove that, surprisingly, in a meaningful range of parameters Rawlsian policy has greater long-run utility than the utilitarian policy even though it is inferior on short time horizons. Specifically, this is true provided that treatment effects satisfy a weak homogeneity assumption, and the welfare dynamics satisfy a rich-get-richer and poor-get-poorer condition. We extend our results with a comprehensive comparison of different policies under different parameter regimes. Through semi-synthetic simulation studies, we evaluate various policies in cases where the assumptions of our theorems do not hold. Our results illustrate that comparing policies based on short-term evaluations can lead to misleading conclusions.

696Learning with Real-time Improving Predictions in Online MDPs

[openreview] [pdf]

Abstract In this paper, we introduce the Decoupling Optimistic Online Mirror Descent (DOOMD) algorithm, a novel online learning approach designed for episodic Markov Decision Processes with real-time improving predictions. Unlike conventional methods that employ a fixed policy throughout each episode, our approach allows for continuous updates of both predictions and policies within an episode. To achieve this, the DOOMD algorithm decomposes decision-making across states, enabling each state to execute an individual sub-algorithm that considers both immediate and long-term effects on future decisions. We theoretically establish a sub-linear regret bound for the algorithm, providing a guarantee on the worst-case performance.

697The Discretization Complexity Analysis of Consistency Models under Variance Exploding Forward Process

[openreview] [pdf]

Abstract Consistency models, a new class of one-step generative models, have shown state-of-the-art performance in one-step generation and achieve competitive performance compared to multi-step diffusion models. The most challenging part of consistency models is the training process, which discretizes the diffusion process and trains a consistency function to map any point at any discretized timepoint of the diffusion process to the data distribution. Despite the empirical success, only a few works focus on the discretization complexity of consistency models. However, the setting of those works is far away from the empirical consistency models with good performance, suffers from large discretization complexity, and fails to explain the empirical success of consistency models. To bridge the gap between theory and application, we analyze consistency models with two key properties: (1) variance exploding forward process and (2) gradually decay discretization stepsize, which are both widely used in empirical consistency models. Under the above realistic setting, we make the first step to explain the empirical success of consistency models and achieve the state-of-the-art discretization complexity for consistency models, which is competitive with the results of diffusion models. After obtaining the results of the one-step sampling method of consistency models, we further analyze a multi-step consistency sampling algorithm proposed by \citet{song2023consistency} and show that this algorithm improves the discretization complexity compared with one-step generation, which matches the empirical observation.

698Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

[openreview] [pdf]

Abstract Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction. To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits (\algo), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an O~(dT+dC)\tilde O(d\sqrt{T}+dC) regret bound, where TT is the number of rounds, dd is the dimension of the context, and 0CT 0 \le C \le T is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without (C=0C=0) adversarial feedback. To the best of our knowledge, our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, we conduct experiments to evaluate our proposed algorithm against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

699Bidirectional Consistency Models

[openreview] [pdf]

Abstract Diffusion models (DMs) are capable of generating remarkably high-quality samples by iteratively denoising a random vector, a process that corresponds to moving along the probability flow ordinary differential equation (PF ODE). Interestingly, DMs can also invert an input image to noise by moving backward along the PF ODE, a key operation for downstream tasks such as interpolation and image editing. However, the iterative nature of this process restricts its speed, hindering its broader application. Recently, Consistency Models (CMs) have emerged to address this challenge by approximating the integral of the PF ODE, largely reducing the number of iterations. Yet, the absence of an explicit ODE solver complicates the inversion process. To resolve this, we introduce Bidirectional Consistency Model (BCM), which learns asingleneural network that enables bothforward and backwardtraversal along the PF ODE, efficiently unifying generation and inversion tasks within one framework. We can train BCM from scratch or tune it using a pre-trained consistency model, which reduces the training cost and increases scalability. We demonstrate that BCM enables one-step generation and inversion while also allowing the use of additional steps to enhance generation quality or reduce reconstruction error. We further showcase BCM’s capability in downstream tasks, such as interpolation, inpainting, and blind restoration of compressed images. Notably, when the number of function evaluations (NFE) is constrained, BCM surpasses domain-specific restoration methods, such as I2^2SB and Palette, in a fully zero-shot manner, offering an efficient alternative for inversion problems.

700Feedback Favors the Generalization of Neural ODEs

[openreview] [pdf]

Abstract The well-known generalization problem hinders the application of artificial neural networks in continuous-time prediction tasks with varying latent dynamics. In sharp contrast, biological systems can neatly adapt to evolving environments benefiting from real-time feedback mechanisms. Inspired by the feedback philosophy, we present feedback neural networks, showing that a feedback loop can flexibly correct the learned latent dynamics of neural ordinary differential equations (neural ODEs), leading to a prominent generalization improvement. The feedback neural network is a novel two-DOF neural network, which possesses robust performance in unseen scenarios with no loss of accuracy performance on previous tasks. A linear feedback form is presented to correct the learned latent dynamics firstly, with a convergence guarantee. Then, domain randomization is utilized to learn a nonlinear neural feedback form. Finally, extensive tests including trajectory prediction of a real irregular object and model predictive control of a quadrotor with various uncertainties, are implemented, indicating significant improvements over state-of-the-art model-based and learning-based methods.

701Beyond the Known: Decision Making with Counterfactual Reasoning Decision Transformer

[openreview] [pdf]

Abstract Decision Transformer (DT) plays a crucial role in modern reinforcement learning, leveraging offline datasets to achieve impressive results across various domains. However, DT requires high-quality, comprehensive data to perform optimally. In real-world applications, such ideal data is often lacking, with the underrepresentation of optimal behaviours posing a significant challenge. This limitation highlights the difficulty of relying on offline datasets for training, as suboptimal data can hinder performance. To address this, we propose the Counterfactual Reasoning Decision Transformer (CRDT), a novel framework inspired by counterfactual reasoning. CRDT enhances DT’s ability to reason beyond known data by generating and utilizing counterfactual experiences, enabling improved decision-making in out-of-distribution scenarios. Extensive experiments across continuous and discrete action spaces, including environments with limited data, demonstrate that CRDT consistently outperforms conventional DT approaches. Additionally, reasoning counterfactually allows the DT agent to obtain stitching ability, allowing it to combine suboptimal trajectories. These results highlight the potential of counterfactual reasoning to enhance RL agents’ performance and generalization capabilities.

702Adversarial Policy Optimization for Preference-based Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we study offline preference-based reinforcement learning (PbRL), where learning is based on pre-collected preference feedback over pairs of trajectories. While offline PbRL has demonstrated remarkable empirical success, existing theoretical approaches face challenges in ensuring conservatism under uncertainty, requiring computationally intractable confidence set constructions. We address this limitation by proposing Adversarial Preference-based Policy Optimization (APPO), a computationally efficient algorithm for offline PbRL that guarantees sample complexity bounds without relying on explicit confidence sets. By framing PbRL as a two-player game between a policy and a model, our approach enforces conservatism in a tractable manner. Using standard assumptions on function approximation and bounded trajectory concentrability, we derive sample complexity bound. To our knowledge, APPO is the first offline PbRL algorithm to offer both statistical efficiency and practical applicability. Experimental results on continuous control tasks demonstrate that APPO effectively learns from complex datasets, showing comparable performance with existing state-of-the-art methods.

703Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

[openreview] [pdf]

Abstract The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy-Tokenized Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, within each transformer block, we compute an averaging token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 49% reduction compared to DiT and a 34% reduction compared to PixArt-α\alpha). The visual exhibition of Qihoo-T2X is available athttps://qihoo-t2x.github.io/.

704Learning Utilities from Demonstrations in Markov Decision Processes

[openreview] [pdf]

Abstract Our goal is to extract useful knowledge from demonstrations of behavior in sequential decision-making problems. Although it is well-known that humans commonly engage inrisk-sensitivebehaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume arisk-neutralagent. Beyond introducing model misspecification, these models do not directly capture the risk attitude of the observed agent, which can be crucial in many applications. In this paper, we propose a novel model of behavior in Markov Decision Processes (MDPs) that explicitly represents the agent’s risk attitude through autilityfunction. We then define the Utility Learning (UL) problem as the task of inferring the observed agent’s risk attitude, encoded via a utility function, from demonstrations in MDPs, and we analyze the partial identifiability of the agent’s utility. Furthermore, we devise two provably efficient algorithms for UL in a finite-data regime, and we analyze their sample complexity. We conclude with proof-of-concept experiments that empirically validate both our model and our algorithms.

705FOSP: Fine-tuning Offline Safe Policy through World Models

[openreview] [pdf]

Abstract Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the deployment of vision-based robotic tasks through online fine-tuning an offline pretrained policy. To facilitate effective fine-tuning, we introduce model-based RL, which is known for its data efficiency. Specifically, our method employs in-sample optimization to improve offline training efficiency while incorporating reachability guidance to ensure safety. After obtaining an offline safe policy, safe policy expansion approach is leveraged for online fine-tuning. The performance of our method is validated on simulation benchmarks with five vision-only tasks and through real-world robot deployment using limited data. It demonstrates that our approach significantly improves the generalization of offline policies to unseen safety-constrained scenarios. To the best of our knowledge, this is the first work to explore offline-to-online RL for safe generalization tasks. The videos are available athttps://sites.google.com/view/safefinetune/home.

706Deconstructing Denoising Diffusion Models for Self-Supervised Learning

[openreview] [pdf]

Abstract In this study, we examine the representation learning abilities of Denoising Diffusion Models (DDM) that were originally purposed for image generation. Our philosophy is to deconstruct a DDM, gradually transforming it into a classical Denoising Autoencoder (DAE). This deconstructive process allows us to explore how various components of modern DDMs influence self-supervised representation learning. We observe that only a very few modern components are critical for learning good representations, while many others are nonessential. Our study ultimately arrives at an approach that is highly simplified and to a large extent resembles a classical DAE. We hope our study will rekindle interest in a family of classical methods within the realm of modern self-supervised learning.

707Efficient Diffusion Models for Symmetric Manifolds

[openreview] [pdf]

Abstract We present a framework for designing efficient diffusion models on symmetric Riemannian manifolds, which include the torus, sphere, special orthogonal group, and unitary group. While diffusion models on symmetric manifolds have gained significant attention, existing approaches often rely on the manifolds’ heat kernels, which lack closed-form expressions and result in exponential-in-dimension per-iteration runtimes during training. We introduce a new diffusion model for symmetric-space manifolds, leveraging a projection of Euclidean Brownian motion to bypass explicit heat kernel computations. Our training algorithm minimizes a novel objective function derived via Ito’s Lemma, with efficiently computable gradients, allowing each iteration to run in polynomial time for symmetric manifolds. Additionally, the symmetries of the manifold ensure the diffusion satisfies an “average-case” Lipschitz condition, enabling accurate and efficient sample generation. These improvements enhance both the training runtime and sample accuracy for key cases of symmetric manifolds, helping to bridge the gap between diffusion models on symmetric manifolds and Euclidean space.

708Pretraining Decision Transformers with Reward Prediction for In-Context Structured Bandit Learning

[openreview] [pdf]

Abstract In this paper, we study the multi-task structured bandit problem where the goal is to learn a near-optimal algorithm that minimizes cumulative regret. The tasks share a common structure and the algorithm exploits the shared structure to minimize the cumulative regret for an unseen but related test task. We use a transformer as a decision-making algorithm to learn this shared structure so as to generalize to the test task. The prior work of pretrained decision transformers like \dpt\ requires access to the optimal action during training which may be hard in several scenarios. Diverging from these works, our learning algorithm does not need the knowledge of optimal action per task during training but predicts a reward vector for each of the actions using only the observed offline data from the diverse training tasks. Finally, during inference time, it selects action using the reward predictions employing various exploration strategies in-context for an unseen test task. We show that our model outperforms other SOTA methods like \dpt, and Algorithmic Distillation (\ad) over a series of experiments on several structured bandit problems (linear, bilinear, latent, non-linear). Interestingly, we show that our algorithm, without the knowledge of the underlying problem structure, can learn a near-optimal policy in-context by leveraging the shared structure across diverse tasks. We further extend the field of pre-trained decision transformers by showing that they can leverage unseen tasks with new actions and still learn the underlying latent structure to derive a near-optimal policy. We validate this over several experiments to show that our proposed solution is very general and has wide applications to potentially emergent online and offline strategies at test time. Finally, we theoretically analyze the performance of our algorithm and obtain generalization bounds in the in-context multi-task learning setting.

709Is Memorization Actually Necessary for Generalization?

[openreview] [pdf]

Abstract Memorization is the ability of deep models to associate training data with seemingly random labels. Even though memorization may not align with a model’s ability to generalize, recent work by~\citet{feldman2020longtail} has demonstrated that memorization is in fact \textit{necessary} for generalization. However, upon closer inspection, we find that their methodology has three limitations. First, the definition of memorization is imprecise, leading to contradictory results. Second, their proposed algorithm used for \textit{approximating} the leave-one-out test (the gold standard for calculating memorization scores) suffers from a high approximation error. Three, the authors induce a distribution shift when calculating marginal utility, leading to flawed results. Having accounted for these errors, we re-evaluate the role of memorization on generalization. We show that most memorization thresholds (the value that dictates whether a point is memorized or not) do not have a statistically significant impact on model accuracy, contrary to what was previously reported. In light of these findings, future researchers are encouraged to design techniques that can accurately approximate memorization scores.

710Enhancing Graph Invariant Learning from a Negative Inference Perspective

[openreview] [pdf]

Abstract The out-of-distribution (OOD) generalization challenge is a longstanding problem in graph learning. Through studying the fundamental cause of data distribution shift, i.e., the changes of environments, significant progress has been achieved in addressing this issue. However, we observe that existing works still fail to effectively address complex environment shifts. Previous practices place excessive attention on extracting causal subgraphs, inevitably treating spurious subgraphs as environment variables. While spurious subgraphs are controlled by environments, the space of environment changes encompass more than the scale of spurious subgraphs. Therefore, existing efforts have a limited inference space for environments, leading to failure under severe environment changes. To tackle this issue, we propose a negative inference graph OOD framework (NeGo) to broaden the inference space for environment factors. Inspired by the successful practice of prompt learning in capturing underlying semantics and causal associations in large language models, we design a negative prompt environment inference to extract underlying environment information. We further introduce the environment-enhanced invariant subgraph learning method to effectively exploit inferred environment embedding, ensuring the robust extraction of causal subgraph in the environment shifts. Lastly, we conduct a comprehensive evaluation of NeGo on real-world datasets and synthetic datasets across domains. NeGo outperforms baselines on nearly all datasets, which verify the effectiveness of our framework. Our source code is available at \url{https://anonymous.4open.science/r/NeGo-E4C1}.

711Fixing Data Augmentations for Out-of-distribution Detection

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection methods, especially post-hoc methods, rely on off-the-shelf pre-trained models. Existing literature shows how OOD and ID performance are correlated, i.e. stronger models with better ID performance tend to perform better in OOD detection. However, significant performance discrepancies exist between model versions, sometimes exceeding the impact of the OOD detection methods themselves. In this study, we systematically investigated this issue and identified two main factors—label smoothing and mixup—that, while improving in-distribution accuracy, lead to a decline in OOD detection performance. We provide empirical and theoretical explanations for this phenomenon and propose a solution that enhances OOD Detection while maintaining strong in-distribution performance. Code will be released upon acceptance.

712Single-Step Diffusion Model-Based Generative Model Inversion Attacks

[openreview] [pdf]

Abstract Generative model inversion attacks (MIAs) have garnered increasing attention for their ability to reconstruct synthetic samples that closely resemble private training data, exposing significant privacy risks in machine learning models. The success of generative MIAs is primarily attributed to image priors learned by generative adversarial networks (GANs) on public auxiliary data, which help constrain the optimization space during the inversion process. However, GAN-based generative MIAs still face limitations, particularly regarding the instability during model inversion optimization and the fidelity of reconstructed samples, indicating substantial room for improvement. In this paper, we address these challenges by exploring generative MIAs based on diffusion models, which offer superior generative performance compared to GANs. Specifically, we replace the GAN generator in existing generative MIAs with a single-step generator distilled from pretrained diffusion models, constraining the search space to the manifold of the generator during the inversion process. In addition, we leverage generative model inversion techniques to investigate privacy leakage issues in widely used large-scale multimodal models, particularly CLIP, highlighting the inherent privacy risks in these models. Our extensive experiments demonstrate that single-step diffusion models-based MIAs significantly outperform their GAN-based counterparts, achieving substantial improvements in traditional metrics and greatly enhancing the visual fidelity of reconstructed samples. This research uncovers vulnerabilities in CLIP models and opens new research directions in generative MIAs.

713Rethinking Reward Model Evaluation: Are We Barking up the Wrong Tree?

[openreview] [pdf]

Abstract Reward Models (RMs) are crucial for aligning language models with human preferences. Currently, the evaluation of RMs depends on measuring accuracy against a validation set of manually annotated preference data. Although this method is straightforward and widely adopted, the relationship between RM accuracy and downstream policy performance remains under-explored. In this work, we conduct experiments in a synthetic setting to investigate how differences in RM measured by accuracy translate into gaps in optimized policy performance. Our findings reveal that while there is a weak positive correlation between accuracy and downstream performance, policies optimized towards RMs with similar accuracy can exhibit quite different performance. Moreover, we discover that the way of measuring accuracy significantly impacts its ability to predict the final policy performance. Through the lens of Regressional Goodhart’s effect, we identify the existence of exogenous variables impacting the relationship between RM quality measured by accuracy and policy model capability. This underscores the inadequacy of relying solely on accuracy to reflect their impact on policy optimization.

714On the Relation Between Linear Diffusion and Power Iteration

[openreview] [pdf]

Abstract Recently, diffusion models have gained popularity due to their impressive generative abilities. These models learn the implicit distribution given by the training dataset, and sample new data by transforming random noise through the reverse process, which can be thought of as gradual denoising. In this work, we examine the generation process as a ``correlation machine’', where random noise is repeatedly enhanced in correlation with the implicit given distribution. To this end, we explore the linear case, where the optimal denoiser is known to be the PCA projection. This enables us to connect the theory of diffusion models to the spiked covariance model, where the dependence of the denoiser on the noise level and the amount of training data can be expressed analytically, in the rank-1 case. In a series of numerical experiments, we extend this result to general low rank data, and show that low frequencies emerge earlier in the generation process, where the denoising basis vectors are more aligned to the true data with a rate depending on their eigenvalues. This model allows us to show that the linear diffusion model converges in mean to the leading eigenvector of the underlying data, similarly to the prevalent Power Iteration method. Finally, we empirically demonstrate the applicability of our findings beyond the linear case, in the Jacobians of a deep, non-linear denoiser, used in general image generation tasks.

715On Extending Direct Preference Optimization to Accommodate Ties

[openreview] [pdf]

Abstract We derive and investigate two DPO variants that explicitly model the possibility of declaring a tie in pair-wise comparisons. We replace the Bradley-Terry model in DPO with two well-known modeling extensions, by Rao and Kupper and by Davidson, that assign probability to ties as alternatives to clear preferences. Our experiments in neural machine translation and summarization show that explicitly labeled ties can be added to the datasets for these DPO variants without the degradation in task performance that is observed when the same tied pairs are presented to DPO. We find empirically that the inclusion of ties leads to stronger regularization with respect to the reference policy as measured by KL divergence, and we see this even for DPO in its original form. These findings motivate and enable the inclusion of tied pairs in preference optimization as opposed to simply discarding them.

716One to All: Individual Reweighting for User-Oriented Fairness in Recommender Systems

[openreview] [pdf]

Abstract Recommender systems often manifest biases toward a small user group, resulting in pronounced disparities in recommendation performance, i.e., the User-Oriented Fairness (UOF) issue. Existing research on UOF faces three major limitations, and no single approach effectively addresses all of them. Limitation 1: Post-processing methods fail to address the root cause of the UOF issue. Limitation 2: Some in-processing methods rely heavily on unstable user similarity calculations under severe data sparsity problems. Limitation 3: Other in-processing methods overlook the disparate treatment of individual users within user groups. In this paper, we propose a novel Individual Reweighting for User-Oriented Fairness framework, namely IR-UOF, to address all the aforementioned limitations. IR-UOF serves as a versatile solution applicable across various backbone recommendation models to achieve UOF. The motivation behind IR-UOF is to introduce an in-processing strategy that addresses the UOF issue at the individual level without the need to explore user similarities. We conduct extensive experiments on three real-world datasets using four backbone recommendation models to demonstrate the effectiveness of IR-UOF in mitigating UOF and improving recommendation fairness.

717A Causal Theoretical Framework for Open Set Domain Adaptation

[openreview] [pdf]

Abstract Open Set Domain Adaptation (OSDA) faces two critical challenges: the emergence of unknown classes in the target domain and changes in observed distributions across domains. Although numerous studies have proposed advanced algorithms, recent experimental results demonstrate that the classical Empirical Risk Minimization (ERM) approach still delivers state-of-the-art performance. However, few theories can effectively explain this disputed phenomenon. To address the theoretical gap, we focus on constructing a causal theoretical framework for OSDA. We formulate the novel concepts of the Fully Informative Causal Invariance Model (FICIM) and the Partially Informative Causal Invariance Model (PICIM). Subsequently, We derive an OSDA theoretical bound to prove that the ERM performs well when the source domain follows FICIM, while it performs poorly when the source domain follows PICIM. The different results may be attributed to the varying amounts of available information when bounding the target domain’s stable expected risk. Finally, across different datasets, we conduct extensive experiments on the FICIM and PICIM source domains to validate the effectiveness of our theoretical results.

718Spatial-aware decision-making with ring attractors in Reinforcement Learning systems

[openreview] [pdf]

Abstract This paper explores the integration of ring attractors, a mathematical model inspired by neural circuit dynamics, into the reinforcement learning (RL) action selection process. Ring attractors, as specialized brain-inspired structures that encode spatial information and uncertainty, offer a biologically plausible mechanism to improve learning speed and predictive performance. They do so by explicitly encoding the action space, facilitating the organization of neural activity, and enabling the distribution of spatial representations across the neural network in the context of deep RL. The application of ring attractors in the RL action selection process involves mapping actions to specific locations on the ring and decoding the selected action based on neural activity. We investigate the application of ring attractors by both building them as exogenous models and integrating them as part of a Deep Learning policy algorithm. Our results show a significant improvement in state-of-the-art models for the Atari 100k benchmark. Notably, our integrated approach improves the performance of state-of-the-art models by half, representing a 53% increase over selected baselines.

719CFG++: Manifold-constrained Classifier Free Guidance for Diffusion Models

[openreview] [pdf]

Abstract Classifier-free guidance (CFG) is a fundamental tool in modern diffusion models for text-guided generation. Although effective, CFG has notable drawbacks. For instance, DDIM with CFG lacks invertibility, complicating image editing; furthermore, high guidance scales, essential for high-quality outputs, frequently result in issues like mode collapse. Contrary to the widespread belief that these are inherent limitations of diffusion models, this paper reveals that the problems actually stem from the off-manifold phenomenon associated with CFG, rather than the diffusion models themselves. More specifically, inspired by the recent advancements of diffusion model-based inverse problem solvers (DIS), we reformulate text-guidance as an inverse problem with a text-conditioned score matching loss and develop CFG++, a novel approach that tackles the off-manifold challenges inherent in traditional CFG. CFG++ features a surprisingly simple fix to CFG, yet it offers significant improvements, including better sample quality for text-to-image generation, invertibility, smaller guidance scales, reduced etc. Furthermore, CFG++ enables seamless interpolation between unconditional and conditional sampling at lower guidance scales, consistently outperforming traditional CFG at all scales. Moreover, CFG++ can be easily integrated into the high-order diffusion solvers and naturally extends to distilled diffusion models. Experimental results confirm that our method significantly enhances performance in text-to-image generation, DDIM inversion, editing, and solving inverse problems, suggesting a wide-ranging impact and potential applications in various fields that utilize text guidance. Project Page:https://cfgpp-diffusion.github.io/anon

720Efficient Perplexity Bound and Ratio Matching in Discrete Diffusion Language Models

[openreview] [pdf]

Abstract While continuous diffusion models excel in modeling continuous distributions, their application to categorical data has been less effective. Recent work has shown that ratio-matching throughscore-entropywithin a continuous-time discrete Markov chain (CTMC) framework serves as a competitive alternative to autoregressive models in language modeling. To enhance this framework, we first introduce three new theorems concerning the KL divergence between the data and learned distribution. Our results serve as the discrete counterpart to those established for continuous diffusion models and allow us to derive an improved upper bound of the perplexity. Second, we empirically show that ratio-matching performed by minimizing thedenoising cross-entropybetween the clean and corrupted data enables models to outperform those utilizing score-entropy with up to 10% lower perplexity/generative-perplexity, and 15% faster training steps. To further support our findings, we introduce and evaluate a novel CTMC transition-rate matrix that allows prediction refinement, and derive the analytic expression for its matrix exponential which facilitates the computation of conditional ratios thus enabling efficient training and generation.

721Elephant in the Room: Unveiling the Pitfalls of Human Proxies in Alignment

[openreview] [pdf]

Abstract The demand for regulating the behavior of large language models (LLMs) has ignited research on alignment algorithms, the essence of which is to align LLMs’ generations with human preferences. Due to infeasibility of humans directly participating in the training or generation of LLMs, existing alignment algorithms choose to align with human preferences carried by proxies, i.e., preference data or reward models. However, whether these human proxies faithfully represent human preferences remain under-explored. We categorize human proxies into two levels based on the degree to which they directly embody human preferences: Level-1 Proxy (preference data) and Level-2 Proxy (reward models). We empirically examine the faithfulness of both levels of proxies and its impacts on alignment performance. We notice that current algorithms tend to overlook the faithfulness of these proxies in reflecting human preferences; many works even directly use reward models as their automatic evaluators without any correlation verification. Current literature of alignment overly focuses on optimizing algorithms, rendering the faithfulness of human proxies an “elephant in the room”—something extremely important yet largely overlooked. According to experimental results, we unveil potential risks of using inferiorhuman proxies’‘, aiming to arouse attention to this hugeelephant’’ in alignment research. We summarize existing pitfalls from different angles and provide a re-labeled preference dataset and insights about reward model usage to facilitate the healthy development of alignment\footnote{This work contains examples that potentially implicate stereotypes, associations, and other harms that could be offensive to individuals in certain social groups.}.

722Looking into User’s Long-term Interests through the Lens of Conservative Evidential Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) been increasingly employed in modern recommender systems to capture users’ evolving preferences, leading to continuously improved recommendations. In this paper, we propose a novel evidential conservative Q-learning framework (ECQL) that learns an effective and conservative recommendation policy by integrating evidence-based uncertainty and conservative learning. ECQL conducts evidence-aware explorations to discover items that are located beyond current observations but reflect users’ long-term interests. It offers an uncertainty-aware conservative view on policy evaluation to discourage deviating too much from users’ current interests. Two central components of ECQL include a uniquely designed sequential state encoder and a novel conservative evidential-actor-critic (CEAC) module. The former generates the current state of the environment by aggregating historical information and a sliding window that contains the current user interactions as well as newly recommended items from RL exploration that may represent future interests. The latter performs an evidence-based rating prediction by maximizing the conservative evidential Q-value and leverages an uncertainty-aware ranking score to explore the item space for a more diverse and valuable recommendation. Experiments on multiple real-world dynamic datasets demonstrate the state-of-the-art performance of ECQL and its capability to capture users’ long-term interests.

723Latent Trajectory: A New Framework for Actor-Critic Reinforcement Learning with Uncertainty Quantification

[openreview] [pdf]

Abstract Uncertainty quantification for deep neural networks is crucial for building reliable modern AI models. This challenge is particularly pronounced in deep reinforcement learning, where agents continuously learn from their interactions with stochastic environments, and the uncertainty of the value function is a key concern for ensuring reliable and robust RL applications. The complexity increases in actor-critic methods, as the training process alternates between optimizing the actor and critic networks, whose optimization nature makes the uncertainty of the value function hard to be quantified. To address this issue, we introduce a novel approach to RL training that conceptualizes transition trajectories as latent variables. Building on this framework, we propose an adaptive Stochastic Gradient Markov Chain Monte Carlo (SGMCMC) algorithm for training deep actor-critic models. This new training method allows for the implicit integration of latent transition trajectories, resulting in a trajectory-independent training process. We provide theoretical guarantees for the convergence of our algorithm and offer empirical evidence showing improvements in both performance and robustness of the deep actor-critic model under our Latent Trajectory Framework (LTF). Furthermore, this framework enables accurate uncertainty quantification for the value function of the RL system, paving the way for more reliable and robust RL applications.

724Diverse Preference Learning for Capabilities and Alignment

[openreview] [pdf]

Abstract As LLMs increasingly impact society, their ability to represent diverse perspectives is critical. However, recent studies reveal that alignment algorithms such as RLHF and DPO significantly reduce the diversity of LLM outputs. Not only do aligned LLMs generate text with repetitive structure and word choice, they also approach problems in more uniform ways, and their responses reflect a narrower range of societal perspectives. We attribute this problem to the KL divergence regularizer employed in preference learning algorithms. This causes the model to overweight majority opinions and sacrifice diversity in exchange for optimal reward. To address this, we propose Diverse Preference Learning, which decouples the entropy and cross-entropy terms in the KL penalty — allowing for fine-grained control over LLM generation diversity. From a capabilities perspective, LLMs trained using Diverse Preference Learning attain higher accuracy on difficult repeated sampling tasks and produce outputs with greater semantic and lexical diversity. From an alignment perspective, they are capable of representing a wider range of societal viewpoints and display improved logit calibration. Notably, Diverse Preference Learning resembles, but is a Pareto improvement over standard temperature scaling.

725Utilizing Explainable Reinforcement Learning to Improve Reinforcement Learning: A Theoretical and Systematic Framework

[openreview] [pdf]

Abstract Reinforcement learning (RL) faces two challenges: (1) The RL agent lacks explainability. (2) The trained RL agent is, in many cases, non-optimal and even far from optimal. To address the first challenge, explainable reinforcement learning (XRL) is proposed to explain the decision-making of the RL agent. In this paper, we demonstrate that XRL can also be used to address the second challenge, i.e., improve RL performance. Our method has two parts. The first part provides a two-level explanation for why the RL agent is not optimal by identifying the mistakes made by the RL agent. Since this explanation includes the mistakes of the RL agent, it has the potential to help correct the mistakes and thus improve RL performance. The second part formulates a constrained bi-level optimization problem to learn how to best utilize the two-level explanation to improve RL performance. In specific, the upper level learns how to use the high-level explanation to shape the reward so that the corresponding policy can maximize the cumulative ground truth reward, and the lower level learns the corresponding policy by solving a constrained RL problem formulated using the low-level explanation. We propose a novel algorithm to solve this constrained bi-level optimization problem, and theoretically guarantee that the algorithm attains global optimality. We use MuJoCo experiments to show that our method outperforms state-of-the-art baselines.

726Thinking Forward and Backward: Effective Backward Planning with Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have exhibited remarkable reasoning and planning capabilities. Most prior work in this area has used LLMs to reason through steps from an initial to a goal state or criterion, thereby effectively reasoning in a forward direction. Nonetheless, many planning problems exhibit an inherent asymmetry such that planning backward from the goal is significantly easier --- for example, if there are bottlenecks close to the goal. We take inspiration from this observation and demonstrate that this bias holds for LLM planning as well: planning performance in one direction correlates with the planning complexity of the problem in that direction. However, our experiments also reveal systematic biases which lead to poor planning in the backward direction. With this knowledge, we propose a backward planning algorithm for LLMs that first flips the problem and then plans forward in the flipped problem. This helps avoid the backward bias, generate more diverse candidate plans, and exploit asymmetries between the forward and backward directions in planning problems --- we find that combining planning in both directions with self-verification improves the overall planning success rates by 4-24% in three planning domains.

727Task Characteristic and Contrastive Contexts for Improving Generalization in Offline Meta-Reinforcement Learning

[openreview] [pdf]

Abstract Context-based offline meta-reinforcement learning (meta-RL) methods typically extract contexts summarizing task information from historical trajectories to achieve adaptation to unseen target tasks. Nevertheless, previous methods may lack generalization and suffer from ineffective adaptation. Our key insight to counteract this issue is that they fail to capture both task characteristic and task contrastive information when generating contexts. In this work, we propose a framework called task characteristic and contrastive contexts for offline meta-RL (TCMRL), which consists of a task characteristic extractor and a task contrastive loss. More specifically, the task characteristic extractor aims at identifying transitions within a trajectory, that are characteristic of a task, when generating contexts. Meanwhile, the task contrastive loss favors the learning of task information that distinguishes tasks from one another by considering interrelations among transitions of trajectory subsequences. Contexts that include both task characteristic and task contrastive information provide a comprehensive understanding of the tasks themselves and implicit relationships among tasks. Experiments in meta-environments show the superiority of TCMRL over previous offline meta-RL methods in generating more generalizable contexts, and achieving efficient and effective adaptation to unseen target tasks.

728CADO: Cost-Aware Diffusion Models for Combinatorial Optimization via RL Fine-tuning

[openreview] [pdf]

Abstract Recent advancements in Machine Learning (ML) have demonstrated significant potential in addressing Combinatorial Optimization (CO) problems through data-driven approaches. Heatmap-based methods, which generate solution heatmaps in a single step and employ an additional decoder to derive solutions for CO tasks, have shown promise due to their scalability for large-scale problems. Traditionally, these complex models are trained using imitation learning with optimal solutions, often leveraging diffusion models. However, our research has identified several limitations inherent in these imitation learning approaches within the context of CO tasks. To overcome these challenges, we propose a 2-phase training framework for diffusion models in CO, incorporating Reinforcement Learning (RL) fine-tuning. Our methodology integrates cost information and the post-process decoder into the training process, thereby enhancing the solver’s capacity to generate effective solutions. We conducted extensive experiments on well-studied combinatorial optimization problems, specifically the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), ranging from small-scale instances to large-scale scenarios. The results demonstrate the significant efficacy of our RL fine-tuning framework, surpassing previous state-of-the-art methods in performance.

729DRoP: Distributionally Robust Pruning

[openreview] [pdf]

Abstract In the era of exceptionally data-hungry models, careful selection of the training data is essential to mitigate the extensive costs of deep learning. Data pruning offers a solution by removing redundant or uninformative samples from the dataset, which yields faster convergence and improved neural scaling laws. However, little is known about its impact on classification bias of the trained models. We conduct the first systematic study of this effect and reveal that existing data pruning algorithms can produce highly biased classifiers. We present theoretical analysis of the classification risk in a mixture of Gaussians to argue that choosing appropriate class pruning ratios, coupled with random pruning within classes has potential to improve worst-class performance. We thus propose DRoP, a distributionally robust approach to pruning and empirically demonstrate its performance on standard computer vision benchmarks. In sharp contrast to existing algorithms, our proposed method continues improving distributional robustness at a tolerable drop of average performance as we prune more from the datasets.

730Seeking Global Flat Minima in Federated Domain Generalization via Constrained Adversarial Augmentation

[openreview] [pdf]

Abstract Federated domain generalization (FedDG) aims at equipping the federally trained model with the domain generalization ability when the model meets new clients with domain shifts. Among factors that possibly indicate generalization, the loss landscape flatness of the trained model is an intuitive, viable, and widely studied one. However, pursuing the flatness of the global model in the FedDG setting is not trivial due to the restriction to preserve data privacy. To address this issue, we propose GFM, a novel algorithm designed to seek Global Flat Minima of the global model. Specifically, GFM leverages a global model-constrained adversarial data augmentation strategy, creating a surrogate for global data within each local client, which allows for split sharpness-aware minimization to approach global flat minima. GFM is compatible with federated learning without compromising data privacy restrictions, and theoretical analysis further supports its rationality by demonstrating that the objective of GFM serves as an upper bound on the robust risk of the global model on global data distribution. Extensive experiments on multiple FedDG benchmarks demonstrate that GFM consistently outperforms previous FedDG and federated learning approaches.

731The Utility and Complexity of In- and Out-of-Distribution Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning, the process of selectively removing data from trained models, is increasingly crucial for addressing privacy concerns and knowledge gaps post-deployment. Despite this importance, existing approaches are often heuristic and lack formal guarantees. In this paper, we analyze the fundamental utility, time, and space complexity trade-offs of approximate unlearning, providing rigorous certification analogous to differential privacy. For in-distribution data, we show that a surprisingly simple and general procedure—empirical risk minimization with output perturbation—achieves tight unlearning-utility-complexity trade-offs, addressing a previous theoretical gap on the separation from unlearning ``for free" via differential privacy. However, such techniques fail out-of-distribution, where unlearning time complexity can exceed that of retraining, even for a single sample. To address this, we propose a new robust and noisy gradient descent variant that provably amortizes unlearning time complexity without compromising utility.

732DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction

[openreview] [pdf]

Abstract Quantifying the uncertainty in the factual parametric knowledge of Large Language Models (LLMs), especially in a black-box setting, poses a significant challenge. Existing methods, which gauge a model’s uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty. Models might respond consistently to the origin query with a wrong answer, yet respond correctly to varied questions from different perspectives about the same query, and vice versa. In this paper, we propose a novel method, DiverseAgentEntropy, for evaluating a model’s uncertainty using multi-agent interaction under the assumption that if a model is certain, it should consistently recall the answer to the original query across a diverse collection of questions about the same original query. We further implement an abstention policy to withhold responses when uncertainty is high. Our method offers a more accurate prediction of the model’s reliability by detecting hallucinations, improving upon self-consistency-based uncertainty methods by 2.5%. Additionally, it demonstrates that existing models often fail to consistently retrieve the correct answer to the same query under diverse varied questions.

733Dynamic Multi-product Selection and Pricing under Preference Feedback

[openreview] [pdf]

Abstract In this study, we investigate the problem of dynamic multi-product selection and pricing by introducing a novel framework based on acensored multinomial logit(C-MNL) choice model. In this model, sellers present a set of products with prices, and buyers filter out products priced above their valuation, purchasing at most one product from the remaining options based on their preferences. The goal is to maximize seller revenue by dynamically adjusting product offerings and prices, while learning both product valuations and buyer preferences through purchase feedback. To achieve this, we propose a Lower Confidence Bound (LCB) pricing strategy. By combining this pricing strategy with either an Upper Confidence Bound (UCB) or Thompson Sampling (TS) product selection approach, our algorithms achieve regret bounds of O~(d32T)\tilde{O}(d^{\frac{3}{2}}\sqrt{T}) and O~(d2T)\tilde{O}(d^{2}\sqrt{T}), respectively. Finally, we validate the performance of our methods through simulations, demonstrating their effectiveness.

734Q-based Variational Inverse Reinforcement Learning

[openreview] [pdf]

Abstract The development of safe and beneficial AI requires that systems can learn and act in accordance with human preferences. However, explicitly specifying these preferences by hand is often infeasible. Inverse reinforcement learning (IRL) addresses this challenge by inferring preferences, represented as reward functions, from expert behavior. We introduce Q-based Variational IRL (QVIRL), a novel Bayesian IRL method that recovers a posterior distribution over rewards from expert demonstrations via primarily learning a variational distribution over Q-values. Unlike previous approaches, QVIRL combines scalability with uncertainty quantification, important for safety-critical applications. We demonstrate QVIRL’s strong performance in apprenticeship learning across various tasks, including classical control problems and safe navigation in the Safety Gymnasium suite, where the method’s uncertainty quantification allows us to produce safer policies.

735Learning from Preferences and Mixed Demonstrations in General Settings

[openreview] [pdf]

Abstract Reinforcement learning is a general method for learning in sequential settings, but it can often be difficult to specify a good reward function when the task is complex. In these cases, preference feedback or expert demonstrations can be used instead. However, existing approaches utilising both together are either ad-hoc or rely on domain-specific properties. Building upon previous work, we develop a novel theoretical framework for learning from human data. Based on this we introduce LEOPARD: Learning Estimated Objectives from Preferences And Ranked Demonstrations. LEOPARD can simultaneously learn from a broad range of data, including negative/failed demonstrations, to effectively learn reward functions in general domains. We find that when a limited amount of human feedback is available, LEOPARD outperforms the current standard practice of pre-training on demonstrations and finetuning on preferences. Furthermore, we show that LEOPARD learns faster when given many types of feedback, rather than just a single one.

736Identifying and Addressing Delusions for Target-Directed Decision Making

[openreview] [pdf]

Abstract We are interested in target-directed agents, which produce targets during decision-time planning, to guide their behaviors and achieve better generalization during evaluation. Improper training of these agents can result in delusions: the agent may come to hold false beliefs about the targets, which cannot be properly rejected, leading to unwanted behaviors and damaging out-of-distribution generalization. We identify different types of delusions by using intuitive examples in carefully controlled environments, and investigate their causes. We demonstrate how delusions can be addressed for agents trained by hindsight relabeling, a mainstream approach in for training target-directed RL agents. We validate empirically the effectiveness of the proposed solutions in correcting delusional behaviors and improving out-of-distribution generalization.

737Provable unlearning in topic modeling and downstream tasks

[openreview] [pdf]

Abstract Machine unlearning algorithms are increasingly important as legal concerns arise around the provenance of training data, but verifying the success of unlearning is often difficult. Provable guarantees for unlearning are often limited to supervised learning settings. In this paper, we provide the first theoretical guarantees for unlearning in the pre-training and fine-tuning paradigm by studying topic models, simple bag-of-words language models that can be adapted to solve downstream tasks like retrieval and classification. First, we design a provably effective unlearning algorithm for topic models that incurs a computational overhead independent of the size of the original dataset. Our analysis additionally quantifies the deletion capacity of the model -- i.e., the number of examples that can be unlearned without incurring a significant cost in model performance. Finally, we formally extend our analyses to account for adaptation to a given downstream task. In particular, we design an efficient algorithm to perform unlearning after fine-tuning the topic model via a linear head. Notably, we show that it is easier to unlearn pre-training data from models that have been fine-tuned to a particular task, and one can unlearn this data without modifying the base model.

738Emphasizing Discriminative Features for Dataset Distillation in Complex Scenarios

[openreview] [pdf]

Abstract Dataset distillation has demonstrated strong performance on simple datasets like CIFAR, MNIST, and TinyImageNet but struggles to achieve similar results in more complex scenarios. In this paper, we propose a novel approach that \textbf{e}mphasizes the \textbf{d}iscriminative \textbf{f}eatures (obtained by Grad-CAM) for dataset distillation, called \textbf{EDF}. Our approach is inspired by a key observation: in simple datasets, high-activation areas typically occupy most of the image, whereas in complex scenarios, the size of these areas is much smaller. Unlike previous methods that treat all pixels equally when synthesizing images, EDF uses Grad-CAM activation maps to enhance high-activation areas. From a supervision perspective, we downplay supervision signals that have lower losses, as they contain common patterns. Additionally, to help the DD community better explore complex scenarios, we build the Complex Dataset Distillation (Comp-DD) benchmark by meticulously selecting sixteen subsets, eight easy and eight hard, from ImageNet-1K. Notably, EDF consistently outperforms SOTA results in complex scenarios, such as ImageNet-1K subsets. Hopefully, more researchers will be inspired and encouraged to enhance the practicality and efficacy of DD. Our code and benchmark will be made public.

739InverseBench: Benchmarking Plug-and-Play Diffusion Models for Scientific Inverse Problems

[openreview] [pdf]

Abstract Plug-and-play diffusion models have emerged as a promising research direction for solving inverse problems. However, current studies primarily focus on natural image restoration, leaving the performance of these algorithms in scientific inverse problems largely unexplored. To address this gap, we introduce \textsc{InverseBench}, a unified framework that evaluates diffusion models across five distinct scientific inverse problems. These problems present unique structural challenges that differ from existing benchmarks, arising from critical scientific applications such as black hole imaging, seismology, optical tomography, medical imaging, and fluid dynamics. With \textsc{InverseBench}, we benchmark 15 inverse problem algorithms that use plug-and-play diffusion models against strong, domain-specific baselines, offering valuable new insights into the strengths and weaknesses of existing algorithms. We open-source the datasets, pre-trained models, and the codebase to facilitate future research and development.

740Novelty Unlocking with Multiobjective Generative Models: Batch Diversity of Human Motions

[openreview] [pdf]

Abstract Current generative models have shown potential performance in many tasks, which typically focus on generating samples that closely adhere to a given distribution, often overlooking the requirement to produce optimal diverse solutions in a batch diversity. Recognizing that maintaining ``diversity" has been a longstanding challenge in multiobjective optimization, we were inspired to introduce a multiobjective optimization approach to enhance diversity in a single pass. This paper utilizes the in-betweening human motion generation task as an example and introduces the multiobjective generative models to demonstrate the effectiveness of the proposed method in producing diverse and smooth human motion sequences. The resulting method, termed the \textit{Multiobjective Generation Framework with In-Betweening Motion Model} (MGF-IMM), frames the human motion in-betweening task as a bi-objective optimization problem. The designed in-betweening motion model is then integrated into a nondominated sorting-based optimization framework to address this bi-objective optimization problem. Through comprehensive qualitative and quantitative experiments, MGF-IMM has demonstrated state-of-the-art performance, surpassing the latest methods and validating its superiority in generating diverse in-betweening human motions.

741A Versatile Influence Function for Data Attribution with Non-Decomposable Loss

[openreview] [pdf]

Abstract Influence function, a technique rooted in robust statistics, has been adapted in modern machine learning for a novel application: data attribution---quantifying how individual training data points affect a model’s predictions. However, the common derivation of influence functions in the data attribution literature is limited to loss functions that decompose into a sum of individual data point losses, with the most prominent examples known as M-estimators. This restricts the application of influence functions to more complex learning objectives, which we refer to as non-decomposable losses, such as contrastive or ranking losses, where a unit loss term depends on multiple data points and cannot be decomposed further. In this work, we bridge this gap by revisiting the general formulation of influence function from robust statistics, which extends beyond M-estimators. Based on this formulation, we propose a novel method, the Versatile Influence Function (VIF), that can be straightforwardly applied to machine learning models trained with any non-decomposable loss. In comparison to the classical approach in statistics, the proposed VIF is designed to fully leverage the power of auto-differentiation, hereby eliminating the need for case-specific derivations of each loss function. We demonstrate the effectiveness of VIF across three examples: Cox regression for survival analysis, node embedding for network analysis, and listwise learning-to-rank for information retrieval. In all cases, the influence estimated by VIF closely resembles the results obtained by brute-force leave-one-out retraining, while being up to 1000 times faster to compute. We believe VIF represents a significant advancement in data attribution, enabling efficient influence-function-based attribution across a wide range of machine learning paradigms, with broad potential for practical use cases.

742Compressed Decentralized Learning with Error-Feedback under Data Heterogeneity

[openreview] [pdf]

Abstract Decentralized learning distributes the training process across multiple nodes, enabling collaborative model training without relying on a central server. Each node performs local training using its own data, with model updates exchanged directly between connected nodes within a given network topology. Various algorithms have been developed within this decentralized learning framework and have been proven to converge under specific assumptions. However, two key challenges remain: 1) ensuring robust performance with both a high degree of gradient compression and data heterogeneity, and 2) providing a general convergence upper bound under commonly used assumptions. To address these challenges, we propose theDiscounted Error-Feedback Decentralized Parallel Stochastic Gradient Descent (DEFD-PSGD)algorithm, which efficiently manages both high levels of gradient compression and data heterogeneity, without sacrificing communication efficiency. The core idea is to introduce controllable residual error feedback that effectively balances the impact of gradient compression and data heterogeneity. Additionally, we develop novel proof techniques to derive a convergence upper bound under relaxed assumptions. Finally, we present experimental results demonstrating that DEFD-PSGD outperforms other state-of-the-art decentralized learning algorithms, particularly in scenarios involving high compression and significant data heterogeneity.

743Empowering Teachers with Enhanced Knowledge via Variable Scale Distillation Framework

[openreview] [pdf]

Abstract Knowledge distillation, a widely used model compression technique, enables a smaller student network to replicate the performance of a larger teacher network by transferring knowledge, typically in the form of softened class probabilities or feature representations. However, current approaches often fail to maximize the teacher’s feature extraction capabilities, as they treat the semantic information transfer between teacher and student as equal. This paper presents a novel framework that addresses this limitation by enhancing the teacher’s learning process through the Variable Scale Distillation Framework. Central to our approach is the Rescale Block, which preserves scale consistency during hierarchical distillation, allowing the teacher to extract richer, more informative features. In extensive experiments on the CIFAR100 dataset, our method consistently outperforms state-of-the-art distillation techniques, achieving an average accuracy improvement of 2.12%. This demonstrates the effectiveness of our approach in fully leveraging the teacher’s capacity to guide the student, pushing the boundaries of knowledge distillation.

744PaI is getting competitive by training longer

[openreview] [pdf]

Abstract The success of iterative pruning methods in achieving state-of-the-art sparse networks has largely been attributed to improved mask identification and an implicit regularization induced by pruning. We challenge this hypothesis and instead posit that their increased training epochs enable improved optimization. To verify this, we show that pruning at initialization (PaI) is significantly boosted by increased training epochs with repeating (cyclic) learning rate schedules akin to iterative pruning, even outperforming standard iterative pruning methods. The dominant mechanism how this is achieved, as we conjecture, can be attributed to a better exploration of the loss landscape leading to a lower training loss. However, at high sparsity, increased training alone is not enough for competitive performance. A strong coupling between learnt parameter initialization and mask seems to be required. Standard methods obtain this coupling via expensive pruning-training iterations, starting from a dense network. To achieve this with sparse training instead, we propose SCULPT-ing, i.e., cyclic training of any sparse mask followed by a single pruning step to couple the parameters and the mask, which is able to match the performance of state-of-the-art iterative pruning methods in the high sparsity regime at reduced computational cost.

745LLMs Can Plan Only If We Tell Them

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated significant capabilities in natural language processing and reasoning, yet their effectiveness in autonomous planning has been under debate. While existing studies have utilized LLMs with external feedback mechanisms or in controlled environments for planning, these approaches often involve substantial computational and development resources due to the requirement for careful design and iterative backprompting. Moreover, even the most advanced LLMs like GPT-4 struggle to match human performance on standard planning benchmarks, such as the Blocksworld, without additional support. This paper investigates whether LLMs can independently generate long-horizon plans that rival human baselines. Our novel enhancements help achieve state-of-the-art results in planning benchmarks out-competing prior methods and human baselines all autonomously.

746Model Extrapolation Expedites Alignment

[openreview] [pdf]

Abstract As the alignment training of large language models (LLMs) usually requires expensive computational resources, exploring more efficient alignment methods to reduce training overhead has always been an important and compelling research challenge. Inspired by prior work onmodel interpolation, we present a simple method calledExPO (model extrapolation)to expedite the alignment of LLMs with human preferences. Based on our observation that interpolating the weights between existing DPO/RLHF models and their initial SFT checkpoints usually produces new models with intermediate performance, we propose to treat a partially-trained model M1\mathcal{M}_1 (corresponding to the intermediate-performing model) as the interpolated result between the initial SFT checkpoint M0\mathcal{M}_0 and a hypothetical better-aligned model M2\mathcal{M}_2. Thus, we can obtain the hypothetical M2\mathcal{M}_2 by simply extrapolating the model weights along the direction from M0\mathcal{M}_0 to M1\mathcal{M}_1, which consequently saves the additional training overhead for M1\mathcal{M}_1 to reach better alignment performance. We validate our hypothesis through controlled experiments, demonstrating that ExPO can boost a DPO model trained with only 20% steps to outperform the fully-trained one. Additionally, we show that ExPO can also notably improve existing open-source LLMs (ranging from 1.8B to 70B parameters), as evidenced by evaluations on the mainstream LLM benchmarks AlpacalEval 2.0 and MT-Bench, which further highlights ExPO’s utility and potential in enabling more efficient LLM alignment.

747Diffusion models for Gaussian distributions: Exact solutions and Wasserstein errors

[openreview] [pdf]

Abstract Diffusion or score-based models recently showed high performance in image generation. They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by solving numerically the backward SDE or its associated flow ODE. Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization and the score approximation. In this paper, we study theoretically the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian. In this restricted framework where the score function is a linear operator, we derive the analytical solutions of the backward SDE and the probability flow ODE. We prove that these solutions and their discretizations are all Gaussian processes, which allows us to compute exact Wasserstein errors induced by each error type for any sampling scheme. Monitoring convergence directly in the data space instead of relying on Inception features, our experiments show that the recommended numerical schemes from the diffusion models literature are also the best sampling schemes for Gaussian distributions.

748Channel-wise Influence: Estimating Data Influence for Multivariate Time Series

[openreview] [pdf]

Abstract The influence function, a robust statistics technique, is an effective post-hoc method that measures the impact of modifying or removing training data on model parameters, offering valuable insights into model interpretability without requiring costly retraining. It would provide extensions like increasing model performance, improving model generalization, and offering interpretability. Recently, Multivariate Time Series (MTS) analysis has become an important yet challenging task, attracting significant attention. However, there is no preceding research on the influence functions of MTS to shed light on the effects of modifying the channel of MTS. Given that each channel in an MTS plays a crucial role in its analysis, it is essential to characterize the influence of different channels. To fill this gap, we propose a channel-wise influence function, which is the first method that can estimate the influence of different channels in MTS, utilizing a first-order gradient approximation. Additionally, we demonstrate how this influence function can be used to estimate the influence of a channel in MTS. Finally, we validated the accuracy and effectiveness of our influence estimation function in critical MTS analysis tasks, such as MTS anomaly detection and MTS forecasting. According to abundant experiments on real-world datasets, the original influence function performs worse than our method and even fails for the channel pruning problem, which demonstrates the superiority and necessity of the channel-wise influence function in MTS analysis.

749Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View

[openreview] [pdf]

Abstract Training language models currently requires pre-determining a fixed compute budget because the typical cosine learning rate schedule depends on the total number of steps. In contrast, the Warmup-Stable-Decay (WSD) schedule uses a constant learning rate to produce a main branch of iterates that can in principle continue indefinitely without a pre-specified compute budget. Then, given any compute budget, one can branch out from the main branch at a proper at any time with a rapidly decaying learning rate to produce a strong model. Empirically, WSD generates an intriguing, non-traditional loss curve: the loss remains elevated during the stable phase but sharply declines during the decay phase. Towards explaining this phenomenon, we conjecture that pretraining loss exhibits a river valley landscape, which resembles a deep valley with a river at its bottom. Under this assumption, we show that during the stable phase, the iterate undergoes large oscillations due to the high learning rate, yet it progresses swiftly along the river. During the decay phase, the rapidly dropping learning rate minimizes the iterate’s oscillations, moving it closer to the river and revealing true optimization progress. Therefore, the sustained high learning rate phase and fast decaying phase are responsible for progress in the river and the mountain directions, respectively, and are both critical. Our analysis predicts phenomenons consistent with empirical observations and shows that this landscape can naturally emerge from pretraining on a simple bi-gram dataset. Inspired by the theory, we introduce WSD-S, a variant of WSD that reuses previous checkpoints’ decay phases and keeps only one main branch, where we resume from a decayed checkpoint. WSD-S empirically outperforms WSD and Cyclic-Cosine in obtaining multiple pretrained language model checkpoints across various compute budgets in a single run for parameters scaling from 0.1B to 1.2B.

750Addressing Label Shift in Distributed Learning via Entropy Regularization

[openreview] [pdf]

Abstract We address the challenge of minimizingtrue riskin multi-node distributed learning. These systems are frequently exposed to both inter-node and intra-nodelabel shifts, which present a critical obstacle to effectively optimizing model performance while ensuring that data remains confined to each node. To tackle this, we propose the Versatile Robust Label Shift (VRLS) method, which enhances the maximum likelihood estimation of the test-to-train label density ratio. VRLS incorporates Shannon entropy-based regularization and adjusts the density ratio during training to better handle label shifts at the test time. In multi-node learning environments, VRLS further extends its capabilities by learning and adapting density ratios across nodes, effectively mitigating label shifts and improving overall model performance. Experiments conducted on MNIST, Fashion MNIST, and CIFAR-10 demonstrate the effectiveness of VRLS, outperforming baselines by up to 20% in imbalanced settings. These results highlight the significant improvements VRLS offers in addressing label shifts. Our theoretical analysis further supports this by establishing high-probability bounds on estimation errors.

751Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

[openreview] [pdf]

Abstract Large language models (LLMs), optimized through human feedback, have rapidly emerged as a leading paradigm for developing intelligent conversational assistants. However, despite their strong performance across many benchmarks, LLM-based agents might still lack conversational skills such as disambiguation -- when they are faced with ambiguity, they often overhedge or implicitly guess users’ true intents rather than asking clarification questions. Under task-specific settings, high-quality conversation samples are often limited, constituting a bottleneck for LLMs’ ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO), that enables data-efficient dialogue policy learning in multi-turn conversation modeling. We demonstrate ACT’s efficacy under in data-efficient tuning scenarios, even when there is no action label available, using multiple real-world conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for complex SQL generation towards data analysis agents. Additionally, we propose evaluating LLMs’ ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard tuning approaches like supervised fine-tuning and DPO.

752Theory on Mixture-of-Experts in Continual Learning

[openreview] [pdf]

Abstract Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.

753Learning in complex action spaces without policy gradients

[openreview] [pdf]

Abstract Conventional wisdom suggests that policy gradient methods are better suited to complex action spaces than action-value methods. However, foundational studies have shown equivalences between these paradigms in small and finite action spaces (O’Donoghue et al., 2017; Schulman et al., 2017a). This raises the question of why their computational applicability and performance diverge as the complexity of the action space increases. We hypothesize that the apparent superiority of policy gradients in such settings stems not from intrinsic qualities of the paradigm, but from universal principles that can also be applied to action-value methods to serve similar functionality. We identify three such principles and provide a framework for incorporating them into action-value methods. To support our hypothesis, we instantiate this framework in what we term QMLE, for Q-learning with maximum likelihood estimation. Our results show that QMLE can be applied to complex action spaces with a controllable computational cost that is comparable to that of policy gradient methods, all without using policy gradients. Furthermore, QMLE demonstrates strong performance on the DeepMind Control Suite, even when compared to the state-of-the-art methods such as DMPO and D4PG.

754CycleVTON: Improving Diffusion-Based Virtual Try-On with Cycle-Consistent Training

[openreview] [pdf]

Abstract We present CycleVTON, a cycle-consistent diffusion-based virtual try-on framework. Unlike existing methods that rely on a single try-on network, our model consists of two conjugated networks. In addition to the regular try-on network, we design a clothing extraction network that extracts the clothing worn by the person and standardizes it into a front-facing format. These two networks are symmetrical, enabling alignment between the generated dressed human and real images of dressed human, as well as between the extracted clothing and its front-facing ground truth. This cycle-consistent optimization strategy allows for enhanced retention of clothing textures and structures, ensuring a more realistic and accurate clothing generation in virtual try-on scenarios. Moreover, the conjugated network structure not only supports traditional virtual try-on but also allows flexible clothing extraction and clothing exchange between different individuals. The experiments on VITON-HD demonstrate the effectiveness of our approach.

755DIAR: Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation

[openreview] [pdf]

Abstract We propose a novel offline reinforcement learning (offline RL) approach, introducing the Diffusion-model-guided Implicit Q-learning with Adaptive Revaluation (DIAR) framework. We address two key challenges in offline RL: out-of-distribution samples and long-horizon problems. We leverage diffusion models to learn state-action sequence distributions and incorporate value functions for more balanced and adaptive decision-making. DIAR introduces an Adaptive Revaluation mechanism that dynamically adjusts decision lengths by comparing current and future state values, enabling flexible long-term decision-making. Furthermore, we address Q-value overestimation by combining Q-network learning with a value function guided by a diffusion model. The diffusion model generates diverse latent trajectories, enhancing policy robustness and generalization. As demonstrated in tasks like Maze2D, AntMaze, and Kitchen, DIAR consistently outperforms state-of-the-art algorithms in long-horizon, sparse-reward environments.

756Forgetting Order of Continual Learning: What is Learned First is Forgotten Last

[openreview] [pdf]

Abstract Catastrophic forgetting poses a significant challenge in continual learning, where models often forget previous tasks when trained on new data. Our empirical analysis reveals a strong correlation between catastrophic forgetting and the learning speed of examples: examples learned early are rarely forgotten, while those learned later are more susceptible to forgetting. We demonstrate that replay-based continual learning methods can leverage this phenomenon by focusing on mid-learned examples for rehearsal. We introduce Goldilocks, a novel replay buffer sampling method that filters out examples learned too quickly or too slowly, keeping those learned at an intermediate speed. Goldilocks improves existing continual learning algorithms, leading to state-of-the-art performance across several image classification tasks.

757SFW sampling for diffusion models via external conditioning

[openreview] [pdf]

Abstract Score-based generative models (SBM), also known as diffusion models, are the de facto state of the art for image synthesis. Despite their unparalleled performance, SBMs have recently been in the spotlight for being tricked into creating not-safe-for-work (NSFW) content, such as violent images and non-consensual nudity. This article proposes a safe-for-work (SFW) sampler for SBMs implementing a Conditional Trajectory Correction step that guides the samples away from undesired regions in the ambient space using external multimodal models as the source of conditioning. Furthermore, using Contrastive Language Image Pre-training (CLIP), our method admits user-defined NSFW classes, which can vary in different settings. Our experiments on the text-to-image SBM Stable Diffusion validate that the proposed SFW sampler effectively reduces the generation of explicit content, as assessed via independent NSFW detectors. Furthermore, the proposed correction comes at a minor cost in image quality and has an almost null effect on samples that do not need correction. Our study confirms the suitability of the SFW sampler towards aligned SBM models.

758Teaching Transformers Causal Reasoning through Axiomatic Training

[openreview] [pdf]

Abstract For text-based AI systems to interact in the real world, causal reasoning is an essential skill. Since active interventions are costly to execute, we study to what extent an agent can learn causal reasoning from symbolic demonstrations of causal axioms. Specifically, we consider an axiomatic training setup where an agent learns from multiple demonstrations of a causal axiom (or rule), rather than incorporating the axiom as an inductive bias or inferring it from data values. A key question is whether the agent would learn to generalize from the axiom demonstrations to new scenarios. For example, if a transformer model is trained on demonstrations of the causal transitivity axiom over small graphs, would it generalize to applying the transitivity axiom over large graphs? Our results, based on a novel axiomatic training scheme, indicate that such generalization is possible. For the transitivity axiom, we find that a 67 million parameter transformer model, when trained on linear causal chains (along with some noisy variations) can generalize well to new kinds of graphs, including longer causal chains, causal chains with reversed order, and graphs with branching; even when it is not explicitly trained for such settings. We extend axiomatic training to a harder task of inferring causation from correlation statements and find similar generalization. On both tasks, our model performs at par (or even better) than many larger language models such as GPT-4, Gemini Pro, and Phi-3. Overall, our axiomatic training framework provides a new paradigm of learning causal reasoning in language models that can be extended to arbitrary axioms, as long as sufficient demonstrations can be generated.

759Lookahead Shielding for Regular Safety Properties in Reinforcement Learning

[openreview] [pdf]

Abstract To deploy reinforcement learning (RL) systems in real-world scenarios we need to consider requirements such as safety and constraint compliance, rather than blindly maximizing for reward. In this paper we develop a lookahead shielding framework for RL with regular safety properties, which on the contrary to prior shielding methodologies requires minimal prior knowledge. At each environment step our framework aims to satisfy the regular safety property for a bounded horizon with high-probability, for the tabular setting we provide provable guarantees. We compare our setup to some common algorithms developed for the constrained Markov decision process (CMDP), and we demonstrate the effectiveness and scalability of our framework by extensively evaluating our framework in both tabular and deep RL environments.

760Target-Oriented Soft-Robust Inverse Reinforcement Learning

[openreview] [pdf]

Abstract In imitation learning, when the learning agent is at a state that is outside the demonstration of the expert, it could be difficult for her to choose an action. To overcome this challenge, inverse reinforcement learning (IRL) learns a parameterized reward function based on which we can generalize the expert’s behavior to those states that are unseen in the demonstration. However, on the one hand, there could be multiple reward functions that can explain the expert’s behavior, leading to reward ambiguity in IRL. On the other hand, though we often consider the transition kernel of the expert to be known to the agent, sometimes the transition kernel of the agent is different from the expert’s and is unknown, leading to transition kernel ambiguity in IRL. Drawing on the notion of soft-robust optimization, we build a target-oriented soft-robust IRL (SRIRL) model where the performance of the output policy strikes a flexible balance between risk aversion and expected return maximization towards reward uncertainty in IRL. Moreover, by employing the robust satisficing framework, our SRIRL is also robust to transition kernel ambiguity in IRL. In our target-oriented SRIRL, we keep a target for the performance of the output policy that balances expected return and risk, and we minimize the constraint violation incurred by the difference between the ambiguous transition kernel and the empirical one. We derive tractable reformulation for SRIRL, and we design tailored first-order methods for SRIRL. Numerical results showcase the soft robustness towards reward uncertainty and the robustness against transition kernel ambiguity of SRIRL, as well as the stronger scalability of our first-order methods compared to a state-of-the-art commercial solver.

761Uncertainty-Based Abstention in LLMs Improves Safety and Reduces Hallucinations

[openreview] [pdf]

Abstract A major barrier to the practical deployment of large language models (LLMs) is their lack of reliability. Three situations where this is particularly apparent are correctness, hallucinations when given unanswerable questions, and safety where responses are harmful or offensive. In all three cases, models should ideally abstain from responding---much like humans refrain from answering questions when uncertain. Inspired by analogous approaches in classification, this study explores the feasibility and efficacy of LLMs abstaining when uncertain in the domain of question-answering. We investigate two kinds of uncertainties, statistical uncertainty metrics and a distinct verbalized measure, termed as In Dialogue Uncertainty (InDU), measuring hedge words such as `I don’t know’ in responses. Using these uncertainty measures combined with models with and without reinforcement learning with human feedback (RLHF), we show in all three situations, abstention based on the right kind of uncertainty measure can boost the reliability of LLMs. By abstaining for a few highly uncertain samples we improve correctness by up to 8%, avoid 50% of hallucinations by correctly identifying unanswerable questions, and in particular increase safety by 70-99% with almost no additional computational overhead.

762ON EXTRAPOLATION IN MATERIAL PROPERTY REGRESSION

[openreview] [pdf]

Abstract Deep learning methods have yielded exceptional performances in material property regression (MPR). However, most existing methods operate under the assumption that the training and test are independent and identically distributed (i.i.d.). This overlooks the importance of extrapolation - predicting material properties beyond the range of training data - which is essential for advanced material discovery, as researchers strive to identify materials with exceptional properties that exceed current capabilities. In this paper, we address this gap by introducing a comprehensive benchmark comprising seven tasks specifically designed to evaluate extrapolation in MPR. We critically evaluate existing methods including deep imbalanced regression (DIR) and regression data augmentation (DA) methods, and reveal their limitations in extrapolation tasks. To address these issues, we propose the Matching-based EXtrapolation (MEX) framework, which reframes MPR as a material-property matching problem to alleviate the inherent complexity of the direct material-to-label mapping paradigm for better extrapolation. Our experimental results show that MEX outperforms all existing methods on our benchmark and demonstrates exceptional capability in identifying promising materials, underscoring its potential for advancing material discovery.

763Combating the Generalization-Forgetting Trade-off in Continual Learning: A Cautious Passive Low-Rank Approach

[openreview] [pdf]

Abstract Large Language Models (LLMs) have shown remarkable capabilities through wide-scale pre-training on a wide range of domains. However, they often suffer from catastrophic forgetting when learning sequential tasks. In this paper, we propose a novel parameter-efficient approach for continual learning in LLMs, which empirically explores the role of different effective layerwise ranks, leveraging lower ranks to mitigate catastrophic forgetting of previous tasks and higher ranks to enhance generalization on new tasks. By employing a subspace similarity metric that evaluates the orthogonality of low-rank subspaces between tasks, we gradually increase the rank of layerwise matrices for each new task, minimizing interference with previously learned tasks while enhancing generalization. Experimental results on standard continual learning benchmarks and challenging math benchmarks demonstrate that our method outperforms existing state-of-the-art approaches, effectively mitigating forgetting, improving task performance, and maintaining strong generalization to unseen tasks in a memory-efficient manner.

764Mitigating Generative Privacy Risks of Diffusion Models via Mixed Self-Synthesized Data Fine-tuning

[openreview] [pdf]

Abstract Diffusion models (DMs) have demonstrated exceptional performance across various generative tasks, yet they also face significant security and privacy concerns, such as Membership Inference Attacks (MIAs), where adversaries attempt to determine whether specific images were part of the DM’s training set. These threats present serious risks, particularly as pre-trained DMs are increasingly accessible online. To address these privacy concerns, we begin by investigating how fine-tuning DMs on a manipulated self-synthesized dataset affects their generative privacy risks, and have the following observations: (1) DMs fine-tuned solely on self-synthesized clean images are more vulnerable to privacy attacks (2) DMs fine-tuned on perturbed self-synthesized images become more robust against privacy attacks but exhibit degraded image generation quality. Based on the observations, we propose MixSyn, a simple and effective framework designed to mitigate privacy risks by fine-tuning DMs on a mixed self-synthesized dataset, which is a mixture of clean and perturbed synthetic images. Extensive experimental results demonstrate that our method significantly mitigates the generative privacy risks of DMs while preserving their original image generation quality.

765Generalization Gradient Descent

[openreview] [pdf]

Abstract We propose a new framework for evaluating the relationship between features and generalization via a theoretical analysis of the out-of-distribution (OOD) generalization problem, in which we simultaneously use two mathematical methods: a generalization ratio that quantitatively characterizes the degree of generalization, and a generalization decision process (GDP) that formalizes the relationship of loss between seen and unseen domains. By combining the concepts of informativeness and variation in the generalization ratio, we intuitively associate them with OOD problems to derive the generalization inequality. We then introduce it to the GDP to select the best loss from seen domains to gradient descent for backpropagation. In the case where the classifier is defined by fully connected neural network, the entire system is trained with backpropagation. There is no need for any model selection criterion or operating on gradients during training. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generalization ability.

766Provable Post-Deployment Deterioration Monitoring

[openreview] [pdf]

Abstract Data distribution often changes when deploying a machine learning model into a new environment, but not all shifts degrade model performance, making interventions like retraining unnecessary. This paper addresses model post-deployment deterioration (PDD) monitoring in the context of unlabeled deployment distributions. We formalize unsupervised PDD monitoring within the model disagreement framework where deterioration is detected if an auxiliary model, performing well on training data, shows significant prediction disagreement with the deployed model on test data. We propose D-PDDM, a principled monitoring algorithm achieving low false positive rates under non-deteriorating shifts and provide sample complexity bounds for high true positive rates under deteriorating shifts. Empirical results on both standard benchmark and a real-world large-scale healthcare dataset demonstrate the effectiveness of the framework in addition to its viability as an alert mechanism for existing high-stakes ML pipelines.

767Weighted-Rank Contrastive Regression for Robust Learning on Imbalance Social Media Popularity Prediction

[openreview] [pdf]

Abstract Social Media Popularity Prediction (SMPP) is the task of forecasting the level of engagement a social media post will receive. It is crucial for understanding audience engagement and enabling targeted marketing strategies. However, the inherent imbalance in real-world social media data, where certain popularity levels are underrepresented, poses a significant challenge. In this study, we leveraged the recent success of contrastive learning and its growing integration into regression tasks by introducing a Weighted-Rank CR loss to address the data imbalance challenges. Experiments on the Social Media Prediction Dataset demonstrated that our method outperformed the vanilla approach and the current state-of-the-art contrastive regression approach Rank-N-Contrast.

768Endless Jailbreaks with Bijection Learning

[openreview] [pdf]

Abstract Despite extensive safety training, LLMs are vulnerable to adversarial inputs. In this work, we introduce a simple but powerful attack paradigm, bijection learning, that yields a practically endless set of jailbreak prompts. We exploit language models’ advanced reasoning capabilities to teach them invertible languages (bijections) in context, pass encoded queries to the model to bypass built-in safety mechanisms, and finally decode responses back into English, yielding helpful replies to harmful requests. Our approach proves effective on a wide range of frontier language models and harm categories. Bijection learning is an automated and universal attack that grows stronger with scale: larger models with more advanced reasoning capabilities are more susceptible to bijection learning jailbreaks despite stronger safety mechanisms.

769Global Convergence of Policy Gradient in Average Reward MDPs

[openreview] [pdf]

Abstract We present the first comprehensive finite-time global convergence analysis of policy gradient for infinite horizon average reward Markov decision processes (MDPs). Specifically, we focus on ergodic tabular MDPs with finite state and action spaces. Our analysis shows that the policy gradient iterates converge to the optimal policy at a sublinear rate of O(1T),O({\frac{1}{T}}), which translates to O(log(T))O({\log(T)}) regret, where TT represents the number of iterations. Performance bounds for discounted reward MDPs cannot be easily extended to average reward MDPs as the bounds grow proportional to the fifth power of the effective horizon. Recent work on such extensions make a smoothness assumption that has not been verified. Thus, our primary contribution is in providing the first complete proof that the policy gradient algorithm converges globally for average-reward MDPs, without such an assumption. We also obtain the corresponding finite-time performance guarantees. In contrast to the existing discounted reward performance bounds, our performance bounds have an explicit dependence on constants that capture the complexity of the underlying MDP. Motivated by this observation, we reexamine and improve the existing performance bounds for discounted reward MDPs. We also present simulations which empirically validate the result.

770Entropy-Based Aggregation for Fair and Effective Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) enables collaborative model training across distributed devices while preserving data privacy. Nonetheless, the heterogeneity of edge devices often leads to inconsistent performance of the globally trained models, resulting in unfair outcomes among users. Existing federated fairness algorithms strive to enhance fairness but often fall short in maintaining the overall performance of the global model, typically measured by the average accuracy across all clients. To address this issue, we propose a novel algorithm that leverages entropy-based aggregation combined with model and gradient alignments to simultaneously optimize fairness and global model performance. Our method employs a bi-level optimization framework, where we derive an analytic solution to the aggregation probability in the inner loop, making the optimization process computationally efficient. Additionally, we introduce an innovative alignment update and an adaptive strategy in the outer loop to further balance global model’s performance and fairness. Theoretical analysis indicates that our approach guarantees convergence even in non-convex FL settings and demonstrates significant fairness improvements in generalized regression and strongly convex models. Empirically, our approach surpasses state-of-the-art federated fairness algorithms, ensuring consistent performance among clients while improving the overall performance of the global model.

771FedMAP: Unlocking Potential in Personalized Federated Learning through Bi-Level MAP Optimization

[openreview] [pdf]

Abstract Federated Learning (FL) enables collaborative training of machine learning (ML) models on decentralized data while preserving data privacy. However, data across clients often differs significantly due to class imbalance, feature distribution skew, sample size imbalance, and other phenomena. Using information from these not identically distributed (non-IID) datasets causes challenges in training. Existing FL methods based on a single global model cannot effectively capture client data variations, resulting in suboptimal performance. Personalized FL (PFL) techniques were introduced to adapt to the local data distribution of each client and utilize the data from other clients. They have shown promising results in addressing these challenges. We propose FedMAP, a novel Bayesian PFL framework which applies Maximum A Posteriori (MAP) estimation to effectively mitigate various non-IID data issues, by means of a parametric prior distribution, which is updated during aggregation. We provide a theoretical foundation illustrating FedMAP’s convergence properties. In particular, we prove that the prior updates in FedMAP correspond to gradient descent iterations for a linear combination of envelope functions associated with the local losses. This differs from previous FL approaches, that aim at minimizing a weighted average of local loss functions and often face challenges with heterogeneous data distributions, resulting in reduced client performance and slower convergence in non-IID settings. Finally, we show, through evaluations of synthetic and real-world datasets, that FedMAP achieves better performance than the existing methods. Moreover, we offer a robust, ready-to-use framework to facilitate practical deployment and further research.

772Direct Preference Optimization With Unobserved Preference Heterogeneity

[openreview] [pdf]

Abstract RLHF has emerged as a pivotal step in aligning language models with human objectives and values. It typically involves learning a reward model from human preference data and then using reinforcement learning to update the generative model accordingly. Conversely, Direct Preference Optimization (DPO) directly optimizes the generative model with preference data, skipping reinforcement learning. However, both RLHF and DPO assume uniform preferences, overlooking the reality of diverse human annotators. This paper presents a new method to align generative models with varied human preferences. We propose an Expectation-Maximization adaptation to DPO, generating a mixture of models based on latent preference types of the annotators. We then introduce a min-max regret ensemble learning model to produce a single generative method to minimize worst-case regret among annotator subgroups with similar latent factors. Our algorithms leverage the simplicity of DPO while accommodating diverse preferences. Experimental results validate the effectiveness of our approach in producing equitable generative policies.

773STABLE DIFFUSION MODELS ARE SECRETLY GOOD AT VISUAL IN-CONTEXT LEARNING

[openreview] [pdf]

Abstract Large language models (LLM) in natural language processing (NLP) have demonstrated great potential for in-context learning (ICL) -- the ability to leverage a few set of example prompts to adapt to various tasks without having to explicitly update model weights. ICL has recently been explored for the visual domain with promising early outcomes. These approaches involve specialized training and/or additional data which complicate the process and limit its generalizability. In this work, we show that off-the-shelf Stable Diffusion models can be re-purposed for visual in-context learning (V-ICL). Specifically, we formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture that explicitly incorporates context between the query and example prompts. Without any additional fine-tuning, we show that this re-purposed Stable Diffusion model is able to adapt to six different tasks: foreground segmentation, single object detection, semantic segmentation, keypoint detection, edge detection, and colorization. For example, the proposed approach improves the mean intersection over union (mIoU) for the foreground segmentation task on Pascal-5i dataset by 8.9% and 3.2% over recent methods such as Visual Prompting and IMProv, respectively. Additionally, we show that the proposed method is able to effectively leverage multiple prompts through ensembling to infer the task better and further improve the performance across all tasks.

774Jacobian Descent for Multi-Objective Optimization

[openreview] [pdf]

Abstract Many optimization problems require balancing multiple conflicting objectives. As gradient descent is limited to single-objective optimization, we introduce its direct generalization: Jacobian descent (JD). This algorithm iteratively updates parameters using the Jacobian matrix of a vector-valued objective function, in which each row is the gradient of an individual objective. While several methods to combine gradients already exist in the literature, they are generally hindered when the objectives conflict. In contrast, we propose projecting gradients to fully resolve conflict while ensuring that they preserve an influence proportional to their norm. We prove significantly stronger convergence guarantees with this approach, supported by our empirical results. Our method also enables instance-wise risk minimization (IWRM), a novel learning paradigm in which the loss of each training example is considered a separate objective. Applied to simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. Additionally, we outline an efficient implementation of JD using the Gramian of the Jacobian matrix to reduce time and memory requirements.

775Towards Off-Road Autonomous Driving via Planner Guided Policy Optimization

[openreview] [pdf]

Abstract Off-road autonomous driving poses significant challenges such as navigating diverse terrains, avoiding obstacles, and maneuvering through ditches. Addressing these challenges requires effective planning and adaptability, making it a long-horizon planning and control problem. Traditional model-based control techniques like Model Predictive Path Integral (MPPI) require dense sampling and accurate modeling of the vehicle-terrain interaction, both of which are computationally expensive, making effective long-horizon planning in real-time intractable. Reinforcement learning (RL) methods operate without this limitation and are computationally cheaper at deployment. However, exploration in obstacle-dense and challenging terrains is difficult, and typical RL techniques struggle to navigate in these terrains. To alleviate the limitations of MPPI, we propose a hierarchical autonomy pipeline with a low-frequency high-level MPPI planner and a high-frequency low-level RL controller. To tackle RL’s exploration challenge, we propose a teacher-student paradigm to learn an end-to-end RL policy, capable of real-time execution and traversal through challenging terrains. The teacher policy is trained using dense planning information from an MPPI planner while the student policy learns to navigate using visual inputs and sparse planning information. In this framework, we introduce a new policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. We demonstrate our performance in a realistic off-road simulator against various RL and imitation learning methods.

776WMAdapter: Adding WaterMark Control to Latent Diffusion Models

[openreview] [pdf]

Abstract Watermarking is essential for protecting the copyright of AI-generated images. We propose WMAdapter, a diffusion model watermark plugin that embeds user-specified watermark information seamlessly during the diffusion generation process. Unlike previous methods that modify diffusion modules to incorporate watermarks, WMAdapter is designed to keep all diffusion components intact, resulting in sharp, artifact-free images. To achieve this, we introduce two key innovations: (1) We develop a contextual adapter that conditions on the content of the cover image to generate adaptive watermark embeddings. (2) We implement an additional finetuning step and a hybrid finetuning strategy that suppresses noticeable artifacts while preserving the integrity of the diffusion components. Empirical results show that WMAdapter provides strong flexibility, superior image quality, and competitive watermark robustness.

777Ctrl123: Consistent Novel View Synthesis via Closed-Loop Transcription

[openreview] [pdf]

Abstract Based on the success of large image diffusion models, multi-view diffusion models have demonstrated remarkable zero-shot capability in novel view synthesis (NVS). However, the pioneering work Zero123 struggles to maintain consistency across generated multiple views. While recent modifications in model and training design have improved multi-view consistency, they often introduce new limitations, such as restricted fixed view generation or reliance on additional conditions. These constraints hinder the broader application of multi-view diffusion models in downstream tasks like 3D reconstruction. We identify the root cause of inconsistency as the excessive diversity inherent in generative models utilized for the NVS task. To address this, we aim to utilize the stronger supervise information to better alignment with ground truth images to constrain the diversity, and propose Ctrl123, aclosed-looptranscription-based multi-view diffusion method that enforces alignment in the CLIP patch feature space. Extensive experiments demonstrate that Ctrl123 excels inarbitrarynovel view generation, significantly improving multi-view consistency compared to existing methods.

778xTED: Cross-Domain Adaptation via Diffusion-Based Trajectory Editing

[openreview] [pdf]

Abstract Reusing pre-collected data from different domains is an appealing solution for decision-making tasks that have insufficient data in the target domain but are relatively abundant in other related domains. Existing cross-domain policy transfer methods mostly aim at learning domain correspondences or corrections to facilitate policy learning, such as learning domain/task-specific discriminators, representations, or policies. This design philosophy often results in heavy model architectures or task/domain-specific modeling, lacking flexibility. This reality makes us wonder: can we directly bridge the domain gaps universally at the data level, instead of relying on complex downstream cross-domain policy transfer models? In this study, we propose theCross-DomainTrajectoryEDiting (xTED) framework that employs a specially designed diffusion model for cross-domain trajectory adaptation. Our proposed model architecture effectively captures the intricate dependencies among states, actions, and rewards, as well as the dynamics patterns within target data. By utilizing the pre-trained diffusion as a prior, source domain trajectories can be transformed to match with target domain properties while preserving original semantic information. This process implicitly corrects underlying domain gaps, enhancing state realism and dynamics reliability in the source data, and allowing flexible incorporation with various downstream policy learning methods. Despite its simplicity, xTED demonstrates superior performance in extensive simulation andreal-robot experiments.

779Enhancing Logits Distillation with Plug&Play Kendall’sτRanking Loss

[openreview] [pdf]

Abstract Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the output of the student model to precisely match the soft labels provided by the teacher model. However, the optimization process of KL divergence is challenging for the student and prone to suboptimal points. Also, we demonstrate that the gradients provided by KL divergence depend on channel scale and thus tend to overlook low-probability channels. The mismatch in low-probability channels also results in the neglect of inter-class relationship information, making it difficult for the student to further enhance performance. To address this issue, we propose an auxiliary ranking loss based on Kendall’s τ\tau Coefficient, which can be plug-and-play in any logit-based distillation method, providing inter-class relationship information and balancing the attention to low-probability channels. We show that the proposed ranking loss is less affected by channel scale, and its optimization objective is consistent with that of KL divergence. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that the proposed ranking loss can be plug-and-play on various baselines and enhance their performance.

780Diff-BBO: Diffusion-Based Inverse Modeling for Black-Box Optimization

[openreview] [pdf]

Abstract Black-box optimization (BBO) aims to optimize an objective function by iteratively querying a black-box oracle in a sample-efficient way. While prior studies focus on forward approaches to learn surrogates for the unknown objective function, they struggle with steering clear of out-of-distribution and invalid inputs. Recently, inverse modeling approaches that map objective space to the design space with conditional diffusion models have demonstrated impressive capability in learning the data manifold. They have shown promising performance in offline BBO tasks. However, these approaches require a pre-collected dataset. How to design the acquisition function for inverse modeling to actively query new data remains an open question. In this work, we propose diffusion-based inverse modeling for black-box optimization (Diff-BBO), an inverse approach leveraging diffusion models for online BBO problem. Instead of proposing candidates in the design space, Diff-BBO employs a novel acquisition function Uncertainty-aware Exploration (UaE) to propose objective function values. Subsequently, we employ a conditional diffusion model to generate samples based on these proposed values within the design space. We demonstrate that using UaE results in optimal optimization outcomes, supported by both theoretical and empirical evidence.

781How does Your RL Agent Explore? An Optimal Transport Analysis of Occupancy Measure Trajectories

[openreview] [pdf]

Abstract The rising successes of RL are propelled by combining smart algorithmic strategies and deep architectures to optimize the distribution of returns and visitations over the state-action space. A quantitative framework to compare the learning processes of these eclectic RL algorithms is currently absent but desired in practice. We address this gap by representing the learning process of an RL algorithm as a sequence of policies generated during training, and then studying the policy trajectory induced in the manifold of occupancy measures. Using an optimal transport-based metric, we measure the length of the paths induced by the policy sequence yielded by an RL algorithm between an initial policy and a final optimal policy. Hence, we first define theEffort of Sequential Learning(ESL). ESL quantifies the relative distance that an RL algorithm travels compared to the shortest path from the initial to the optimal policy. Further, we connect the dynamics of policies in the occupancy measure space and regret, another metric to understand the suboptimality of an RL algorithm, by defining theOptimal Movement Ratio(OMR). OMR assesses the fraction of movements in the occupancy measure space that effectively reduce an analogue of regret. Finally, we derive approximation guarantees to estimate ESL and OMR with finite number of samples and without access to an optimal policy. Through empirical analyses across various environments and algorithms, we demonstrate that ESL and OMR provide insights into the exploration processes of RL algorithms and hardness of different tasks in discrete and continuous MDPs.

782NoisyTraj: Robust Trajectory Prediction with Noisy Observations

[openreview] [pdf]

Abstract Trajectory prediction aims to forecast an agent’s future trajectories based on its historical observed trajectories, which is a critical task for various applications such as autonomous driving, robotics, and surveillance systems. Most existing trajectory prediction methods assume that the observed trajectories collected for forecasting are clean. However, in real-world scenarios, noise is inevitably introduced into the observations due to errors from sensors, detection, and tracking processes, resulting in the collapse of the existing approaches. Therefore, it is essential to perform robust trajectory prediction based on noisy observations, which is a more practical scenario. In this paper, we propose NoisyTraj, a noise-agnostic approach capable of tackling the problem of trajectory prediction with arbitrary types of noisy observations. Specifically, we put forward a mutual information-based mechanism to denoise the original noisy observations. This mechanism optimizes the produced trajectories to exhibit a pattern that closely resembles the clean trajectory pattern while deviating from the noisy one. Considering that the trajectory structure may be destroyed through the only optimization of mutual information, we introduce an additional reconstruction loss to preserve the structure information of the produced observed trajectories. Moreover, we further propose a ranking loss based on the intuitive idea that prediction performance using denoised trajectories should surpass that using the original noisy observations, thereby further enhancing performance. Because NoisyTraj does not rely on any specific module tailored to particular noise distributions, it can handle arbitrary types of noise in principle. Additionally, our proposed NoisyTraj can be easily integrated into existing trajectory prediction models. Extensive experiments conducted on the ETH/UCY and Stanford Drone datasets (SDD) demonstrate that NoisyTraj significantly improves the accuracy of trajectory prediction with noisy observations, compared to the baselines.

783Active Fine-Tuning of Generalist Policies

[openreview] [pdf]

Abstract Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks. This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning. However, as soon as several tasks need to be learned, we must decidewhich tasks should be demonstrated and how often?We study this multi-task problem and explore an interactive framework in which the agentadaptivelyselects the tasks to be demonstrated. We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy. We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.

784Scrutinize What We Ignore: Reining In Task Representation Shift Of Context-Based Offline Meta Reinforcement Learning

[openreview] [pdf]

Abstract Offline meta reinforcement learning (OMRL) has emerged as a promising approach for interaction avoidance and strong generalization performance by leveraging pre-collected data and meta-learning techniques. Previous context-based approaches predominantly rely on the intuition that alternating optimization between the context encoder and the policy can lead to performance improvements, as long as the context encoder follows the principle of maximizing the mutual information between the task variable MM and its latent representation ZZ (I(Z;M)I(Z;M)) while the policy adopts the standard offline reinforcement learning (RL) algorithms conditioning on the learned task representation. Despite promising results, the theoretical justification of performance improvements for such intuition remains underexplored. Inspired by the return discrepancy scheme in the model-based RL field, we find that the previous optimization framework can be linked with the general RL objective of maximizing the expected return, thereby explaining performance improvements. Furthermore, after scrutinizing this optimization framework, we find it ignores the variation of the task representation in the alternating optimization process, which weakens the condition necessary for monotonic performance improvements, and may therefore violate the monotonicity. We name this issue \underline{task representation shift} and theoretically prove that the monotonic performance improvements can be guaranteed with appropriate context encoder updates. We use different settings to rein in the task representation shift on three widely adopted training objectives concerning maximizing I(Z;M)I(Z;M) across different data qualities. Empirical results show that reining in the task representation shift can indeed improve performance. Our work opens up a new avenue for OMRL, leading to a better understanding between the task representation and performance improvements.

785A Dual-Fusion Cognitive Diagnosis Framework for Open Student Learning Environments

[openreview] [pdf]

Abstract Cognitive diagnosis model (CDM) is a fundamental and upstream component in intelligent education. It aims to infer students’ mastery levels based on historical response logs. However, existing CDMs usually follow the ID-based embedding paradigm, which could often diminish the effectiveness of CDMs in open student learning environments. This is mainly because they can hardly directly infer new students’ mastery levels or utilize new exercises or knowledge without retraining. Textual semantic information, due to its unified feature space and easy accessibility, can help alleviate this issue. Unfortunately, directly incorporating semantic information may not benefit CDMs, since it does not capture response-relevant features and thus discards the individual characteristics of each student. To this end, this paper proposes a dual-fusion cognitive diagnosis framework (DFCD) to address the challenge of aligning two different modalities, i.e., textual semantic features and response-relevant features. Specifically, in DFCD, we first propose the exercise-refiner and concept-refiner to make the exercises and knowledge concepts more coherent and reasonable via large language models. Then, DFCD encodes the refined features using text embedding models to obtain the semantic information. For response-related features, we propose a novel response matrix to fully incorporate the information within the response logs. Finally, DFCD designs a dual-fusion module to merge the two modal features. The ultimate representations possess the capability of inference in open student learning environments and can be also plugged in existing CDMs. Extensive experiments across real-world datasets show that DFCD achieves superior performance by integrating different modalities and strong adaptability in open student learning environments.

786Almost sure convergence of stochastic Hamiltonian descent methods

[openreview] [pdf]

Abstract Gradient normalization and soft clipping are two popular techniques for tackling instability issues and improving convergence of stochastic gradient descent (SGD) with momentum. In this article, we study these types of methods through the lens of dissipative Hamiltonian systems. Gradient normalization and certain types of soft clipping algorithms can be seen as (stochastic) implicit-explicit Euler discretizations of dissipative Hamiltonian systems, where the kinetic energy function determines the type of clipping that is applied. We make use of dynamical systems theory to show in a unified way that all of these schemes converge to stationary points of the objective function, almost surely, in several different settings: a) for LL-smooth objective functions, when the variance of the stochastic gradients is possibly infinite b) under the (L0,L1)(L_0,L_1)-smoothness assumption, for heavy-tailed noise with bounded variance and c) for (L0,L1)(L_0,L_1)-smooth functions in the empirical risk minimization setting, when the variance is possibly infinite but the expectation is finite.

787Leveraging Variable Sparsity to Refine Pareto Stationarity in Multi-Objective Optimization

[openreview] [pdf]

Abstract Gradient-based multi-objective optimization (MOO) is essential in modern machine learning, with applications in e.g., multi-task learning, federated learning, algorithmic fairness and reinforcement learning. In this work, we first reveal some limitations of Pareto stationarity, a widely accepted first-order condition for Pareto optimality, in the presence of sparse function-variable structures. Next, to account for such sparsity, we propose a novel solution concept termed Refined Pareto Stationarity (RPS), which we prove is always sandwiched between Pareto optimality and Pareto stationarity. We give an efficient partitioning algorithm to automatically mine the function-variable dependency and substantially trim non-optimal Pareto stationary solutions. Then, we show that gradient-based descent algorithms in MOO can be enhanced with our refined partitioning. In particular, we propose Multiple Gradient Descent Algorithm with Refined Partition (RP-MGDA) as an example method that converges to RPS, while still enjoying a similar per-step complexity and convergence rate. Lastly, we validate our approach through experiments on both synthetic examples and realistic application scenarios where distinct function-variable dependency structures appear. Our results highlight the importance of exploiting function-variable structure in gradient-based MOO, and provide a seamless enhancement to existing approaches.

788Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

[openreview] [pdf]

Abstract In Open-Set Domain Adaptation (OSDA) we wish to perform classification in a target domain which contains a novel class along with kk non-novel classes. This work formally studies OSDA under the assumption that classes are separable, and the supports of source and target domains coincide, while other aspects of the distribution may change. We develop a simple and scalable method that attains robustness to distribution shift and is guaranteed to solve the problem, while showing that it cannot be solved under weaker conditions that have been studied for OSDA in the past, particularly in the presence of covariate shift. We formally define the realistic assumptions within the scope of OSDA problem that the previous literature has either overlooked or not explicitly addressed. In a thorough empirical evaluation on both image and text data, we observe that existing OSDA methods are not robust to the distribution shifts we consider. The results demonstrate the efficacy of joint representation learning for classification of known classes and detection of novel ones using principled methods. We find that optimizing these two objectives in unison leads to mutual improvements in task performance contrary to what might be expected when objectives are considered independently. Our rigorous empirical study also examines how OSDA performance under distribution shift is affected by parameters of the problem such as the size of novel class. Taken together, our observations emphasize the importance of formalizing assumptions under which OSDA methods operate and to develop appropriate methodology that are capable of scaling with large datasets and models for different scenarios of OSDA.

789Transformers versus LSTMs for electronic trading

[openreview] [pdf]

Abstract The rapid advancement of artificial intelligence has seen widespread application of long short-term memory (LSTM), a type of recurrent neural network (RNN), in time series forecasting. Despite the success of Transformers in natural language processing (NLP), which prompted interest in their efficacy for time series prediction, their application in financial time series forecasting is less explored compared to the dominant LSTM models. This study investigates whether Transformer-based models can outperform LSTMs in financial time series forecasting. It involves a comparative analysis of various LSTM-based and Transformer-based models on multiple financial prediction tasks using high-frequency limit order book data. A novel LSTM-based model named DLSTM is introduced alongside a newly designed Transformer-based model tailored for financial predictions. The findings indicate that Transformer-based models exhibit only a marginal advantage in predicting absolute price sequences, whereas LSTM-based models demonstrate superior and more consistent performance in predicting differential sequences such as price differences and movements.

790Unified Framework for Causal Discovery and Long-term Forecasting in Non-stationary Environments

[openreview] [pdf]

Abstract Non-stationary data is prevalent in various real-world domains such as climate science, economics, and neuroscience, presenting significant challenges for tasks like forecasting and causal discovery from observational data. Existing approaches often operate under the assumption that the data is stationary. In this work, we introduce a unified framework that combines long-term forecasting and causal discovery with non-linear relations in a non-stationary setting. Specifically, we assume that the nonlinear causal relations in the observed space can be transformed into linear relations in the latent space via projections. In addition, we model the non-stationarity in the system as arising from time-varying causal relations. The proposed model demonstrates that adopting a causal perspective for long-term forecasting not only addresses the limitations of each task but also makes the causal process identifiable, enhances interpretability, and provides more reliable predictions. Moreover, our approach reformulates causal discovery into a scalable, non-parametric deep learning problem. Through experiments on both synthetic and real-world datasets, we show that our framework outperforms baseline methods in both forecasting and causal discovery, underscoring the benefits of this integrated approach.

791TLXML: Task-Level Explanation of Meta-Learning via Influence Functions

[openreview] [pdf]

Abstract The scheme of adaptation via meta-learning is seen as an ingredient for solving the problem of data shortage or distribution shift in real-world applications, but it also brings the new risk of inappropriate updates of the model in the user environment, which increases the demand for explainability. Among the various types of XAI methods, establishing a method of explanation based on past experience in meta-learning requires special consideration due to its bi-level structure of training, which has been left unexplored. In this work, we propose influence functions for explaining meta-learning that measure the sensitivities of training tasks to adaptation and inference. We also argue that the approximation of the Hessian using the Gauss-Newton matrix resolves computational barriers peculiar to meta-learning. We demonstrate the adequacy of the method through experiments on task distinction and task distribution distinction using image classification tasks with MAML and Prototypical Network.

792No Preference Left Behind: Group Distributional Preference Optimization

[openreview] [pdf]

Abstract Preferences within a group of people are not uniform but follow a distribution. While existing alignment methods like Direct Preference Optimization (DPO) attempt to steer models to reflect human preferences, they struggle to capture the distributional pluralistic preferences within a group. These methods often skew toward dominant preferences, overlooking the diversity of opinions, especially when conflicting preferences arise. To address this issue, we propose Group Distribution Preference Optimization (GDPO), a novel framework that aligns language models with the distribution of preferences within a group by incorporating the concept of beliefs that shape individual preferences. GDPO calibrates a language model using statistical estimation of the group’s belief distribution and aligns the model with belief-conditioned preferences, offering a more inclusive alignment framework than traditional methods. In experiments using both synthetic controllable opinion generation and real-world movie review datasets, we show that DPO fails to align with the targeted belief distributions, while GDPO consistently reduces this alignment gap during training. Additionally, our evaluation metrics demonstrate that GDPO outperforms existing approaches in aligning with group distributional preferences, marking a significant advance in pluralistic alignment.

793Cluster-Segregate-Perturb (CSP): A Model-agnostic Explainability Pipeline for Spatiotemporal Land Surface Forecasting Models

[openreview] [pdf]

Abstract Satellite images are increasingly valuable for modeling regional climate change. Earth surface forecasting is one task that combines satellite imagery and meteorological data to understand how climate evolves over time. However, understanding the complex relationship between meteorological variables and land surface changes remains a challenge. Our paper introduces a pipeline that integrates principles from perturbation-based techniques like LIME and global explainability techniques methods like PDP, addressing the limitations of these techniques in high-dimensional spatiotemporal models. This pipeline facilitates analyses such as marginal sensitivity, correlation, and lag analysis, etc for complex land forecasting models. Using ConvLSTM for surface forecasting, we analyzed influence of variables like temperature, pressure, and precipitation on the NDVI of the surface predictions. Our study in EarthNet2021 Dataset (primarily consists of samples from the European Alps region, collected during the spring to fall seasons) revealed that precipitation had the greatest impact, followed by temperature, while pressure has little to no direct effect on NDVI. Additionally, interesting nonlinear correlations between meteorological variables and NDVI have been uncovered.

794HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models

[openreview] [pdf]

Abstract Safety guard models that detect malicious queries aimed at large language models (LLMs) are essential for ensuring the secure and responsible deployment of LLMs in real-world applications. However, deploying existing safety guard models with billions of parameters alongside LLMs on mobile devices is impractical due to substantial memory requirements and latency. To reduce this cost, we distill a large teacher safety guard model into a smaller one using a labeled dataset of instruction-response pairs with binary harmfulness labels. Due to the limited diversity of harmful instructions in the existing labeled dataset, naively distilled models tend to underperform compared to larger models. To bridge the gap between small and large models, we proposeHarmAug, a simple yet effective data augmentation method that involves jailbreaking an LLM and prompting it to generate harmful instructions. Given a prompt such as, “Make a single harmful instruction prompt that would elicit offensive content”, we add an affirmative prefix (e.g., “I have an idea for a prompt:”) to the LLM’s response. This encourages the LLM to continue generating the rest of the response, leading to sampling harmful instructions. Another LLM generates a response to the harmful instruction, and the teacher model labels the instruction-response pair. We empirically show that our HarmAug outperforms other relevant baselines. Moreover, a 435-million-parameter safety guard model trained with HarmAug achieves an F1 score comparable to larger models with over 7 billion parameters, and even outperforms them in AUPRC, while operating at less than 25% of their computational cost. Ourcode,safety guard model, andsynthetic datasetare publicly available.

795Preference Elicitation for Offline Reinforcement Learning

[openreview] [pdf]

Abstract Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in various environments.

796Flashback: Understanding and Mitigating Forgetting in Federated Learning

[openreview] [pdf]

Abstract In Federated Learning (FL), forgetting, or the loss of knowledge across rounds, hampers algorithm convergence, especially in the presence of severe data heterogeneity among clients. This study explores the nuances of this issue, emphasizing the critical role of forgetting leading to FL’s inefficient learning within heterogeneous data contexts. Knowledge loss occurs in both client-local updates and server-side aggregation steps; addressing one without the other fails to mitigate forgetting. We introduce a metric to measure forgetting granularly, ensuring distinct recognition amid new knowledge acquisition. Based on this, we propose Flashback, a novel FL algorithm with a dynamic distillation approach that regularizes the local models and effectively aggregates their knowledge. The results from extensive experimentation across different benchmarks show that Flashback mitigates forgetting and outperforms other state-of-the-art methods, achieving faster round-to-target accuracy by converging in 6 to 16 rounds, being up to 27×\times faster.

797Auction-Based Regulation for Artificial Intelligence

[openreview] [pdf]

Abstract In an era of “moving fast and breaking things”, regulators have moved slowly to pick up the safety, bias, and legal pieces left in the wake of broken Artificial Intelligence (AI) deployment. Since AI models, such as large language models, are able to push misinformation and stoke division within our society, it is imperative for regulators to employ a framework that mitigates these dangers and ensures user safety. While there is much-warranted discussion about how to address the safety, bias, and legal woes of state-of-the-art AI models, the number of rigorous and realistic mathematical frameworks to regulate AI safety is lacking. We take on this challenge, proposing an auction-based regulatory mechanism that provably incentivizes model-building agents (i) to deploy safer models and (ii) to participate in the regulation process. We provably guarantee, via derived Nash Equilibria, that each participating agent’s best strategy is to submit a model safer than a prescribed minimum-safety threshold. Empirical results show that our regulatory auction boosts safety and participation rates by 20% and 15% respectively, outperforming simple regulatory frameworks that merely enforce minimum safety standards.

798Fair Anomaly Detection For Imbalanced Groups

[openreview] [pdf]

Abstract Anomaly detection (AD) has been widely studied for decades in many real-world applications, including fraud detection in finance, and intrusion detection for cybersecurity, etc. Due to the imbalanced nature between protected and unprotected groups and the imbalanced distributions of normal examples and anomalies, the learning objectives of most existing anomaly detection methods tend to solely concentrate on the dominating unprotected group. Thus, it has been recognized by many researchers about the significance of ensuring model fairness in anomaly detection. However, the existing fair anomaly detection methods tend to erroneously label most normal examples from the protected group as anomalies in the imbalanced scenario where the unprotected group is more abundant than the protected group. This phenomenon is caused by the improper design of learning objectives, which statistically focus on learning the frequent patterns (i.e., the unprotected group) while overlooking the under-represented patterns (i.e., the protected group). To address these issues, we propose FADIG, a fairness-aware anomaly detection method targeting the imbalanced scenario. It consists of a fairness-aware contrastive learning module and a rebalancing autoencoder module to ensure fairness and handle the imbalanced data issue, respectively. Moreover, we provide the theoretical analysis that shows our proposed contrastive learning regularization guarantees group fairness. Empirical studies demonstrate the effectiveness and efficiency of FADIG across multiple real-world datasets.

799Understanding and Mitigating Distribution Shifts for Machine Learning Force Fields

[openreview] [pdf]

Abstract Machine Learning Force Fields (MLFFs) are a promising alternative to expensive ab initio quantum mechanical molecular simulations. Given the diversity of chemical spaces that are of interest and the cost of generating new data, it is important to understand how MLFFs generalize beyond their training distributions. Our diagnostic experiments on real-world datasets reveal common distribution shifts that pose significant challenges, including for large foundation models trained on extensive datasets. Based on these observations, we hypothesize that current supervised training methods inadequately regularize MLFFs, resulting in overfitting and learning poor representations of out-of-distribution systems. Based on our observations, we propose two new methods as initial steps for mitigating distribution shifts for MLFFs. Our methods focus on test-time refinement strategies that incur minimal computational cost. The first strategy, based on spectral graph theory, modifies the edges of test graphs to align with graph structures seen during training. It can be applied to any existing pre-trained model to mitigate connectivity distribution shifts. Our second strategy improves representations for out-of-distribution systems at test-time by taking gradient steps using an auxiliary objective. We demonstrate that our test-time refinement strategies can reduce force errors by an order of magnitude on out-of-distribution systems, suggesting that MLFFs are capable of modeling diverse chemical spaces, but are not being effectively trained to do so. Our experiments establish clear benchmarks for evaluating the generalization capabilities of the next generation of MLFFs.

800Innovative Thinking, Infinite Humor: Humor Research of Large Language Models through Structured Thought Leaps

[openreview] [pdf]

Abstract Humor is a culturally nuanced aspect of human language that presents challenges for understanding and generation, requiring participants to possess good creativity and strong associative thinking. Similar to reasoning tasks like solving math problems, humor generation requires continuous reflection and revision to foster creative thinking, rather than relying on a sudden flash of inspiration like Creative Leap-of-Thought (CLoT) paradigm. Although CLoT can realize the ability of remote association generation, this paradigm fails to emphasize the importance of rationales between those seemingly unrelated concepts. Therefore, in this paper, we propose a systematic way of thinking about generating humor and based on it, we built Creative Leap of Structured Thought (CLoST) frame. First, a reward model is necessary achieve the purpose of being able to correct errors, since there is currently no expert model of humor and a usable rule to determine whether a piece of content is humorous. Judgement-oriented instructions are designed to improve the capability of a model, and we also propose an open-domain instruction evolutionary method to fully unleash the potential. Then, through reinforcement learning, the model learns to hone its rationales of the thought chain and refine the strategies it uses. Thus, it learns to recognize and correct its mistakes, and finally generate the most humorous and creative answer. These findings deepen our understanding of the creative capabilities of LLMs and provide ways to enhance LLMs’ creative abilities for cross-domain innovative applications.

801Revisiting Large-Scale Non-convex Distributionally Robust Optimization

[openreview] [pdf]

Abstract Distributionally robust optimization (DRO) is a powerful technique to train robust machine learning models that perform well under distribution shifts. Compared with empirical risk minimization (ERM), DRO optimizes the expected loss under the worst-case distribution in an uncertainty set of distributions. This paper revisits the important problem of DRO with non-convex smooth loss functions. For this problem, Jin et al. (2021) showed that its dual problem is generalized (L0,L1)(L_0, L_1)-smooth condition and gradient noise satisfies the affine variance condition, designed an algorithm of mini-batch normalized gradient descent with momentum, and proved its convergence and complexity. In this paper, we show that the dual problem and the gradient noise satisfy simpler yet more precise partially generalized smoothness condition and partially affine variance condition by studying the optimization variable and dual variable separately, which further yields much simpler algorithm design and convergence analysis. We develop a double stochastic gradient descent with clipping (D-SGD-C) algorithm that converges to an ϵ\epsilon-stationary point with O(ϵ4)\mathcal O(\epsilon^{-4}) gradient complexity, which matches with results in Jin et al. (2021). Our algorithm does not need to use momentum, and the proof is much simpler, thanks to the more precise characterization of partially generalized smoothness and partially affine variance noise. We further design a variance-reduced method that achieves a lower gradient complexity of O(ϵ3)\mathcal O(\epsilon^{-3}). Our theoretical results and insights are further verified numerically on a number of tasks, and our algorithms outperform the existing DRO method (Jin et al., 2021).

802Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

[openreview] [pdf]

Abstract Studying how to fine-tune offline reinforcement learning (RL) pre-trained policy is profoundly significant for enhancing the sample efficiency of RL algorithms. However, directly fine-tuning pre-trained policies often results in sub-optimal performance. This is primarily due to the distribution shift between offline pre-training and online fine-tuning stages. Specifically, the distribution shift limits the acquisition of effective online samples, ultimately impacting the online fine-tuning performance. In order to narrow down the distribution shift between offline and online stages, we proposed Q conditioned state entropy (QCSE) as intrinsic reward. Specifically, QCSE maximizes the state entropy of all samples individually, considering their respective Q values. This approach encourages exploration of low-frequency samples while penalizing high-frequency ones, and implicitly achieves State Marginal Matching (SMM), thereby ensuring optimal performance, solving the asymptotic sub-optimality of constraint-based approaches. Additionally, QCSE can seamlessly integrate into various RL algorithms, enhancing online fine-tuning performance. To validate our claim, we conduct extensive experiments, and observe significant improvements with QCSE ( about 10.9% for CQL and 8% for Cal-QL). Furthermore, we extended experimental tests to other algorithms, affirming the generality of QCSE.

803Out-of-distribution Generalization for Total Variation based Invariant Risk Minimization

[openreview] [pdf]

Abstract Invariant risk minimization is an important general machine learning framework that has recently been interpreted as a total variation model (IRM-TV). However, how to improve out-of-distribution (OOD) generalization in the IRM-TV setting remains unsolved. In this paper, we propose a novel OOD generalization approach for IRM-TV, named OOD-TV-IRM, based on its theoretical analysis. The key idea is to deploy an autonomous TV penalty that depends on the invariant feature extractor. We construct the autonomous TV penalty using a neural network with another set of parameters, which can be learned via an adversarial scheme against the parameters of the invariant feature extractor. Experimental results show that OOD-TV-IRM outperforms IRM-TV in most situations.

804Prompt Optimization with Human Feedback

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable performances in various tasks. However, the performances of LLMs heavily depend on the input prompt. This has given rise to a number of recent works on prompt optimization. However, the previous works often require the availability of a numeric score to assess the quality of every prompt. Unfortunately, when a human user interacts with a black-box LLM, it is often infeasible and unreliable to attain such a score. Instead, it is usually significantly easier and more reliable to obtain preference feedback from a human user, i.e., showing the user the responses generated from a pair of prompts and asking the user which one is preferred. Therefore, in this paper, we study the problem of prompt optimization with human feedback (POHF), in which we aim to optimize the prompt for a black-box LLM using only human preference feedback. By drawing inspirations from dueling bandits, we design a theoretically principled strategy to select a pair of prompts to query for preference feedback in every iteration, and hence introduce our algorithm named automated POHF (APOHF). We apply our APOHF algorithm to a variety of tasks, including optimizing user instructions, prompt optimization for text-to-image generative models, and response optimization with human feedback (i.e., further refining the response using a variant of our APOHF). The results demonstrate that our APOHF can efficiently find a good prompt using a small number of preference feedback instances.

805Dreamguider: Improved Training free Diffusion-based Conditional Generation

[openreview] [pdf]

Abstract Diffusion models have emerged as a formidable tool for training-free conditional generation. However, a key hurdle in inference-time guidance techniques is the need for compute-heavy backpropagation through the diffusion network for estimating the guidance direction. Moreover, these techniques often require handcrafted parameter tuning on a case-by-case basis. Although some recent works have introduced minimal compute methods for linear inverse problems, a generic lightweight guidance solution to both linear and non-linear guidance problems is still missing. To this end, we propose Dreamguider, a method that enables inference-time guidance without compute-heavy backpropagation through the diffusion network. The key idea is to regulate the gradient flow through a time-varying factor. Moreover, we propose an empirical guidance scale that works for a wide variety of tasks, hence removing the need for handcrafted parameter tuning. We further introduce an effective lightweight augmentation strategy that significantly boosts the performance during inference-time guidance. We present experiments using Dreamguider on multiple tasks across multiple datasets and models to show the effectiveness of the proposed modules. To facilitate further research, we will make the code public after the review process.

806Mirror Descent Actor Critic via Bounded Advantage Learning

[openreview] [pdf]

Abstract Regularization is a core component of recent Reinforcement Learning (RL) algorithms. Mirror Descent Value Iteration (MDVI) uses both Kullback-Leibler divergence and entropy as regularizers in its value and policy updates. Despite its empirical success in discrete action domains and strong theoretical garantees, the performance improvement of a MDVI-based method over the entropy-only-regularized RL is limited in continuous action domains. In this study, we propose Mirror Descent Actor Critic (MDAC) as an actor-critic style instantiation of MDVI for continuous action domains, and show that its empirical performance is significantly boosted by bounding the values of actor’s log-density terms in the critic’s loss function. Further, we relate MDAC to Advantage Learning by recalling that the actor’s log-probability is equal to the regularized advantage function in tabular cases, and theoretically show that the error of optimal policy misspecification is decreased by bounding the advantage terms.

[openreview] [pdf]

Abstract Transformer models have achieved remarkable results in the field of Natural Language Processing (NLP) with the introduction of breakthrough large language models like GPT and LLaMA recently. Motivated by their ability to capture long-range dependencies, researchers have successfully adapted these models to the task of time series forecasting. However, despite their potential, effectiveness of applying these pre-trained time series transformer models in the target domain is limited due to the need for hyper-parameter optimisation to match the characteristics of the target domain. This paper presents a novel algorithm that uses parameter efficient fine-tuning such as Low Rank Adaptation (LoRA) coupled with Limited Discrepancy Search (LDS) to efficiently auto fine-tune pre-trained time series transformers for a given target domain. Our approach helps in making informed design choices involving LoRA tunable hyper-parameters with strong performance-cost trade-offs that are highly transferable across different target domains. Our experiments demonstrate that autotune efficiently identifies the optimal configuration of LoRA hyper-parameters, achieving an average MASE improvement of 5.21% across all datasets and 4.76% for out-of-domain datasets compared to zero shot pre-trained models, with improvements as high as 20.59% for one of the out-of-domain datasets.

808Understanding Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing

[openreview] [pdf]

Abstract Structured State Space Models (SSMs) have emerged as alternatives to transformers, addressing the challenges of processing long sequences. While SSMs are often regarded as effective in capturing long-term dependencies, we theoretically demonstrate that they suffer from a strong recency bias. Our empirical findings reveal that this bias impairs the models’ ability to recall distant information and introduces robustness issues. We conducted scaling experiments and discovered that deeper structures in SSMs facilitate the learning of long contexts. However, our theoretical analysis reveal that as SSMs increase in depth, they exhibit a tendency toward over-smoothing, resulting in token representations becoming increasingly indistinguishable. This over-smoothing phenomenon ultimately constrains the scalability of SSMs to achieve improved performance. Collectively, these findings highlight important limitations of SSMs and underscore the need for further research to address these challenges in long-range sequence modeling.

809Evaluating and Explaining the Severity of Distribution Shifts: Illustration with Tabular Text Classification

[openreview] [pdf]

Abstract After deploying a machine learning model, distribution shifts may emerge in real-world data. When dealing with unlabeled data, it can be challenging to accurately assess the impact of these drifts on the model’s performance, for any type and intensity of shift. In that case, decisions such as updating the model for every benign shift would not be cost-efficient. In this paper, we introduce the Error Classifier, an error assessment method that addresses two tasks: unsupervised performance estimation and error detection on out-of-distribution data. The Error Classifier computes the probability that the model will fail based on detected fault patterns. Further, we employ a sampling-based approximation of Shapley values, with the Error Classifier as value function, in order to explain why a shift is predicted as severe, in terms of feature values. As explanation methods can sometimes disagree, we suggest evaluating the consistency of explanations produced by our technique and different ones. We focus on classification and illustrate the relevance of our method in a bimodal context, on tabular datasets with text fields. We measure our method against a selection of 15 baselines from various domains, on 7 datasets with a variety of shifts, and 2 multimodal fusion strategies for the classification models. Lastly, we show the usefulness of our explanation algorithm on instances affected by various types of shifts.

810Incorporating Visual Correspondence into Diffusion Model for Visual Try-On

[openreview] [pdf]

Abstract Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets.

811Semantic-Aware Diffusion Model for Sequential Recommendation

[openreview] [pdf]

Abstract Sequential recommendation aims to predict the next click for a particular user based on their historical interacted item sequences. Recently, diffusion-based methods have achieved the state-of-the-art performance in sequential recommendation. However, they fail to effectively utilize the rich semantic information embedded in items during the diffusion process to accurately guide the generation, leading to sub-optimal results. To address this limitation, we designed SDRec, aSemantic-awareDiffusion model for sequentialRecommendation. Our model introduces a novel architecture, the Semantic Fusion Layer, which leverages the embedding table from the encoder to incorporate item semantics into the diffusion process through an attention mechanism. Together with the well-designed contrastive and generative losses, SDRec effectively utilizes the item semantics in diffusion model, unleashing the potential of sequential recommendation. Our experiments show that SDRec has over 10% relative gain with superior efficiency compared with existing methods.

812Differentiable Solver Search for fast diffusion sampling

[openreview] [pdf]

Abstract Diffusion-based models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal and reveals a compact search space comprised of timestep and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify the optimal solver. Equipped with the searched solver, our rectified flow models, SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-256×256256\times256 with only 10 steps. Meanwhile, our DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates its generality across various model architectures, resolutions, and model sizes.

813Distilling Reinforcement Learning into Single-Batch Datasets

[openreview] [pdf]

Abstract Dataset distillation compresses a large dataset into a small, often one-batch, synthetic dataset such that learning on the synthetic dataset approximates learning on the large dataset. Training on the distilled dataset can be performed in as little as one step of gradient descent. We demonstrate that distillation is generalizable to different tasks by distilling reinforcement learning environments into one-batch supervised learning datasets. This demonstrates not only distillation’s ability to compress a reinforcement learning task but also its ability to transform one learning modality (reinforcement learning) into another (supervised learning). We present a novel extension of proximal policy optimization for meta-learning and use it in distillation of both a multi-dimensional extension of the classic cart-pole problem and several Atari games. We demonstrate distillation’s ability to compress complex RL environments into one-step supervised learning, explore RL distillation’s generalizability across learner architectures, and demonstrate distilling an environment into the smallest-possible synthetic dataset.

814Multi-Session Budget Optimization for Forward Auction-based Federated Learning

[openreview] [pdf]

Abstract Auction-based Federated Learning (AFL) has emerged as an important research field in recent years. The prevailing strategies for FL data consumers (DCs) assume that the entire team of the required data owners (DOs) for an FL task must be assembled before training can commence. In practice, a DC can trigger the FL training process multiple times. DOs can thus be gradually recruited over multiple FL model training sessions. Existing bidding strategies for AFL DCs are not designed to handle such scenarios. Therefore, the problem of multi-session AFL remains open. To address this problem, we propose the Multi-session Budget Optimization Strategy for forward Auction-based Federated Learning (MultiBOS-AFL). Based on hierarchical reinforcement learning, MultiBOS-AFL jointly optimizes inter-session budget pacing and intra-session bidding for AFL DCs, with the objective of maximizing the total utility. Extensive experiments on six benchmark datasets show that it significantly outperforms seven state-of-the-art approaches. On average, MultiBOS-AFL achieves 12.28% higher utility, 14.52% more data acquired through auctions for a given budget, and 1.23% higher test accuracy achieved by the resulting FL model compared to the best baseline. To the best of our knowledge, it is the first budget optimization decision support method with budget pacing capability designed for DCs in multi-session forward auction-based FL.

815What’s New in My Data? Novelty Exploration via Contrastive Generation

[openreview] [pdf]

Abstract Fine-tuning is widely used to adapt language models for specific goals, often leveraging real-world data such as patient records, customer-service interactions, or web content in languages not covered in pre-training. These datasets are typically massive, noisy, and often confidential, making their direct inspection challenging. However, understanding them is essential for guiding model deployment and informing decisions about data cleaning or suppressing any harmful behaviors learned during fine-tuning. In this study, we introduce the task of novelty discovery through generation, which aims to identify novel properties of a fine-tuning dataset by generating examples that illustrate these properties. Our approach - Contrastive Generative Exploration (CGE) - assumes no direct access to the data but instead relies on a pre-trained model and the same model after fine-tuning. By contrasting the predictions of these two models, CGE can generate examples that highlight novel characteristics of the fine-tuning data. However, this simple approach may produce examples that are too similar to one another, failing to capture the full range of novel phenomena present in the dataset. We address this by introducing an iterative version of CGE, where the previously generated examples are used to update the pre-trained model, and this updated model is then contrasted with the fully fine-tuned model to generate the next example, promoting diversity in the generated outputs. Our experiments demonstrate the effectiveness of CGE in detecting novel content, such as toxic language, as well as new natural and programming languages. Furthermore, we show that CGE remains effective even when models are fine-tuned using differential privacy techniques.

816Rectified Diffusion: Straightness Is Not Your Need in Rectified Flow

[openreview] [pdf]

Abstract Diffusion models have greatly improved visual generation but are hindered by slow generation speed due to the computationally intensive nature of solving generative ODEs. Rectified flow, a widely recognized solution, improves generation speed by straightening the ODE path. Its key components include: 1) using the diffusion form of flow-matching, 2) employing v\boldsymbol v-prediction, and 3) performing rectification (a.k.a. reflow). In this paper, we argue that the success of rectification primarily lies in using a pretrained diffusion model to obtain matched pairs of noise and samples, followed by retraining with these matched noise-sample pairs. Based on this, components 1) and 2) are unnecessary. Furthermore, we highlight that straightness is not an essential training target for rectification; rather, it is a specific case of flow-matching models. The more critical training target is to achieve a first-order approximate ODE path, which is inherently curved for models like DDPM and Sub-VP. Building on this insight, we propose Rectified Diffusion, which generalizes the design space and application scope of rectification to encompass the broader category of diffusion models, rather than being restricted to flow-matching models. We validate our methods on Stable Diffusion v1-5 and Stable Diffusion XL. Our methods not only greatly simplifies the training procedure of rectified flow-based previous works~(e.g., InstaFlow) but also achieves superior performance with even lower training cost.

817Exploiting Hidden Symmetry to Improve Objective Perturbation for DP linear learners with a nonsmoothℓ1-norm

[openreview] [pdf]

Abstract Objective Perturbation (OP) is a classic approach to differentially private (DP) convex optimization with smooth loss functions but is less understood for nonsmooth cases. In this work, we study how to apply OP to DP linear learners under loss functions with an implicit 1\ell_1-norm structure, such as max(0,x)\max(0,x) as a motivating example. We propose to first smooth out the hidden 1\ell_1-norm by convolution, and then invoke standard OP. Convolution has many advantages that distinguish itself from Moreau Envelope, such as approximating from above and a higher degree of hyperparameters. These advantages, in conjunction with the symmetry of 1\ell_1-norm, result in tighter pointwise approximation, which further facilitates tighter analysis of generalization risks by using pointwise bounds. Under mild assumptions on groundtruth distributions, the proposed OP-based algorithm is found to be rate-optimal, and can achieve the excess generalization risk O(1n+dln(1/δ)nε)\mathcal{O}(\frac{1}{\sqrt{n}}+\frac{\sqrt{d\ln(1/\delta)}}{n\varepsilon}). Experiments demonstrate the competitive performance of the proposed method to Noisy-SGD.

818Scaling Laws for Diffusion Transformers

[openreview] [pdf]

Abstract Diffusion transformers (DiT) have already achieved appealing synthesis and scaling properties in content recreation, \emph{e.g.,} image and video generation.However, scaling laws of DiT are less explored, which usually offer precise predictions regarding optimal model size and data requirements given a specific compute budget.Therefore, experiments across a broad range of compute budgets, from \texttt{1e17} to \texttt{6e18} FLOPs are conducted to confirm the existence of scaling laws in DiT \emph{for the first time}. Concretely, the loss of pretraining DiT also follows a power-law relationship with the involved compute.Based on the scaling law, we can not only determine the optimal model size and required data but also accurately predict the text-to-image generation loss given a model with 1B parameters and a compute budget of \texttt{1e21} FLOPs.Additionally, we also demonstrate that the trend of pretraining loss matches the generation performances (\emph{e.g.,} FID), even across various datasets, which complements the mapping from compute to synthesis quality and thus provides a predictable benchmark that assesses model performance and data quality at a reduced cost.

819Diffusion Transformers for Tabular Data Time Series Generation

[openreview] [pdf]

Abstract Tabular data generation has recently attracted a growing interest due to its different application scenarios. However, generating time series of tabular data, where each element of the series depends on the others, remains a largely unexplored domain. This gap is probably due to the difficulty of jointly solving different problems, the main of which are the heterogeneity of tabular data (a problem common to non-time-dependent approaches) and the variable length of a time series. In this paper, we propose a Diffusion Transformers (DiTs) based approach for tabular data series generation. Inspired by the recent success of DiTs in image and video generation, we extend this framework to deal with heterogeneous data and variable-length sequences. Using extensive experiments on six datasets, we show that the proposed approach outperforms previous work by a large margin. Our code will be made public after this article is accepted.

820Spatiotemporal Backward Inconsistency Learning Gives STGNNs Icing on the Cake

[openreview] [pdf]

Abstract Spatiotemporal prediction models facilitate various smart-city applications across various domains,such as traffic and climate. While current advancements in these models emphasize leveraging cutting-edge technologies to enhance spatiotemporal learning, they often operate under the implicit assumption of spatiotemporal feature consistency between inputs and labels, overlooking the critical issue of input-label inconsistency. In this study, we introduce a universal spatiotemporal backward inconsistency learning module capable of seamless integration into a variety of models, offering a notable performance boost by explicitly modeling label features to address input-label inconsistency. Our approach includes the development of a spatiotemporal residual theory, advocating for a holistic spatiotemporal learning that encompasses both forward spatiotemporal learning to capture input data’s spatiotemporal features for generating base predictions, akin to existing STNNs, and a backward process to learn residuals that rectify input-label inconsistency, thereby refining the base predictions. Based on this theory, we design the Spatio-Temporal Backward Inconsistency Learning Module (STBIM) for this backward correction process, comprising a residual learning module for decoupling inconsistency information from input representations and label representations, and a residual propagation module for smoothing residual terms to facilitate stable learning. The generated prediction correction term is used to enhance the prediction accuracy. Experimental results on 11 datasets from the traffic and atmospheric domains, combined with 15 spatiotemporal prediction models, demonstrate the broad positive impact of the proposed STBIM. The code is available athttps://anonymous.4open.science/r/ICLR2025-2598.

821AutoRegressive Knowledge Base Completion

[openreview] [pdf]

Abstract Despite their large sizes, many Knowledge Graphs (KGs) remain highly incomplete. This problem has motivated numerous approaches to complete\textit{complete} the KGs by embedding them in a latent space to find the missing links. Although these methods show promising performance, a general limitation is that the scores given to possible links are uncalibrated and cannot be interpreted across different queries. Hence, we say they are local\textit{local} as they relate to a specific context. This limitation makes it non-trivial to deduce the truth value of the links and to answer complex queries. Another limitation is that their learning depends on negative sampling, which is challenging due to the Open World Assumption (OWA).To solve this problem, we propose a novel auto-regressive generative model that learns a joint distribution of the entities and relations of the KG without resorting to negative sampling. This distribution can be used to infer the probability that a link is sampled from the KG, which allows us to return a global\textit{global} score that is interpretable in different contexts. Moreover, our method has the additional advantage that it offers probabilistic semantics for complex reasoning and knowledge base completion, achieving state-of-the-art performance on link prediction with consistent scores across the entire KG.

822Text-to-Model: Text-Conditioned Neural Network Diffusion for Train-Once-for-All Personalization

[openreview] [pdf]

Abstract Generative artificial intelligence (GenAI) has made significant progress in understanding world knowledge and generating content from human languages across various modalities, like text-to-text large language models, text-to-image stable diffusion, and text-to-video Sora. While in this paper, we investigate the capability of GenAI for text-to-model generation, to see whether GenAI can comprehend hyper-level knowledge embedded within AI itself parameters. Specifically, we study a practical scenario termed train-once-for-all personalization, aiming to generate personalized models for diverse end-users and tasks using text prompts. Inspired by the recent emergence of neural network diffusion, we present Tina, a text-conditioned neural network diffusion for train-once-for-all personalization. Tina leverages a diffusion transformer model conditioned on task descriptions embedded using a CLIP model. Despite the astronomical number of potential personalized tasks (e.g., 1.73×10131.73\times10^{13}), by our design, Tina demonstrates remarkable in-distribution and out-of-distribution generalization even trained on small datasets (1000\sim 1000). We further verify whether and how \Tina understands world knowledge by analyzing its capabilities under zero-shot/few-shot image prompts, different numbers of personalized classes, prompts of natural language descriptions, and predicting unseen entities.

823State Space Models are Provably Comparable to Transformers in Dynamic Token Selection

[openreview] [pdf]

Abstract Deep neural networks based on state space models (SSMs) are attracting significant attention in sequence modeling since their computational cost is significantly smaller than that of Transformers. While the capabilities of SSMs have been demonstrated through experiments in various tasks, theoretical understanding of SSMs is still limited. In particular, most theoretical studies discuss the capabilities of SSM layers without nonlinear layers, and there is a lack of discussion on their combination with nonlinear layers. In this paper, we explore the capabilities of SSMs combined with fully connected neural networks, and show that they are comparable to Transformers in extracting the essential tokens depending on the input. As concrete examples, we consider two synthetic tasks, which are challenging for a single SSM layer, and demonstrate that SSMs combined with nonlinear layers can efficiently solve these tasks. Furthermore, we study the nonparametric regression task, and prove that the ability of SSMs is equivalent to that of Transformers in estimating functions belonging to a certain class.

824Improving Diffusion-based Data Augmentation with Inversion Circle Interpolation

[openreview] [pdf]

Abstract Data Augmentation (DA), i.e., synthesizing faithful and diverse samples to expand the original training set, is a prevalent and effective strategy to improve various visual recognition tasks. With the powerful image generation ability, diffusion-based DA has shown strong performance gains on different benchmarks. In this paper, we analyze today’s diffusion-based DA methods, and argue that they can- not take account of both faithfulness and diversity, which are two critical keys for generating high-quality samples and boosting final classification performance. To this end, we propose a novel Diffusion-based Inversion Interpolation DA method: Diff-II. Specifically, Diff-II consists of three main steps: 1) Category concepts learning: Learning concept embeddings for each category. 2) Inversion interpolation: Calculating the inversion for each image, and conducting random circle interpolation for two randomly sampled inversions from the same category. 3) Two-stage denoising: Using different prompts to generate synthesized images in a coarse-to-fine manner. Extensive experiments on multiple image classification tasks (e.g., few-shot, long-tailed, and out-of-distribution classification) have demonstrated its effectiveness over state-of-the-art diffusion-based DA methods.

825Single Teacher, Multiple Perspectives: Teacher Knowledge Augmentation for Enhanced Knowledge Distillation

[openreview] [pdf]

Abstract Do diverse perspectives help students learn better? Multi-teacher knowledge distillation, which is a more effective technique than traditional single-teacher methods, supervises the student from different perspectives (i.e., teacher). While effective, multi-teacher, teacher ensemble, or teaching assistant-based approaches are computationally expensive and resource-intensive, as they require training multiple teacher networks. These concerns raise a question: can we supervise the student with diverse perspectives using only a single teacher? We, as the pioneer, demonstrate TeKAP, a novel teacher knowledge augmentation technique that generates multiple synthetic teacher knowledge by perturbing the knowledge of a single pretrained teacher i.e., Teacher Knowledge Augmentation via Perturbation, at both the feature and logit levels. These multiple augmented teachers simulate an ensemble of models together. The student model is trained on both the actual and augmented teacher knowledge, benefiting from the diversity of an ensemble without the need to train multiple teachers. TeKAP significantly reduces training time and computational resources, making it feasible for large-scale applications and easily manageable. Experimental results demonstrate that our proposed method helps existing state-of-the-art knowledge distillation techniques achieve better performance, highlighting its potential as a cost-effective alternative. The source code can be found in the supplementary.

826Reset Method based on the Theory of Manifold Optimization on Real Manifolds

[openreview] [pdf]

Abstract Manifold optimization is prominent in the fields of applied mathematics, statistics, machine learning, and in particular, deep learning. By leveraging the intrinsic geometric properties of manifolds, constrained optimization problems can be transformed into unconstrained optimization problems on certain manifolds. An innovative method, Reset Method, is introduced that combines manifold optimization and standard methods (SGD, Adam and AdamW), aiming to enhance the improvement of precision. The efficacy of our proposed method is corroborated by extensive deep learning experiments, providing visible higher precision.

827Beyond Finite Data: Towards Data-free Out-of-distribution Generalization via Extrapolation

[openreview] [pdf]

Abstract Out-of-distribution (OOD) generalization is a favorable yet challenging property for deep neural networks. The core challenges lie in the limited availability of source domains that help models learn an invariant representation from the spurious features. Various domain augmentation have been proposed but largely rely on interpolating existing domains and frequently face difficulties in creating truly “novel” domains. Humans, on the other hand, can easily extrapolate novel domains, thus, an intriguing question arises: How can neural networks extrapolate like humans and achieve OOD generalization? We introduce a novel approach to domain extrapolation that leverages reasoning ability and the extensive knowledge encapsulated within large language models (LLMs) to synthesize entirely new domains. Starting with the class of interest, we query the LLMs to extract relevant knowledge for these novel domains. We then bridge the gap between the text-centric knowledge derived from LLMs and the pixel input space of the model using text-to-image generation techniques. By augmenting the training set of domain generalization datasets with high-fidelity, photo-realistic images of these new domains, we achieve significant improvements over all existing methods, as demonstrated in both single and multi-domain generalization across various benchmarks. With the ability to extrapolate any domains for any class, our method has the potential to learn a generalized model for any task without any data. To illustrate, we put forth a much more difficult setting termed, data-free domain generalization, that aims to learn a generalized model in the absence of any collected data. Our empirical findings support the above argument and our methods exhibit commendable performance in this setting, even surpassing the supervised setting by approximately 1-2% on datasets such as VLCS.

828Direct Advantage Estimation in Partially Observable Environments

[openreview] [pdf]

Abstract Direct Advantage Estimation (DAE) was recently shown to improve sample-efficiency of deep reinforcement learning algorithms. However, DAE assumes full observability of the environment, which may be restrictive in realistic settings. In the present work, we first show that DAE can be extended to partially observable domains with minor modifications. Secondly, we address the increased computational cost due to the need to approximate the transition probabilities through the use of discrete latent dynamics models. Finally, we empirically evaluate the proposed method using the Arcade Learning Environments, and show that it is scalable and sample-efficient.

829Client2Vec: Improving Federated Learning by Distribution Shifts Aware Client Indexing

[openreview] [pdf]

Abstract Federated Learning (FL) is a privacy-preserving distributed machine learning paradigm. Nonetheless, the substantial distribution shifts among clients pose a considerable challenge to the performance of current FL algorithms. To mitigate this challenge, various methods have been proposed to enhance the FL training process. This paper endeavors to tackle the issue of data heterogeneity from another perspective---by improving FL algorithms prior to the actual training stage. Specifically, we introduce the Client2Vec mechanism, which generates a unique client index for each client before the commencement of FL training. Subsequently, we leverage the generated client index to enhance the subsequent FL training process. To demonstrate the effectiveness of the proposed Client2Vec method, we conduct three case studies that assess the impact of the client index on the FL training process. These case studies encompass enhanced client sampling, model aggregation, and local training. Extensive experiments conducted on diverse datasets and model architectures show the efficacy of Client2Vec across all three case studies. Our code will be publicly available.

830Model-based RL as a Minimalist Approach to Horizon-Free and Second-Order Bounds

[openreview] [pdf]

Abstract Learning a transition model via Maximum Likelihood Estimation (MLE) followed by planning inside the learned model is perhaps the most standard and simplest Model-based Reinforcement Learning (RL) framework. In this work, we show that such a simple Model-based RL scheme, when equipped with optimistic and pessimistic planning procedures, achieves strong regret and sample complexity bounds in online and offline RL settings. Particularly, we demonstrate that under the conditions where the trajectory-wise reward is normalized between zero and one and the transition is time-homogenous, it achieves nearly horizon-free and second-order bounds. Nearly horizon-free means that our bounds have no polynomial dependence on the horizon of the Markov Decision Process. A second-order bound is a type of instance-dependent bound that scales with respect to the variances of the returns of the policies which can be small when the system is nearly deterministic and (or) the optimal policy has small values. We highlight that our algorithms are simple, fairly standard, and indeed have been extensively studied in the RL literature: they learn a model via MLE, build a version space around the MLE solution, and perform optimistic or pessimistic planning depending on whether operating in the online or offline mode. These algorithms do not rely on additional specialized algorithmic designs such as learning variances and performing variance-weighted learning and thus can easily leverage non-linear function approximations. The simplicity of the algorithms also implies that our horizon-free and second-order regret analysis is actually standard and mainly follows the general framework of optimism/pessimism in the face of uncertainty.

831A Hypothesis on Black Swan in Unchanging Environments

[openreview] [pdf]

Abstract Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.

832Exploratory Preference Optimization: Provably Sample-Efficient Exploration in RLHF with General Function Approximation

[openreview] [pdf]

Abstract This paper investigates a basic question in reinforcement learning from human feedback (RLHF) from a theoretical perspective: how to efficiently explore in an online manner under preference feedback and general function approximation. We take the initial step towards a theoretical understanding of this problem by proposing a novel algorithm,Exploratory Preference Optimization(XPO). This algorithm is elegantly simple---requiring only a one-line modification to (online) Direct Preference Optimization (DPO; Rafailov et al., 2023)---yet provides the strongest known provable guarantees. XPO augments the DPO objective with a novel and principledexploration bonus, enabling the algorithm to strategically explore beyond the support of the initial model and preference feedback data. We prove that XPO is provably sample-efficient and converges to a near-optimal policy under natural exploration conditions, regardless of the initial model’s coverage. Our analysis builds on the observation that DPO implicitly performs a form ofBellman error minimization. It synthesizes previously disparate techniques from language modeling and theoretical reinforcement learning in a serendipitous fashion through the lens ofKL-regularized Markov decision processes.

833Transformers Learn Bayesian Networks Autoregressively In-Context

[openreview] [pdf]

Abstract Transformers have achieved tremendous successes in various fields, notably excelling in tasks involving sequential data like natural language processing. Despite their achievements, there is limited understanding of the theoretical capabilities of transformers. In this paper, we theoretically investigate the capability of transformers to autoregressively learn Bayesian networks in-context. Specifically, we consider a setting where a set of independent samples generated from a Bayesian network are observed and form a context. We show that, there exists a simple transformer model that can (i) estimate the conditional probabilities of the Bayesian network according to the context, and (ii) autoregressively generate a new sample according to the Bayesian network with estimated conditional probabilities. We further demonstrate in extensive experiments that such a transformer does not only exist in theory, but can also be effectively obtained through training. Our analysis showcases the potential of transformers to effectively learn complicated probabilistic models, and contributes to a better understanding of the success of large language models.

834Evaluating Ranking Loss Functions in Performance Predictor for NAS

[openreview] [pdf]

Abstract Performance evaluation is a critical but compute-intensive procedure in neural architecture search (NAS). To alleviate evaluation costs, performance predictors have been widely adopted to predict architecture performance directly. Recent studies have introduced ranking loss functions into predictors to focus on the architecture rankings instead of absolute accuracy, thus enhancing the ranking ability of performance predictors. Despite the successful application of ranking loss functions, the lack of comprehensive measure metrics and different experimental configurations make a fair comparison among these loss functions a huge challenge. Additionally, some well-known ranking loss functions have not been thoroughly examined in the context of performance predictors. In this paper, we conduct the first study for 11 ranking loss functions containing the existing and the novel ones by comparing their effectiveness in performance predictors under various settings. We find that: (i) The choice of ranking loss function has a major influence on the performance of predictors; (ii) the quality of the architectures searched by the predictor-based NAS methods is closely correlated with the predictor’s performance on top-centered rank metrics, rather than traditional metrics like Kendall Tau. We believe these results and insights can serve as recommendations for the optimal loss function to employ in predictors across various search spaces and experimental conditions.

835Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers

[openreview] [pdf]

Abstract Large language models have been successful at tasks involving basic forms of in-context reasoning, such as generating coherent language, as well as storing vast amounts of knowledge. At the core of the Transformer architecture behind such models are feed-forward and attention layers, which are often associated to knowledge and reasoning, respectively. In this paper, we study this distinction empirically and theoretically in a controlled synthetic setting where certain next-token predictions involve both distributional and in-context information. We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning. Our theoretical analysis identifies gradient noise as a key factor behind this discrepancy. Finally, we illustrate how similar disparities emerge in pre-trained models through ablations on the Pythia model family on simple reasoning tasks.

836Learning Randomized Algorithms with Transformers

[openreview] [pdf]

Abstract Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks’ computation and predictions.

837Overcoming label shift in targeted federated learning

[openreview] [pdf]

Abstract Federated learning enables multiple actors to collaboratively train models without sharing private data. This unlocks the potential for scaling machine learning to diverse applications. Existing algorithms for this task are well-justified when clients and the intended target domain share the same distribution of features and labels, but this assumption is often violated in real-world scenarios. One common violation is label shift, where the label distributions differ across clients or between clients and the target domain, which can significantly degrade model performance. To address this problem, we propose FedPALS, a novel model aggregation scheme that adapts to label shifts by leveraging knowledge of the target label distribution at the central server. Our approach ensures unbiased updates under stochastic gradient descent, ensuring robust generalization across clients with diverse, label-shifted data. Extensive experiments on image classification demonstrate that FedPALS consistently outperforms standard baselines by aligning model aggregation with the target domain. Our findings reveal that conventional federated learning methods suffer severely in cases of extreme client sparsity, highlighting the critical need for target-aware aggregation. FedPALS offers a principled and practical solution to mitigate label distribution mismatch, ensuring models trained in federated settings can generalize effectively to label-shifted target domains.

838Differentiable Reasoning about Knowledge Graphs with Reshuffled Embeddings

[openreview] [pdf]

Abstract Knowledge graph (KG) embedding methods learn geometric representations of entities and relations to predict plausible missing knowledge. These representations are typically assumed to capture rule-like inference patterns. However, our theoretical understanding of the kinds of inference patterns that can be captured in this way remains limited. Ideally, KG embedding methods should be expressive enough such that for any set of rules, there exists an embedding that exactly captures these rules. This principle has been studied within the framework of region-based embeddings, but existing models are severely limited in the kinds of rule bases that can be captured. We argue that this stems from the use of representations that correspond to the Cartesian product of two-dimensional regions. As an alternative, we propose RESHUFFLE, a simple model based on ordering constraints that can faithfully capture a much larger class of rule bases than existing approaches. Moreover, the embeddings in our framework can be learned by a Graph Neural Network (GNN), which effectively acts as a differentiable rule base. This has some practical advantages, e.g. ensuring that embeddings can be easily updated as new knowledge is added to the KG. At the same time, since the resulting representations can be used similarly to standard KG embeddings, our approach is significantly more efficient than existing approaches to differentiable reasoning. The GNN-based formulation also allows us to study how bounded inference can be captured. We show in particular that bounded reasoning with arbitrary sets of closed path rules can be captured in this way.

839Beyond Auto-Regression: Fast LLMs via Self-Distillation Through Time

[openreview] [pdf]

Abstract Autoregressive (AR) Large Language Models (LLMs) have demonstrated significant success across numerous tasks. However, the AR modeling paradigm presents certain limitations; for instance, contemporary autoregressive LLMs are trained to generate one token at a time, which can result in noticeable latency. Recent advancements have indicated that search and repeated sampling can enhance performance in various applications, such as theorem proving, code generation, and alignment, by utilizing greater computational resources during inference. In this study, we demonstrate that diffusion language models are capable of generating at least 32 tokens simultaneously, while exceeding the performance of AR models in text quality and on the LAMBADA natural language understanding benchmark. This outcome is achieved through a novel distillation method for discrete diffusion models, which reduces the number of inference steps by a factor of 32-64. Practically, our models, even without caching, can generate tokens at a rate that is up to 8 times faster than AR models employing KV-caching, and we anticipate further improvements with the inclusion of caching. Moreover, we demonstrate the efficacy of our approach for diffusion language models with up to 860M parameters.

840Scalable Decentralized Learning with Teleportation

[openreview] [pdf]

Abstract Decentralized SGD can run with low communication costs, but its sparse communication characteristics deteriorate the convergence rate, especially when the number of nodes is large. In decentralized learning settings, communication is assumed to occur on only a given topology, while in many practical cases, the topology merely represents a preferred communication pattern, and connecting to arbitrary nodes is still possible. Previous studies have tried to alleviate the convergence rate degradation in these cases by designing topologies with large spectral gaps. However, the degradation is still significant when the number of nodes is substantial. In this work, we propose TELEPORTATION. TELEPORTATION activates only a subset of nodes, and the active nodes fetch the parameters from previous active nodes. Then, the active nodes update their parameters by SGD and perform gossip averaging on a relatively small topology comprising only the active nodes. We show that by activating only a proper number of nodes, TELEPORTATION can completely alleviate the convergence rate degradation. Furthermore, we propose an efficient hyperparameter-tuning method to search for the appropriate number of nodes to be activated. Experimentally, we showed that TELEPORTATION can train neural networks more stably and achieve higher accuracy than Decentralized SGD.

841Learning Task Belief Similarity with Latent Dynamics for Meta-Reinforcement Learning

[openreview] [pdf]

Abstract Meta-reinforcement learning requires utilizing prior task distribution information obtained during exploration to rapidly adapt to unknown tasks. The efficiency of an agent’s exploration hinges on accurately identifying the current task. Recent Bayes-Adaptive Deep RL approaches often rely on reconstructing the environment’s reward signal, which is challenging in sparse reward settings, leading to suboptimal exploitation. Inspired by bisimulation metrics, which robustly extracts behavioral similarity in continuous MDPs, we propose SimBelief—a novel meta-RL framework via measuring similarity of task belief in Bayes-Adaptive MDP (BAMDP). SimBelief effectively extracts common features of similar task distributions, enabling efficient task identification and exploration in sparse reward environments. We introduce latent task belief metric to learn the common structure of similar tasks and incorporate it into the real task belief. By learning the latent dynamics across task distributions, we connect shared latent task belief features with specific task features, facilitating rapid task identification and adaptation. Our method outperforms state-of-the-art bselines on sparse reward MuJoCo and panda-gym tasks.

842Leveraging Additional Information in POMDPs with Guided Policy Optimization

[openreview] [pdf]

Abstract Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty. While additional information, such as that available in simulations, can enhance training, effectively leveraging it remains an open problem. To address this, we introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner. The guider takes advantage of supplementary information while ensuring alignment with the learner’s policy, which is primarily trained via Imitation Learning (IL). We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in IL approaches. Our approach includes two practical variants, GPO-penalty and GPO-clip, and empirical evaluations show strong performance across various tasks, including continuous control with partial observability and noise, and memory-based challenges, significantly outperforming existing methods.

843Causal-aware Graph Neural Architecture Search under Distribution Shifts

[openreview] [pdf]

Abstract Graph neural architecture search (Graph NAS) has emerged as a promising approach for autonomously designing graph neural network architectures by leveraging the correlations between graphs and architectures. However, the existing methods fail to generalize under distribution shifts that are ubiquitous in real-world graph scenarios, mainly because the graph-architecture correlations they exploit might be spurious and varying across distributions. In this paper, we propose to handle the distribution shifts in the graph architecture search process by discovering and exploiting the causal relationship between graphs and architectures to search for the optimal architectures that can generalize under distribution shifts. The problem remains unexplored with the following critical challenges: 1) how to discover the causal graph-architecture relationship that has stable predictive abilities across distributions, 2) how to handle distribution shifts with the discovered causal graph-architecture relationship to search the generalized graph architectures. To address these challenges, we propose a novel approach, Causal-aware Graph Neural Architecture Search (CARNAS), which is able to capture the causal graph-architecture relationship during the architecture search process and discover the generalized graph architecture under distribution shifts. Specifically, we propose Disentangled Causal Subgraph Identification to capture the causal subgraphs that have stable prediction abilities across distributions. Then, we propose Graph Embedding Intervention to intervene on causal subgraphs within the latent space, ensuring that these subgraphs encapsulate essential features for prediction while excluding non-causal elements. Additionally, we propose Invariant Architecture Customization to reinforce the causal invariant nature of the causal subgraphs, which are utilized to tailor generalized graph architectures. Extensive experiments on synthetic and real-world datasets demonstrate that our proposed CARNAS achieves advanced out-of-distribution generalization ability by discovering the causal relationship between graphs and architectures during the search process.

844Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning

[openreview] [pdf]

Abstract Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein techniques such as reinforcement learning can obtain broad coverage over states and actions. However, high-fidelity physics simulators are fundamentally misspecified approximations to reality, making direct zero-shot transfer challenging, especially in tasks where precise and forceful manipulation is necessary. This makes real-world fine-tuning of policies pretrained in simulation an attractive approach to robot learning. However, exploring the real-world dynamics with standard RL fine-tuning techniques is to inefficient for many real-world applications. This paper introduces Simulation-Guided Fine-Tuning, a general framework which leverages the structure of the simulator to guide exploration, substantially accelerating adaptation to the real-world. We demonstrate our approach across several manipulation tasks in the real world, learning successful policies for problems that are challenging to learn using purely real-world data. We further provide theoretical backing for the paradigm.

845HelpSteer2-Preference: Complementing Ratings with Preferences

[openreview] [pdf]

Abstract Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than the other, when adequately matched for data. This is primarily because these approaches require data collected in different (but incompatible) formats, meaning that adequately matched data is not available in existing public datasets. To tackle this problem, we release preference annotations (designed for Bradley-Terry training) to complement existing ratings (designed for Regression style training) in the HelpSteer2 dataset. To improve data interpretability, preference annotations are accompanied with human-written justifications. Using this data, we conduct the first head-to-head comparison of Bradley-Terry and Regression models when adequately matched for data. Based on insights derived from such a comparison, we propose a novel approach to combine Bradley-Terry and Regression reward modeling. A Llama-3.1-70B-Instruct model tuned with this approach scores 94.1 on RewardBench, emerging top of more than 140 reward models as of 1 Oct 2024. We also demonstrate the effectiveness of this reward model at aligning models to follow instructions in RLHF. We open-source this dataset (CC-BY-4.0 license) and openly release the trained Reward Model.

846DiNO-Diffusion: Scaling Medical Diffusion Models via Self-Supervised Pre-Training

[openreview] [pdf]

Abstract Diffusion models (DMs) require large annotated datasets for training, limiting their applicability in medical imaging where datasets are typically smaller and sparsely annotated. We introduce DiNO-Diffusion, a self-supervised method for training DMs that conditions the generation process on image embeddings extracted from DiNO, a pretrained vision transformer. By not relying on annotations, our training leverages over 868k unlabelled images from public chest X-Ray (CXR) datasets. DiNO-Diffusion shows comprehensive manifold coverage, with FID scores as low as 4.7, and emerging properties when evaluated in downstream tasks, allowing to generate semantically-diverse synthetic datasets even from small data pools, demonstrating up to 20% AUC increase in classification performance when used for data augmentation. Results suggest that DiNO-Diffusion could facilitate the creation of large datasets for flexible training of downstream AI models from limited amount of real data, while also holding potential for privacy preservation. Additionally, DiNO-Diffusion demonstrates zero-shot segmentation performance of up to 84.4% Dice score when evaluating lung lobe segmentation, evidencing good CXR image-anatomy alignment akin to textual descriptors on vanilla DMs. Finally, DiNO-Diffusion can be easily adapted to other medical imaging modalities or state-of-the-art diffusion models, allowing large-scale, multi-domain image generation pipelines for medical imaging.

847FederatedQ-Learning with Reference-Advantage Decomposition: Almost Optimal Regret and Logarithmic Communication Cost

[openreview] [pdf]

Abstract In this paper, we consider model-free federated reinforcement learning for tabular episodic Markov decision processes. Under the coordination of a central server, multiple agents collaboratively explore the environment and learn an optimal policy without sharing their raw data. Despite recent advances in federated QQ-learning algorithms achieving near-linear regret speedup with low communication cost, existing algorithms only attain suboptimal regrets compared to the information bound. We propose a novel model-free federated QQ-Learning algorithm, termed FedQ-Advantage. Our algorithm leverages reference-advantage decomposition for variance reduction and adopts three novel designs: separate event-triggered communication and policy switching, heterogeneous communication triggering conditions, and optional forced synchronization. We prove that our algorithm not only requires a lower logarithmic communication cost but also achieves an almost optimal regret, reaching the information bound up to a logarithmic factor and near-linear regret speedup compared to its single-agent counterpart when the time horizon is sufficiently large.

848Learn With Imagination: Safe Set Guided State-wise Constrained Policy Optimization

[openreview] [pdf]

Abstract Deep reinforcement learning (RL) excels in various control tasks, yet the absence of safety guarantees hampers its real-world applicability. In particular, explorations during learning usually results in safety violations, while the RL agent learns from those mistakes. On the other hand, safe control techniques ensure persistent safety satisfaction but demand strong priors on system dynamics, which is usually hard to obtain in practice. To address these problems, we present Safe Set Guided State-wise Constrained Policy Optimization (S-3PO), a pioneering algorithm generating state-wise safe optimal policies with zero training violations, i.e., learning without mistakes. S-3PO first employs a safety-oriented monitor with black-box dynamics to ensure safe exploration. It then enforces a unique cost for the RL agent to converge to optimal behaviors within safety constraints. S-3PO outperforms existing methods in high-dimensional robotics tasks, managing state-wise constraints with zero training violation. This innovation marks a significant stride towards real-world safe RL deployment.

849Universal generalization guarantees for Wasserstein distributionally robust models

[openreview] [pdf]

Abstract Distributionally robust optimization has emerged as an attractive way to train robust machine learning models, capturing data uncertainty and distribution shifts. Recent statistical analyses have proved that generalization guarantees of robust models based on the Wasserstein distance have generalization guarantees that do not suffer from the curse of dimensionality. However, these results are either approximate, obtained in specific cases, or based on assumptions difficult to verify in practice. In contrast, we establish exact generalization guarantees that cover a wide range of cases, with arbitrary transport costs and parametric loss functions, including deep learning objectives with nonsmooth activations. We complete our analysis with an excess bound on the robust objective and an extension to Wasserstein robust models with entropic regularizations.

[openreview] [pdf]

Abstract We present Bayesian Binary Search (BBS), a novel probabilistic variant of the classical binary search/bisection algorithm. BBS leverages machine learning/statistical techniques to estimate the probability density of the search space and modifies the bisection step to split based on probability density rather than the traditional midpoint, allowing for the learned distribution of the search space to guide the search algorithm. Search space density estimation can flexibly be performed using supervised probabilistic machine learning techniques (e.g., Gaussian process regression, Bayesian neural networks, quantile regression) or unsupervised learning algorithms (e.g., Gaussian mixture models, kernel density estimation (KDE), maximum likelihood estimation (MLE)). We demonstrate significant efficiency gains of using BBS on both simulated data across a variety of distributions and in a real-world binary search use case of probing channel balances in the Bitcoin Lightning Network, for which we have deployed the BBS algorithm in a production setting.

851Temporal Source Recovery for Time-Series Source-Free Unsupervised Domain Adaptation

[openreview] [pdf]

Abstract Source-Free Unsupervised Domain Adaptation (SFUDA) has gained popularity for its ability to adapt pretrained models to target domains without accessing source domains, ensuring source data privacy. While SFUDA is well-developed in visual tasks, its application to Time-Series SFUDA (TS-SFUDA) remains limited due to the challenge of transferring crucial temporal dependencies across domains. Although a few researchers begin to explore this area, they rely on specific source domain designs, which are impractical as source data owners cannot be expected to follow particular pretraining protocols. To solve this, we propose Temporal Source Recovery (TemSR), a framework that transfers temporal dependencies for effective TS-SFUDA without requiring source-specific designs. TemSR features a recovery process that leverages masking, recovery, and optimization to generate a source-like distribution with recovered source temporal dependencies. To ensure effective recovery, we further design segment-based regularization to restore local dependencies and anchor-based recovery diversity maximization to enhance the diversity of the source-like distribution. The source-like distribution is then adapted to the target domain using traditional UDA techniques. Extensive experiments across multiple TS tasks demonstrate the effectiveness of TemSR, even surpassing existing TS-SFUDA method that requires source domain designs.

852Decomposed Learning and Grokking

[openreview] [pdf]

Abstract Grokking is a delayed transition from memorisation to generalisation in neural networks. It poses challenges for efficient learning, particularly in structured tasks and small-data regimes. This paper explores grokking in modular arithmetic, explicitly focusing on modular division with a modulus of 97. We introduce a novel learning method called Decomposed Learning, which leverages Singular Value Decomposition (SVD) to modify the weight matrices of neural networks. Decomposed learning reduces or avoids grokking by changing the representation of the weight matrix, AA, into the product of three matrices UU, Σ\Sigma and VTV^T, promoting the discovery of compact, generalisable representations early in the learning process. Through empirical evaluations on the modular division task, we show that Decomposed Learning significantly reduces the effect of grokking and, in some cases, eliminates it. Moreover, Decomposed Learning can reduce the parameters required for practical training, enhancing model efficiency and generalisation. These results suggest that our SVD-based method provides a practical and scalable solution for mitigating grokking, with implications for broader transformer-based learning tasks.

853Continual Learning: Less Forgetting, More OOD Generalization via Adaptive Contrastive Replay

[openreview] [pdf]

Abstract Machine learning models often suffer from catastrophic forgetting of previously learned knowledge when learning new classes. Various methods have been proposed to mitigate this issue. However, rehearsal-based learning, which retains samples from previous classes, typically achieves good performance but tends to memorize specific instances, struggling with Out-of-Distribution (OOD) generalization. This often leads to high forgetting rates and poor generalization. Surprisingly, the OOD generalization capabilities of these methods have been largely unexplored. In this paper, we highlight this issue and propose a simple yet effective strategy inspired by contrastive learning and data-centric principles to address it. We introduce Adaptive Contrastive Replay (ACR), a method that employs dual optimization to simultaneously train both the encoder and the classifier. ACR adaptively populates the replay buffer with misclassified samples while ensuring a balanced representation of classes and tasks. By refining the decision boundary in this way, ACR achieves a balance between stability and plasticity. Our method significantly outperforms previous approaches in terms of OOD generalization, achieving an improvement of 13.41% on Split CIFAR-100, 9.91% on Split Mini-ImageNet, and 5.98% on Split Tiny-ImageNet.

854Offline Reinforcement Learning with Closed-loop Policy Evaluation and Diffusion World-Model Adaptation

[openreview] [pdf]

Abstract Generative models, particularly diffusion models, have been utilized as world models in offline reinforcement learning (RL) to generate synthetic data, enhancing policy learning efficiency. Current approaches either train diffusion models once before policy learning begins or rely on online interactions for alignment. In this paper, we propose a novel offline RL algorithm, Adaptive Diffusion World Model for Policy Evaluation (ADEPT), which integrates closed-loop policy evaluation with world model adaptation. It employs an uncertainty-penalized diffusion model to iteratively interact with the target policy for evaluation. The uncertainty of the world model is estimated by comparing the output generated with different noises, which is then used to constrain out-of-distribution actions. During policy training, the diffusion model performs importance-sampled updates to progressively align with the evolving policy. We analyze the performance of the proposed method and provide an upper bound on the return gap between our method and the real environment under an optimal policy. The results shed light on various key factors affecting learning performance. Evaluations on the D4RL benchmark demonstrate significant improvement over state-of-the-art baselines, especially when only sub-optimal demonstrations are available -- thus requiring improved alignment between the world model and offline policy evaluation.

855Mitigating Unobserved Confounding via Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Learning Conditional average treatment effect estimation from observational data is a challenging task due to the existence of unobserved confounders. Previous methods mostly focus on assuming the Ignorability assumption ignoring the unobserved confounders or overlooking the impact of an a priori knowledge on the generation process of the latent variable, which can be quite impractical in real-world scenarios. Motivated by the recent advances in the latent variable modeling, we propose to capture the unobserved latent space using diffusion model, and accordingly to estimate the causal effect. More concretely, we build on the reverse diffusion process for the unobserved confounders as a Markov chain conditioned on an apriori knowledge. In order to implement our model in a feasible way, we derive the variational bound in closed form. In the experiments, we compare our model with the state-of-the-art methods based on both synthetic and real-world datasets, demonstrating consistent improvements of our model.

856Learning to Achieve Goals with Belief State Transformers

[openreview] [pdf]

Abstract We introduce the “Belief State Transformer”, a next-token predictor that takes both a prefix and suffix as inputs, with a novel objective of predicting both the next token for the prefix and the previous token for the suffix. The Belief State Transformer effectively learns to solve challenging problems that conventional forward-only transformers struggle with, in a domain-independent fashion. Key to this success is learning a compact belief state that captures all relevant information necessary for accurate predictions. Empirical ablations show that each component of the model is essential in difficult scenarios where standard Transformers fall short. For the task of story writing with known prefixes and suffixes, our approach outperforms the Fill-in-the-Middle method for reaching known goals and demonstrates improved performance even when the goals are unknown.Altogether, the Belief State Transformer enables more efficient goal-conditioned decoding, better test-time inference, and high-quality text representations on small scale problems.

857ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning

[openreview] [pdf]

Abstract Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Despite advancements, the extension of video lengths remains constrained by computational resources. Most existing video synthesis models are limited to generating short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we trained ExSVD, an extended model based on Stable Video Diffusion model. Our approach enhances the model’s capacity to generate up to 5×5\times its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn’t compromise the model’s innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.

858On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning

[openreview] [pdf]

Abstract Deep Reinforcement Learning (DRL) policies are critically vulnerable to adversarial noise in observations, posing severe risks in safety-critical scenarios. For example, a self-driving car receiving manipulated sensory inputs about traffic signs could lead to catastrophic outcomes. Existing strategies to fortify RL algorithms against such adversarial perturbations generally fall into two categories: (a) using regularization methods that enhance robustness by incorporating adversarial loss terms into the value objectives, and (b) adopting “maximin” principles, which focus on maximizing the minimum value to ensure robustness. While regularization methods reduce the likelihood of successful attacks, their effectiveness drops significantly if an attack does succeed. On the other hand, maximin objectives, although robust, tend to be overly conservative. To address this challenge, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), which naturally balances optimizing value and robustness against adversarial attacks. To optimize ACoE in a scalable manner in model-free settings, we propose a theoretically justified surrogate objective known as Cumulative-ACoE (C-ACoE). The core idea of optimizing C-ACoE is utilizing the belief about the underlying true state given the adversarially perturbed observation. Our empirical evaluations demonstrate that our method outperforms current state-of-the-art approaches for addressing adversarial RL problems across all established benchmarks (MuJoCo, Atari, and Highway) used in the literature.

859General Framework for Off-Policy Learning with Partially-Observed Reward

[openreview] [pdf]

Abstract Off-policy learning (OPL) in contextual bandits aims to learn a decision-making policy that maximizes the target rewards by using only historical interaction data collected under previously developed policies. Unfortunately, when rewards are only partially observed, the effectiveness of OPL degrades severely. Well-known examples of such partial rewards include explicit ratings in content recommendations, conversion signals on e-commerce platforms that are partial due to delay, and the issue of censoring in medical problems. One possible solution to deal with such partial rewards is to use secondary rewards, such as dwelling time, clicks, and medical indicators, which are more densely observed. However, relying solely on such secondary rewards can also lead to poor policy learning since they may not align with the target reward. Thus, this work studies a new and general problem of OPL where the goal is to learn a policy that maximizes the expected target reward by leveraging densely observed secondary rewards as supplemental data. We then propose a new method called Hybrid Policy Optimization for Partially-Observed Reward (HyPeR), which effectively uses the secondary rewards in addition to the partially observed target reward to achieve effective OPL despite the challenging scenario. We also discuss a case where we aim to optimize not only the expected target reward but also the expected secondary rewards to some extent; counter-intuitively, we will show that leveraging the two objectives is in fact advantageous also for the optimization of only the target reward. Along with statistical analysis of our proposed methods, empirical evaluations on both synthetic and real-world data show that HyPeR outperforms existing methods in various scenarios.

860Ambient Diffusion Posterior Sampling: Solving Inverse Problems with Diffusion Models Trained on Corrupted Data

[openreview] [pdf]

Abstract We provide a framework for solving inverse problems with diffusion models learned from linearly corrupted data. Firstly, we extend the Ambient Diffusion framework to enable training directly from measurements corrupted in the Fourier domain. Subsequently, we train diffusion models for MRI with access only to Fourier subsampled multi-coil measurements at acceleration factors R=2,4,6,8=2, 4, 6, 8. Secondly, we propose Ambient Diffusion Posterior Sampling\textit{Ambient Diffusion Posterior Sampling} (A-DPS), a reconstruction algorithm that leverages generative models pre-trained on one type of corruption (e.g. image inpainting) to perform posterior sampling on measurements from a different forward process (e.g. image blurring). For MRI reconstruction in high acceleration regimes, we observe that A-DPS models trained on subsampled data are better suited to solving inverse problems than models trained on fully sampled data. We also test the efficacy of A-DPS on natural image datasets (CelebA, FFHQ, and AFHQ) and show that A-DPS can sometimes outperform models trained on clean data for several image restoration tasks in both speed and performance.

861Variational Mode Decomposition and Linear Embeddings are What You Need For Time-Series Forecasting

[openreview] [pdf]

Abstract Time-series forecasting often faces challenges due to data volatility, which can lead to inaccurate predictions. Variational Mode Decomposition (VMD) has emerged as a promising technique to mitigate volatility by decomposing data into distinct modes, enhancing forecast accuracy. This study integrates VMD with linear models to develop a robust forecasting framework. Our approach is evaluated on 13 diverse datasets, including ETTm2, WindTurbine, M4, and 10 air quality datasets from Southeast Asian cities. The effectiveness of the VMD strategy is assessed by comparing Root Mean Squared Error (RMSE) values from models utilizing VMD against those without it. Additionally, we benchmark linear-based models against well-known neural network architectures such as LSTM, BLSTM, and RNN. The results demonstrate a significant reduction in RMSE across nearly all models following VMD application. Notably, the Linear + VMD model achieved the lowest average RMSE in univariate forecasting at 0.619. In multivariate forecasting, the DLinear + VMD model consistently outperformed others, attaining the lowest RMSE across all datasets with an average of 0.019. These findings underscore the effectiveness of combining VMD with linear models for superior time-series forecasting.

862How to Get Your LLM to Generate Challenging Problems for Evaluation

[openreview] [pdf]

Abstract The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional human annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduceCHASE, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70% when we scale up the context size to 50k tokens.

863A Computation and Communication Efficient Projection-free Algorithm for Decentralized Constrained Optimization

[openreview] [pdf]

Abstract Decentralized constrained optimization problems arise in numerous real-world applications, where a major challenge lies in the computational complexity of projecting onto complex sets, especially in large-scale systems. The projection-free method, Frank-Wolfe (FW), is popular for the constrained optimization problem with complex sets due to its efficiency in tackling the projection process. However, when applying FW methods to decentralized constrained finite-sum optimization problems, previous studies provide suboptimal incremental first-order oracle (IFO) bounds in both convex and non-convex settings. In this paper, we propose a stochastic algorithm named Decentralized Variance Reduction Gradient Tracking Frank-Wolfe (DVRGTFW\texttt{DVRGTFW}), which incorporates the techniques of variance reduction, gradient tracking, and multi-consensus in the FW update to obtain tight bounds. We present a novel convergence analysis, diverging from previous decentralized FW methods, and demonstrating O~(n+nmLε1)\tilde{\mathcal{O}}(n+\sqrt{\frac{n}{m}}L\varepsilon^{-1}) and O(nmL2ε2)\mathcal{O}(\sqrt{\frac{n}{m}}L^2\varepsilon^{-2}) IFO complexity bounds in convex and non-convex settings, respectively. To the best of our knowledge, these bounds are the best achieved in the literature to date. Besides, in the non-convex case, DVRGTFW\texttt{DVRGTFW} achieves O(L2ε21λ2(W))\mathcal{O}(\frac{L^2\varepsilon^{-2}}{\sqrt{1-\lambda_2(W)}}) communication complexity which is closed to the lower bound Ω(Lε21λ2(W))\Omega(\frac{L\varepsilon^{-2}}{\sqrt{1-\lambda_2(W)}}). Empirical results validate the convergence properties of DVRGTFW\texttt{DVRGTFW} and highlight its superior performance over other related methods.

864On last-iterate convergence of distributed Stochastic Gradient Descent algorithm with momentum

[openreview] [pdf]

Abstract Distributed Stochastic Gradient optimization algorithms are studied extensively to address challenges in centralized approaches, such as data privacy, communication load, and computational efficiency, especially when dealing with large datasets. However, convergence theory research for these algorithms has been limited, particularly for distributed momentum-based SGD (mSGD) algorithms. Current theoretical work on distributed mSGD algorithms primarily focuses on establishing time-average convergence theory, whereas last-iterate convergence—considered a stronger and more practical definition than time-average convergence—has yet to be thoroughly explored. In this paper, we aim to establish the last-iterate convergence theory for a class of distributed mSGD algorithms with a decaying learning rate. First, we propose a general framework for distributed mSGD algorithms. Within this framework and under general conditions, we have proven the last-iterate convergence of the gradient of the loss function for a class of distributed mSGD algorithms. Furthermore, we have estimated the corresponding last-iterate convergence rate under supplementary conditions. Moreover, we theoretically prove that in the early stage, the adding of a momentum term can make the iterations converge more rapidly to a neighborhood of the stationary point. Some experiments are provided to illustrate the theoretical findings.

865From Conflicts to Convergence: A Zeroth-order Method for Multi-Objective Learning

[openreview] [pdf]

Abstract Multi-objective learning (MOL) is a popular paradigm for learning problems under multiple criteria, where various dynamic weighting algorithms (e.g., MGDA and MODO) have been formulated to find an updated direction for avoiding conflicts among objectives. Recently, increasing endeavors have struggled to tackle the black-box MOL when the gradient information of objectives is unavailable or difficult to be attained. Albeit the impressive success of zeroth-order method for single-objective black-box learning, the corresponding MOL algorithm and theoretical understanding are largely absent. Unlike single-objective problems, the errors of MOL introduced by zeroth-order gradients can simultaneously affect both the gradient estimation and the gradient coefficients λ\lambda, leading to further error amplification. To address this issue, we propose a Stochastic Zeroth-order Multiple Objective Descent algorithm (SZMOD), which leverages function evaluations to approximate gradients and develops a new decomposition strategy to handle the complicated black-box multi-objective optimization. Theoretically, we provide convergence and generalization guarantees for SZMOD in both general non-convex and strongly convex settings. Our results demonstrate that the proposed SZMOD enjoys a promising generalization bound of O(n12)\mathcal{O}(n^{-\frac{1}{2}}), which is comparable to the existing results of first-order methods requiring additional gradient information. Experimental results validate our theoretical analysis.

866ParetoFlow: Guided Flows in Multi-Objective Optimization

[openreview] [pdf]

Abstract In offline multi-objective optimization (MOO), we leverage an offline dataset of designs and their associated labels to simultaneously minimize multiple objectives. This setting more closely mirrors complex real-world problems compared to single-objective optimization. Recent works mainly employ evolutionary algorithms and Bayesian optimization, with limited attention given to the generative modeling capabilities inherent in such data. In this study, we explore generative modeling in offline MOO through flow matching, noted for its effectiveness and efficiency. We introduce a \textit{ParetoFlow} method, specifically designed to guide flow sampling to approximate the Pareto front. Traditional predictor~(classifier) guidance is inadequate for this purpose because it models only a single objective. In response, we propose a \textit{multi-objective predictor guidance} module that assigns each sample a weight vector, representing a weighted distribution across multiple objective predictions. A local filtering scheme is introduced to address non-convex Pareto fronts. These weights uniformly cover the entire objective space, effectively directing sample generation towards the Pareto front. Since distributions with similar weights tend to generate similar samples, we introduce a \textit{neighboring evolution} module to foster knowledge sharing among neighboring distributions. This module generates offspring from these distributions, and selects the most promising one for the next iteration. Our method achieves state-of-the-art performance across various tasks. Our code is available.

867Do LLMs estimate uncertainty well in instruction-following?

[openreview] [pdf]

Abstract Large language models (LLMs) could be valuable personal AI agents across various domains, provided they can precisely follow user instructions. However, recent studies have shown significant limitations in LLMs’ instruction-following capabilities, raising concerns about their reliability in high-stakes applications. Accurately estimating LLMs’ uncertainty in adhering to instructions is critical to mitigating deployment risks. We present, to our knowledge, the first systematic evaluation of the uncertainty estimation abilities of LLMs in the context of instruction-following. Our study identifies key challenges with existing instruction-following benchmarks, where multiple factors are entangled with uncertainty stems from instruction-following, complicating the isolation and comparison across methods and models. To address these issues, we introduce a controlled evaluation setup with two benchmark versions of data, enabling a comprehensive comparison of uncertainty estimation methods under various conditions. Our findings show that existing uncertainty methods struggle, particularly when models make subtle errors in instruction following. While internal model states provide some improvement, they remain inadequate in more complex scenarios. The insights from our controlled evaluation setups provide a crucial understanding of LLMs’ limitations and potential for uncertainty estimation in instruction-following tasks, paving the way for more trustworthy AI agents.

868ConceptPrune: Concept Editing in Diffusion Models via Skilled Neuron Pruning

[openreview] [pdf]

Abstract While large-scale text-to-image diffusion models have demonstrated impressive image-generation capabilities, there are significant concerns about their potential misuse for generating unsafe content, violating copyright, and perpetuating societal biases. Recently, the text-to-image generation community has begun addressing these concerns by editing or unlearning undesired concepts from pre-trained models. However, these methods often involve data-intensive and inefficient fine-tuning or utilize various forms of token remapping, rendering them susceptible to adversarial jailbreaks. In this paper, we present a simple and effective training-free approach, ConceptPrune, wherein we first identify critical regions within pre-trained models responsible for generating undesirable concepts, thereby facilitating straightforward concept unlearning via weight pruning. Experiments across a range of concepts including artistic styles, nudity, and object erasure demonstrate that target concepts can be efficiently erased by pruning a tiny fraction, approximately 0.12% of total weights, enabling multi-concept erasure and robustness against various white-box and black-box adversarial attacks.

869Beyond Markov Assumption: Improving Sample Efficiency in MDPs by Historical Augmentation

[openreview] [pdf]

Abstract Under the Markov assumption of Markov Decision Processes (MDPs), an optimal stationary policy does not need to consider history and is no worse than any non-stationary or history-dependent policy. Therefore, existing Deep Reinforcement Learning (DRL) algorithms usually model sequential decision-making as an MDP and then try to optimize a stationary policy by single-step state transitions. However, such optimization is often faced with sample inefficiency when the causal relationships of state transitions are complex. To address the above problem, this paper investigates if augmenting the states with their historical information can simplify the complex causal relationships in MDPs and thus improve the sample efficiency of DRL. First, we demonstrate that a complex causal relationship of single-step state transitions may be inferred by a simple causal function of the historically augmented states. Then, we propose a convolutional neural network architecture to learn the representation of the current state and its historical trajectory. The main idea of this representation learning is to compress the high-dimensional historical trajectories into a low-dimensional space. In this way, we can extract the simple causal relationships from historical information and avoid the overfitting caused by high-dimensional data. Finally, we formulate Historical Augmentation Aided Actor-Critic (HA3C) algorithm by adding the learned representations to the actor-critic method. The experiment on standard MDP tasks demonstrates that HA3C outperforms current state-of-the-art methods in terms of both sample efficiency and performance.

870Analytic DAG Constraints for Differentiable DAG Learning

[openreview] [pdf]

Abstract Recovering underlying Directed Acyclic Graph (DAG) structures from observational data presents a formidable challenge due to the combinatorial nature of the DAG-constrained optimization problem. Recently, researchers have identified gradient vanishing as one of the primary obstacles in differentiable DAG learning and have proposed several DAG constraints to mitigate this issue. By developing the necessary theory to establish a connection between analytic functions and DAG constraints, we demonstrate that analytic functions from the set {f(x)=c0+i=1cixic00;i>0,ci>0;r=limici/ci+1>0}\{f(x) = c_0 + \sum_{i=1}c_ix^i|c_0 \geqslant 0; \forall i > 0, c_i > 0; r = \lim_{i\rightarrow \infty}c_{i}/c_{i+1} > 0\} can be employed to formulate effective DAG constraints. Furthermore, we establish that this set of functions is closed under several functional operators, including differentiation, summation, and multiplication. Consequently, these operators can be leveraged to create novel DAG constraints based on existing ones. Using these properties, we designed a series of DAG constraints and designed an efficient algorithm to evaluate these DAG constraints. Experiments on various settings show that our DAG constraints outperform previous state-of-the-arts approaches.

871DiffImp: Efficient Diffusion Model for Probabilistic Time Series Imputation with Bidirectional Mamba Backbone

[openreview] [pdf]

Abstract Probabilistic time series imputation has been widely applied in real-world scenarios due to its ability to estimate uncertainty of imputation results. Meanwhile, denoising diffusion probabilistic models (DDPMs) have achieved great success in probabilistic time series imputation tasks with its power to model complex distributions. However, current DDPM-based probabilistic time series imputation methodologies are confronted with two types of challenges: 1) \textit{ The backbone modules of the denoising parts are not capable of achieving sequence modeling with low time complexity.} 2) \textit{ The architecture of denoising modules can not handle the inter-variable and bidirectional dependencies in the time series imputation problem effectively.} To address the first challenge, we integrate the computational efficient state space model, namely Mamba, as the backbone denosing module for DDPMs. To tackle the second challenge, we carefully devise several SSM-based blocks for bidirectional modeling and inter-variable relation understanding. Experimental results demonstrate that our approach can achieve state-of-the-art time series imputation results on multiple datasets, different missing scenarios and missing ratios.

872Autoencoders for Anomaly Detection are Unreliable

[openreview] [pdf]

Abstract Autoencoders are frequently used for anomaly detection, both in the unsupervised and semi-supervised settings. They rely on the assumption that when trained using the reconstruction loss, they will be able to reconstruct normal data more accurately than anomalous data. Some recent works have posited that this assumption may not always hold, but little has been done to study the validity of the assumption in theory. In this work we prove that this assumption indeed does not hold, and show that anomalies, lying far away from normal data, can be perfectly reconstructed in practice. We extend the understanding of autoencoders for anomaly detection by showing how they can perfectly reconstruct out of bounds, or interpolate undesirably, and note how this can be dangerous in safety critical applications. We connect theory to practice by showing that the proven behavior in linear autoencoders also occurs when applying non-linear autoencoders on both tabular data and real-world image data, the two primary application areas of autoencoders for anomaly detection.

873Finally Rank-Breaking Conquers MNL Bandits: Optimal and Efficient Algorithms for MNL Assortment

[openreview] [pdf]

Abstract We address the problem of active online assortment optimization problem with preference feedback, which is a framework for modeling user choices and subsetwise utility maximization. The framework is useful in various real-world applications including ad placement, online retail, recommender systems, and fine-tuning language models, amongst many others. The problem, although has been studied in the past, lacks an intuitive and practical solution approach with simultaneously efficient algorithm and optimal regret guarantee. E.g., popularly used assortment selection algorithms often require the presence of a ``strong reference" which is always included in the choice sets, further they are also designed to offer the same assortments repeatedly until the reference item gets selected---all such requirements are quite unrealistic for practical applications. In this paper, we designed efficient algorithms for the problem of regret minimization in assortment selection with \emph{Plackett Luce} (PL) based user choices. We designed a novel concentration guarantee for estimating the score parameters of the PL model using `\emph{Pairwise Rank-Breaking}', which builds the foundation of our proposed algorithms. Moreover, our methods are practical, provably optimal, and devoid of the aforementioned limitations of the existing methods.

874CRAFT: Time Series Forecasting with Cross-Future Behavior Awareness

[openreview] [pdf]

Abstract Time series forecasting is the crucial infrastructure in the field of e-commerce, providing technical support for consumer behavior analysis, sales trends forecasting, etc. E-commerce allows consumers to reserve in advance. These pre-booking features reflect future sales trends and can increase the certainty of time series forecasting issues. In this paper, we define these features as Cross-Future Behavior, which occurs before the current time but takes effect in the future. To increase the performance of time series forecasting, we leverage these features and propose the CRoss-Future Behavior Awareness based Time Series Forecasting method (CRAFT). The core idea of CRAFT is to utilize the trend of cross-future behavior to mine the trend of time series data to be predicted. Specifically, to settle the sparse and partial flaws of cross-future behavior, CRAFT employs the Koopman Predictor Module to extract the key trend and the Internal Trend Mining Module to supplement the unknown area of the cross-future behavior matrix. Then, we introduce the External Trend Guide Module with a hierarchical structure to acquire more representative trends from higher levels. Finally, we apply the demand-constrained loss to calibrate the distribution deviation of prediction results. We conduct experiments on real-world dataset. Experiments on both offline large-scale dataset and online A/B test demonstrate the effectiveness of CRAFT. Our dataset and code will be released after formal publication.

875Leveraging Semantic and Positional Uncertainty for Trajectory Prediction

[openreview] [pdf]

Abstract Given a time horizon with historical movement data and environmental context, trajectory prediction aims to forecast the future motion of dynamic entities, such as vehicles and pedestrians. A key challenge in this task arises from the dynamic and noisy nature of real-time maps. This noise primarily stems from two resources: (1) positional errors due to sensor inaccuracies or environmental occlusions, and (2) cognitive errors resulting from incorrect scene understanding. In an attempt to solve this problem, we propose a new framework that estimates two kinds of uncertainty, \ie, positional uncertainty and semantic uncertainty simultaneously, and explicitly incorporates both uncertainties into the trajectory prediction process. In particular, we introduce a dual-head structure to independently perform semantic prediction twice and positional prediction twice, and further extract the prediction variance as the uncertainty indicator in an end-to-end manner. The uncertainty is then directly concatenated with the semantic and positional predictions to enhance the trajectory estimation. To validate the effectiveness of our uncertainty-aware approach, we evaluate it on the real-world driving dataset, \ie, nuScenes. Extensive experiments on 3 mapping estimation and 2 trajectory approaches show that the proposed method (1) effectively captures map noise through both positional and semantic uncertainties, and (2) seamlessly integrates and enhances existing trajectory prediction methods on multiple evaluation metrics, \ie, minADE, minFDE, and MR.

876Actionable Inverse Classification with Action Fairness Guarantees

[openreview] [pdf]

Abstract Machine learning (ML) classifiers are increasingly used in critical decision-making domains such as finance, healthcare, and the judiciary. However, their interpretability and fairness remain significant challenges, often leaving users without clear guidance on how to improve unfavourable outcomes. This paper introduces an actionable ML framework that provides minimal, explainable modifications to input data to change classification results. We also propose a novel concept of “action fairness,” which ensures that users from different subgroups incur similar costs when altering their classification outcomes. Our approach identifies the nearest decision boundary point to a given query, allowing for the determination of minimal cost actions. We demonstrate the effectiveness of this method using real-world credit assessment data, showing that our solution not only improves the fairness of classifier outcomes but also enhances their usability and interpretability.

877Better Instruction-Following Through Minimum Bayes Risk

[openreview] [pdf]

Abstract General-purpose LLM judges capable of human-level evaluation provide not only a scalable and accurate way of evaluating instruction-following LLMs but also new avenues for supervising and improving their performance. One promising way of leveraging LLM judges for supervision is through Minimum Bayes Risk (MBR) decoding, which uses a reference-based evaluator to select a high-quality output from amongst a set of candidate outputs. In the first part of this work, we explore using MBR decoding as a method for improving the test-time performance of instruction-following LLMs. We find that MBR decoding with reference-based LLM judges substantially improves over greedy decoding, best-of-N decoding with reference-free judges and MBR decoding with lexical and embedding-based metrics on AlpacaEval and MT-Bench. These gains are consistent across LLMs with up to 70B parameters, demonstrating that smaller LLM judges can be used to supervise much larger LLMs. Then, seeking to retain the improvements from MBR decoding while mitigating additional test-time costs, we explore iterative self-training on MBR-decoded outputs. We find that self-training using Direct Preference Optimisation leads to significant performance gains, such that the self-trained models with greedy decoding generally match and sometimes exceed the performance of their base models with MBR decoding.

878On-Policy Fine-grained Knowledge Feedback for Hallucination Mitigation

[openreview] [pdf]

Abstract Hallucination occurs when large language models (LLMs) exhibit behavior that deviates from the boundaries of their knowledge during the response generation process. Previous learning-based methods focus on detecting knowledge boundaries and finetuning models with instance-level feedback, but they suffer from inaccurate signals due to off-policy data sampling and coarse-grained feedback. In this paper, we introduce \textit{\b{R}einforcement \b{L}earning \b{f}or \b{H}allucination} (RLFH), a fine-grained feedback-based online reinforcement learning method for hallucination mitigation. Unlike previous learning-based methods, RLFH enables LLMs to explore the boundaries of their internal knowledge and provide on-policy, fine-grained feedback on these explorations. To construct fine-grained feedback for learning reliable generation behavior, RLFH decomposes the outcomes of large models into atomic facts, provides statement-level evaluation signals, and traces back the signals to the tokens of the original responses. Finally, RLFH adopts the online reinforcement algorithm with these token-level rewards to adjust model behavior for hallucination mitigation. For effective on-policy optimization, RLFH also introduces an LLM-based fact assessment framework to verify the truthfulness and helpfulness of atomic facts without human intervention. Experiments on HotpotQA, SQuADv2, and Biography benchmarks demonstrate that RLFH can balance their usage of internal knowledge during the generation process to eliminate the hallucination behavior of LLMs.

879Effective and Efficient Time-Varying Counterfactual Prediction with State-Space Models

[openreview] [pdf]

Abstract Time-varying counterfactual prediction (TCP) from observational data supports the answer of when and how to assign multiple sequential treatments, yielding importance in various applications. Despite the progress achieved by recent advances, e.g., LSTM or Transformer based causal approaches, their capability of capturing interactions in long sequences remains to be improved in both prediction performance and running efficiency. In parallel with the development of TCP, the success of the state-space models (SSMs) has achieved remarkable progress toward long-sequence modeling with saved running time. Consequently, studying how Mamba simultaneously benefits the effectiveness and efficiency of TCP becomes a compelling research direction. In this paper, we propose to exploit advantages of the SSMs to tackle the TCP task, by introducing a counterfactual Mamba model with Covariate-based Decorrelation towards Selective Parameters (Mamba-CDSP). Motivated by the over-balancing problem in TCP of the direct covariate balancing methods, we propose to de-correlate between the current treatment and the representation of historical covariates, treatments, and outcomes, which can mitigate the confounding bias while preserve more covariate information. In addition, we show that the overall de-correlation in TCP is equivalent to regularizing the selective parameters of Mamba over each time step, which leads our approach to be effective and lightweight. We conducted extensive experiments on both synthetic and real-world datasets, demonstrating that Mamba-CDSP not only outperforms baselines by a large margin, but also exhibits prominent running efficiency.

880Incorporating Human Preferences into Interpretable Reinforcement Learning with Tree Policies

[openreview] [pdf]

Abstract Interpretable reinforcement learning (RL) seeks to create agents that are efficient, transparent, and understandable to the populations that they impact. A significant gap in current approaches is the underutilization of human feedback, which is typically employed only for post-hoc evaluation. We propose to center the needs of end users by incorporating the feedback that would be obtained in a user study directly into the training of interpretable RL algorithms. Our approach involves preference learning, where we learn preferences over high-level features that are not directly optimizable during the RL training process. We introduce an evolutionary algorithm that leverages user feedback to guide training toward interpretable decision-tree policies that are better-aligned with human preferences. We demonstrate the effectiveness of our method through experiments using synthetic preference data. Our results show an improvement in preference alignment compared to baselines, yielding policies that are more aligned with underlying user preferences but does so with sample efficiency in the number of user queries, thereby decreasing the burden on the user in providing such data.

881Reward Learning from Multiple Feedback Types

[openreview] [pdf]

Abstract Learning rewards from preference feedback has become an important tool in the alignment of agentic models. Preference-based feedback, often implemented as a binary comparison between multiple completions, is an established method to acquire large-scale human feedback. However, human feedback in other contexts is often much more diverse. Such diverse feedback can better support the goals of a human annotator, and the simultaneous use of multiple sources might be mutually informative for the learning process or carry type-dependent biases for the reward learning process. Despite these potential benefits, learning from different feedback types has yet to be explored extensively. In this paper, we bridge this gap by enabling experimentation and evaluating multi-type feedback in a wide set of environments. We present a process to generate high-quality simulated feedback of six different types. Then, we implement reward models and downstream RL training for all six feedback types. Based on the simulated feedback, we investigate the use of types of feedback across five RL environments and compare them to pure preference-based baselines. We show empirically that diverse types of feedback can be utilized simultaneously and lead to improved reward modeling performance. This work is the first strong indicator of the potential of true multi-type feedback for RLHF.

882Manifold Learning via Foliations, and Knowledge Transfer

[openreview] [pdf]

Abstract Understanding how real data is distributed in high dimensional spaces is the key to many tasks in machine learning. We want to provide a natural geometric structure on the space of data employing a deep ReLU neural network trained as a classifier. Through the data information matrix (DIM), a variation of the Fisher information matrix, the model will discern a singular foliation structure on the space of data. We show that the singular points of such foliation are contained in a measure zero set, and that a local regular foliation exists almost everywhere. Experiments show that the data is correlated with leaves of such foliation. Moreover we show the potential of our approach for knowledge transfer by analyzing the spectrum of the DIM to measure distances between datasets.

883Towards Generalization under Topological Shifts: A Diffusion PDE Perspective

[openreview] [pdf]

Abstract The capability of generalization is a cornerstone for the success of modern learning systems. For non-Euclidean data that particularly involves topological features, one important aspect neglected by prior studies is how learning-based models generalize under topological shifts. This paper makes steps towards understanding the generalization of graph neural networks operated on varying topologies through the lens of diffusion PDEs. Our analysis first reveals that the upper bound of the generalization error yielded by local diffusion equation models, which are intimately related to message passing over observed structures, would exponentially grow w.r.t. topological shifts. In contrast, extending the diffusion operator to a non-local counterpart that learns latent structures from data can in principle control the generalization error under topological shifts even when the model accommodates observed structures. On top of these results, we propose Advective Diffusion Transformer inspired by advective diffusion equations serving as a physics-inspired continuous model that synthesizes observed and latent structures for graph learning. The model demonstrates superiority in various downstream tasks across information networks, molecular screening and protein interactions.

884Diversity-Enhanced and Classification-Aware Prompt Learning for Few-Shot Learning via Stable Diffusion

[openreview] [pdf]

Abstract Recent text-to-image generative models have exhibited an impressive ability to generate fairly realistic images from some text prompts. In this work, we explore to leverage off-the-shelf text-to-image generative models to train non-specific downstream few-shot classification model architectures using synthetic dataset to classify real images. Current approaches use hand-crafted or model-generated text prompts of text-to-image generative models to generated desired synthetic images, however, they have limited capability of generating diversity images. Especially, their synthetic datasets has relatively limited relevance to the downstream classification tasks. This makes them fairly hard to guarantee training models from synthetic images are efficient in practice. To address this issue, we propose a method capable of adaptively learning proper text prompts for the off-the-shelf diffusion model to generate diverse and classification-aware synthetic images. Our approach shows notable improvements in various classification datasets, with results comparable to existing prompt designing methods. We find that replacing data generation strategy of existing zero/few-shot methods with proposed method could consistly improves downstream classification performance across different network architectures, demostrating its model-agnostic characteristic for few-shot learning. This makes it possible to train an efficient downstream few-shot learning models from synthetic images generated by proposed method for real problems.

885Learn out of the box: optimizing both diversity and performance in Offline Reinforcement Learning

[openreview] [pdf]

Abstract In offline reinforcement learning, most existing methods have focused primarily on optimizing performance, often neglecting the promotion of diverse behaviors. While some approaches generate diverse behaviors from well-constructed, heterogeneous datasets, their effectiveness is significantly reduced when applied to less diverse data. To address this, we introduce a novel intrinsic reward mechanism that encourages behavioral diversity, irrespective of the dataset’s heterogeneity. By maximizing the mutual information between actions and policies under each state, our approach enables agents to learn a variety of behaviors, including those not explicitly represented in the data. Although performing out-of-distribution actions can lead to risky outcomes, we mitigate this risk by incorporating the ensemble-diversified actor-critic (EDAC) method to estimate Q-value uncertainty, preventing agents from adopting suboptimal behaviors. Through experiments using the D4RL benchmarks on MuJoCo tasks, we demonstrate that our method achieves behavioral diversity while maintaining performance across environments constructed from both heterogeneous and homogeneous datasets.

886OptionZero: Planning with Learned Options

[openreview] [pdf]

Abstract Planning with options -- a sequence of primitive actions -- has been shown effective in reinforcement learning within complex environments. Previous studies have focused on planning with predefined options or learned options through expert demonstration data. Inspired by MuZero, which learns superhuman heuristics without any human knowledge, we propose a novel approach, named OptionZero. OptionZero incorporates an option network into MuZero, providing autonomous discovery of options through self-play games. Furthermore, we modify the dynamics network in MuZero to provide environment transitions when using options, allowing searching deeper under the same simulation constraints. Empirical experiments conducted in 26 Atari games demonstrate that OptionZero outperforms MuZero, achieving a 131.58% improvement in mean human-normalized score. Our behavior analysis shows that OptionZero not only learns options but also acquires strategic skills tailored to different game characteristics. Our findings show promising directions for discovering and using options in planning.

887Delay-Aware Reinforcement Learning: Insights From Delay Distributional Perspective

[openreview] [pdf]

Abstract Although deep reinforcement learning (DRL) has achieved great success across various domains, the presence of random delays in real-world scenarios (e.g., remote control) poses a significant challenge to its practicality. Existing delay-aware DRLs mainly focus on state augmentation with historical memory, ensuring that the actions taken are aligned with the true state. However, these approaches still rely on the conventional expected QQ value. In contrast, to model delay uncertainty, we aim to go beyond the expected value and propose a distributional DRL to represent the distribution of this QQ value. Based on the delay distribution, we further propose a correction mechanism for the distributional QQ value, enabling the agent to learn accurate returns in delayed environments. Finally, we apply these techniques to design the delay-aware distributional actor-critic (DADAC) DRL framework, in which the critic is the corrected distributional value function. Experimental results demonstrate that compared to the state-of-the-art delay-aware DRL methods, the proposed DADAC exhibits substantial performance advantages in handling random delays in the MuJoCo continuous control tasks. The corresponding source code is available athttps://anonymous.4open.science/r/DADAC.

888BOND: Aligning LLMs with Best-of-N Distillation

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models.

889Task Facet Learning: A Structured Approach to Prompt Optimization

[openreview] [pdf]

Abstract Given a task in the form of a basic description and its training examples, prompt optimization is the problem of synthesizing the given information into a text prompt for a large language model. Humans solve this problem by also considering the different facets that define a task (e.g., counter-examples, explanations, analogies) and including them in the prompt. However, it is unclear whether existing algorithmic approaches, based on iteratively editing a given prompt or automatically selecting a few in-context examples, can cover the multiple facets required to solve a complex task. In this work, we view prompt optimization as that of learning multiple facets of a task from a set of training examples. We exploit structure in the prompt optimization problem and break down a prompt into loosely coupled semantic sections. The proposed algorithm, UniPrompt, (1) clusters the input space and uses clustered batches so that each batch likely corresponds to a different facet of the task, and (2) utilizes a feedback mechanism to propose adding, editing or deleting a section, which in turn is aggregated over a batch to capture generalizable facets. Empirical evaluation on multiple datasets and a real-world task shows that prompts generated using UniPrompt obtain higher accuracy than human-tuned prompts and those from state-of-the-art methods. In particular, our algorithm can generate long, complex prompts that existing methods are unable to generate.

890Elucidating the Design Choice of Probability Paths in Flow Matching for Forecasting

[openreview] [pdf]

Abstract Flow matching has recently emerged as a powerful paradigm for generative modeling, and has been extended to probabilistic time series forecasting in latent spaces. However, the impact of the specific choice of probability path model on forecasting performance remains under-explored. In this work, we demonstrate that forecasting spatio-temporal data with flow matching is highly sensitive to the selection of the probability path model. Motivated by this insight, we propose a novel probability path model designed to improve forecasting performance. Our empirical results across various dynamical system benchmarks show that our model achieves faster convergence during training and improved predictive performance compared to existing probability path models. Importantly, our approach is efficient during inference, requiring only a few sampling steps. This makes our proposed model practical for real-world applications and opens new avenues for probabilistic forecasting.

891Toward Principled Transformers for Knowledge Tracing

[openreview] [pdf]

Abstract Knowledge tracing aims to reason about changes in students’ knowledge and to predict students’ performance in educational learning settings. We propose knowledge tracing set transformers (KTSTs), a straightforward model class for knowledge tracing prediction tasks. This model class is conceptually simpler than previous state-of-the-art approaches, which are overly complex due to domain-inspired components, and which are in part based on suboptimal design choices and flawed evaluation. In contrast, for KTSTs we propose principled set representations of student interactions and a simplified variant of learnable modification of attention matrices for positional information in a student’s learning history. While being largely domain-agnostic, the proposed model class thus accounts for characteristic traits of knowledge tracing tasks. In extensive empirical experiments on standardized benchmark datasets, KTSTs establish new state-of-the-art performance.

892Bayesian Learning of Adaptive Koopman Operator with Application to Robust Motion Planning for Autonomous Trucks

[openreview] [pdf]

Abstract Koopman theory has recently been shown to enable an efficient data-driven approach for modeling physical systems, offering a linear framework despite underlying nonlinear dynamics. It is, however, not clear how to account for uncertainty or temporal distributional shifts within this framework, both commonly encountered in real-world autonomous driving with changing weather conditions and time-varying vehicle dynamics. In this work, we introduce Bayesian learning of adaptive Koopman operator to address these limitations. Specifically, we propose a Bayesian Koopman operator that incorporates uncertainty quantification, enabling more robust predictions. To tackle distributional shifts, we propose an online adaptation mechanism, ensuring the operator remains responsive to changes in system dynamics. Additionally, we apply the architecture to motion planning and show that it gives fast and precise predictions. By leveraging uncertainty awareness and real-time updates, our planner generates dynamically accurate trajectories and makes more informed decisions. We evaluate our method on real-world truck dynamics data under varying weather conditions—such as wet roads, snow, and ice—where uncertainty and dynamic shifts are prominent, as well as in other simulated environments. The results demonstrate our method’s ability to deliver accurate, uncertainty-aware open-loop predictions for dynamic systems.

893OccSora: 4D Occupancy Generation Models as World Simulators for Autonomous Driving

[openreview] [pdf]

Abstract Understanding the evolution of 3D scenes is important for effective autonomous driving. While conventional methods model the scene development with the motion of individual instances, world models emerge as a generative framework to describe the general scene dynamics. However, most existing methods adopt an autoregressive framework to perform next-token prediction, which suffer from inefficiency to model long-term temporal evolutions.To address this, we propose a diffusion-based 4D occupancy generation model, OccSora, to simulate the development of the 3D world for autonomous driving. We employ a 4D scene tokenizer to obtain compact discrete spatial-temporal representations for 4D occupancy input and achieve high-quality reconstruction for long-sequence occupancy videos. We then learn a diffusion transformer on the spatial-temporal representations and generate 4D occupancy conditioned on a trajectory prompt. We conduct extensive experiments on the widely used nuScenes dataset with Occ3D occupancy annotations. OccSora can generate 16s videos with authentic 3D layout and temporal consistency, demonstrating its ability to understand the spatial and temporal distributions of driving scenes. With trajectory-aware 4D generation, OccSora has the potential to serve as a world simulator for the decision-making of autonomous driving.

894Data-adaptive Differentially Private Prompt Synthesis for In-Context Learning

[openreview] [pdf]

Abstract Large Language Models (LLMs) rely on the contextual information embedded in examples/demonstrations to perform in-context learning (ICL). To mitigate the risk of LLMs potentially leaking private information contained in examples in the prompt, we introduce a novel data-adaptive differentially private algorithm calledAdaDPSynto generate synthetic examples from the private dataset and then use these synthetic examples to perform ICL. The objective of AdaDPSyn is to adaptively adjust the noise level in the data synthesis mechanism according to the inherent statistical properties of the data, thereby preserving high ICL accuracy while maintaining formal differential privacy guarantees. A key innovation in AdaDPSyn is thePrecision-Focused Iterative Radius Reductiontechnique, which dynamically refines the aggregation radius - the scope of data grouping for noise addition - based on patterns observed in data clustering, thereby minimizing the amount of additive noise. We conduct extensive experiments on standard benchmarks and compare AdaDPSyn with DP few-shot generation algorithm (Tang et al., 2023). The experiments demonstrate that AdaDPSyn not only outperforms DP few-shot generation, but also maintains high accuracy levels close to those of non-private baselines, providing an effective solution for ICL with privacy protection.

895Advantage Alignment Algorithms

[openreview] [pdf]

Abstract Artificially intelligent agents are increasingly being integrated into human decision-making: from large language model (LLM) assistants to autonomous vehicles. These systems often optimize their individual objective, leading to conflicts, particularly in general-sum games where naive reinforcement learning agents empirically converge to Pareto-suboptimal Nash equilibria. To address this issue, opponent shaping has emerged as a paradigm for finding socially beneficial equilibria in general-sum games. In this work, we introduce Advantage Alignment, a family of algorithms derived from first principles that perform opponent shaping efficiently and intuitively. We achieve this by aligning the advantages of interacting agents, increasing the probability of mutually beneficial actions when their interaction has been positive. We prove that existing opponent shaping methods implicitly perform Advantage Alignment. Compared to these methods, Advantage Alignment simplifies the mathematical formulation of opponent shaping, reduces the computational burden and extends to continuous action domains. We demonstrate the effectiveness of our algorithms across a range of social dilemmas, achieving state-of-the-art cooperation and robustness against exploitation.

896Adversarial Guided Diffusion Models for Adversarial Purification

[openreview] [pdf]

Abstract Diffusion model (DM) based adversarial purification (AP) has proven to be a powerful defense method that can remove adversarial perturbations and generate a purified example without threats. In principle, the pre-trained DMs can only ensure that purified examples conform to the same distribution of the training data, but it may inadvertently compromise the semantic information of input examples, leading to misclassification of purified examples. Recent advancements introduce guided diffusion techniques to preserve semantic information while removing the perturbations. However, these guidances often rely on distance measures between purified examples and diffused examples, which can also preserve perturbations in purified examples. To further unleash the robustness power of DM-based AP, we propose an adversarial guided diffusion model (AGDM) by introducing a novel adversarial guidance that contains sufficient semantic information but does not explicitly involve adversarial perturbations. The guidance is modeled by an auxiliary neural network obtained with adversarial training, considering the distance in the latent representations rather than at the pixel-level values. Extensive experiments are conducted on CIFAR-10, CIFAR-100 and ImageNet to demonstrate that our method is effective for simultaneously maintaining semantic information and removing the adversarial perturbations. In addition, comprehensive comparisons show that our method significantly enhances the robustness of existing DM-based AP, with an average robust accuracy improved by up to 7.30% on CIFAR-10. The code will be available upon acceptance.

897A General Aggregation Federated Learning Intervention Algorithm based ondo-Calculus

[openreview] [pdf]

Abstract This article explores federated long-tail learning (Fed-LT) tasks, which involve clients with private and heterogeneous data that exhibit a long-tail distribution. We propose two methods: (a) Client Re-weighted Prior Analyzer (CRePA), which balances the global model’s performance on tail and non-tail categories and enhances performance on tail categories while maintaining it on non-tail categories. (b) Federated Long-Tail Causal Intervention Model (FedLT-CI) computes clients’ causal effects on the global model’s performance in the tail and enhances the interpretability of Fed-LT. CRePA achieves state-of-the-art performance, and FedLT-CI improves tail performance significantly without affecting non-tail performance. Extensive experiments indicate that CRePA achieved SOTA performance compared to other baselines on CIFAR-10-LT and CIFAR-100-LT. Applying the FedLT-CI to all baselines significantly improved tail performance without affecting non-tail performance.

898Multi Task Inverse Reinforcement Learning for Common Sense Reward

[openreview] [pdf]

Abstract One of the challenges in applying reinforcement learning in a complex real-world environment lies in providing the agent with a sufficiently detailed reward function. Any misalignment between the reward and the desired behavior can result in unwanted outcomes. This may lead to issues like “reward hacking” where the agent maximizes rewards by unintended behavior. In this work, we propose to disentangle the reward into two distinct parts. A simple task-specific reward, outlining the particulars of the task at hand, and an unknown common-sense reward, indicating the expected behavior of the agent within the environment. We then explore how this common-sense reward can be learned from expert demonstrations. We first show that inverse reinforcement learning, even when it succeeds in training an agent, does not learn a useful reward function. That is, training a new agent with the learned reward does not impair the desired behaviors. We then demonstrate that this problem can be solved by training simultaneously on multiple tasks. That is, multi-task inverse reinforcement learning can learn a useful reward function.

899WardropNet: Traffic Flow Predictions via Equilibrium-Augmented Learning

[openreview] [pdf]

Abstract When optimizing transportation systems, anticipating traffic flows is a central element. Yet, computing such traffic equilibria remains computationally expensive. Against this background, we introduce a novel combinatorial optimization augmented neural network architecture that allows for fast and accurate traffic flow predictions. We propose WardropNet, a neural network that combines classical layers with a subsequent equilibrium layer: the first ones inform the latter by predicting the parameterization of the equilibrium problem’s latency functions. Using supervised learning we minimize the difference between the actual traffic flow and the predicted output. We show how to leverage a Bregman divergence fitting the geometry of the equilibria, which allows for end-to-end learning. WardropNet outperforms pure learning-based approaches in predicting traffic equilibria for realistic and stylized traffic scenarios. On realistic scenarios, WardropNet improves on average for time-invariant predictions by up to 72% and for time-variant predictions by up to 23% over pure learning-based approaches.

900Point Cloud Dataset Distillation

[openreview] [pdf]

Abstract This study introduces dataset distillation (DD) tailored for 3D data, particularly point clouds. DD aims to substitute large-scale real datasets with a small set of synthetic samples while preserving model performance. Existing methods mainly focus on structured data such as images. However, adapting DD for unstructured point clouds poses challenges due to their diverse orientations and resolutions in 3D space. To address these challenges, we theoretically demonstrate the importance of matching rotation-invariant features between real and synthetic data for 3D distillation. We further propose a plug-and-play point cloud rotator to align the point cloud to a canonical orientation, facilitating the learning of rotation-invariant features by all point cloud models. Furthermore, instead of optimizing fixed-size synthetic data directly, we devise a point-wise generator to produce point clouds at various resolutions based on the sampled noise amount. Compared to conventional DD methods, the proposed approach, termed DD3D, enables efficient training on low-resolution point clouds while generating high-resolution data for evaluation, thereby significantly reducing memory requirements and enhancing model scalability. Extensive experiments validate the effectiveness of DD3D in shape classification and part segmentation tasks across diverse scenarios, such as cross-architecture and cross-resolution settings.

901Dominant Shuffle: An Incredibly Simple but Exceptionally Effective Data Augmentation Method for Time-Series Prediction

[openreview] [pdf]

Abstract Frequency-domain data augmentation (DA) has shown strong performance in time-series prediction due to its ability to preserve data-label consistency. However, we observed that existing frequency-domain augmentations introduce excessive variability, leading to out-of-distribution samples that may be harmful to model performance. To address this, we introduced two simple modifications to frequency-domain DA. First, we limit perturbations to dominant frequencies with larger magnitudes, which capture the main periodicities and trends of the signal. Second, instead of using complicated random perturbations, we simply shuffle the dominant frequency components, which preserves the original structure while avoiding external noise. With the two simple modifications, we proposed dominant shuffle—a simple yet highly effective data augmentation technique for time-series prediction. Our method is remarkably simple, requiring only a few lines of code, yet exceptionally effective, consistently and significantly improving model performance. Extensive experiments on short-term, long term, few-shot and cold-start prediction tasks with eight state-of-the-art models, nine existing augmentation methods and twelve datasets demonstrate that dominant shuffle consistently boosts model performance with substantial gains, outperforming existing augmentation techniques.

902Enhanced Diffusion Sampling via Extrapolation with Multiple ODE Solutions

[openreview] [pdf]

Abstract Diffusion probabilistic models (DPMs), while effective in generating high-quality samples, often suffer from high computational costs due to the iterative sampling process. To address this, we propose an enhanced ODE-based sampling method for DPMs inspired by Richardson extrapolation, which has been shown to reduce numerical error and improve convergence rates. Our method, termed RX-DPM, utilizes numerical solutions obtained over multiple denoising steps, leveraging the multiple ODE solutions to extrapolate the denoised prediction in DPMs. This significantly enhances the accuracy of estimations for the final sample while preserving the number of function evaluations (NFEs). Unlike standard Richardson extrapolation, which assumes uniform discretization of the time grid, we have developed a more general formulation tailored to arbitrary time step scheduling, guided by the local truncation error derived from a baseline sampling method. The simplicity of our approach facilitates accurate estimation of numerical solutions without additional computational overhead, and allows for seamless and convenient integration into various DPMs and solvers. Additionally, RX-DPM provides explicit error estimates, effectively illustrating the faster convergence achieved as the order of the leading error term increases. Through a series of experiments, we demonstrate that the proposed method effectively enhances the quality of generated samples without requiring additional sampling iterations.

903Double Check My Desired Return: Transformer with Value Validation for Offline RL

[openreview] [pdf]

Abstract Recently, there has been increasing interest in applying Transformers to offline reinforcement learning (RL). Existing methods typically frame offline RL as a sequence modeling problem and learn actions via Supervised learning (RvS). However, RvS-trained Transformers struggle to align actual returns with desired target returns, especially when dealing with underrepresented returns in the dataset (interpolation) or missed higher returns that could be achieved by stitching sub-optimal trajectories (extrapolation). In this work, we propose a novel method that Double Checks the Transformer with value validation for Offline RL (Doctor). Doctor integrates the strengths of supervised learning (SL) and temporal difference (TD) learning by jointly optimizing the action prediction and value function. SL stabilizes the prediction of actions conditioned on target returns, while TD learning adds stitching capability to the Transformer. During inference, we introduce a double-check mechanism. We sample actions around desired target returns and validate them with value functions. This mechanism ensures better alignment between the predicted action and the desired target return and is beneficial for further online exploration and fine-tuning. We evaluate Doctor on the D4RL benchmark in both offline and offline-to-online settings, demonstrating that Doctor does much better in return alignment, either within the dataset or beyond the dataset. Furthermore, Doctor performs on par with or outperforms existing RvS-based and TD-based offline RL methods on the final performance.

904Decentralized Sporadic Federated Learning: A Unified Algorithmic Framework with Convergence Guarantees

[openreview] [pdf]

Abstract Decentralized federated learning (DFL) captures FL settings where both (i) model updates and (ii) model aggregations are exclusively carried out by the clients without a central server. Existing DFL works have mostly focused on settings where clients conduct a fixed number of local updates between local model exchanges, overlooking heterogeneity and dynamics in communication and computation capabilities. In this work, we propose Decentralized Sporadic Federated Learning (DSpodFL\texttt{DSpodFL}), a DFL methodology built on a generalized notion ofsporadicityin both local gradient and aggregation processes. DSpodFL\texttt{DSpodFL} subsumes many existing decentralized optimization methods under a unified algorithmic framework by modeling the per-iteration (i) occurrence of gradient descent at each client and (ii) exchange of models between client pairs as arbitrary indicator random variables, thus capturingheterogeneous and time-varyingcomputation/communication scenarios. We analytically characterize the convergence behavior of DSpodFL\texttt{DSpodFL} for both convex and non-convex models and for both constant and diminishing learning rates, under mild assumptions on the communication graph connectivity, data heterogeneity across clients, and gradient noises. We show how our bounds recover existing results from decentralized gradient descent as special cases. Experiments demonstrate that DSpodFL\texttt{DSpodFL} consistently achieves improved training speeds compared with baselines under various system settings.

905Discrete Bregman Divergence

[openreview] [pdf]

Abstract The Bregman divergence, which is generated from a convex function, is commonly used as a pseudo-distance for comparing vectors or functions in continuous spaces. In contrast, defining an analog of the Bregman divergence for discrete spaces is nontrivial. Iyer and Bilmes (2012a) considered Bregman divergences on discrete domains using submodular functions as generating functions, the discrete analogs of convex functions. In this paper, we further generalize this framework to cases where the generating function is neither submodular nor supermodular, thus increasing the flexibility and representational capacity of the resulting divergence, which we term the discrete Bregman divergence. Additionally, we introduce a learnable form of this divergence using permutation-invariant neural networks (NNs) and demonstrate through experiments that it effectively captures key structural properties in discrete data, outperforming existing methods on tasks such as clustering. This work addresses the challenge of defining meaningful divergences in discrete settings and provides a new tool for tasks requiring structure-preserving distance measures.

906Fundamental Limits of Prompt Tuning Transformers: Universality, Capacity and Efficiency

[openreview] [pdf]

Abstract We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models. Our key contributions are prompt tuning onsingle-headtransformers with only asingleself-attention layer: (i) is universal, and (ii) supports efficient (even nearly-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH). Statistically, we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions. In addition, we provide an exponential-in-dLdL and -in-(1/ϵ)(1/\epsilon) lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers. Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of thesoft-prompt-inducedkeys and queries, and provide an upper bound criterion. Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH. Within this criterion, we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms. These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.

907Advantage-Guided Distillation for Preference Alignment in Small Language Models

[openreview] [pdf]

Abstract Alignment techniques such as RLHF enable LLMs to generate outputs that align with human preferences and play an essential role in their effectiveness. However, their impact often diminishes when applied to smaller language models, likely due to the limited capacity of these models. Instead of directly applying existing alignment techniques to smaller models, we propose to utilize a well-aligned teacher LLM to guide the alignment process for these models, thereby facilitating the transfer of the teacher’s knowledge of human preferences to the student model. To achieve this, we first explore a straightforward approach, Dual-Constrained Knowledge Distillation (DCKD), that employs knowledge distillation with two KL-divergence constraints from the aligned teacher to the unaligned student. To further enhance the contrastive effect, we then propose Advantage-Guided Distillation for Preference Alignment (ADPA), which leverages an advantage function from the aligned teacher to deliver more nuanced, distribution-level reward signals for the student’s alignment. Our experimental results demonstrate that these two approaches appreciably improve the alignment of smaller language models and narrow the performance gap with their larger counterparts.

908Exploring the Causal Mechanisms: Towards Robust and Explainable Algorithm Selection

[openreview] [pdf]

Abstract Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and meta-features. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces causality to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging causal relationships to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned causal graph and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability. Extensive experiments on the ASlib benchmark validate the advantages of the proposed model in terms of robustness and explainability. The code will make publicly available after the review process.

909Unlocking Video-LLM via Agent-of-Thoughts Distillation

[openreview] [pdf]

Abstract This paper tackles the problem of video question answering (VideoQA), a task that often requires multi-step reasoning and a profound understanding of spatial-temporal dynamics. While large generative video-language models perform well on benchmarks, they often lack explainability and spatial-temporal grounding. In this paper, we proposeAgent-of-ThoughtsDistillation (AoTD), a method that enhances generative models by incorporating automatically generated Chain-of-Thoughts (CoTs) into the instruction-tuning process. Specifically, we leverage an agent-based system to decompose complex questions into sub-tasks, and address them with specialized vision models, the intermediate results are then treated as reasoning chains. We also introduce a verification mechanism using a large language model (LLM) to ensure the reliability of generated CoTs. Extensive experiments demonstrate that AoTD improves the performance on multiple-choice and open-ended benchmarks.

910R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedback

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) provides a paradigm for aligning large language models (LLMs) with human preferences. This involves the initial training of a reward model based on pairwise human feedback. The reward model is subsequently utilized in reinforcement learning to assess the scores of each generated sentence as a whole, further guiding the optimization of LLMs. However, current approaches have a significant shortcoming: They allocate a single, sparse, and delayed reward to an entire sequence of output. This may overlook some significant individual contributions of each token towards the desired outcome. To overcome this limitation, our paper proposes a novel reward redistribution method called R3HF, which facilitates a more fine-grained, token-level reward allocation. Specifically, our method treats the reward prediction task of the reward model as a regression problem. As a result, the redistributed rewards are computed by evaluating the specific contribution of each token to the reward model’s output. This detailed approach improves the model’s understanding of language nuances, leading to more precise enhancements in its performance. Our method is crafted to integrate seamlessly with most current techniques while incurring minimal computational costs. Through comprehensive experiments across diverse datasets and tasks, we have verified the effectiveness and superiority of our approach.

911Spreading Out-of-Distribution Detection on Graphs

[openreview] [pdf]

Abstract Node-level out-of-distribution (OOD) detection on graphs has received significant attention from the machine learning community. However, previous approaches are evaluated using unrealistic benchmarks that consider only randomly selected OOD nodes, failing to reflect the interactions among nodes. In this paper, we introduce a new challenging task to model the interactions of OOD nodes in a graph, termed spreading OOD detection, where a newly emerged OOD node spreads its property to neighboring nodes. We curate realistic benchmarks by employing the epidemic spreading models that simulate the spreading of OOD nodes on the graph. We also showcase a ``Spreading COVID-19" dataset to demonstrate the applicability of spreading OOD detection in real-world scenarios. Furthermore, to effectively detect spreading OOD samples under the proposed benchmark setup, we present a new approach called energy distribution-based detector (EDBD), which includes a novel energy-aggregation scheme. EDBD is designed to mitigate undesired mixing of OOD scores between in-distribution (ID) and OOD nodes. Our extensive experimental results demonstrate the superiority of our approach over state-of-the-art methods in both spreading OOD detection and conventional node-level OOD detection tasks across seven benchmark datasets.

912Double Descent Meets Out-of-Distribution Detection: Theoretical Insights and Empirical Analysis of the role of model complexity

[openreview] [pdf]

Abstract While overparameterization is known to benefit generalization, its impact on Out-Of-Distribution (OOD) detection is less understood. This paper investigates the influence of model complexity in OOD detection. We propose an expected OOD risk metric to evaluate classifiers confidence on both training and OOD samples. Leveraging Random Matrix Theory, we derive bounds for the expected OOD risk of binary least-squares classifiers applied to Gaussian data. We show that the OOD risk depicts an infinite peak, when the number of parameters is equal to the number of samples, which we associate with the double descent phenomenon. Our experimental study on different OOD detection methods across multiple neural architectures extends our theoretical insights and highlights a double descent curve. Our observations suggest that overparameterization does not necessarily lead to better OOD detection. Using the Neural Collapse framework, we provide insights to better understand this behavior. To facilitate reproducibility, our code will be made publicly available upon publication.

913Open-World Test-Time Training: Self-Training with Contrastive Learning

[openreview] [pdf]

Abstract Traditional test-time training (TTT) methods, while addressing domain shifts, often assume a consistent class set that limits their applicability in real-world scenarios with infinite variety. Open-World Test-Time Training (OWTTT) addresses the challenge of generalizing deep learning models to unknown target domain distributions, especially in the presence of strong Out-of-Distribution (OOD) data. Existing TTT methods often struggle to maintain performance when confronted with strong OOD data. In OWTTT, the primary focus has been on distinguishing between strong and weak OOD data. However, during the early stages of TTT, initial feature extraction is hampered by interference from strong OOD and corruptions, leading to reduced contrast and premature classification of certain classes as strong OOD. To handle this problem, we introduce Open World Dynamic Contrastive Learning (OWDCL), an innovative approach that leverage contrastive learning to augment positive sample pairs. This strategy not only enhances contrast in the early stages but also significantly enhances model robustness in later stages. In comparison datasets, our OWDCL model achieves state-of-the-art performance.

914Breaking through Data Scarcity: Knowledge Transfer in Offline Reinforcement Learning

[openreview] [pdf]

Abstract We focus on knowledge transfer in offline reinforcement learning (RL), which aims to significantly improve the learning of an optimal policy in a target task based on a pre-collected dataset without further interactions with the environment. Data scarcity and high-dimensional feature spaces seriously pose challenges to offline RL in many real-world applications, and knowledge transfer offers a promising solution. We propose a novel and comprehensive knowledge transfer framework for offline RL, which carefully considers the relationship between the target and source tasks within the linear Markov decision process (MDP) framework. This enables efficient knowledge transfer from related source tasks to enhance learning in the target task and effectively address data scarcity concerns in offline RL. Our main contributions include establishing a relationship with the learning process between the target task and source task, introducing an effective and robust knowledge transfer technique to reduce the suboptimality of the learned policy, and demonstrating the significant effectiveness of the knowledge transfer framework through detailed theoretical analysis. Our work significantly contributes to the advancement of offline RL by providing a practical and robust framework for knowledge transfer facilitating more efficient and effective data utilization in various applications.

915BAYESIAN EXPERIMENTAL DESIGN VIA CONTRASTIVE DIFFUSIONS

[openreview] [pdf]

Abstract Bayesian Optimal Experimental Design (BOED) is a powerful tool to reduce the cost of running a sequence of experiments. When based on the Expected Information Gain (EIG), design optimization corresponds to the maximization of some intractable expectedcontrastbetween prior and posterior distributions. Scaling this maximization to high dimensional and complex settings has been an issue due to BOED inherent computational complexity. In this work, we introduce anexpected posteriordistribution with cost-effective sampling properties and provide a tractable access to the EIG contrast maximization via a new EIG gradient expression. Diffusion-based samplers are used to compute the dynamics of the expected posterior and ideas from bi-level optimization are leveraged to derive an efficient joint sampling-optimization loop, without resorting to lower bound approximations of the EIG. The resulting efficiency gain allows to extend BOED to the well-tested generative capabilities of diffusion models. By incorporating generative models into the BOED framework, we expand its scope and its use in scenarios that were previously impractical. Numerical experiments and comparison with state-of-the-art methods show the potential of the approach.

916Zero-shot Novel View Synthesis via Adaptive Modulating Video Diffusion Process

[openreview] [pdf]

Abstract By harnessing the potent generative capabilities of pre-trained large video diffusion models, we propose a new novel view synthesis paradigm that operates \textit{without} the need for training. The proposed method adaptively modulates the diffusion sampling process with the given views to enable the creation of visually pleasing results from single or multiple views of static scenes or monocular videos of dynamic scenes. Specifically, built upon our theoretical modeling, we iteratively modulate the score function with the given scene priors represented with warped input views to control the video diffusion process. Moreover, by theoretically exploring the boundary of the estimation error, we achieve the modulation in an adaptive fashion according to the view pose and the number of diffusion steps. Extensive evaluations on both static and dynamic scenes substantiate the significant superiority of our method over state-of-the-art methods both quantitatively and qualitatively. The source code can be found on the anonymous webpage:https://github.com/PAPERID5494/VD_NVS. We also refer reviewers to the Supplementary Material for the video demo.

917FedGO : Federated Ensemble Distillation with GAN-based Optimality

[openreview] [pdf]

Abstract For federated learning in practical settings, a significant challenge is the considerable diversity of data across clients. To tackle this data heterogeneity issue, it has been recognized that federated ensemble distillation is effective. Federated ensemble distillation requires an unlabeled dataset on the server, which could either be an extra dataset the server already possesses or a dataset generated by training a generator through a data-free approach. Then, it proceeds by generating pseudo-labels for the unlabeled data based on the predictions of client models and training the server model using this pseudo-labeled dataset. Consequently, the efficacy of ensemble distillation hinges on the quality of these pseudo-labels, which, in turn, poses a challenge of appropriately assigning weights to client predictions for each data point, particularly in scenarios with data heterogeneity. In this work, we suggest a provably near-optimal weighting method for federated ensemble distillation, inspired by theoretical results in generative adversarial networks (GANs). Our weighting method utilizes client discriminators, trained at the clients based on a generator distributed from the server and their own datasets. Our comprehensive experiments on various image classification tasks illustrate that our method significantly improves the performance over baselines, under various scenarios with and without extra server dataset. Furthermore, we provide an extensive analysis of additional communication cost, privacy leakage, and computational burden caused by our weighting method.

918Greedy Learning to Optimize with Convergence Guarantees

[openreview] [pdf]

Abstract Learning to optimize is an approach that leverages training data to accelerate the solution of optimization problems. Many approaches use unrolling to parametrize the update step and learn optimal parameters. Although L2O has shown empirical advantages over classical optimization algorithms, memory restrictions often greatly limit the unroll length and learned algorithms usually do not provide convergence guarantees. In contrast, we introduce a novel method employing a greedy strategy that learns iteration-specific parameters by minimizing the function value at the next iteration. This enables training over significantly more iterations while maintaining constant memory usage. We parameterize the update such that parameter learning corresponds to solving a convex optimization problem at each iteration. In particular, we explore preconditioned gradient descent with multiple parametrizations including a novel convolutional preconditioner. With our learned algorithm, convergence in the training set is proven even when the preconditioner is neither symmetric nor positive definite. Convergence on a class of unseen functions is also obtained, ensuring robust performance and generalization beyond the training data. We test our learned algorithms on two inverse problems, image deblurring and Computed Tomography, on which learned convolutional preconditioner demonstrates improved empirical performance over classical optimization algorithms such as Nesterov’s Accelerated Gradient Method and the quasi-Newton method L-BFGS.

919Procedural Fairness Through Addressing Social Determinants of Opportunity

[openreview] [pdf]

Abstract Social determinants of opportunityare variables that, while not directly pertaining to any specific individual, capture key aspects of contexts and environments that have direct causal influences on certain attributes of an individual, e.g., environmental pollution in an area affects individual’s health condition, and educational resources in an neighborhood influence individual’s academic preparedness. Previous algorithmic fairness literature often overlookssocial determinants of opportunity, leading to implications for procedural fairness and structural justice that are incomplete and potentially even inaccurate. We propose a modeling framework that explicitly incorporatessocial determinants of opportunityand their causal influences on individual-level attributes of interest. To demonstrate theoretical perspectives and practical applicability of our framework, we consider college admissions as a running example. Specifically, for three mainstream admission procedures that have historically been implemented or are still in use today, we distinguish and draw connections between the outcome of admission decision-making and the underlying distribution of academic preparedness in the applicant population. Our findings suggest that mitigation strategies centering solely around protected features may introduce new procedural unfairness when addressing existing discrimination. Considering both individual-level attributes andsocial determinants of opportunityfacilitates a more comprehensive explication of benefits and burdens experienced by individuals from diverse demographic backgrounds as well as contextual environments, which is essential for understanding and achieving procedural fairness effectively and transparently.

920Topology-aware Graph Diffusion Model with Persistent Homology

[openreview] [pdf]

Abstract Generating realistic graphs presents challenges in estimating accurate distribution of graphs in an embedding space while preserving structural characteristics such as topology. However, existing graph generation methods primarily focus on approximating the joint distribution of graph nodes and edges, overlooking topology-wise similarity hindering accurate representation of global graph structures such as connected components and loops. To address this issue, we propose a topology-aware diffusion-based graph generation method that aims to closely resemble the structural characteristics of the original graph by leveraging persistent homology from topological data analysis (TDA). Specifically, we suggest a novel loss function, Persistence Diagram Matching (PDM) loss, which ensures the generated graphs to closely match the topology of the original graphs, enhancing their fidelity and preserving essential homological properties. Also, we introduce a novel topology-aware attention to enhance the self-attention module in the denoising network. Through comprehensive experiments, we demonstrate the effectiveness of our approach not only by exhibiting high generation performance across various metrics, but also by demonstrating a closer alignment with the distribution of topological features observed in the original graphs. In addition, application to real brain network data showcases its versatility and potential for complex and real graph application.

921METHODS OF IMPROVING LLM TRAINING STABILITY

[openreview] [pdf]

Abstract Training stability of large language models (LLMs) is an important research topic. Reproducing training instabilities can be costly, so we use a small language model with 830M parameters and experiment with higher learning rates to force models to diverge, as in Wortsman et al. (2024). One of the sources of training instability is the growth of logits in attention layers Dehghani et al. (2023). We extend the focus of the previous work [Dehghani et al. (2023),Wortsman et al. (2024)] and look not only at the magnitude of the logits but at all outputs of linear layers in the Transformer block. We observe that with a high learning rate the L2 norm of all linear layer outputs grow with each training step and the model diverges. Specifically we observe that QKV, Proj and FC2 layers have the largest growth of the output magnitude. This prompts us to explore several options: 1) apply layer normalization not only after QK layers (as it is done in [Dehghani et al. (2023), Wortsman et al. (2024)]) but after Proj and FC2 layers too; 2) apply layer normalization after the QKV layer (and remove pre normalization). 3) apply QK layer normalization together with softmax capping. We show that with the last two methods we can increase learning rate by 1.5x (without model divergence) in comparison to an approach based on QK layer normalization only Dehghani et al. (2023). Also we observe significant perplexity improvements for all three methods in comparison to the baseline model.

922LSTR: Long-Short Range Aggregation for Trajectory Prediction at Intersection Scenarios

[openreview] [pdf]

Abstract Trajectory prediction is crucial for practical applications, encompassing navigation for autonomous vehicles and the implementation of safety systems based on the Internet of Vehicles (IoV). Most existing methods significantly rely on comprehensive map information, employing robust rule constraints to incrementally predict trajectories within the driver’s local decision-making context. However, in environments characterized by weak rule enforcement, such as urban intersections, these approaches neglect the disparity between the driver’s overarching intentions and current behaviors.Recognizing the characteristics of intersection traffic flow—macroscopically organized yet microscopically disordered, exhibiting highly heterogeneous conditions—this paper presents a novel model termed Long-short Range Aggregation for Trajectory Prediction in Intersections (LSTR). This model anchors the vehicle’s local decision-making process to long-range intentions. Specifically, LSTR predicts the vehicle’s destination via a global intention inference module and models its long-range driving intentions through clustering to extract macroscopic traffic flow patterns. This long-range intention subsequently informs the short-range local interaction behaviors captured by the local behavior decision module. Ultimately, the fused features from these two modules are analyzed using a multi-modal decoder to interpret the various motion patterns, resulting in the trajectory prediction outcomes.We rigorously validate the proposed framework across multiple intersection scenarios utilizing real-world datasets, including inD, roundD, and a subset of WOMD. Experimental results demonstrate that our model outperforms numerous benchmarks without relying on additional information such as HD maps of intersections.

923Learning Symmetries through Loss Landscape

[openreview] [pdf]

Abstract Incorporating equivariance as an inductive bias into deep learning architectures, to take advantage of the data symmetry, has been successful in multiple applications such as chemistry and dynamical systems. The build of equivariance architecture, particularly w.r.t. roto-translations, is crucial for effectively modeling geometric graphs and molecules, where the understanding of 3D structures enhances generalization. However, despite their potential, equivariant models often pose challenges due to their high computational complexity. In this paper, we study the capabilities of unconstrained models (which do not build equivariance into the architecture) and how they generalize compared to equivariant models. We show that unconstrained models can learn approximate symmetries by minimizing additional simple equivariance loss. By formulating equivariance as a new learning objective, we can control the level of approximate equivariance in the model. Our method achieves competitive performance compared to equivariant baselines while being 10x faster at inference and 2.5x at training.

924Do Think Tags Really Help LLMs Plan? A Critical Evaluation of ReAct-Style Prompting

[openreview] [pdf]

Abstract The reasoning abilities of Large Language Models (LLMs) remain a topic of debate, which are critically tested in sequential decision-making problems. ReAct, a recently popular method has gained popularity for claiming to enhance LLM reasoning abilities while directly prompting them by interleaving reasoning trace with action execution"``\textit{interleaving reasoning trace with action execution}" in text-based planning domains such as AlfWorld and WebShop. However, given the different components of ReAct-style prompting, it remains unclear what the source of improvement in LLM performance is. In this paper, we critically examine the claims of ReAct-style prompting for sequential decision-making problems. By introducing systematic variations to the input prompt, we perform a sensitivity analysis along the original claims of ReAct. Contrary to these claims and common use-cases that utilize ReAct-style prompting, we find that the performance is minimally influenced by the interleaved reasoning trace or by the content of these generated reasoning traces. Instead, the performance of LLMs is primarily driven by the unreasonably high degree of similarity between input example tasks and queries, implicitly forcing the prompt designer to provide instance-specific examples which significantly increases the cognitive burden on the human. Our empirical results, on the same suite of domains as ReAct, show that the perceived reasoning abilities of LLMs stem from the exemplar-query similarity and approximate retrieval rather than any inherent reasoning abilities.

925In-Context Transfer Learning: Demonstration Synthesis by Transferring Similar Tasks

[openreview] [pdf]

Abstract In-context learning (ICL) is an effective approach to help large language models (LLMs) adapt to various tasks by providing demonstrations of the target task. Considering the high cost of labeling demonstrations, many methods propose synthesizing demonstrations from scratch using LLMs. However, the quality of the demonstrations synthesized from scratch is limited by the capabilities and knowledge of LLMs. To address this, inspired by transfer learning, we propose In-Context Transfer Learning (ICTL), which synthesizes target task demonstrations by transferring labeled demonstrations from similar source tasks. ICTL consists of two steps: source sampling and target transfer. First, we define an optimization objective, which minimizes transfer error to sample source demonstrations similar to the target task. Then, we employ LLMs to transfer the sampled source demonstrations to match the definition and format of the target task. Experiments on Super-NI show that ICTL outperforms synthesis from scratch by 2.0% on average, demonstrating the effectiveness of our method.

926MixMax: Distributional Robustness in Function Space via Optimal Data Mixtures

[openreview] [pdf]

Abstract Machine learning models are often required to perform well across several pre-defined settings, such as a set of user groups. Worst-case performance is a common metric to capture this requirement, and is the objective of group distributionally robust optimization (group DRO). Unfortunately, these methods struggle when the loss is non-convex in the parameters, or the model class is non-parametric. Here, we make a classical move to address this: we reparameterize group DRO from parameter space to function space, which results in a number of advantages. First, we show that group DRO over the space of bounded functions admits a minimax theorem. Second, for cross-entropy and mean squared error, we show that the minimax optimal mixture distribution is the solution of a simple convex optimization problem. Thus, provided one is working with a model class of universal function approximators, group DRO can be solved by a convex optimization problem followed by a classical risk minimization problem. We call our method MixMax. In our experiments, we found that MixMax matched or outperformed the standard group DRO baselines, and in particular, MixMax improved the performance of XGBoost over the only baseline, data balancing, for variations of the ACSIncome and CelebA annotations datasets.

927Cross-Entropy Is All You Need To Invert the Data Generating Process

[openreview] [pdf]

Abstract Supervised learning has become a cornerstone of modern machine learning, yet a comprehensive theory explaining its effectiveness remains elusive. Empirical phenomena, such as neural analogy-making and the linear representation hypothesis, suggest that supervised models can learn interpretable factors of variation in a linear fashion. Recent advances in self-supervised learning, particularly nonlinear Independent Component Analysis, have shown that these methods can recover latent structures by inverting the data generating process. We extend these identifiability results to parametric instance discrimination, then show how insights transfer to the ubiquitous setting of supervised learning with cross-entropy minimization. We prove that even in standard classification tasks, models learn representations of ground-truth factors of variation up to a linear transformation. We corroborate our theoretical contribution with a series of empirical studies. First, using simulated data matching our theoretical assumptions, we demonstrate successful disentanglement of latent factors. Second, we show that on DisLib, a widely-used disentanglement benchmark, simple classification tasks recover latent structures up to linear transformations. Finally, we reveal that models trained on ImageNet encode representations that permit linear decoding of proxy factors of variation. Together, our theoretical findings and experiments offer a compelling explanation for recent observations of linear representations, such as superposition in neural networks. This work takes a significant step toward a cohesive theory that accounts for the unreasonable effectiveness of supervised deep learning.

928Torque-Aware Momentum

[openreview] [pdf]

Abstract Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.

929Bandit Learning in Matching Markets with Indifference

[openreview] [pdf]

Abstract A rich line of recent works studies how participants in matching markets learn their unknown preferences through iterative interactions with each other. Two sides of participants in the market can be respectively formulated as players and arms in the bandit problem. To ensure market stability, the objective is to minimize the stable regret of each player. Though existing works provide significant theoretical upper bounds for players’ stable regret, the results heavily rely on the assumption that each participant has a strict preference ranking. However, in real applications, multiple candidates (e.g., workers in the labor market and students in school admission) usually demonstrate comparable performance levels, making it challenging for participants (e.g. employers and schools) to differentiate and rank their preferences. To deal with the potential indifferent preferences, we propose an adaptive exploration algorithm based on arm-guided Gale-Shapley (AE-AGS). We show that its stable regret is of order O(NKlogT/Δ2)O(NK \log T / \Delta^2), where NN is the number of players, KK the number of arms, TT the total time horizon, and Δ\Delta the minimum non-zero preference gap. To the best of our knowledge, this is the first polynomial regret bound applicable to the more general indifference setting, and it is only O(N)O(N) worse than the state-of-the-art result in the strict preference setting. Extensive experiments demonstrate the algorithm’s effectiveness in handling such complex situations and its consistent superiority over baselines.

930OccVAR: Scalable 4D Occupancy Prediction via Next-Scale Prediction

[openreview] [pdf]

Abstract In this paper, we propose OCCVAR, a generative occupancy world model that simulates the movement of the ego vehicle and the evolution of the surrounding environment. Different from visual generation, the occupancy world model should capture the fine-grained 3D geometry and dynamic evolution of the 3D scenes, posing great challenges for the generative models. Recent approaches based on autoregression (AR) have demonstrated the potential to predict vehicle movement and future occupancy scenes simultaneously from historical observations, but they typically suffer from the inefficiency and temporal degradation in long-time generation. To holistically address the efficiency and quality issues, we propose a spatial-temporal transformer via temporal next-scale prediction, aiming at predicting the 4D occupancy scenes from coarse to fine scales. To model the dynamic evolution of the scene, we incorporate the ego movement before the tokenized occupancy sequence, enabling the prediction of ego movement and controllable scene generation. To model the fine-grained 3D geometry, OCCVAR utilizes a muitli-scale scene tokenizer to capture the hierarchical information of the 3D scene. Experiments show that OCCVAR is capable of high-quality occupancy reconstruction, long-time generation and fast inference speed compared to prior works.

931DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On

[openreview] [pdf]

Abstract In this paper, we introduce DiffusionTrend, a pioneering approach for virtual fashion try-on that forgoes the need for training diffusion models, thereby offering simple, conventional pose virtual try-on services with significantly reduced computational overhead. By leveraging advanced diffusion models, DiffusionTrend harnesses latents rich with prior information to capture the nuances of garment details. Throughout the diffusion denoising process, these details are seamlessly integrated into the model image generation, expertly directed by a precise garment mask crafted by a lightweight and compact CNN. Although our DiffusionTrend model initially demonstrates suboptimal metric performance, our exploratory approach offers several significant advantages: (1) It circumvents the need for resource-intensive training of diffusion models on large datasets. (2) It eliminates the necessity for various complex and user-unfriendly model inputs. (3) It delivers a visually compelling virtual try-on experience, underscoring the potential of training-free diffusion models for future research within the community. Overall, this initial foray into the application of untrained diffusion models in virtual try-on technology paves the way for further exploration and refinement in this innovative field.

932Achieving Optimal Breakdown for Byzantine-Robust Gossip

[openreview] [pdf]

Abstract Distributed approaches have many computational benefits, but they are vulnerable to attacks from a subset of devices transmitting incorrect information. This paper investigates Byzantine-resilient algorithms in a decentralized setting, where devices communicate directly with one another. ~We investigate the notion of \emph{breakdown point}, and show an upper bound on the number of adversaries that decentralized algorithms can tolerate. We introduce CG+\mathrm{CG}^+, an algorithm at the intersection of ClippedGossip\mathrm{ClippedGossip} and NNA\mathrm{NNA}, two popular approaches for robust decentralized learning. CG+\mathrm{CG}^+ meets our upper bound, and thus obtains optimal robustness guarantees, whereas neither of the existing two does. We provide experimental evidence for this gap by presenting an attack tailored to sparse graphs which breaks NNA\mathrm{NNA} but against which CG+\mathrm{CG}^+ is robust.

933Context-Aware Online Recommendation with Bayesian Incentive Compatibility

[openreview] [pdf]

Abstract Recommender systems play a crucial role in internet economies by connecting users with relevant products or services. However, designing effective recommender systems faces two key challenges: (1) the exploration-exploitation tradeoff in balancing new product exploration against exploiting known preferences, and (2) context-aware Bayesian incentive compatibility in accounting for users’ heterogeneous preferences and self-interested behaviors. This paper formalizes these challenges into a Context-aware Bayesian Incentive-Compatible Recommendation Problem (CBICRP). To address the CBICRP, we propose a two-stage algorithm (RCB) that integrates incentivized exploration with an efficient offline learning component for exploitation. In the first stage, our algorithm explores available products while maintaining context-aware Bayesian incentive compatibility to determine sufficient sample sizes. The second stage employs inverse proportional gap sampling integrated with arbitrary efficient machine learning method to ensure sublinear regret. Theoretically, we prove that RCB achieves O(KdT)O(\sqrt{KdT}) regret and satisfies Bayesian incentive compatibility (BIC). Empirically, we validate RCB’s strong incentive gain, sublinear regret, and robustness through simulations and a real-world application on personalized warfarin dosing. Our work provides a principled approach for incentive-aware recommendation in online preference learning settings.

934Online learning meets Adam: The Road of Interpretable Adaptive Optimizer Design

[openreview] [pdf]

Abstract This paper explores the theoretical foundations of Adam, a widely used adaptive optimizer. Building on recent developments in non-convex optimization and online learning, particularly the discounted-to-nonconvex conversion framework, we present two aspects of results: First, we introduce clip-free FTRL, a novel variant of the classical Follow-the-Regularized-Leader (FTRL) algorithm. Unlike scale-free FTRL and the recently proposed β\beta-FTRL, our clip-free variant eliminates the need for clipping operations, aligning more closely with Adam’s practical implementation. This modification provides deeper theoretical insights into Adam’s empirical success and aligns the theoretical framework with practical implementations. By incorporating a refined analysis, our second result establishes a theoretical guarantee for the Last Iterate Convergence (LIC) under the proposed discounts-to-nonconvex conversion algorithm in LIC, which differs from the previous guarantee that has convergence evenly distributed in all iterations. Additionally, we extend this result to provide the last iterate convergence guarantee for the popular β\beta-FTRL algorithm under the same framework. However, the derived last iterate convergence of β\beta-FTRL reveals a persistent fixed error, potentially suggesting either limitations in popular online learning methods or the need for additional assumptions about the objective function.

935In-Context Reinforcement Learning From Suboptimal Historical Data

[openreview] [pdf]

Abstract Large-scale transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context Reinforcement Learning (RL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL instances, and then fix and use this transformer to create an action policy for new RL instances. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT), which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

936Propagation Alone is Enough for Graph Contrastive Learning

[openreview] [pdf]

Abstract Graph contrastive learning has recently gained substantial attention, leading to the development of various methodologies. In this work, we reveal that a simple training-free propagation method PROP achieves competitive results over dedicatedly designed GCL methods across a diverse set of benchmarks. We elucidate the underlying rationale for PROP’s effectiveness by drawing connections between the propagation operator and established unsupervised learning algorithms. To investigate the reasons for the suboptimal performance of existing GCL methods, we decouple the propagation and transformation phases of graph neural networks. Our findings indicate that GCL inadequately learns effective transformation weights while exhibiting potential for solid propagation learning. In light of these insights, we enhance PROP with learnable propagation, introducing a novel GCL method termed PROPGCL. The effectiveness of PROPGCL is demonstrated through comprehensive evaluations.

937Flow of Reasoning: Training LLMs for Divergent Problem Solving with Minimal Examples

[openreview] [pdf]

Abstract The ability to generate diverse solutions to a given problem is a hallmark of human creativity. This divergent reasoning is also crucial for machines, enhancing their robustness and enabling them to assist humans in many applications such as scientific discovery. However, existing approaches to multi-step reasoning with large language models (LLMs) have mostly focused only on the reasoning accuracy, without further discovering more diverse valid solutions. For example, supervised fine-tuning can improve LLM reasoning quality, but requires extensive supervised data to capture the full range of possible solutions. Reinforcement learning aims to find limited highest-reward solutions while neglecting the solution diversity. To fill this gap, we propose Flow of Reasoning (FoR), an efficient diversity-seeking LLM finetuning method aimed at improving reasoning quality and diversity with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow on a DAG-structured reasoning graph. This formulation allows us to incorporate and adapt principled GFlowNet approaches, for finetuning LLMs to sample diverse reasoning paths with probabilities proportional to the (unnormalized) reward of target problems. Extensive experiments show that, with limited training examples (e.g., 15 examples), FoR enables the discovery of diverse, creative, high-quality solutions, greatly outperforming a wide range of existing inference and training methods across five challenging puzzle-solving tasks, including BlocksWorld (embodied reasoning), Game24 (math puzzle solving), PrOntoQA (logical reasoning), Rubik’s Cube (spatial reasoning), and 1D-ARC (abstraction reasoning).

938From Global Assessment to Local Selection: Efficiently Solving Traveling Salesman Problems of All Sizes

[openreview] [pdf]

Abstract The Traveling Salesman Problem (TSP) is a well-known combinatorial optimization problem with broad real-world applications. Recent advancements in neural network-based TSP solvers have shown promising results. Nonetheless, these models often struggle to efficiently solve both small- and large-scale TSPs using the same set of pre-trained model parameters, limiting their practical utility. To address this issue, we introduce a novel neural TSP solver named GELD, built upon our proposed broad global assessment and refined local selection framework. Specifically, GELD integrates a lightweight Global-view Encoder (GE) with a heavyweight Local-view Decoder (LD) to enrich embedding representation while accelerating the decision-making process. Moreover, GE incorporates a novel low-complexity attention mechanism, allowing GELD to achieve low inference latency and scalability to larger-scale TSPs. Additionally, we propose a two-stage training strategy that utilizes training instances of different sizes to bolster GELD’s generalization ability. Extensive experiments conducted on both synthetic and real-world datasets demonstrate that GELD outperforms seven state-of-the-art models considering both solution quality and inference speed. Furthermore, GELD can be employed as a post-processing method to exchange affordable computing time for significantly improved solution quality, capable of solving TSPs with up to 744,710 nodes without relying on divide-and-conquer strategies.

939Experimental Design for Nonstationary Optimization

[openreview] [pdf]

Abstract Traditional methods for optimizing neural networks often struggle when used to train networks in settings where the data distributions change, and plasticity preservation methods have been shown to improve performance in such settings (e.g. continual learning and reinforcement learning). With the growing inter- est in nonstationary optimization and plasticity research, there is also a growing need to properly define experimental design and hyperparameter search protocols to enable principled research. Each new proposed work typically adds several new hyperparameters makes many more design decisions such as hyperparame- ter selection protocols, evaluation protocols, and types of tasks examined. While innovation in experiment design is important, it is also necessary to (1) question whether those innovations are leading to the best progress and (2) have standard- ized practices that make it easier to directly compare to prior works. In this paper, we first perform an extensive empirical study of over 27,000 trials looking at the performance of different methods and hyperparameters across different settings and architectures used in the literature to provide an evaluation of these methods and the hyperparameters they use under similar experimental conditions. We then examine several core experiment design choices made by the community, affirm- ing some while providing evidence against others, and provide concrete recom- mendations and analysis that can be used to guide future research.

940Turn-by-Turn Driving Navigation: Leveraging Sequence Model for Real-time Audio Instructions

[openreview] [pdf]

Abstract Turn-by-turn (TBT) navigation systems are integral to modern driving experiences, providing real-time audio instructions to guide drivers safely to destinations. However, existing audio instruction policy often rely on rule-based approaches that struggle to balance informational content with cognitive load, potentially leading to driver confusion or missed turns in complex environments. To overcome these difficulties, we first model the generation of audio instructions as a multi-task learning problem by decomposing the audio content into combinations of modular elements. Then, we propose a novel deep learning framework that leverages the powerful spatiotemporal information processing capabilities of Transformers and the strong multi-task learning abilities of Mixture of Experts (MoE) to generate real-time, context-aware audio instructions for TBT driving navigation. A cloud-edge collaborative architecture is implemented to handle the computational demands of the model, ensuring scalability and real-time performance for practical applications. Experimental results in the real world demonstrate that the proposed method significantly reduces the yaw rate compared to traditional methods, delivering clearer and more effective audio instructions. This is the first large-scale application of deep learning in driving audio navigation, marking a substantial advancement in intelligent transportation and driving assistance technologies.

941Learning Neural Networks with Distribution Shift: Efficiently Certifiable Guarantees

[openreview] [pdf]

Abstract We give the first provably efficient algorithms for learning neural networks with respect to distribution shift. We work in the Testable Learning with Distribution Shift framework (TDS learning) of Klivans et al. (2024), where the learner receives labeled examples from a training distribution and unlabeled examples from a test distribution and must either output a hypothesis with low test error or reject if distribution shift is detected. No assumptions are made on the test distribution.All prior work in TDS learning focuses on classification, while here we must handle the setting of nonconvex regression. Our results apply to real-valued networks with arbitrary Lipschitz activations and work whenever the training distribution has strictly sub-exponential tails. For training distributions that are bounded and hypercontractive, we give a fully polynomial-time algorithm for TDS learning one hidden-layer networks with sigmoid activations. We achieve this by importing classical kernel methods into the TDS framework using data-dependent feature maps and a type of kernel matrix that couples samples from both train and test distributions.

942Convergence Analysis of the Wasserstein Proximal Algorithm beyond Convexity

[openreview] [pdf]

Abstract The proximal algorithm is a powerful tool to minimize nonlinear and nonsmooth functionals in a general metric space. Motivated by the recent progress in studying the training dynamics of the noisy gradient descent algorithm on two-layer neural networks in the mean-field regime, we provide in this paper a simple and self-contained analysis for the convergence of the general-purpose Wasserstein proximal algorithm without assuming geodesic convexity on the objective functional. Under a natural Wasserstein analog of the Euclidean Polyak-{\L}ojasiewicz inequality, we show that the proximal algorithm achieves an unbiased and linear convergence rate. Our convergence rate improves upon existing rates of the proximal algorithm for solving Wasserstein gradient flows under strong geodesic convexity. We also extend our analysis to the inexact proximal algorithm for geodesically semiconvex objectives. In our numerical experiments, proximal training demonstrates a faster convergence rate than the noisy gradient descent algorithm on mean-field neural networks.

943Towards Black-Box Membership Inference Attack for Diffusion Models

[openreview] [pdf]

Abstract Given the rising popularity of AI-generated art and the associated copyright con- cerns, identifying whether an artwork was used to train a diffusion model is an important research topic. The work approaches this problem from the membership inference attack (MIA) perspective. We first identify the limitation of applying existing MIA methods for proprietary diffusion models: the required access of internal U-nets. To address the above problem, we introduce a novel member- ship inference attack method that uses only the image-to-image variation API and operates without access to the model’s internal U-net. We validate our method using DDIM and Stable Diffusion setups and further extend both our approach and existing algorithms to the Diffusion Transformer architecture. Our experimental results consistently outperform previous methods.

944REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability

[openreview] [pdf]

Abstract Understanding the agent’s learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent’s decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent’s learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent’s learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent’s performance in the test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent’s learning process. The experiments demonstrate that explanations derived from this framework can effectively help optimize the training tasks, resulting in improved learning efficiency and final performance.

945Bridging Lottery Ticket and Grokking: Understanding Grokking from Inner Structure of Networks

[openreview] [pdf]

Abstract Grokking is the intriguing phenomenon of delayed generalization: networks initially memorize training data with perfect accuracy but poor generalization, then transition to a generalizing solution with continued training. While reasons for this delayed generalization, such as weight norms and sparsity, have been discussed, the influence of network structure, particularly the role of subnetworks, still needs to be explored. In this work, we link the grokking phenomenon to the lottery ticket hypothesis to investigate the impact of inner network structures. We demonstrate that using lottery tickets obtained at the generalizing phase (`grokking tickets’) significantly reduces delayed generalization on various tasks, including multiple modular arithmetic, polynomial regression, sparse parity, and MNIST. For example, lottery tickets accelerate the grokking (transition from memorization to generalization) up to \emph{1/65} compared to dense networks in modular addition. Through a series of controlled experiments, our findings reveal that neither small weight norms nor sparsity alone account for the reduction of delayed generalization; instead, the presence of a good subnetwork structure is crucial. Analyzing the transition from memorization to generalization, we observe that rapid changes in subnetwork structures, measured by the Jaccard distance, strongly correlate with improvements in test accuracy. We further show that pruning techniques can accelerate the grokking process, transforming a memorizing network into a generalizing one without updating the weights. By demonstrating that good subnetworks are key to achieving generalization and that pruning can expedite this process, we provide new insights into the mechanisms underlying neural network generalization.

946LLM Cascade with Multi-Objective Optimal Consideration

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated exceptional capabilities in understanding and generating natural language. However, their high deployment costs often pose a barrier to practical applications, especially. Cascading local and server models offers a promising solution to this challenge. While existing studies on LLM cascades have primarily focused on the performance-cost trade-off, real-world scenarios often involve more complex requirements. This paper introduces a novel LLM Cascade strategy with Multi-Objective Optimization, enabling LLM cascades to consider additional objectives (e.g., privacy) and better align with the specific demands of real-world applications while maintaining their original cascading abilities. Extensive experiments on three benchmarks validate the effectiveness and superiority of our approach.

947Universal Concavity-Aware Descent Rate for Optimizers

[openreview] [pdf]

Abstract Many machine learning problems involve a challenging task of calibrating parameters in a computational model to fit the training data; this task is especially challenging for non-convex problems. Many optimization algorithms have been proposed to assist in calibrating these parameters, each with its respective advantages in different scenarios, but it is often difficult to determine the scenarios for which an algorithm is best suited. To contend with this challenge, much work has been done on proving the rate at which these optimizers converge to their final solution, however the wide variety of such convergence rate bounds, each with their own different assumptions, convergence metrics, tightnesses, and parameters (which may or may not be known to the practitioner) make comparing these convergence rates difficult. To help with this problem, we present a minmax-optimal algorithm and, by comparison to it, give a single descent bound which is applicable to a very wide family of optimizers, tasks, and data (including all of the most prevalent ones), which also puts special emphasis on being tight even in parameter subspaces in which the cost function is concave.

948Almost Sure Convergence of Average Reward Temporal Difference Learning

[openreview] [pdf]

Abstract Tabular average reward Temporal Difference (TD) learning is perhaps the simplest and the most fundamental policy evaluation algorithm in average reward reinforcement learning. After at least 25 years since its discovery, we are finally able to provide a long-awaited almost sure convergence analysis. Namely, we are the first to prove that, under very mild conditions, tabular average reward TD converges almost surely to a sample path dependent fixed point. Key to this success is a new general stochastic approximation result concerning nonexpansive mappings with Markovian and additive noise, built on recent advances in stochastic Krasnoselskii-Mann iterations.

949Differentiable Integer Linear Programming

[openreview] [pdf]

Abstract Machine learning (ML) techniques have shown great potential in generating high-quality solutions for integer linear programs (ILPs). However, existing methods typically rely on asupervised learningparadigm, leading to (1)expensive training costdue to repeated invocations of traditional solvers to generate training labels, and (2)plausible yet infeasible solutionsdue to the misalignment between the training objective (minimizing prediction loss) and the inference objective (generating high-quality solutions). To tackle this challenge, we proposeDiffILO(DifferentiableIntegerLinear ProgrammingOptimization), anunsupervised learning paradigm for learning to solve ILPs. Specifically, through a novel probabilistic modeling, DiffILO reformulates ILPs---discrete and constrained optimization problems---into continuous, differentiable (almost everywhere), and unconstrained optimization problems. This reformulation enables DiffILO to simultaneously solve ILPs and train the model via straightforward gradient descent, providing two major advantages. First, it significantly reduces the training cost, as the training process does not need the aid of traditional solvers at all. Second, it facilitates the generation of feasible and high-quality solutions, as the modellearns to solve ILPsin an end-to-end manner, thus aligning the training and inference objectives. Experiments on commonly used ILP datasets demonstrate that DiffILO not only achieves an average training speedup of 13.2 times compared to supervised methods, but also outperforms them by generating heuristic solutions with significantly higher feasibility ratios and much better solution qualities.

950RouteFinder: Towards Foundation Models for Vehicle Routing Problems

[openreview] [pdf]

Abstract This paper introduces RouteFinder, a comprehensive foundation model framework to tackle different Vehicle Routing Problem (VRP) variants. Our core idea is that a foundation model for VRPs should be able to represent variants by treating each as a subset of a generalized problem equipped with different attributes. We propose a unified VRP environment capable of efficiently handling any attribute combination. The RouteFinder model leverages a modern transformer-based encoder and global attribute embeddings to improve task representation. Additionally, we introduce two reinforcement learning techniques to enhance multi-task performance: mixed batch training, which enables training on different variants at once, and multi-variant reward normalization to balance different reward scales. Finally, we propose efficient adapter layers that enable fine-tuning for new variants with unseen attributes. Extensive experiments on 24 VRP variants show RouteFinder achieves competitive results. Our code is openly available.

951A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization

[openreview] [pdf]

Abstract Most decentralized optimization algorithms are handcrafted. While endowed with strong theoretical guarantees, these algorithms generally target a broad class of problems, thereby not being adaptive or customized to specific problem features. This paper studies data-driven decentralized algorithms trained to exploit problem features to boost convergence. Existing learning-to-optimize methods typically suffer from poor generalization or prohibitively vast search spaces. In addition, they face more challenges in decentralized settings where nodes must reach consensus through neighborhood communications without global information. To resolve these challenges, this paper first derives the necessary conditions that successful decentralized algorithmic rules need to satisfy to achieve both optimality and consensus. Based on these conditions, we propose a novelMathematics-inspiredLearning-to-optimize framework forDecentralizedoptimization (MiLoDo). Empirical results demonstrate that MiLoDo-trained algorithms outperform handcrafted algorithms and exhibit strong generalizations. Algorithms learned via MiLoDo in 100 iterations perform robustly when running 100,000 iterations during inferences. Moreover, MiLoDo-trained algorithms on synthetic datasets perform well on problems involving real data, higher dimensions, and different loss functions.

952Graph Supervised Contrastive Learning for Geodemographics

[openreview] [pdf]

Abstract Geodemographic analysis is essential for understanding population characteristics and addressing socio-economic disparities across regions. However, limited research has been conducted on modelling changes in demographic data over time using Graph Neural Networks (GNNs). In this study, we address this gap by leveraging GNNs to model correlations between the 2011 census data (England & Wales), observing changes over time, and the Output Area Classification 2021, which reflects socio-economic differences between Output Areas. We propose a novel framework that utilises Supervised Contrastive Learning on graphs to obtain robust OA embeddings, with a particular focus on improving the model’s performance for minority classes. To evaluate the effectiveness of our framework, we conducted two downstream tasks based on the 2021 OA embeddings. Our results demonstrate that the proposed approach provides valuable insights for geodemographic analysis and offers policymakers a useful tool for assessing socio-economic transitions over time, and planning ahead on the basis of it.

953Effective LLM Knowledge Learning Requires Rethinking Generalization

[openreview] [pdf]

Abstract Large language models (LLMs) are trained on a substantial amount of documents that contain extensive world knowledge. However, it is still not well-understood how knowledge is acquired via autoregressive pre-training and extracted via question-answering. This lack of understanding greatly hinders effective knowledge learning, especially for continued pre-training on up-to-date information, as this evolving information often does not have diverse repetitions like foundational knowledge. In this paper, we focus on understanding and improving LLM knowledge learning. We found and verified that knowledge learning for LLMs can be deemed as an implicit supervised task hidden in the autoregressive pre-training objective. Our findings suggest that knowledge learning for LLMs would benefit from methods designed to improve generalization ability for supervised tasks. Based on our analysis, we propose to diversify training documents’ formats as data augmentation to grow in-distribution samples. This data augmentation method does not present the risk of altering the facts embedded in documents as text paraphrasing. We also introduce sharpness-aware minimization as an effective optimization algorithm to better improve generalization. Moreover, we adapt our method to instruction tuning for generalization to various phrasings of questions. Extensive experiment results validate our findings and demonstrate our methods’ effectiveness in improving knowledge learning in both the continued pre-training and instruction tuning stages. This paper offers new perspectives and insights to interpret and design effective strategies for LLM knowledge learning.

[openreview] [pdf]

Abstract Research on Out-Of-Distribution (OOD) detection focuses mainly on building scores that efficiently distinguish OOD data from In Distribution (ID) data. On the other hand, Conformal Prediction (CP) uses non-conformity scores to construct prediction sets with probabilistic coverage guarantees. In other words, the former designs scores, while the latter designs probabilistic guarantees based on scores. Therefore, we claim that these two fields might be naturally intertwined. This work advocates for cross-fertilization between OOD and CP by formalizing their link and emphasizing two benefits of using them jointly. First, we show that in standard OOD benchmark settings, evaluation metrics can be overly optimistic due to the test dataset’s finite sample size. Based on the work of (Bates et al, 2022), we define newconformal AUROCandconformal FRP@TPRβ\betametrics, which are corrections that provide probabilistic conservativeness guarantees on the variability of these metrics. We show the effect of these corrections on two reference OOD and anomaly detection benchmarks, OpenOOD (Yang et al, 2022) and ADBench (Han et al. 2022). Second, we explore using OOD scores as non-conformity scores and show that they can improve the efficiency of the prediction sets obtained with CP.

955Last-Iterate Convergence Properties of Regret-Matching Algorithms in Games

[openreview] [pdf]

Abstract We study last-iterate convergence properties of algorithms for solving two-player zero-sum games based on Regret Matching+^+ (RM+^+). Despite their widespread use for solving real games, virtually nothing is known about their last-iterate convergence. A major obstacle to analyzing RM-type dynamics is that their regret operators lack Lipschitzness and (pseudo)monotonicity. We start by showing numerically that several variants used in practice, such as RM+^+, predictive RM+^+ and alternating RM+^+, all lack last-iterate convergence guarantees even on a simple 3×33\times 3 matrix game. We then prove that recent variants of these algorithms based on a smoothing technique, extragradient RM+^{+} and smooth Predictive RM+^+, enjoy asymptotic last-iterate convergence (without a rate), 1/t1/\sqrt{t} best-iterate convergence, and when combined with restarting, linear-rate last-iterate convergence. Our analysis builds on a new characterization of the geometric structure of the limit points of our algorithms, marking a significant departure from most of the literature on last-iterate convergence. We believe that our analysis may be of independent interest and offers a fresh perspective for studying last-iterate convergence in algorithms based on non-monotone operators.

956Questioning Simplicity Bias Assumptions

[openreview] [pdf]

Abstract The Simplicity Bias (SB) is the observation that the training of most commonly used neural network architectures with standard training techniques is biased toward learning simple functions. This phenomenon can be a benefit or drawback depending on the relative complexity of the desired function to be learnt. If the desired function is relatively simple it’s a positive. However, if there are simpler features that are highly predictive; commonly named shortcuts or spurious features, that are not present in the test environment, the SB can result in poor generalisation performance. Most existing works on mitigating the SB make various assumptions, either about the features present in the train and test domains or by assuming access to information about the test domain at train time. In this paper we review recent work on the SB and take a critical look at these assumptions.

957Monophilic Neighbourhood Transformers

[openreview] [pdf]

Abstract Graph neural networks (GNNs) have seen widespread application across diverse fields, including social network analysis, chemical research, and computer vision. Nevertheless, their efficacy is compromised by an inherent reliance on the homophily assumption, which posits that adjacent nodes should exhibit relevance or similarity. This assumption becomes a limitation when dealing with heterophilic graphs, where it is more common for dissimilar nodes to be connected. Addressing this challenge, recent research indicates that real-world graphs generally exhibit monophily, a characteristic where a node tends to be related to the neighbours of its neighbours. Inspired by this insight, we introduce Neighbourhood Transformers (NT), a novel approach that employs self-attention within every neighbourhood of the graph to generate informative messages for the nodes within, as opposed to the central node in conventional GNN frameworks. We develop a neighbourhood partitioning strategy equipped with switchable attentions, significantly reducing space consumption by over 95% and time consumption by up to 92.67% in NT. Experimental results on node classification tasks across 5 heterophilic and 5 homophilic graphs demonstrate that NT outperforms current state-of-the-art methods, showcasing their expressiveness and adaptability to different graph types. The code for this study will be made available following the publication of this manuscript.

958Dynamic Elimination For PAC Optimal Item Selection From Relative Feedback

[openreview] [pdf]

Abstract We study the problem of best-item identification from relative feedback where a learner adaptively plays subsets of items and receives stochastic feedback in the form of the best item in the set. We propose an algorithm - Dynamic Elimination (DE) - that dynamically prunes sub-optimal items from contention to efficiently identify the best item and show a strong sample complexity upper bound for it. We further formalize the notion of inferred updates to obtain estimates on item win rates without directly playing them by leveraging item correlation information. We propose the Dynamic Elimination by Correlation (DEBC) algorithm as an extension to DE with inferred updates. We show through extensive experiments that DE and DEBC significantly outperform all existing baselines across multiple datasets in various settings.

959TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning

[openreview] [pdf]

Abstract This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajectories are typically parameterized by trajectory generators such as Movement Primitives (MP), allowing for smooth and efficient exploration over long horizons while capturing high-level temporal correlations. However, ERL methods are often constrained to on-policy frameworks due to the difficulty of evaluating state-action values for entire action sequences, limiting their sample efficiency and preventing the use of more efficient off-policy architectures. TOP-ERL addresses this shortcoming by segmenting long action sequences and estimating the state-action values for each segment using a transformer-based critic architecture alongside an n-step return estimation. These contributions result in efficient and stable training that is reflected in the empirical results conducted on sophisticated robot learning environments. TOP-ERL significantly outperforms state-of-the-art RL methods. Thorough ablation studies additionally show the impact of key design choices on the model performance.

960How Far Are We from True Unlearnability?

[openreview] [pdf]

Abstract High-quality data plays an indispensable role in the era of large models, but the use of unauthorized data for model training greatly damages the interests of data owners. To overcome this threat, several unlearnable methods have been proposed, which generate unlearnable examples (UEs) by compromising the training availability of data. Clearly, due to unknown training purpose and the powerful representation learning capabilities of existing models, these data are expected to be unlearnable for various task models, i.e., they will not help improve the model’s performance. However, unexpectedly, we find that on the multi-task dataset Taskonomy, UEs still perform well in tasks such as semantic segmentation, failing to exhibit cross-task unlearnability. This phenomenon leads us to question: How far are we from attaining truly unlearnable examples? We attempt to answer this question from the perspective of model optimization. We observe the difference of convergence process between clean models and poisoned models on a simple model using the loss landscape and find that only a part of the critical parameter optimization paths show significant differences, implying a close relationship between the loss landscape and unlearnability. Consequently, we employ the loss landscape to explain the underlying reasons for UEs and propose Sharpness-Aware Learnability (SAL) for quantifying the unlearnability of parameters based on this explanation. Furthermore, we propose an Unlearnable Distance (UD) metric to measure the unlearnability of data based on the SAL distribution of parameters in clean and poisoned models. Finally, we conduct benchmark tests on mainstream unlearnable methods using the proposed UD, aiming to promote community awareness of the capability boundaries of existing unlearnable methods. The code is available athttps://github.com/MLsecurityLab/HowFarAreFromTrueUnlearnability.git.

961Characterizing the Training Dynamics of Private Fine-tuning with Langevin Diffusion

[openreview] [pdf]

Abstract We show that differentially private full fine-tuning (DP-FFT) can distort pre-trained backbone features based on both theoretical and empirical results. We identify the cause of the distortion as the misalignment between the pre-trained backbone and the randomly initialized linear head. We prove that a sequential fine-tuning strategy can mitigate the feature distortion: first-linear-probing-then-fine-tuning (DP-LP-FFT). A new approximation scheme allows us to derive approximate upper and lower bounds on the training loss of DP-LP and DP-FFT, in a simple but canonical setting of 2-layer neural networks with ReLU activation. Experiments on real-world datasets and architectures are consistent with our theoretical insights. We also derive new upper bounds for 2-layer linear networks without the approximation. Moreover, our theory suggests a trade-off of privacy budget allocation in multi-phase fine-tuning methods like DP-LP-FFT.

962On the Benefits of Attribute-Driven Graph Domain Adaptation

[openreview] [pdf]

Abstract Graph Domain Adaptation (GDA) addresses a pressing challenge in cross-network learning, particularly pertinent due to the absence of labeled data in real-world graph datasets. Recent studies attempted to learn domain invariant representations by eliminating structural shifts between graphs. In this work, we show that existing methodologies have overlooked the significance of the graph node attribute, a pivotal factor for graph domain alignment. Specifically, we first reveal the impact of node attributes for GDA by theoretically proving that in addition to the graph structural divergence between the domains, the node attribute discrepancy also plays a critical role in GDA. Moreover, we also empirically show that the attribute shift is more substantial than the topology shift, which further underscore the importance of node attribute alignment in GDA. Inspired by this finding, a novel cross-channel module is developed to fuse and align both views between the source and target graphs for GDA. Experimental results on a variety of benchmark verify the effectiveness of our method.

963Enhancing Multi-Objective Offline RL with Adaptive Preference Integration

[openreview] [pdf]

Abstract Multi-objective reinforcement learning (MORL) is crucial for real-world applications where multiple conflicting goals must be optimized, such as in healthcare or autonomous systems. Offline MORL extends these benefits by using pre-collected datasets, allowing for effective learning without continuous interaction with the environment. However, existing offline MORL algorithms often struggle with scaling across large preference spaces and handling unknown preferences during evaluation. To address these challenges, we propose the Preference-Attended Multi-Objective Decision Transformer (PA-MODT), a novel architecture that integrates a preference-attention block with a modular transformer structure. This design enables effective generalization over different preferences and trajectories, providing a more robust approach to generating optimal Pareto fronts. We tested PA-MODT on five D4MORL datasets with millions of trajectories representing various objectives and found that it consistently outperforms existing models, achieving Pareto fronts that align closely with behavioral policy. This demonstrates PA-MODT’s potential to effectively manage complex multi-objective reinforcement learning tasks.

964Optimal Algorithm for Max-Min Fair Bandit

[openreview] [pdf]

Abstract We consider a multi-player multi-armed bandit problem (MP-MAB) where NN players compete for KK arms in TT rounds. The reward distribution is heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero reward. In this paper, we aim to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing regret upper bound result of O(logTloglogT)O(\log T\log \log T) to achieve max-min fairness. More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bounded of O((N2+K)logT/Δ)O((N^2+K)\log T / \Delta), where Δ\Delta is the minimum reward gap between max-min value and sub-optimal arms. We assume NKN\leq K to guarantee all players can select their arms without collisions. In addition, we also provide an Ω(maxN2,KlogT/Δ)\Omega(\max{N^2, K} \log T / \Delta) regret lower bound for this problem. This lower bound indicates that our algorithm is optimal with respect to key parameters, which significantly improves the performance of algorithms in previous work. Numerical experiments again verify the efficiency and improvement of our algorithms.

965CAN TRANSFORMERS IN-CONTEXT LEARN BEHAVIOR OF A LINEAR DYNAMICAL SYSTEM?

[openreview] [pdf]

Abstract We investigate whether transformers can learn to track a random process when given observations of a related process and parameters of the dynamical system that relates them as context. More specifically, we consider a finite-dimensional state-space model described by the state transition matrix FF, measurement matrices h1,,hNh_1, \dots, h_N, and the process and measurement noise covariance matrices QQ and RR, respectively; these parameters, randomly sampled, are provided to the transformer along with the observations y1,,yNy_1,\dots,y_N generated by the corresponding linear dynamical system. We argue that in such settings transformers learn to approximate the celebrated Kalman filter, and empirically verify this both for the task of estimating hidden states xN1,2,3,...,N^\hat{x_{N|1,2,3,...,N}} as well as for one-step prediction of the (N+1)st(N+1)^{st} observation, y^N+11,2,3,...,N\hat{y}_{N+1|1,2,3,...,N}. A further study of the transformer’s robustness reveals that its performance is retained even if the model’s parameters are partially withheld. In particular, we demonstrate that the transformer remains accurate at the considered task even in the absence of state transition and noise covariance matrices, effectively emulating operations of the Dual-Kalman filter.

966Dense Backpropagation Improves Routing for Sparsely-Gated Mixture-of-Experts

[openreview] [pdf]

Abstract Mixture of Experts (MoE) pretraining is more scalable than dense Transformer pretraining, because MoEs learn to route inputs to a sparse set of their feedforward parameters. However, this means that MoEs only receive a sparse backward update, leading to problems such as router load imbalance where some experts receive more tokens than others. We present a lightweight approximation method that gives the MoE a dense gradient while only sparsely activating its parameters. A key insight into the design of our method is that at scale, many tokens not routed to a given expert may nonetheless lie in the span of tokens that were routed to that expert, allowing us to create an approximation for the expert output of that token from existing expert outputs. Our dense backpropagation outperforms standard TopK routing across multiple MoE configurations without increasing runtime.

967Learning to Plan with Personalized Preferences

[openreview] [pdf]

Abstract Understanding and adapting to human preferences is essential for the effective integration of artificial agents into daily human life, particularly as AI becomes increasingly involved in collaboration and assistance roles. Previous studies on preference recognition in embodied intelligence have largely adopted a generalized yet non-personalized approach. To fill in this gap, our research focuses on empowering embodied agents to learn and adapt to individual preferences, a task complicated by the challenges of inferring these preferences from minimal observations and requiring robust few-shot generalization. To facilitate future study, we introduce PbP, an embodied environment that supports hundreds of diverse preferences ranging from complex action sequences to specific sub-actions. Our experiments on PbP reveal that while symbol-based approaches show promise in terms of effectiveness and scalability, accurately inferring implicit preferences and planning adaptive actions from limited data remain challenging. Nevertheless, preference serves as a valuable abstraction of human behaviors, and incorporating preference as a key intermediary step in planning can significantly enhance the personalization and adaptability of AI agents. We hope our findings can pave the way for future research on more efficient preference learning and personalized planning in dynamic environments.

968Learning Pattern-Specific Experts for Time Series Forecasting Under Patch-level Distribution Shift

[openreview] [pdf]

Abstract Time series forecasting, which aims to predict future values based on historical data, has garnered significant attention due to its broad range of applications. However, real-world time series often exhibit complex non-uniform distribution with varying patterns across segments, such as season, operating condition, or semantic meaning, making accurate forecasting challenging. Existing approaches, which typically train a single model to capture all these diverse patterns, often struggle with the pattern drifts between patches and may lead to poor generalization. To address these challenges, we propose TFPS, a novel architecture that leverages pattern-specific experts for more accurate and adaptable time series forecasting. TFPS employs a dual-domain encoder to capture both time-domain and frequency-domain features, enabling a more comprehensive understanding of temporal dynamics. It then uses subspace clustering to dynamically identify distinct patterns across data patches. Finally, pattern-specific experts model these unique patterns, delivering tailored predictions for each patch. By explicitly learning and adapting to evolving patterns, TFPS achieves significantly improved forecasting accuracy. Extensive experiments on real-world datasets demonstrate that TFPS outperforms state-of-the-art methods, particularly in long-term forecasting, through its dynamic and pattern-aware learning approach. The data and codes are available:https://anonymous.4open.science/r/TFPS-D001.

969Multi-Resolution Decomposable Diffusion Model for Non-Stationary Time Series Anomaly Detection

[openreview] [pdf]

Abstract Recently, generative models have shown considerable promise in unsupervised time series anomaly detection. Nonetheless, the task of effectively capturing complex temporal patterns and minimizing false alarms becomes increasingly challenging when dealing with non-stationary time series, characterized by continuously fluctuating statistical attributes and joint distributions. To confront these challenges, we underscore the benefits of multi-resolution modeling, which improves the ability to distinguish between anomalies and non-stationary behaviors by leveraging correlations across various resolution scales. In response, we introduce aMulti-ResolutionDecomposable DiffusionModel (MODEM), which integrates a coarse-to-fine diffusion paradigm with a frequency-enhanced decomposable network to adeptly navigate the intricacies of non-stationarity. Technically, the coarse-to-fine diffusion model embeds cross-resolution correlations into the forward process to optimize diffusion transitions mathematically. It then innovatively employs low-resolution recovery to guide the reverse trajectories of high-resolution series in a coarse-to-fine manner, enhancing the model’s ability to learn and elucidate underlying temporal patterns. Furthermore, the frequency-enhanced decomposable network operates in the frequency domain to extract globally shared time-invariant information and time-variant temporal dynamics for accurate series reconstruction. Extensive experiments conducted across five real-world datasets demonstrate that our proposed MODEM achieves state-of-the-art performance and can be generalized to other time series tasks. The code will be publicly available upon acceptance.

970Multiple-Frequencies Population-Based Training

[openreview] [pdf]

Abstract Reinforcement Learning’s high sensitivity to hyperparameters is a source of instability and inefficiency, creating significant challenges for practitioners. Hyperparameter Optimization (HPO) algorithms have been developed to address this issue, among them Population-Based Training (PBT) stands out for its ability to generate hyperparameters schedules in a single training run. PBT trains a population of agents, each with its own hyperparameters, frequently ranking them and replacing the worst performers with mutations of the best agents. These intermediate selection steps can cause PBT to focus on short-term improvements, leading it to get stuck in local optima and eventually fall behind vanilla Random Search over longer timescales. This paper studies how this greediness issue is connected to the choice ofevolution frequency, the rate at which the selection is done. We propose Multiple-Frequencies Population-Based Training (MF-PBT), a novel HPO algorithm that addresses greediness by employing sub-populations, each evolving at distinct frequencies. MF-PBT introduces a migration process to transfer information between sub-populations, with an asymmetric design to balance short and long-term optimization. Extensive experiments on the Brax suite demonstrate that MF-PBT improves sample efficiency and long-term performance, even without tuning hyperparameters. Code will be released.

971Offline-to-Online Reinforcement Learning with Classifier-Free Diffusion Generation

[openreview] [pdf]

Abstract Offline-to-online Reinforcement Learning (O2O RL) aims to perform online fine-tuning on an offline pre-trained policy to minimize costly online interactions. Existing methods have used offline data or online data to generate new data for data augmentation, which has led to performance improvement during online fine-tuning. However, they have not fully analyzed and utilized both types of data simultaneously. Offline data helps prevent agents from settling too early on suboptimal policies by providing diverse data, while online data improves training stability and speeds up convergence. In this paper, we propose a data augmentation approach, Classifier-Free Diffusion Generation (CFDG). Considering the differences between offline data and online data, we use conditional diffusion to generate both types of data for augmentation in the online phase, aiming to improve the quality of sample generation. Experimental results show that CFDG outperforms replaying the two data types or using a standard diffusion model to generate new data. Our method is versatile and can be integrated with existing offline-to-online RL algorithms. By implementing CFDG to popular methods IQL, PEX and APL, we achieve a notable 15% average improvement in empirical performance on the D4RL benchmark like MuJoCo and AntMaze.

972Auditing Data Controller Compliance with Data Withdrawal

[openreview] [pdf]

Abstract We study auditing total data withdrawal, the case in which a user requests the exclusion of their data from both the training and test data for some machine learning task. This approach is motivated by the need for comprehensive compliance with data privacy regulations and legal frameworks around the world. We conceptualize the task of auditing total data withdrawal as an optimization problem. Compliance verification is conducted under mild assumptions using a dedicated verification algorithm. We then evaluate this formulation over various datasets, architectures, and verification hyperparameters. Our verification algorithm serves as a tool for regulators to ensure auditable compliance and provides enhanced privacy guarantees for users.

973Real-World Benchmarks Make Membership Inference Attacks Fail on Diffusion Models

[openreview] [pdf]

Abstract Membership inference attacks (MIAs) on diffusion models have emerged as potential evidence of unauthorized data usage in training pre-trained diffusion models. These attacks aim to detect the presence of specific images in training datasets of diffusion models. Our study delves into the evaluation of state-of-the-art MIAs on diffusion models and reveals critical flaws and overly optimistic performance estimates in existing MIA evaluation. We introduce CopyMark, a more realistic MIA benchmark that distinguishes itself through the support for pre-trained diffusion models, unbiased datasets, and fair evaluation pipelines. Through extensive experiments, we demonstrate that the effectiveness of current MIA methods significantly degrades under these more practical conditions. Based on our results, we alert that MIA, in its current state, is not a reliable approach for identifying unauthorized data usage in pre-trained diffusion models. To the best of our knowledge, we are the first to discover the performance overestimation of MIAs on diffusion models and present a unified benchmark for more realistic evaluation.

974Online Decision Deferral under Budget Constraints

[openreview] [pdf]

Abstract Machine Learning (ML) models are increasingly used to support or substitute decision making. In applications where skilled experts are a limited resource, it is crucial to reduce their burden and automate decisions when the performance of an ML model is at least of equal quality. However, models are often pre-trained and fixed, while tasks arrive sequentially and their distribution may shift. In that case, the respective performance of the decision makers may change, and the deferral algorithm must remain adaptive. We propose a contextual bandit model of this online decision making problem. Our framework includes budget constraints and different types of partial feedback models. Beyond the theoretical guarantees of our algorithm, we propose efficient extensions that achieve remarkable performance on real-world datasets.

975Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales

[openreview] [pdf]

Abstract Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance. Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning from AI Feedback (RLAIF) introduce additional challenges. For instance, diverse preferences complicate the alignment process, and prediction errors in a trained reward model can become more severe as the LLM generates unseen outputs. These RL challenges create confusion about whether the probability of an action for a given state should be increased or decreased, similar to the noise in labels for classification tasks. In this work, we enhance the stability of the RL training procedure by adapting reverse cross-entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss. We demonstrate performance improvements across various tasks and scales. We conduct experiments in discrete action tasks (Atari games) and continuous action space tasks (MuJoCo benchmark and Box2D) using Symmetric A2C (SA2C) and Symmetric PPO (SPPO), with and without added noise. Notably, SPPO shows strong performance across different hyperparameters. Furthermore, we validate the benefits of the symmetric RL loss in the RLHF framework using PPO for natural language processing tasks, demonstrating improved performance in tasks such as IMDB positive sentiment and TL;DR summarization.

976Gradient based Causal Discovery with Diffusion Model

[openreview] [pdf]

Abstract Causal discovery from observational data is an important problem in many applied sciences. Incorporating a recently proposed smooth characterization of acyclicity, gradient-based causal discovery approaches search for a Directed Acyclic Graph (DAG) by optimizing various neural models. Although they show some inspiring results given certain assumptions satisfied, their capability of modeling complex nonlinear causal generative functions is still unsatisfactory. Motivated by recent advances in deep generative models, we propose to use diffusion models for causal discovery, and search for the DAG under continuous optimization frameworks. With flexible parameter configurations, diffusion model has the ability to represent various functions, and the proposed causal discovery approach are able to generate graphs with satisfactory accuracy on observational data generated by either linear or nonlinear causal models. This is evidenced by empirical results on both synthetic and real data.

977Zero-Order Diffusion Guidance for Inverse Problems

[openreview] [pdf]

Abstract We propose zero order diffusion guidance, a method that allows using a diffusion model to solve inverse problems without access to the gradients of the process we seek to invert. Our method employs a zero-order gradient estimator combined with a novel differentiable dimensionality reduction strategy to approximate true gradients during guidance while keeping the task computationally tractable in thousands of dimensions. We apply our method to model inversion and demonstrate how it can be used to reconstruct high-quality faces in a realistic scenario where the adversary has only black-box access to face embeddings. Across a range of inverse problems—including synthetic experiments and JPEG restoration—we show that access to gradients is not necessary for effective guidance. Our black-box method matches white-box performance, thus expanding the scope of inverse problems that can be solved with diffusion-based approaches.

978Extending Stability Analysis to Adaptive Optimization Algorithms Using Loss Surface Geometry

[openreview] [pdf]

Abstract Adaptive optimization algorithms, such as Adam Kingma & Ba (2015) and RM-SProp Tieleman & Hinton (2012), have become integral to training deep neu-ral networks, yet their stability properties and impact on generalization remain poorly understood Wilson et al. (2017). This paper extends linear stability anal-ysis to adaptive optimizers, providing a theoretical framework that explains their behavior in relation to loss surface geometry Wu et al. (2022); Jastrz˛ebski et al.(2019). We introduce a novel generalized coherence measure that quantifies the interaction between the adaptive preconditioner and the Hessian of the loss func-tion. This measure yields necessary and sufficient conditions for linear stability near stationary points, offering insights into why adaptive methods may converge to sharper minima with poorer generalization. Our analysis leads to practical guidelines for hyperparameter tuning, demon-strating how to improve the generalization performance of adaptive optimizers. Through extensive experiments on benchmark datasets and architectures, includ-ing ResNet He et al. (2016) and Vision Transformers Dosovitskiy et al. (2020), we validate our theoretical predictions, showing that aligning the adaptive precon-ditioner with the loss surface geometry through careful parameter selection can narrow the generalization gap between adaptive methods and SGD Loshchilov & Hutter (2018).

979Adaptive Algorithm for Non-Stationary Online Convex-Concave Optimization

[openreview] [pdf]

Abstract This paper addresses the problem of Online Convex-Concave Optimization, an extension of Online Convex Optimization to two-player time-varying convex-concave games. Our objective is to minimize the dynamic duality gap (D-DGap), a key performance metric that evaluates the players’ strategies against arbitrary comparator sequences. Existing algorithms struggle to achieve optimal performance, particularly in stationary or predictable environments. We propose a novel, modular algorithm comprising three key components: an Adaptive Module that adjusts to varying levels of non-stationarity, a Multi-Predictor Aggregator that selects the optimal predictor from multiple candidates, and an Integration Module that seamlessly combines the strengths of both. Our algorithm guarantees a minimax optimal D-DGap upper bound, up to a logarithmic factor, while also achieving a prediction error-based D-DGap bound. Empirical results further demonstrate the effectiveness and adaptability of the proposed method.

980Four eyes see more than two: Dataset Distillation with Mixture-of-Experts

[openreview] [pdf]

Abstract The ever-growing size of datasets in deep learning presents a significant challenge in terms of training efficiency and computational cost. Dataset distillation (DD) has emerged as a promising approach to address this challenge by generating compact synthetic datasets that retain the essential information of the original data. However, existing DD methods often suffer from performance degradation when transferring distilled datasets across different network architectures (i.e. the model utilizing distilled dataset for further training is different from the one used in dataset distillation). To overcome this limitation, we propose a novel mixture-of-experts framework for dataset distillation. Our goal focuses on promoting diversity within the distilled dataset by distributing the distillation tasks to multiple expert models. Each expert specializes in distilling a distinct subset of the dataset, encouraging them to capture different aspects of the original data distribution. To further enhance diversity, we introduce a distance correlation minimization strategy to encourage the experts to learn distinct representations. Moreover, during the testing stage (where the distilled dataset is used for training a new model), the mixup-based fusion strategy is applied to better leverage the complementary information captured by each expert. Through extensive experiments, we demonstrate that our framework effectively mitigates the issue of cross-architecture performance degradation in dataset distillation, particularly in low-data regimes, leading to more efficient and versatile deep learning models while being trained upon the distilled dataset.

981Capability Localization: Capabilities Can be Localized rather than Individual Knowledge

[openreview] [pdf]

Abstract Large scale language models have achieved superior performance in tasks related to natural language processing, however, it is still unclear how model parameters affect performance improvement. Previous studies assumed that individual knowledge is stored in local parameters, and the storage form of individual knowledge is dispersed parameters, parameter layers, or parameter chains, which are not unified. We found through fidelity and reliability evaluation experiments that individual knowledge cannot be localized. Afterwards, we constructed a dataset for decoupling experiments and discovered the potential for localizing data commonalities. To further reveal this phenomenon, this paper proposes a Commonality Neuron Localization (CNL) method, which successfully locates commonality neurons and achieves a neuron overlap rate of 96.42% on the GSM8K dataset. Finally, we have demonstrated through cross data experiments that commonality neurons are a collection of capability neurons that possess the capability to enhance performance.

982Decision Information Meets Large Language Models: The Future of Explainable Operations Research

[openreview] [pdf]

Abstract Operations Research (OR) is vital for decision-making in many industries. While recent OR methods have seen significant improvements in automation and efficiency through integrating Large Language Models (LLMs), they still struggle to produce meaningful explanations. This lack of clarity raises concerns about transparency and trustworthiness in OR applications. To address these challenges, we propose a comprehensive framework, Explainable Operations Research (EOR), emphasizing actionable and understandable explanations accompanying optimization. The core of EOR is the concept of Decision Information, which emerges from what-if analysis and focuses on evaluating the impact of complex constraints (or parameters) changes on decision-making. Specifically, we utilize bipartite graphs to quantify the changes in the OR model and adopt LLMs to improve the explanation capabilities. Additionally, we introduce the first industrial benchmark to rigorously evaluate the effectiveness of explanations and analyses in OR, establishing a new standard for transparency and clarity in the field.

983Towards Infinite-Long Prefix in Transformer

[openreview] [pdf]

Abstract Prompting and context-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks. They are empirically efficient and effective, matching the performance of full parameter fine-tuning, but the theoretical understandings are limited. In this paper, we aim to address this limitation by studying their ability from the perspective of prefix length. In particular, we provide a convergence guarantee for training an ultra-long prefix in a stylized setting using the Neural Tangent Kernel (NTK) framework. Based on this strong theoretical guarantee, we design and implement an algorithm that only needs to introduce and fine-tune a few extra trainable parameters instead of an infinite-long prefix in each layer of a transformer, and can approximate the prefix attention to a guaranteed polynomial-small error. Preliminary experimental results on vision, natural language, and math data show that our method achieves superior or competitive performance compared to existing methods like full parameters fine-tuning, P-Tuning V2, and LoRA. This demonstrates our method is promising for parameter-efficient fine-tuning.

984Think Twice Before You Act: Improving Inverse Problem Solving With MCMC

[openreview] [pdf]

Abstract Recent studies demonstrate that diffusion models can serve as a strong prior for solving inverse problems. A prominent example is Diffusion Posterior Sampling (DPS), which approximates the posterior distribution of data given the measure using Tweedie’s formula. Despite the merits of being versatile in solving various inverse problems without re-training, the performance of DPS is hindered by the fact that this posterior approximation can be inaccurate especially for high noise levels. Therefore, we propose Diffusion Posterior MCMC (DPMC), a novel inference algorithm based on Annealed MCMC to solve inverse problems with pretrained diffusion models. We define a series of intermediate distributions inspired by the approximated conditional distributions used by DPS. Through annealed MCMC sampling, we encourage the samples to follow each intermediate distribution more closely before moving to the next distribution at a lower noise level, and therefore reduce the accumulated error along the path. We test our algorithm in various inverse problems, including super resolution, Gaussian deblurring, motion deblurring, inpainting, and phase retrieval. Our algorithm outperforms DPS with less number of evaluations across nearly all tasks, and is competitive among existing approaches.

985Grounding Video Models to Actions through Goal Conditioned Exploration

[openreview] [pdf]

Abstract Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data is available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.

986Mitigating Overestimation in Offline Reinforcement Learning with Anomaly Detection

[openreview] [pdf]

Abstract Reinforcement Learning (RL) encounters substantial challenges in real-world applications, due to the time-consuming, costly, and risky nature of interacting with the environment. Offline Reinforcement Learning addresses this limitation by training models on static datasets, allowing an optimal policy to be learned from pre-collected data without requiring additional interactions with the environment. However, in this setting, when the agent queries actions outside the training data distribution, it can lead to overestimation of Q-values for OOD (Out-of-distribution) actions, ultimately hindering policy optimization. Previous works attempted to address this problem using explicit constraints such as penalty terms or support restriction. But these methods often fail to identify OOD actions or result in overly conservative Q-value estimates. We propose a novel solution that adjusts weights during training by using an anomaly detection model to identify the distribution of the offline dataset and employing anomaly scores to guide the offline RL process. Our method(RLAD) not only effectively mitigates the overestimation of OOD actions but also achieves near state-of-the-art performance on continuous D4RL tasks. Additionally, this framework is highly flexible, allowing for integration with various off-policy or offline RL algorithms and Anomaly Detection models to enhance performance.

987ENHANCING DIVERSITY AND ACCURACY IN PERSONALIZED TAG RECOMMENDATIONS: A HYBRID SEMANTIC AND CONTEXTUAL ANALYSIS APPROACH

[openreview] [pdf]

Abstract This paper introduces HYCOMB, a cascading Hybrid model that innovatively integrates Collaborative Filtering (CF), Content-Based Filtering (CB), and Context- Aware (CA) methods to address the challenge of data sparsity in tag recommendation systems. Unlike traditional models that rely heavily on user-item interactions, HYCOMB enhances recommendation diversity and interpretability by utilizing semantic clustering in CF to extract and analyze user sentiment from tags, adding a layer of nuanced understanding often missing in conventional systems. The CB component advances this by applying sophisticated NLP techniques to refine these recommendations based on item attributes, while the CA component incorporates movie synopses for deeper contextual understanding. Developed and tested using the MovieLens 20M dataset, our model demonstrates significant outperformance over baseline methods in terms of precision and recall, achieving scores of 0.813 and 0.364 respectively. Further, a newly introduced Overall Total Similarity metric that underscores its ability to deliver relevant and diverse recommendations. HYCOMB’s strategic amalgamation of CF, CB, and CA not only mitigates the effects of sparse data but also improves the precision and diversity of tag recommendations, reflecting a more accurate alignment with user preferences.

988FEDERATED COMPOSITIONAL OPTIMIZATION: THE IMPACT OF TWO-SIDED LEARNING RATES ON COMMUNICATION EFFICIENCY

[openreview] [pdf]

Abstract Compositional optimization (CO) has recently gained popularity due to its applications in distributionally robust optimization (DRO), meta-learning, reinforcement learning, and many other machine learning applications. The large-scale and distributed nature of data necessitates efficient federated learning (FL) algorithms for CO, but the compositional structure of the objective poses significant challenges. Current methods either rely on large batch gradients (which are impractical) or suffer from suboptimal communication efficiency. To address these challenges, we propose efficient FedAvg-type algorithms for solving non-convex CO in the FL setting. We first establish that standard FedAvg fails in solving the federated CO problems due to data heterogeneity, which amplifies bias in local gradient estimates. Our analysis establishes that either {\em additional communication} or {\em two-sided learning rate-based} algorithms are required to control this bias. To this end, we develop two algorithms for solving the federated CO problem. First, we propose FedDRO that utilizes the compositional problem structure to design a communication strategy that allows FedAvg to control the bias in the estimation of the compositional gradient, achieving O(ϵ2)\mathcal{O}(\epsilon^{-2}) sample and O(ϵ3/2)\mathcal{O}(\epsilon^{-3/2}) communication complexity. Then we propose DS-FedDRO, a two-sided learning rate algorithm, that eliminates the need for additional communication and achieves the optimal O(ϵ2)\mathcal{O}(\epsilon^{-2}) sample and O(ϵ1)\mathcal{O}(\epsilon^{-1}) communication complexity, highlighting the importance of two-sided learning rate algorithms for solving federated CO problems. The proposed algorithms avoid the need for large batch gradients and achieve linear speedup with the number of clients. We corroborate our theoretical findings with empirical studies on large-scale DRO problems.

989Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model

[openreview] [pdf]

Abstract Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation learning models through online interactions with the environment. We introduce Policy Decorator, which uses a model-agnostic residual policy to refine large imitation learning models during online interactions. By implementing controlled exploration strategies, Policy Decorator enables stable, sample-efficient online learning. Our evaluation spans eight tasks across two benchmarks—ManiSkill and Adroit—and involves two state-of-the-art imitation learning models (Behavior Transformer and Diffusion Policy). The results show Policy Decorator effectively improves the offline-trained policies and preserves the smooth motion of imitation learning models, avoiding the erratic behaviors of pure RL policies. See ourproject pagefor videos.

990Balanced Hyperbolic Embeddings Are Natural Out-of-Distribution Detectors

[openreview] [pdf]

Abstract Out-of-distribution recognition forms an important and well-studied problem in computer vision, with the goal to filter out samples that do not belong to the distribution on which a network has been trained. The conclusion of this paper is simple: a good hierarchical hyperbolic embedding is preferred for discriminating in- and out-of-distribution samples. We introduce Balanced Hyperbolic Learning. We outline a hyperbolic class embedding algorithm that jointly optimizes for hierarchical distortion and balancing between shallow and wide subhierarchies. We can then use the class embeddings as hyperbolic prototypes for classification on in-distribution data. We outline how existing out-of-distribution scoring functions can be generalized to operate with hyperbolic prototypes. Empirical evaluations across 13 datasets and 13 scoring functions show that our hyperbolic embeddings outperform existing out-of-distribution approaches when trained on the same data with the same backbones. We also show that our hyperbolic embeddings outperform other hyperbolic approaches and naturally enable hierarchical out-of-distribution generalization.

991Understanding Synthetic Context Extension via Retrieval Heads

[openreview] [pdf]

Abstract Long-context LLMs are increasingly desired for a broad set of applications such as retrieval-augmented generation. The high cost for pretraining LLMs over long contexts has led to exploration of fine-tuning LLMs with synthetically generated data in a post-training stage. However, it remains unclear how and why fine-tuning on synthetic data transfers to long-context performance on realistic tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We explore synthetic data variants from the literature by varying the realism of the concept expression and context diversity of the data. We find that models trained on synthetic data fall short of the real data, but surprisingly, the mismatch can be interpreted and even predicted in terms of a special set of attention heads that are responsible for retrieval over long context, retrieval heads (Wu et al., 2024). The retrieval heads learned on synthetic data are mostly subsets of the retrieval heads learned on real data, and there is a strong correlation between the recall of heads learned and the downstream performance of a model. Furthermore, with attention knockout and activation patching, we mechanistically show that retrieval heads are not only necessary, but also provide fine-grained explanations for the performance gap between fine-tuning on synthetic and real data. Our results shed light on how to interpret the success and failure of synthetic data fine-tuning and how to create better synthetic data that can be transferred to realistic capabilities over long context.

992The Vital Role of Gradient Clipping in Byzantine-Resilient Distributed Learning

[openreview] [pdf]

Abstract Byzantine-resilient distributed machine learning seeks to achieve robust learning performance in the presence of misbehaving or adversarial workers. While state-of-the-art (SOTA) robust distributed gradient descent (Robust-DGD) methods were proven theoretically optimal, their empirical success has often relied on pre-aggregation gradient clipping. However, the currently considered static clipping strategy exhibits mixed results: improving robustness against some attacks while being ineffective or detrimental against others. We address this gap by proposing a principled adaptive clipping strategy, termed Adaptive Robust Clipping (ARC). We show that ARC consistently enhances the empirical robustness of SOTA Robust-DGD methods, while preserving the theoretical robustness guarantees. Our analysis shows that ARC provably improves the asymptotic convergence guarantee of Robust-DGD in the case when the model is well-initialized. We validate this theoretical insight through an exhaustive set of experiments on benchmark image classification tasks. We observe that the improvement induced by ARC is more pronounced in highly heterogeneous and adversarial settings.

993Inference-Time Alignment of Diffusion Models with Direct Noise Optimization

[openreview] [pdf]

Abstract In this work, we focus on the alignment problem of diffusion models with a continuous reward function, which represents specific objectives for downstream tasks, such as increasing darkness or improving the aesthetics of images. The central goal of the alignment problem is to adjust the distribution learned by diffusion models such that the generated samples maximize the target reward function. We propose a novel alignment approach, named Direct Noise Optimization (DNO), that optimizes the injected noise during the sampling process of diffusion models. By design, DNO operates at inference-time, and thus is tuning-free and prompt-agnostic, with the alignment occurring in an online fashion during generation. We rigorously study the theoretical properties of DNO and also propose variants to deal with non-differentiable reward functions. Furthermore, we identify that naive implementation of DNO occasionally suffers from the out-of-distribution reward hacking problem, where optimized samples have high rewards but are no longer in the support of the pretrained distribution. To remedy this issue, we leverage classical high-dimensional statistics theory to an effective probability regularization technique. We conduct extensive experiments on several important reward functions and demonstrate that the proposed DNO approach can achieve state-of-the-art reward scores within a reasonable time budget for generation.

994On Learning Representations for Tabular Dataset Distillation

[openreview] [pdf]

Abstract Dataset distillation generates a small set of information-rich instances from a large dataset, resulting in reduced storage requirements, privacy or copyright risks, and computational costs for downstream modeling, though much of the research has focused on the image data modality. We study tabular data distillation, which brings in novel challenges such as the inherent feature heterogeneity and the common use of non-differentiable learning models (such as decision tree ensembles and nearest-neighbor predictors). To mitigate these challenges, we present TDColER, a tabular data distillation framework via column embeddings-based representation learning. To evaluate this framework, we also present a tabular data distillation benchmark, TDBench. Based on an elaborate evaluation on TDBench, resulting in 226,200 distilled datasets and 541,980 models trained on them, we demonstrate that TDColER is able to boost the distilled data quality of off-the-shelf distillation schemes by 0.5-143% across 7 different tabular learning models.

995Decoupling Backdoors from Main Task: Toward the Effective and Durable Backdoors in Federated Learning

[openreview] [pdf]

Abstract Federated learning, as a distributed machine learning method, enables multiple participants to collaboratively train a central model without sharing their private data. However, this decentralized mechanism introduces new privacy and security concerns. Malicious attackers can embed backdoors into local models, which are inherited by the central global model through the federated aggregation process. While previous studies have demonstrated the effectiveness of backdoor attacks, the effectiveness and durability often rely on unrealistic assumptions, such as a large number of attackers and scaled malicious contributions. These assumptions arise because a sufficient number of attackers can neutralize the contributions of honest participants, allowing the backdoor to be successfully inherited by the central model. In this work, we attribute these backdoor limitations to the coupling between the main and backdoor tasks. To address these backdoor limitations, we propose a min-max backdoor attack framework that decouples backdoors from the main task, ensuring that these two tasks do not interfere with each other. The maximization phase employs the principle of universal adversarial perturbation to create triggers that amplify the performance disparity between poisoned and benign samples. These samples are then used to train a backdoor model in the minimization process. We evaluate the proposed framework in both image classification and semantic analysis tasks. Comparisons with four backdoor attack methods under five defense algorithms show that our method achieves good attack performance even if there is a small number of attackers and when the submitted model parameters are not scaled. In addition, even if attackers are completely removed in the training process, the implanted backdoors will not be dramatically weakened by the contributions of other honest participants.

996Divergence-enhanced Knowledge-guided Context Optimization for Visual-Language Prompt Tuning

[openreview] [pdf]

Abstract Prompt tuning vision-language models like CLIP has shown great potential in learning transferable representations for various downstream tasks. The main issue is how to mitigate the over-fitting problem on downstream tasks with limited training samples. While knowledge-guided context optimization (Yao et al.,2023; 2024) has been proposed by constructing consistency constraints to handle catastrophic forgetting in the pre-trained backbone, it also introduces a potential bias toward pre-training. This paper proposes a novel and simple Divergence-enhanced Knowledge-guided Prompt Tuning (DeKg) method to address this issue. The key insight is that the bias toward pre-training can be alleviated by encouraging the independence between the learnable and the crafted prompt. Specifically, DeKg employs the Hilbert-Schmidt Independence Criterion (HSIC) to regularize the learnable prompts, thereby reducing their dependence on prior general knowledge, and enabling divergence induced by target knowledge. Comprehensive evaluations demonstrate that DeKg serves as a plug-and-play module can seamlessly integrate with existing knowledge-guided methods and achieves superior performance in three challenging benchmarks.

997Editable Concept Bottleneck Models

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) have garnered much attention for their ability to elucidate the prediction process through a human-understandable concept layer. However, most previous studies focused on cases where the data, including concepts, are clean. In many scenarios, we always need to remove/insert some training data or new concepts from trained CBMs due to different reasons, such as privacy concerns, data mislabelling, spurious concepts, and concept annotation errors. Thus, the challenge of deriving efficient editable CBMs without retraining from scratch persists, particularly in large-scale applications. To address these challenges, we propose Editable Concept Bottleneck Models (ECBMs). Specifically, ECBMs support three different levels of data removal: concept-label-level, concept-level, and data-level. ECBMs enjoy mathematically rigorous closed-form approximations derived from influence functions that obviate the need for re-training. Experimental results demonstrate the efficiency and effectiveness of our ECBMs, affirming their adaptability within the realm of CBMs.

998Diffusion Attacker: Diffusion-Driven Prompt Manipulation for LLM Jailbreak

[openreview] [pdf]

Abstract Large Language Models can generate harmful content when prompted with carefully crafted inputs, a vulnerability known as LLM jailbreaking. As LLMs become more powerful, studying jailbreaking becomes a critical aspect of enhancing security and human value alignment. Currently, jailbreak is usually implemented by adding suffixes or using prompt templates, which suffers from low attack diversity. Inspired by diffusion models, this paper introduces the DiffusionAttacker, an end-to-end generative method for jailbreak rewriting. Our approach employs a seq2seq text diffusion model as a generator, conditioning on the original prompt and guiding the denoising process with a novel attack loss. This method preserves the semantic content of the original prompt while producing harmful content. Additionally, we leverage the Gumbel-Softmax technique to make the sampling process from the output distribution of the diffusion model differentiable, thereby eliminating the need for an iterative token search. Through extensive experiments on the Advbench and Harmbench, we show that DiffusionAttacker outperforms previous methods in various evaluation indicators including attack success rate (ASR), fluency, and diversity.

999Bad-PFL: Exploiting Backdoor Attacks against Personalized Federated Learning

[openreview] [pdf]

Abstract Data heterogeneity and backdoor attacks rank among the most significant challenges facing federated learning (FL). For data heterogeneity, personalized federated learning (PFL) enables each client to maintain a private personalized model to cater to client-specific knowledge. Meanwhile, vanilla FL has proven vulnerable to backdoor attacks. However, recent advancements in PFL community have demonstrated a potential immunity against such attacks. This paper explores this intersection further, revealing that existing federated backdoor attacks fail in PFL because backdoors about manually designed triggers struggle to survive in personalized models. To tackle this, we degisn Bad-PFL, which employs features from natural data as our trigger. As long as the model is trained on natural data, it inevitably embeds the backdoor associated with our trigger, ensuring its longevity in personalized models. Moreover, our trigger undergoes mutual reinforcement training with the model, further solidifying the backdoor’s durability and enhancing attack effectiveness. The large-scale experiments across three benchmark datasets demonstrate the superior performance of our attack against various PFL methods, even when equipped with state-of-the-art defense mechanisms.

1000Efficient Predictive Counterfactual Regret Minimization+Algorithm in Solving Extensive-Form Games

[openreview] [pdf]

Abstract Imperfect-information extensive-form games (IIGs) serve as a foundational model for capturing interactions among multiple agents in sequential settings with hidden information. A common objective of IIGs is to calculate a Nash equilibrium (NE). Counterfactual Regret Minimization (CFR) algorithms have been widely developed to learn an NE in two-player zero-sum IIGs. Among CFR algorithms, Predictive CFR+^+ (PCFR+^+) is powerful, usually achieving an extremely fast empirical convergence rate. However, PCFR+^+ suffers from the significant discrepancy between strategies represented by explicit accumulated counterfactual regrets across two consecutive iterations, which decreases the empirical convergence rate of PCFR+^+ in practice. To mitigate this significant discrepancy, we introduce a novel and effective variant of PCFR+^+, termed Pessimistic PCFR+^+ (P2PCFR+^+), minimizing the discrepancy between strategies represented by implicit and explicit accumulated regrets within the same iteration. We provide theoretical proof to show that P2PCFR+^+ exhibits a faster theoretical convergence rate than PCFR+^+. Experimental results demonstrate that P2PCFR+^+ outperforms other tested CFR variants.

1001Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs

[openreview] [pdf]

Abstract Multimodal large language models (MLLMs) offer a powerful mechanism for interpreting visual information. However, they often suffer from hallucinations, which impede the real-world usage of these models. Existing methods attempt to alleviate this issue by designing special decoding strategies that penalize the summary tokens. However, these methods lack analysis of the relationship between hallucination and summarization mechanism of LLMs. Interestingly, we find that penalizing summary tokens is not necessary: merely intervening the query-key parameters variance, without costing extra inference time, still alleviates hallucinations. Specifically, we explore the causes of hallucinations by analyzing localized self-attention patterns called ``anchor" tokens and define the attention localization degree of the model as token propagation probabilities. Our analysis reveals that over-propagation of anchor tokens occurs when the distribution of eigenvalues of the query and key matrices has a non-zero mean and a polarized variance, leading to excessive dependence on anchor tokens while neglecting vision information and describes the image content with hallucination. Based on the observation, we propose a versatile plug-and-play decoding strategy, Dynamic Token Propagation Mechanism (TAME), to alleviate excessive propagation by dynamically intervening the eigenspectrum variance of the attention weight, thereby alleviating hallucinations without relying on complex decoding strategies. Extensive experiments reveal a correlation between the eigenspectrum and hallucinations across various MLLMs, and show that TAME reduces the percentage of hallucinated objects.

1002Retrieval Augmented Imputation using Data Lake Tables

[openreview] [pdf]

Abstract Data imputation is an essential problem in many data science applications. Existing methods often struggle to impute missing values in scenarios where there is a lack of sufficient data redundancy. In this paper, leveraging large language models (LLMs) and data lakes, we propose a novel approach for retrieval-augmented imputation called RAI, utilizing fine-grained tuple-level retrieval instead of traditional coarse-grained table-based retrieval. RAI addresses the challenges of retrieving relevant tuples for missing value imputation from a data lake, where tuples have heterogeneous attributes, diverse values, and missing values. Rather than simply searching for similar tables, RAI employs a tuple encoder to learn meaningful representations for capturing tuple similarities and differences, enabling effective identification of candidate tuples. The retrieved results are further refined by a tuple reranker. We also introduce a new benchmark, mvBench, to advance further research. Extensive experiments demonstrate that RAI significantly outperforms existing methods. We conduct extensive experiments, demonstrating that RAI significantly outperforms state-of-the-art table-based retrieval-augmented imputation methods by 10.7%.

1003Learning from End User Data with Shuffled Differential Privacy

[openreview] [pdf]

Abstract We study a setting of collecting and learning from private data distributed across end users. In the shuffled model of differential privacy, the end users partially protect their data locally before sharing it, and their data is also anonymized during its collection to enhance privacy. This model has recently become a prominent alternative to central DP, which requires full trust in a central data curator, and local DP, where fully local data protection takes a steep toll on downstream accuracy.Our main technical result is a shuffled DP protocol for privately estimating the kernel density function of a distributed dataset, with accuracy essentially matching central DP. We use it to privately learn a classifier from the end user data, by learning a private density function per class. Moreover, we show that the density function itself can recover the semantic content of its class, despite having been learned in the absence of any unprotected data. Our experiments show the favorable downstream performance of our approach, and highlight key downstream considerations and trade-offs in a practical ML deployment of shuffled DP.

1004An Effective Manifold-based Optimization Method for Distributionally Robust Classification

[openreview] [pdf]

Abstract How to promote the robustness of existing deep learning models is a challenging problem for many practical classification tasks. Recently, Distributionally Robust Optimization (DRO) methods have shown promising potential to tackle this problem. These methods aim to construct reliable models by minimizing the worst-case risk within a local region (called ‘‘uncertainty set’’) around the empirical data distribution. However, conventional DRO methods tend to be overly pessimistic, leading to certain discrepancy between the real data distribution and the uncertainty set, which can degrade the classification performance. To address this issue, we propose a manifold-based DRO method that takes the geometric structure of training data into account for constructing the uncertainty set. Specifically, our method employs a carefully designed ‘‘game’’ that integrates contrastive learning with Jacobian regularization to capture the manifold structure, enabling us to solve DRO problems constrained by the data manifold. By utilizing a novel idea for approximating geodesic distance on manifolds, we also provide the theoretical guarantees for its robustness. Moreover, our proposed method is easy to implement in practice. We conduct a set of experiments on several popular benchmark datasets, where the results demonstrate our advantages in terms of accuracy and robustness.

1005Emergent properties with repeated examples

[openreview] [pdf]

Abstract We study the performance of transformers as a function of the number of repetitions of training examples with algorithmically generated datasets. On three problems of mathematics: the greatest common divisor, modular multiplication, and matrix eigenvalues, we show that for a fixed number of training steps, models trained on smaller sets of repeated examples outperform models trained on larger sets of single-use examples. We also demonstrate that {\em two-set training} - repeated use of a small random subset of examples, along normal sampling on the rest of the training set - provides for faster learning and better performance. This highlights that the benefits of repetition can outweigh those of data diversity. These datasets and problems provide a controlled setting to shed light on the still poorly understood interplay between generalization and memorization in deep learning.

1006Masked Diffusion Models are Secretly Time-Agnostic Masked Models and Exploit Inaccurate Categorical Sampling

[openreview] [pdf]

Abstract Masked diffusion models (MDMs) have emerged as a popular research topic for generative modeling of discrete data, thanks to their superior performance over other discrete diffusion models, and are rivaling the auto-regressive models (ARMs) for language modeling tasks. The recent effort in simplifying the masked diffusion framework further leads to alignment with continuous-space diffusion models and more principled training and sampling recipes. In this paper, however, we reveal that both training and sampling of MDMs are theoretically free from the time variable, arguably the key signature of diffusion models, and are instead equivalent to masked models. The connection on the sampling aspect is drawn by our proposed first-hitting sampler (FHS). Specifically, we show that the FHS is theoretically equivalent to MDMs’ original generation process while significantly alleviating the time-consuming categorical sampling and achieving a 20×\times speedup. In addition, our investigation raises doubts about whether MDMs can truly beat ARMs in text generation. We identify, for the first time, an underlying numerical issue, even with the commonly used 32-bit floating-point precision, which results in inaccurate categorical sampling. We show that it lowers the effective temperature both theoretically and empirically, and the resulting decrease in token diversity makes previous evaluations, which assess the generation quality solely through the incomplete generative perplexity metric, somewhat unfair.

1007XTraffic: A Dataset Where Traffic Meets Incidents with Explainability and More

[openreview] [pdf]

Abstract Long-separated research has been conducted on two highly correlated tracks: traffic and incidents. Traffic track witnesses complicating deep learning models, e.g., to push the prediction a few percent more accurate, and the incident track only studies the incidents alone, e.g., to infer the incident risk. We, for the first time, spatiotemporally aligned the two tracks in a large-scale region (16,972 traffic nodes) over the whole year of 2023: our XTraffic dataset includes traffic, i.e., time-series indexes on traffic flow, lane occupancy, and average vehicle speed, and incidents, whose records are spatiotemporally-aligned with traffic data, with seven different incident classes. Additionally, each node includes detailed physical and policy-level meta-attributes of lanes. Our data can revolutionalize traditional traffic-related tasks towards higher interpretability and practice: instead of traditional prediction or classification tasks, we conduct: (1) post-incident traffic forecasting to quantify the impact of different incidents on traffic indexes; (2) incident classification using traffic indexes to determine the incidents types for precautions measures; (3) global causal analysis among the traffic indexes, meta-attributes, and incidents to give high-level guidance of the interrelations of various factors; (4) local causal analysis within road nodes to examine how different incidents affect the road segments’ relations. The dataset is available athttps://anonymous.4open.science/r/XTraffic-E069.

1008Ensembles provably learn equivariance through data augmentation

[openreview] [pdf]

Abstract Recently, it was proved that group equivariance emerges in ensembles of neural networks as the result of full augmentation in the limit of infinitely wide neural networks (neural tangent kernel limit). In this paper, we extend this result significantly. We provide a proof that this emergence does not depend on the neural tangent kernel limit at all. We also consider stochastic settings, and furthermore general architectures. For the latter, we provide a simple sufficient condition on the relation between the architecture and the action of the group for our results to hold. We validate our findings through simple numeric experiments.

1009Dual Variance Reduction with Momentum for Imbalanced Black-Box Discrete Prompt Learning

[openreview] [pdf]

Abstract Black-box prompt learning has proven to be an effective approach for customizing large language models (LLMs) offered as services to address various downstream tasks. Within this domain, policy gradient-based methods have garnered substantial attention as a prominent approach for learning discrete prompts. However, the highly imbalanced data distribution in the real world limits the applicability of such approaches by influencing LLMs’ tendency to favor certain categories. To tackle the challenge posed by imbalanced data, this paper pioneers the integration of pairwise AUC loss into the policy gradient optimization of discrete text prompts and proposes learning discrete prompts with doubly policy gradient. Unfortunately, the doubly policy gradient estimation suffers from two variance components, resulting in unstable optimization. As a further improvement, we propose (1) a novel unbiased variance-reduced doubly policy gradient estimator and (2) incorporating the STORM variance reduction technique. Ultimately, we introduce a novel momentum-based discrete prompt learning method with doubly policy gradient (mDP-DPG). Crucially, we provide theoretical convergence guarantees for mDP-DPG within standard frameworks. The experimental results show that mDP-DPG surpasses baseline approaches across diverse imbalanced text classification datasets, emphasizing the advantages of our proposed approach for tackling data imbalance. Our code is available at the following URL:https://anonymous.4open.science/r/DPDPG-1ECB.

1010Open-World Planning via Lifted Regression with LLM-based Affordances for Embodied Agents

[openreview] [pdf]

Abstract Open-world planning is crucial for embodied AI agents that must make decisions with incomplete task-relevant knowledge. In fact, the main challenges lie in reasoning about objects and their affordances that are unknown to the agent. Large Language Models (LLMs), pre-trained on vast internet-scale data, have emerged as potential solutions for open-world planning. However, LLMs have limitations in long-horizon planning tasks and face problems related to interpretability, reliability, and cost-efficiency. Symbolic planning methods, on the other hand, offer structured and verifiable approaches to long-horizon tasks, but often struggle to generate feasible plans in an open-world setting. In this work, we propose a novel approach, called LLM-Regress, which combines the strengths of lifted symbolic regression planning with LLM-based affordances. The lifted representation allows us to generate plans capable of handling arbitrary unknown objects, while regression planning is the only planning paradigm that guarantees complete solutions using lifted representations. For such tasks, we leverage LLMs to supplement missing affordances knowledge for unknown objects. The regression nature of our approach enables the agent to focus on actions and objects relevant to the goal, thus avoiding the need for costly LLM calls for every decision. We evaluate our approach on the ALFWorld dataset and introduce a new ALFWorld-Afford dataset with higher planning complexity and more affordances types. The empirical results demonstrate that our method outperforms existing approaches in terms of success rates, planning duration, and number of LLM Tokens. Finally, we show that our approach is resilient to domain shifts in affordances and generalizes effectively to unseen tasks. This work underscores the importance of integrating symbolic reasoning with LLM knowledge for open-world decision-making in embodied AI.

1011It Helps to Take a Second Opinion: Teaching Smaller LLMs To Deliberate Mutually via Selective Rationale Optimisation

[openreview] [pdf]

Abstract Very large language models (LLMs) such as GPT-4 have shown the ability to handle complex tasks by generating and self-refining step-by-step rationales. Smaller language models (SLMs), typically with < 13B parameters, have been improved by using the data generated from very-large LMs through knowledge distillation. However, various practical constraints such as API costs, copyright, legal and ethical policies restrict using large (often opaque) models to train smaller models for commercial use. Limited success has been achieved at improving the ability of an SLM to explore the space of possible rationales and evaluate them by itself through self-deliberation. To address this, we propose COALITION, a trainable framework that facilitates interaction between two variants of the same SLM and trains them to generate and refine rationales optimized for the end-task. The variants exhibit different behaviors to produce a set of diverse candidate rationales during the generation and refinement steps. The model is then trained via Selective Rationale Optimization (SRO) to prefer generating rationale candidates that maximize the likelihood of producing the ground-truth answer. During inference, COALITION employs a controller to select the suitable variant for generating and refining the rationales. On five different datasets covering mathematical problems, commonsense reasoning, and natural language inference, COALITION outperforms several baselines by up to 5%. Our ablation studies reveal that cross-communication between the two variants performs better than using the single model to self-refine the rationales. We also demonstrate the applicability of COALITION for LMs of varying scales (4B to 14B parameters) and model families (Mistral, Llama, Qwen, Phi). We release the code for this work here.

1012Test Time Learning for Time Series Forecasting

[openreview] [pdf]

Abstract We propose the use of Test-Time Training (TTT) modules in a cascade architecture to enhance performance in long-term time series forecasting. Through extensive experiments on standard benchmark datasets, we demonstrate that TTT modules consistently outperform state-of-the-art models, including Mamba-based TimeMachine, particularly in scenarios involving extended sequence and prediction lengths. Our results show significant improvements, especially on larger datasets such as Electricity, Traffic, and Weather, underscoring the effectiveness of TTT in capturing long-range dependencies. Additionally, we explore various convolutional architectures within the TTT framework, showing that convolutional blocks as hidden layer architectures can achieve competitive results.

1013Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction

[openreview] [pdf]

Abstract Spatial intelligence is the ability of a machine to perceive, reason, and act in three dimensions within space and time. Recent advancements in large-scale auto-regressive models have demonstrated remarkable capabilities across various reasoning tasks. However, these models often struggle with fundamental aspects of spatial reasoning, particularly in answering questions like “Where am I?” and “What will I see?”. While some attempts have been done, existing approaches typically treat them as separate tasks, failing to capture their interconnected nature. In this paper, we present GST, a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction. The proposed innovative camera tokenization method enables the model to learn the joint distribution of 2D projections and their corresponding spatial perspectives in an auto-regressive manner. This unified training paradigm demonstrates that joint optimization of pose estimation and novel view synthesis leads to improved performance in both tasks, for the first time, highlighting the inherent relationship between spatial awareness and visual prediction.

1014Black-Box Adversarial Attack on Dialogue Generation via Multi-Objective Optimization

[openreview] [pdf]

Abstract Transformer-based dialogue generation (DG) models are ubiquitous in modern conversational artificial intelligence (AI) platforms. These models, however, are susceptible to adversarial attacks, i.e., prompts that appear textually indiscernible from normal inputs but are maliciously crafted to make the models generate responses incoherent and irrelevant to the conversational context. Evaluating the adversarial robustness of DG models is thus crucial to their real-world deployment. Adversarial methods typically exploit gradient information and output logits (or probabilities) to effectively modify key input tokens, thereby achieving excellent attack performance. Nevertheless, such white-box approaches are impractical in real-world scenarios since the models’ internal parameters are typically inaccessible. While black-box methods, which exploit only input prompts and DG models’ output responses to craft adversarial attacks, offer a wider applicability, they often suffer from poor performance.In a human-machine conversation, good generated responses are expected to be semantically coherent and textually succinct. We thus formulate adversarial attack on DG models as a bi-objective optimization problem, where input prompts are modified in order to 1) minimize the response coherence, and 2) maximize the generation length. In this paper, we empirically demonstrate that optimizing either objective alone results in subpar performance. We then propose a dialogue generation attack framework (DGAttack) that employs multi-objective optimization to consider both objectives simultaneously when perturbing user prompts to craft adversarial inputs. Leveraging the exploration capability of multi-objective evolutionary algorithm due to its intrinsic diversity preservation, DGAttack successfully creates effective adversarial prompts in a true black-box manner, i.e., accessing solely DG models’ inputs and outputs. Experiments across four benchmark datasets and three language models (i.e., BART, DialoGPT, T5) demonstrate the excellent performance of DGAttack compared to existing white-box, gray-box, and black-box approaches. Especially, benchmarks with large language models (i.e., Llama 3.1 and Gemma 2) suggest that DGAttack is the state-of-the-art black-box adversarial attack on dialogue generation.

1015Long Tail Classification Through Cost Sensitive Loss Functions

[openreview] [pdf]

Abstract Class imbalance in the data introduces significant challenges in training machine models especially with long-tailed datasets. Specifically, it leads to biased models that overfit with respect to the dominant classes while under-performing on the minority classes. This, in turn, results in seemingly satisfactory yet biased overall results. Hence, the above biasing needs to be controlled such that the desired generalizability of the model is not entirely compromised. To that end, we introduce a novel Cost-Sensitive Loss (CSL) function designed to dynamically adjust class weights, and incorporate a reinforcement learning mechanism to optimize these adjustments. The proposed CSL function can be seamlessly integrated with existing loss functions, to enhance performance on imbalanced datasets, rendering them robust and scalable. We implemented the above CSL function in form of a framework which leverages reinforcement learning to optimally apply these adjustments over consecutive training epochs. Experimental Results on benchmark datasets demonstrate that our proposed approach significantly outperforms state-of-the-art methods. The results indicate that our approach can provide an optimal trade-off in the model accuracy and generalization with diverse kinds of imbalanced data.

1016Rare-Mark-Aware Next Event Prediction In Marked Event Streams

[openreview] [pdf]

Abstract In marked event streams, Marked Temporal Point Process (MTPP) is central to predicting when and what mark the next event will occur based on the history. In various real-world applications, the mark distribution is significantly imbalanced, i.e., some marks are frequent, and others are rare. We unveil that such imbalance can cause the rare mark missing issue when predicting the next event – frequent marks are dominant, and rare marks often have no chance. However, rare marks can be essential in some applications (e.g., the occurrence of a 7-magnitude earthquake), and missing such rare marks in the next event prediction is risky. To address this issue, we tackle a novel Rare-mark-aware Next Event Prediction problem (RM-NEP), answering two questions for each mark m: “what is the probability that the mark of the next event is m?, and if m, when will the next event happen?”. Solving RM-NEP gives rare marks equal opportunity as frequent marks in the next event prediction. This guarantees that rare marks are always included in the predicted results. Moreover, RM-NEP allows arbitrary number of rare marks samples for time prediction without interference from frequent marks, ensuring the time prediction is accurate. To solve RM-NEP effectively, we first unify the improper integration of two different functions into one and then develop a novel Integral-free Neural Marked Temporal Point Process (IFNMTPP) to approximate the target integral directly. Extensive experiments on real-world and synthetic datasets demonstrate the superior performance of our solution for RM-NEP against various baselines.

1017FROM LOW TO HIGH-VALUE DESIGNS: OFFLINE OPTIMIZATION VIA GENERALIZED DIFFUSION

[openreview] [pdf]

Abstract This paper presents a new perspective on offline optimization. Instead of viewing it as a surrogate or inverse modeling task -- mapping either from an input design to its corresponding performance or from a desired performance to potential input candidates -- we approach offline optimization as a distributional translation task that transforms an implicit distribution of low-value inputs (i.e., the offline data) into a (better) distribution of high-value inputs (i.e., the solution candidates). This avoids explicitly modeling the target function, which is ultimately constrained by the limited amount of offline data. In contrast, our view of offline optimization as a distributional translation task is substantiated through a generalized Brownian bridge diffusion process mapping between two implicit data distributions, which can be more reliably learned using additional low- and high-value inputs drawn from synthetic functions similar to the target function. This is enabled by fitting multiple Gaussian processes with different parameterizations to the offline data and using them as functional posterior to generate artificial functions similar to the target function. Our experiments show that this approach is consistently more effective than previous methods, establishing a new state-of-the-art performance.

1018Regret Bounds for Episodic Risk-Sensitive Linear Quadratic Regulator

[openreview] [pdf]

Abstract Risk-sensitive linear quadratic regulator is one of the most fundamental problems in risk-sensitive optimal control. In this paper, we study online adaptive control of risk-sensitive linear quadratic regulator in the finite horizon episodic setting. We propose a simple least-squares greedy algorithm and show that it achieves O~(logN)\widetilde{\mathcal{O}}(\log N) regret under a specific identifiability assumption, where NN is the total number of episodes. If the identifiability assumption is not satisfied, we propose incorporating exploration noise into the least-squares-based algorithm, resulting in an algorithm with O~(N)\widetilde{\mathcal{O}}(\sqrt{N}) regret. To our best knowledge, this is the first set of regret bounds for episodic risk-sensitive linear quadratic regulator. Our proof relies on perturbation analysis of less-standard Riccati equations for risk-sensitive linear quadratic control, and a delicate analysis of the loss in the risk-sensitive performance criterion due to applying the suboptimal controller in the online learning process.

1019Learning Under Multi-dimensional Domain Shifts: A Ensemble of Mixtures of Experts Approach

[openreview] [pdf]

Abstract Domain shifts pose a significant challenge in deep learning applications. Existing methods typically address domain shifts by treating each domain in isolation, overlooking the underlying factors driving the shifts, or focus on only \emph{one} factor. However, domain shifts in the real world often occur across \emph{multiple} dimensions simultaneously. For example, medical datasets from different hospitals can exhibit variations in factors including demographics, equipment manufacturers, and imaging protocols, demonstrating a three-dimensional shifts. In this paper, we introduce a novel approach to address the complexity of multi-dimensional domain shifts. Our method leverages an ensemble of mixtures of experts (EMoE), with each MoE specialized in different dimensions. Crucially, we innovate a domain estimator to address a particularly challenging issue frequently encountered in practice: domain labels may be missing or unreliable. A significant advantage of our method is its generalizability and adaptability to both centralized and federated learning settings, as well as its versatility across various tasks. Extensive experiments on six datasets demonstrate the superiority of our method over state-of-the-art domain generalization and personalized federated learning approaches.

1020Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

[openreview] [pdf]

Abstract This paper introduces a new framework for recovering causal graphs from observational data, leveraging the fact that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes. This insight enables a direct test for potential causal relationships by checking the variance of their corresponding effect-cause conditional distributions across multiple downsampled subsets of the data. These subsets are selected to reflect different prior cause distributions, while preserving the effect-cause conditional relationships. Using this invariance test and exploiting an (empirical) sparsity of most causal graphs, we develop an algorithm that efficiently uncovers causal relationships with quadratic complexity in the number of observational features/variables, reducing the processing time by up to 25x compared to state-of-the-art methods. Our empirical studies on a diverse benchmark of large-scale datasets demonstrate that the developed algorithm consistently performs better or comparable to existing works while generally achieving better scalability.

1021Neural Deconstruction Search for Vehicle Routing Problems

[openreview] [pdf]

Abstract Autoregressive construction approaches generate solutions to vehicle routing problems in a step-by-step fashion, leading to high-quality solutions that are nearing the performance achieved by handcrafted, operations research techniques. In this work, we challenge the conventional paradigm of sequential solution construction and introduce an iterative search framework where solutions are instead deconstructed by a neural policy. Throughout the search, the neural policy collaborates with a simple greedy insertion algorithm to rebuild the deconstructed solutions. Our approach surpasses the performance of state-of-the-art operations research methods across three challenging vehicle routing problems of various problem sizes.

1022Certifiedℓ2Attribution Robustness via Uniformly Smoothed Attributions

[openreview] [pdf]

Abstract Model attribution is a popular tool to explain the rationales behind model predictions. However, recent work suggests that the attributions are vulnerable to minute perturbations, which can be added to input samples to fool the attributions while maintaining the prediction outputs. Although empirical studies have shown positive performance via adversarial training, an effective certified defense method is eminently needed to understand the robustness of attributions. In this work, we propose to use uniform smoothing technique that augments the vanilla attributions by noises uniformly sampled from a certain space. It is proved that, for all perturbations within the attack region, the cosine similarity between uniformly smoothed attribution of perturbed sample and the unperturbed sample is guaranteed to be lower bounded. We also derive alternative formulations of the certification that is equivalent to the original one and provides the maximum size of perturbation or the minimum smoothing radius such that the attribution can not be perturbed. We evaluate the proposed method on three datasets and show that the proposed method can effectively protect the attributions from attacks, regardless of the architecture of networks, training schemes and the size of the datasets.

1023Precise Parameter Localization for Textual Generation in Diffusion Models

[openreview] [pdf]

Abstract Novel diffusion models (DMs) can synthesize photo-realistic images with integrated high-quality text. Surprisingly, we demonstrate through attention activation patching that only less than 1% of DMs’ parameters contained in attention layers influence the generation of textual content within the images. Building on this observation, by precisely targeting cross and joint attention layers of DMs, we improve the efficiency and performance of textual generation. We introduce several applications that benefit from localizing the layers responsible for textual content generation. We first show that a LoRA-based fine-tuning solely of the localized layers enhances, even more, the general text-generation capabilities of large DMs while preserving the quality and diversity of the DMs’ generations. Then, we demonstrate how we can use the localized layers to edit textual content in generated images. Finally, we extend this idea to the practical use case of preventing the generation of toxic text in a cost-free manner. In contrast to prior work, our localization approach is broadly applicable across various diffusion model architectures, including U-Net (e.g., LDM and SDXL) and transformer-based (e.g., DeepFloyd IF and Stable Diffusion 3), utilizing diverse text encoders (e.g., from CLIP and the large language models like T5).

1024Learning from Demonstration with Implicit Nonlinear Dynamics Models

[openreview] [pdf]

Abstract Learning from Demonstration (LfD) is a useful paradigm for training policies that solve tasks involving complex motions, such as those encountered in robotic manipulation. In practice, the successful application of LfD requires overcoming error accumulation during policy execution, i.e. the problem of drift due to errors compounding over time and the consequent out-of-distribution behaviours. Existing works seek to address this problem through scaling data collection, correcting policy errors with a human-in-the-loop, temporally ensembling policy predictions or through learning a dynamical system model with convergence guarantees. In this work, we propose and validate an alternative approach to overcoming this issue. Inspired by reservoir computing, we develop a recurrent neural network layer that includes a fixed nonlinear dynamical system with tunable dynamical properties for modelling temporal dynamics. We validate the efficacy of our neural network layer on the task of reproducing human handwriting motions using the LASA Human Handwriting Dataset. Through empirical experiments we demonstrate that incorporating our layer into existing neural network architectures addresses the issue of compounding errors in LfD. Furthermore, we perform a comparative evaluation against existing approaches including a temporal ensemble of policy predictions and an Echo State Network (ESN) implementation. We find that our approach yields greater policy precision and robustness on the handwriting task while also generalising to multiple dynamics regimes and maintaining competitive latency scores.

1025TimeInf: Time Series Data Contribution via Influence Functions

[openreview] [pdf]

Abstract Evaluating the contribution of individual data points to a model’s prediction is critical for interpreting model predictions and improving model performance. Existing data contribution methods have been applied to various data types, including tabular data, images, and text; however, their primary focus has been on i.i.d. settings. Despite the pressing need for principled approaches tailored to time series datasets, the problem of estimating data contribution in such settings remains under-explored, possibly due to challenges associated with handling inherent temporal dependencies. This paper introduces TimeInf, a model-agnostic data contribution estimation method for time-series datasets. By leveraging influence scores, TimeInf attributes model predictions to individual time points while preserving temporal structures between the time points. Our empirical results show that TimeInf effectively detects time series anomalies and outperforms existing data attribution techniques as well as state-of-the-art anomaly detection methods. Moreover, TimeInf offers interpretable attributions of data values, allowing us to distinguish diverse anomalous patterns through visualizations. We also showcase a potential application of TimeInf in identifying mislabeled anomalies in the ground truth annotations.

1026Generalized Behavior Learning from Diverse Demonstrations

[openreview] [pdf]

Abstract Diverse behavior policies are especially valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and individual preferences, which can be used to characterize useful diversity and learn diverse performant policies. However, we show that prior work that builds naive representations of demonstration heterogeneity fails in generating successful novel behaviors that generalize over preferences. We propose Guided Strategy Discovery (GSD), which introduces a novel diversity formulation based on a learned task-relevance measure that prioritizes behaviors exploring modeled latent factors. We empirically validate across three continuous control benchmarks for generalizing to in-distribution (interpolation) and out-of-distribution (extrapolation) preferences that GSD outperforms baselines in novel behavior discovery by \sim21%. Finally, we demonstrate that GSD can generalize striking behaviors for table tennis in a virtual testbed while leveraging human demonstrations collected in the real world.

1027EIA: ENVIRONMENTAL INJECTION ATTACK ON GENERALIST WEB AGENTS FOR PRIVACY LEAKAGE

[openreview] [pdf]

Abstract Recently, generalist web agents have demonstrated remarkable potential in autonomously completing a wide range of tasks on real websites, significantly boosting human productivity. However, web tasks, such as booking flights, usually involve users’ personally identifiable information (PII), which may be exposed to potential privacy risks if web agents accidentally interact with compromised websites—a scenario that remains largely unexplored in the literature. In this work, we narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a realistic threat model for attacks on the website, where we consider two adversarial targets: stealing users’ specific PII or the entire user request. Then, we propose a novel attack method, termed Environmental Injection Attack (EIA). EIA injects malicious content designed to adapt well to environments where the agents operate and our work instantiates EIA specifically for privacy scenarios in web environments. We collect 177 action steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct experiments using one of the most capable generalist web agent frameworks to date. The results demonstrate that EIA achieves up to 70% attack success rate (ASR) in stealing users’ specific PII and 16% ASR in stealing a full user request at an action step. Additionally, by accessing the stealthiness and experimenting with a defensive system prompt, we indicate that EIA is hard to detect and mitigate. Notably, attacks that are not well adapted for a webpage can be detected through careful human inspection, leading to our discussion about the trade-off between security and autonomy. However, extra attackers’ efforts can make EIA seamlessly adapted, rendering such human supervision ineffective. Thus, we further discuss the implications on defenses at the pre- and post-deployment stages of the websites without relying on human supervision and call for more advanced defense strategies.

1028Breaking Free from MMI: A New Frontier in Rationalization by Probing Input Utilization

[openreview] [pdf]

Abstract Extracting a small subset of crucial rationales from the full input is a key problem in explainable natural language processing research. The most widely used fundamental criterion for rationale extraction is the maximum mutual information (MMI) criterion. In this paper, we first demonstrate that MMI suffers from diminishing marginal returns. Once part of the rationale has been identified, finding the remaining portions contributes only marginally to increasing the mutual information, making it difficult to use MMI to locate the rest. In contrast to MMI that aims to reproduce the prediction, we seek to identify the parts of the input that the network can actually utilize. This is achieved by comparing how different rationale candidates match the non-zero rank components of the weight matrix. The weight matrix of a neural network is typically low-rank. If an input is fully utilized by the network, it generally occupies the non-zero rank subspaces of the weight matrix, resulting in a representation with a high norm. Conversely, if an input primarily occupies the zero-rank subspaces of the weight matrix, its representation norm will approach zero, behaving like noise that the network cannot effectively utilize. Building on this, we propose using the norms of rationale candidates as an alternative objective to MMI. Through experiments on four text classification datasets and one graph classification dataset using three network architectures (GRUs, BERT, and GCN), we show that our method outperforms MMI and its improved variants in identifying better rationales. We also compare our method with a representative LLM (llama-3.1-8b-instruct) and find that our simple method gets comparable results to it and can sometimes even outperform it. Our work represents a pioneering attempt to abandon MMI in the XAI field.

1029Simple and Controllable Uniform Discrete Diffusion Language Models

[openreview] [pdf]

Abstract Diffusion models for continuous data gained widespread adoption owing to their high quality generation and control mechanisms. However, controllable diffusion on discrete data faces challenges: continuous diffusion guidance methods are not applicable and recent discrete diffusion models are not well-suited to control or exhibit a quality gap. Here, we provide a straightforward derivation of classifier-free and classifier-based guidance for discrete diffusion, as well as a new class of diffusion models that leverage uniform noise and thus can continuously edit their outputs. We improve the quality of these models with a novel continuous-time variational lower bound that yields state-of-the-art performance, in settings with small vocabularies. Empirically, we demonstrate the effectiveness of our guidance mechanisms relative to autoregressive and diffusion baselines, especially in conjunction with uniform noise diffusion, on several discrete data domains, including genomic sequences, small molecule design, and discretized image generation.

1030When to retrain a machine learning model

[openreview] [pdf]

Abstract A significant challenge in maintaining real-world machine learning models is responding to the continuous and unpredictable evolution of data. Most practitioners are faced with the difficult question: when should I retrain or update my machine learning model? This seemingly straightforward problem is particularly challenging for three reasons: 1) decisions must be made based on very limited information - we usually have access to only a few examples, 2) the nature, extent, and impact of the distribution shift are unknown, and 3) it involves specifying a cost ratio between retraining and poor performance, which can be hard to characterize. Existing works address certain aspects of this problem, but none offer a comprehensive solution. Distribution shift detection falls short as it cannot account for the cost trade-off; the scarcity of the data, paired with its unusual structure, makes it a poor fit for existing offline reinforcement learning methods, and the online learning formulation overlooks key practical considerations. To address this, we present a principled formulation of the retraining problem and propose an uncertainty-based method that makes decisions by continually forecasting the evolution of model performance. Our experiments show that the method consistently outperforms existing baselines on 6 datasets. We thoroughly assess its robustness to mis-specified cost trade-off.

1031Enhancing Federated Domain Adaptation with Multi-Domain Prototype-Based Federated Fine-Tuning

[openreview] [pdf]

Abstract Federated Domain Adaptation (FDA) is a Federated Learning (FL) scenario where models are trained across multiple clients with unique data domains but a shared category space, without transmitting private data. The primary challenge in FDA is data heterogeneity, which causes significant divergences in gradient updates when using conventional averaging-based aggregation methods, reducing the efficacy of the global model. This further undermines both in-domain and out-of-domain performance (within the same federated system but outside the local client), which is critical in certain business applications. To address this, we propose a novel framework called \textbf{M}ulti-domain \textbf{P}rototype-based \textbf{F}ederated Fine-\textbf{T}uning (MPFT). MPFT fine-tunes a pre-trained model using multi-domain prototypes, i.e., several pretrained representations enriched with domain-specific information from category-specific local data. This enables supervised learning on the server to create a globally optimized adapter that is subsequently distributed to local clients, without the intrusion of data privacy. Empirical results show that MPFT significantly improves both in-domain and out-of-domain accuracy over conventional methods, enhancing knowledge preservation and adaptation in FDA. Notably, MPFT achieves convergence within a single communication round, greatly reducing computation and communication costs. To ensure privacy, MPFT applies differential privacy to protect the prototypes. Additionally, we develop a prototype-based feature space hijacking attack to evaluate robustness, confirming that raw data samples remain unrecoverable even after extensive training epochs. The complete implementation of MPFL is available at \url{https://anonymous.4open.science/r/DomainFL/}.

1032Adversarial Inception for Bounded Backdoor Poisoning in Deep Reinforcement Learning

[openreview] [pdf]

Abstract Recent works have demonstrated the vulnerability of Deep Reinforcement Learning (DRL) algorithms against training-time, backdoor poisoning attacks. These attacks induce pre-determined, adversarial behavior in the agent upon observing a fixed trigger during deployment while allowing the agent to solve its intended task during training. Prior attacks rely on arbitrarily large perturbations to the agent’s rewards to achieve both of these objectives - leaving them open to detection. Thus, in this work, we propose a new class of backdoor attacks against DRL which achieve state of the art performance while minimally altering the agent’s rewards. These ``inception’’ attacks train the agent to associate the targeted adversarial behavior with high returns by inducing a disjunction between the agent’s chosen action and the true action executed in the environment during training. We formally define these attacks and prove they can achieve both adversarial objectives. We then devise an online inception attack which significantly out-performs prior attacks under bounded reward constraints.

1033Guiding Skill Discovery with Foundation Models

[openreview] [pdf]

Abstract Learning diverse skills without hand-crafted reward functions could potentially accelerate reinforcement learning in downstream tasks. However, existing skill discovery methods focus solely on maximizing the diversity of skills without considering human preferences, which leads to undesirable behaviors and possibly dangerous skills. For instance, a cheetah robot trained using previous methods learns to roll in all directions to maximize skill diversity, whereas we would prefer it to run without flipping or entering hazardous areas. In this work, we propose aFoundation modelGuided (FoG) skill discovery method, which incorporates human intentions into skill discovery through foundation models. Specifically, FoG extracts a score function from foundation models to evaluate states based on human intentions, assigning higher values to desirable states and lower to undesirable ones. These scores are then used to re-weight the rewards of skill discovery algorithms. By optimizing the re-weighted skill discovery rewards, FoG successfully learns to eliminate undesirable behaviors, such as flipping or rolling, and to avoid hazardous areas in both state-based and pixel-based tasks. Interestingly, we show that FoG can discover skills involving behaviors that are difficult to define. Interactive visualisations are available from:https://sites.google.com/view/iclr-fog

1034Skip the Steps: Data-Free Consistency Distillation for Diffusion-based Samplers

[openreview] [pdf]

Abstract Sampling from probability distributions is a fundamental task in machine learning and statistics. However, most existing algorithms require numerous iterative steps to transform a prior distribution into high-quality samples, resulting in high computational costs and limiting their practicality in time-constrained and resource-limited environments. In this work, we propose consistency samplers, a novel class of samplers capable of generating high-quality samples in a single step. Our method introduces a new consistency distillation algorithm for diffusion-based samplers, which eliminates the need for data or full trajectory integration. By utilizing incomplete sampling trajectories and noisy intermediate representations along the diffusion process, we efficiently learn a direct one-step mapping from any state to its corresponding terminal state in the target distribution. Moreover, our approach enables few-step sampling, allowing users to flexibly balance compute costs and sample quality. We demonstrate the effectiveness of consistency samplers across multiple benchmark tasks, achieving high-quality results with one-step or few-step sampling while significantly reducing the sampling time compared to existing samplers. For instance, our method is 100-200x faster than prior diffusion-based samplers while having comparable sample quality.

1035Conformal Prediction Sets Can Cause Disparate Impact

[openreview] [pdf]

Abstract Although conformal prediction is a promising method for quantifying the uncertainty of machine learning models, the prediction sets it outputs are not inherently actionable. Many applications require a single output to act on, not several. To overcome this, prediction sets can be provided to a human who then makes an informed decision. In any such system it is crucial to ensure the fairness of outcomes across protected groups, and researchers have proposed that Equalized Coverage be used as the standard for fairness. By conducting experiments with human participants, we demonstrate that providing prediction sets can increase the unfairness of their decisions. Disquietingly, we find that providing sets that satisfy Equalized Coverage actually increases unfairness compared to marginal coverage. Instead of equalizing coverage, we propose to equalize set sizes across groups which empirically leads to more fair outcomes.

1036Gradient correlation is needed to accelerate SGD with momentum

[openreview] [pdf]

Abstract Empirically, it has been observed that adding momentum to Stochastic Gradient Descent (SGD) accelerates the convergence of the algorithm. However, the literature has been rather pessimistic, even in the case of convex functions, about the possibility of theoretically proving this observation. We investigate the possibility of obtaining accelerated convergence of the Stochastic Nesterov Accelerated Gradient (SNAG), a momentum-based version of SGD, when minimizing a sum of functions in a convex setting. We demonstrate that the average correlation between gradients allows to verify the strong growth condition, which is the key ingredient to obtain acceleration with SNAG. Numerical experiments, both in linear regression and deep neural network optimization, confirm in practice our theoretical results.

1037When Will It Fail?: Anomaly to Prompt for Forecasting Future Anomalies in Time Series

[openreview] [pdf]

Abstract Recently, time series forecasting, which predicts future signals, and time series anomaly detection, which identifies abnormal signals in given data, have achieved impressive success. However, in the real world, merely forecasting future signals or detecting anomalies in existing signals is not sufficiently informative to prevent potential system breakdowns, which lead to huge costs and require intensive human labor. In this work, we tackle a challenging and under-explored problem of time series anomaly prediction. In this scenario, the models are required to forecast the upcoming signals while considering anomaly points to detect them. To resolve this challenging task, we propose a simple yet effective framework, Anomaly to Prompt (A2P), which is jointly trained via the forecasting and anomaly detection objectives while sharing the feature extractor for better representation. On top of that, A2P leverages Anomaly-Aware Forecasting (AAF), which derives the anomaly probability by random anomaly injection to forecast abnormal time points. Furthermore, we propose Synthetic Anomaly Prompting (SAP) for more robust anomaly detection by enhancing the diversity of abnormal input signals for training anomaly detection model. As a result, our model achieves state-of-the-art performances on seven real-world datasets, proving the effectiveness of our proposed framework A2P for a new time series anomaly prediction task.

1038Haland: Human-AI Coordination via Policy Generation from Language-guided Diffusion

[openreview] [pdf]

Abstract Developing intelligent agents that can effectively coordinate with diverse human partners is a fundamental goal of artificial general intelligence. Previous approaches typically generate a variety of partners to cover human policies, and then either train a single universal agent or maintain multiple best-response (BR) policies for different partners. However, the first direction struggles with the stochastic and multimodal nature of human behaviors, and the second relies on costly few-shot adaptations during policy deployment, which is unbearable in real-world applications such as healthcare and autonomous driving. Recognizing that human partners can easily articulate their preferences or behavioral styles through natural languages and make conventions beforehand, we propose a framework for Human-AI Coordination via Policy Generation from Language-guided Diffusion, referred to as Haland. Haland first trains BR policies for various partners using reinforcement learning, and then compresses policy parameters into a single latent diffusion model, conditioned on task-relevant language derived from their behaviors. Finally, the alignment between task-relevant and natural languages is achieved to facilitate efficient human-AI coordination. Empirical evaluations across diverse cooperative environments demonstrate that Haland generates agents with significantly enhanced zero-shot coordination performance, utilizing only natural language instructions from various partners, and outperforms existing methods by approximately 89.64%.

1039Correcting the Bias of Normalizing Flows by Synthetic Outliers for Improving Out-of-Distribution Detection

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection is critical for ensuring the reliability and robustness of deep learning models in real-world applications. While normalizing flows have demonstrated impressive performance for various task of image OOD detection, recent findings suggest that they still encounter limitations and severe biases when applied to datasets with different statistics. Specifically, it has been observed that normalizing flow models tend to assign higher likelihoods to OOD samples with low complexity, which undermines the effectiveness of likelihood based OOD detection methods. In this paper, we explore the bias related to data complexity linked to normalizing flow models in OOD detection. We propose a novel method for bias correction by incorporating synthetic outliers during training, guiding the model to assign lower likelihoods to OOD samples. Additionally, we introduce a specialized training objective that leverages the softplus function for OOD data, ensuring a smooth and effective training process. Extensive experiments on benchmark and high-dimensional real-world datasets, including both images and texts, confirm that our proposed approach significantly enhances OOD detection accuracy, achieving performance comparable to models trained with a limited number of real outliers. Moreover, our method increases the Lipschitz constant, supporting the hypothesis presented in related literature.

1040Counterfactual fairness prediction: Consistent estimation with generative models and theoretical guarantees

[openreview] [pdf]

Abstract Fairness in predictions is of direct importance in practice due to legal, ethical, and societal reasons. This is often accomplished through counterfactual fairness, which ensures that the prediction for an individual is the same as that in a counterfactual world under a different sensitive attribute. However, achieving counterfactual fairness is challenging as counterfactuals are unobservable, and, because of that, existing baselines for counterfactual fairness do not have theoretical guarantees. In this paper, we propose a novel counterfactual fairness predictor for making predictions under counterfactual fairness. Here, we follow the standard counterfactual fairness setting and directly learn the counterfactual distribution of the descendants of the sensitive attribute via tailored neural networks, which we then use to enforce fair predictions through a novel counterfactual mediator regularization. Unique to our work is that we provide theoretical guarantees that our method is effective in ensuring the notion of counterfactual fairness. We further compare the performance across various datasets, where our method achieves state-of-the-art performance.

1041Model-Enhanced Adversarial Inverse Reinforcement Learning with Model Estimation Reward Shaping in Stochastic Environments

[openreview] [pdf]

Abstract In this paper, we aim to tackle the limitation of the Adversarial Inverse Reinforcement Learning (AIRL) method in stochastic environments where theoretical results cannot hold and performance is degraded. To address this issue, we propose a novel method which infuses the dynamics information into the reward shaping with the theoretical guarantee for the induced optimal policy in the stochastic environments. Incorporating our novel model-enhanced rewards, we present a novel Model-Enhanced AIRL framework, which integrates transition model estimation directly into reward shaping. Furthermore, we provide a comprehensive theoretical analysis of the reward error bound and performance difference bound for our method. The experimental results in MuJoCo benchmarks show that our method can achieve superior performance in stochastic environments and competitive performance in deterministic environments, with significant improvement in sample efficiency, compared to existing baselines.

1042ClassDiffusion: More Aligned Personalization Tuning with Explicit Class Guidance

[openreview] [pdf]

Abstract Recent text-to-image customization works have proven successful in generating images of given concepts by fine-tuning diffusion models on a few examples. However, tuning-based methods inherently tend to overfit the concepts, resulting in failure to create the concept under multiple conditions (e.g., headphone is missing when generating “a `dog wearing a headphone”). Interestingly, we notice that the base model before fine-tuning exhibits the capability to compose the base concept with other elements (e.g., “a dog wearing a headphone”), implying that the compositional ability only disappears after personalization tuning. We observe a semantic shift in the customized concept after fine-tuning, indicating that the personalized concept is not aligned with the original concept, and further show through theoretical analyses that this semantic shift leads to increased difficulty in sampling the joint conditional probability distribution, resulting in the loss of the compositional ability. Inspired by this finding, we presentClassDiffusion, a technique that leverages asemantic preservation lossto explicitly regulate the concept space when learning a new concept. Although simple, this approach effectively prevents semantic drift during the fine-tuning process of the target concepts. Extensive qualitative and quantitative experiments demonstrate that the use of semantic preservation loss effectively improves the compositional abilities of fine-tuning models. Lastly, we also extend our ClassDiffusion to personalized video generation, demonstrating its flexibility.

1043Identifying Drivers of Predictive Aleatoric Uncertainty

[openreview] [pdf]

Abstract Explainability and uncertainty quantification are two pillars of trustable artificial intelligence. However, the reasoning behind uncertainty estimates is generally left unexplained. Identifying the drivers of uncertainty complements explanations of point predictions in recognizing model limitations and enhances trust in decisions and their communication. So far, explanations of uncertainties have been rarely studied. The few exceptions rely on Bayesian neural networks or technically intricate approaches, such as auxiliary generative models, thereby hindering their broad adoption. We propose a straightforward approach to explain predictive aleatoric uncertainties. We estimate uncertainty in regression as predictive variance by adapting a neural network with a Gaussian output distribution. Subsequently, we apply out-of-the-box explainers to the model’s variance output. This approach can explain uncertainty influences more reliably than more complex published approaches, which we demonstrate in a synthetic setting with a known data-generating process. We further adapt multiple metrics from conventional XAI research to uncertainty explanations. We quantify our findings with a nuanced benchmark analysis that includes real-world datasets. Finally, we apply our approach to an age regression model and discover reasonable drivers of uncertainty. Overall, the proposed straightforward method explains uncertainty estimates with little modifications to the model architecture and decisively outperforms more intricate methods.

1044Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers

[openreview] [pdf]

Abstract The denoising diffusion model has recently emerged as a powerful generative technique, capable of transforming noise into meaningful data. While theoretical convergence guarantees for diffusion models are well established when the target distribution aligns with the training distribution, practical scenarios often present mismatches. One common case is in zero-shot conditional diffusion sampling, where the target conditional distribution is different from the (unconditional) training distribution. These score-mismatched diffusion models remain largely unexplored from a theoretical perspective. In this paper, we present the first performance guarantee with explicit dimensional dependencies for general score-mismatched diffusion samplers, focusing on target distributions with finite second moments. We show that score mismatches result in an asymptotic distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions. This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise. Interestingly, the derived convergence upper bound offers useful guidance for designing a novel bias-optimal zero-shot sampler in linear conditional models that minimizes the asymptotic bias. For such bias-optimal samplers, we further establish convergence guarantees with explicit dependencies on dimension and conditioning, applied to several interesting target distributions, including those with bounded support and Gaussian mixtures. Our findings are supported by numerical studies.

1045Outlier Synthesis via Hamiltonian Monte Carlo for Out-of-Distribution Detection

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection is crucial for developing trustworthy and reliable machine learning systems. Recent advances in training with auxiliary OOD data demonstrate efficacy in enhancing detection capabilities. Nonetheless, these methods heavily rely on acquiring a large pool of high-quality natural outliers. Some prior methods try to alleviate this problem by synthesizing virtual outliers but suffer from either poor quality or high cost due to the monotonous sampling strategy and the heavy-parameterized generative models. In this paper, we overcome all these problems by proposing the Hamiltonian Monte Carlo Outlier Synthesis (HamOS) framework, which views the synthesis process as sampling from Markov chains. Based solely on the in-distribution data, the Markov chains can extensively traverse the feature space and generate diverse and representative outliers, hence exposing the model to miscellaneous potential OOD scenarios. The Hamiltonian Monte Carlo with sampling acceptance rate almost close to 1 also makes our framework enjoy great efficiency. By empirically competing with SOTA baselines on both standard and large-scale benchmarks, we verify the efficacy and efficiency of our proposed HamOS.

1046Value from Observations: Towards Large-Scale Imitation Learning via Self-Improvement

[openreview] [pdf]

Abstract Imitation Learning from Observation (IfO) offers a powerful way to learn behaviors from large-scale, mixed-quality data. Unlike prevalent methods, IfO does not require large numbers of expert demonstrations with actions or carefully crafted reward functions. However, current research dominantly focuses on idealized scenarios with specially tailored data distributions. This paper introduces a novel algorithm to learn from datasets with varying quality, moving closer to a paradigm in which the imitation learning can be performed iteratively in a self-improvement setting. Our method extends RL-based imitation learning to action-free demonstrations, using a value function to transfer information between expert and non-expert data. Through comprehensive evaluation, we delineate the relation between different data distributions and algorithms and highlight the limitations of established methods. Our findings provide valuable insights for developing more robust and practical IfO techniques and on a path to scalable behaviour learning.

1047Generating Likely Counterfactuals Using Sum-Product Networks

[openreview] [pdf]

Abstract The need to explain decisions made by AI systems is driven by both recent regulation and user demand. The decisions are often explainable only post hoc. In counterfactual explanations, one may ask what constitutes the best counterfactual explanation. Clearly, multiple criteria must be taken into account, although “distance from the sample” is a key criterion. Recent methods that consider the plausibility of a counterfactual seem to sacrifice this original objective. Here, we present a system that provides high-likelihood explanations that are, at the same time, close and sparse. We show that the search for the most likely explanations satisfying many common desiderata for counterfactual explanations can be modeled using Mixed-Integer Optimization (MIO). We use a Sum-Product Network (SPN) to estimate the likelihood of a counterfactual. To achieve that, we propose an MIO formulation of an SPN, which can be of independent interest.

1048Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

[openreview] [pdf]

Abstract The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results. However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-kk), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code will be made publicly available.

1049Multidimensional Trajectory Optimization for Flow and Diffusion

[openreview] [pdf]

Abstract In flow and diffusion-based generative modeling, conventional methods rely on unidimensional coefficients for the trajectory of differential equations. In this work, we first introduce a multidimensional coefficient that generalizes the conventional unidimensional coefficient into multiple dimensions. We also propose a new multidimensional trajectory optimization, which suggests a novel trajectory optimality determined by the final transportation quality rather than predefined properties like straightness. Our approach employs simulation dynamics and adversarial training to optimize these inference trajectories. To empirically validate our method, we conduct experiments on various generative models, including EDM and Stochastic Interpolant, across multiple datasets such as 2D synthetic datasets, CIFAR-10, FFHQ, and AFHQv2. Remarkably, inference using our optimized multidimensional trajectory achieves significant performance improvements with low NFE (e.g., 5), achieving state-of-the-art results in CIFAR-10 conditional generation. The introduction of multidimensional trajectory optimization enhances model efficiency and opens new avenues for exploration in flow and diffusion-based generative modeling.

1050Asynchronous Federated Reinforcement Learning with Policy Gradient Updates: Algorithm Design and Convergence Analysis

[openreview] [pdf]

Abstract To improve the efficiency of reinforcement learning (RL), we propose a novel asynchronous federated reinforcement learning (FedRL) framework termed AFedPG, which constructs a global model through collaboration among NN agents using policy gradient (PG) updates. To address the challenge of lagged policies in asynchronous settings, we design a delay-adaptive lookahead technique \textit{specifically for FedRL} that can effectively handle heterogeneous arrival times of policy gradients. We analyze the theoretical global convergence bound of AFedPG, and characterize the advantage of the proposed algorithm in terms of both the sample complexity and time complexity. Specifically, our AFedPG method achieves O(ϵ2.5N)\mathcal{O}(\frac{{\epsilon}^{-2.5}}{N}) sample complexity for global convergence at each agent on average. Compared to the single agent setting with O(ϵ2.5)\mathcal{O}(\epsilon^{-2.5}) sample complexity, it enjoys a linear speedup with respect to the number of agents. Moreover, compared to synchronous FedPG, AFedPG improves the time complexity from O(tmaxN)\mathcal{O}(\frac{t_{\max}}{N}) to O(i=1N1ti)1\mathcal{O}({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1}, where tit_{i} denotes the time consumption in each iteration at agent ii, and tmaxt_{\max} is the largest one. The latter complexity O(i=1N1ti)1\mathcal{O}({\sum_{i=1}^{N} \frac{1}{t_{i}}})^{-1} is always smaller than the former one, and this improvement becomes significant in large-scale federated settings with heterogeneous computing powers (tmaxtmint_{\max}\gg t_{\min}). Finally, we empirically verify the improved performance of AFedPG in four widely-used MuJoCo environments with varying numbers of agents. We also demonstrate the advantages of AFedPG in various computing heterogeneity scenarios.

1051PEER Pressure: Model-to-Model Regularization for Single Source Domain Generalization

[openreview] [pdf]

Abstract Neural networks are frequently deployed on multiple unseen target domains, which are distributionally different from the source domain on which the model is trained. Data augmentation is the most popular tool for single source domain generalization, which expands the source domain by generating simulated ones, commonly adopted by existing approaches. In this work, we observe that the performance of such augmentation-based methods in the target domains frequently fluctuates during training, posing challenges in model selection under realistic scenarios. We argue that the fluctuation stems from the inability of the model to accumulate the knowledge learned from diverse augmentations, exacerbating feature distortion during training. Based on this observation, we propose a novel generalization method, coined Parameter-Space Ensemble with Entropy Regularization (PEER), that uses a proxy model to learn the augmented data on behalf of the main model. The main model is updated by averaging its parameters with the proxy model, progressively accumulating knowledge over the training steps. Maximizing the mutual information between the output representations of the two models guides the learning process of the proxy model, mitigating feature distortion during training. Extensive experimental results demonstrate the effectiveness of PEER in reducing the OOD performance fluctuation and enhancing generalization across various datasets, including PACS, Digits, Office-Home, and VLCS. Notably, our method with simple random augmentation achieves state-of-the-art performance, surpassing prior approaches on sDG that utilize complex data augmentation strategies.

1052Memory-Efficient Algorithm Distillation for In-context Reinforcement Learning

[openreview] [pdf]

Abstract It’s recently reported that by employing the superior In-context Learning (ICL) ability of autoregressive Transformer, a method named Algorithm Distillation\textit{Algorithm Distillation} (AD) could distill the whole Reinforcement Learning process into neural network then generalize to unseen\textit{unseen} scenarios with performance comparable to the distilled algorithm. However, to enable ICL, it’s vital for self-attention module to have a context that spans cross-episodes histories and contains thousands of tokens. Such a long-range context and the quadratic memory complexity of self-attention pose difficulty on applying AD into many common RL tasks. On the other hand, designing memory efficient Transformers for long-range document modeling\textit{long-range document modeling} is itself a fast-developing and fruitful field, which leads to a natural question: Could Efficient Transformers exhibit similar in-context learning ability and be used for Memory-Efficient Algorithm Distillation?\textit{Could Efficient Transformers exhibit similar in-context learning ability and be used for Memory-Efficient Algorithm Distillation?} In this paper, we firstly build a benchmark suite that is thorough, efficient and flexible. Thanks to it, we perform extensive experiments and verify an existing method named ERNIE-Docs\textit{ERNIE-Docs} (ED) could offer competitive performance with significantly reduced memory footprint. With systematic ablation studies, we further investigate various facets influencing the ICL ability of ED and provide our own insights into its hyperparameter tuning.

1053Transformers Learn Temporal Difference Methods for In-Context Reinforcement Learning

[openreview] [pdf]

Abstract Traditionally, reinforcement learning (RL) agents learn to solve new tasks by updating their parameters through interactions with the task environment. However, recent works have demonstrated that transformer-based RL agents, after certain pretraining procedures, can learn to solve new out-of-distribution tasks without parameter updates, a phenomenon known as in-context reinforcement learning (ICRL). The empirical success of ICRL is widely attributed to the hypothesis that the forward pass of these models implements an RL algorithm. However, no prior works have demonstrated a precise equivalence between a forward pass and any specific RL algorithm, even in simplified settings like transformers with linear attention. In this paper, we present the first proof by construction demonstrating that transformers with linear attention can implement temporal difference (TD) learning in the forward pass — referred to as in-context TD. We also provide theoretical analysis and empirical evidence demonstrating the emergence of in-context TD after training the transformer with a multi-task TD algorithm, offering the first constructive explanation for transformers’ ability to perform in-context reinforcement learning.

1054Online Auction for Ads and Organics

[openreview] [pdf]

Abstract This paper introduces the first online blending auction mechanism design for sponsored items (ads) alongside organic items (organics), ensuring guaranteed Pareto optimality for platform revenue, advertiser utilities, and user interest (measured through clicks). We innovatively define an umbrella term, “traffic item,” to encompass both organics and auctionable ad items, where an organic represents a unit of traffic to be auctioned, valued positively by attracting user interest with a fixed zero bid and payment. The online blending traffic distribution problem is thus transformed into an auction problem with unified valuation metric for the traffic item, which is subsequently formulated as an online multi-objective constrained optimization problem. We derive a Pareto equation for this optimization problem, characterizing the optimal auction mechanism set by its solution set. This solution is implemented through a novel two-stage Adaptive Modeled Mechanism Design (AMMD), which (1) trains a hypernetwork to learn a family of parameterized mechanisms, each corresponding to a specific solution of the Pareto equation, and (2) employs feedback-based online control to adaptively adjust the mechanism parameters, ensuring real-time optimality in a dynamic environment. Extensive experiments demonstrate that AMMD outperforms existing methods in both click-through rates and revenue across multiple auction scenarios, particularly highlighting its adaptability to online environments. The code has been submitted and will be released publicly.

1055ToMA: Token Merging with Attention For Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have emerged as leading models for image generation. Plug-and-play token merging techniques have recently been introduced to mitigate the heavy computation cost of transformer blocks in diffusion models. However, existing methods overlook two key factors: 1. they fail to incorporate modern efficient implementation of attention, so that, the overhead backfires the achieved algorithmic efficiency 2. the selection of token to merge ignores the relation among tokens, limiting the image quality. In this paper, we propose Token Merging with Attention(ToMA) with three major improvements. Firstly, we utilize submodular-based token selection method to identify diverse tokens as merge destination, representative of the entire token set. Secondly, we propose attention merge, utilizing the efficient attention implementation, to perform the merge with negligible overhead. Also we abstract the (un-)merging as (inverse-)linear transformations which also allows shareable transformation across layers/iterations. Finally, we utilize the image locality to further accelerate the computation by performing all the operations on tokens in local tiles.

1056Exploring Data Distillation for efficient generation of Tabular Data

[openreview] [pdf]

Abstract Tabular data generation methods have emerged to address growing concerns about the use of sensitive tabular data for training machine learning models. Many methods focus on creating high-quality tabular data that can be used in place of the original dataset while retaining generalization performance on downstream tasks and protecting sensitive data in an era where privacy is paramount. Despite their avid success, many of the methods face implacable challenges and obstacles to wide-scale applications primarily due to the significant computational costs associated with data synthesis. In this paper, we propose a flexible data distillation pipeline as an alternative to conventional synthetic data generators that obtain competitive privacy metrics while achieving significantly higher downstream performance at a fraction of the compute costs. In particular, our method has accelerated data synthesis by 5×5\times on average when compared to synthetic generators while also achieving superior performance.

1057Temporal Adaptive Convolutional Intervention Network for Counterfactual Estimation: A Domain Generalization Perspective

[openreview] [pdf]

Abstract Accurate estimation of time-varying treatment effects is crucial for optimizing interventions in personalized medicine. However, observational data often contains complex confounding bias and temporal complexities, making counterfactual estimation challenging. We propose Temporal Adaptive Convolutional Intervention Network (TACIN), a novel model that introduces an Intervention-aware Functional Convolution kernel to emphasize the role of treatments and capture complex temporal treatment interactions. TACIN addresses confounding bias from a domain generalization perspective, approximating the unknown target domain using adversarial examples and incorporating Sharpness-Aware Minimization to derive a generalization bound. This approach is more suitable for longitudinal settings compared to existing methods inspired by domain adaptation techniques due to inherent differences between static and longitudinal contexts. Experiments on simulated datasets demonstrate TACIN’s superior performance compared to state-of-the-art models for counterfactual estimation over time.

1058MaxInfoRL: Boosting exploration in reinforcement learning through information gain maximization

[openreview] [pdf]

Abstract Reinforcement learning (RL) algorithms aim to balance exploiting the current best strategy with exploring new options that could lead to higher rewards. Most common RL algorithms use undirected exploration, i.e., select random sequences of actions. Exploration can also be directed using intrinsic rewards, such as curiosity or model epistemic uncertainty. However, effectively balancing task and intrinsic rewards is challenging and often task-dependent. In this work, we introduce a framework, MaxInfoRL, for balancing intrinsic and extrinsic exploration. MaxInfoRL steers exploration towards informative transitions, by maximizing intrinsic rewards such as the information gain about the underlying task. When combined with Boltzmann exploration, this approach naturally trades off maximization of the value function with that of the entropy over states, rewards, and actions. We show that our approach achieves sublinear regret in the simplified setting of multi-armed bandits. We then apply this general formulation to a variety of off-policy model-free RL methods for continuous state-action spaces, yielding novel algorithms that achieve superior performance across hard exploration problems and complex scenarios such as visual control tasks.

1059Specialized Foundation Models struggle to beat Supervised Baselines

[openreview] [pdf]

Abstract Following its success for vision and text, the “foundation model” (FM) paradigm—pretraining large models on massive data, then fine-tuning on target tasks—has rapidly expanded to domains in the sciences, engineering, healthcare, and beyond. Has this achieved what the original FMs accomplished, i.e. the supplanting of traditional supervised learning in their domains? To answer we look at three modalities—genomics, satellite data, and time series—with multiple recent FMs and compare them to a standard supervised learning workflow: model development, hyperparameter tuning, and training, all using only data from the target task. Across those three specialized domains, we find that it is consistently possible to train simple supervised models—no more complicated than a lightly modified wide ResNet or UNet—that match or even outperform the latest foundation models. Our work demonstrates that the benefits of large-scale pretraining have yet to be realized in many specialized areas, reinforces the need to compare new FMs to strong, well-tuned baselines, and introduces two new, easy-to-use, open-source, and automated workflows for doing so.

1060Exploiting Open-World Data for Adaptive Continual Learning

[openreview] [pdf]

Abstract Continual learning (CL), which involves learning from sequential tasks without forgetting, is mainly explored in supervised learning settings where all data are labeled. However, high-quality labeled data may not be readily available at a large scale due to high labeling costs, making the application of existing CL methods in real-world scenarios challenging. In this paper, we study a more practical facet of CL: open-world continual learning, where the training data comes from the open-world dataset and is partially labeled and non-i.i.d. Building on the insight that task shifts in CL can be viewed as distribution transitions from known classes to novel classes, we propose OpenACL, a method that explicitly leverages novel classes in unlabeled data to enhance continual learning. Specifically, OpenACL considers novel classes within open-world data as potential classes for upcoming tasks and mines the underlying pattern from them to empower the model’s adaptability to upcoming tasks. Furthermore, learning from extensive unlabeled data also helps to tackle the issue of catastrophic forgetting. Extensive experiments validate the effectiveness of OpenACL and show the benefit of learning from open-world data.

1061Learning to Search from Demonstration Sequences

[openreview] [pdf]

Abstract Search and planning are essential for solving many real-world problems. However, in numerous learning scenarios, only action-observation sequences, such as demonstrations or instructions sequences, are available for learning. Relying solely on supervised learning with these sequences can lead to sub-optimal performance due to the vast, unseen search space encountered during training. In this paper, we introduce Differentiable Tree Search Network (D-TSN), a novel neural network architecture that learns to construct search trees from just sequences of demonstrations by performing gradient descent on a best-first search tree construction algorithm. D-TSN enables the joint learning of submodules, including an encoder, value function, and world model, which are essential for planning. To construct the search tree, we employ a stochastic tree expansion policy and formulate it as another decision-making task. Then, we optimize the tree expansion policy via REINFORCE, and introduce an effective variance reduction technique for the gradient computation. D-TSN can be applied to problems with a known world model or to scenarios where it needs to jointly learn a world model with a latent state space. We study problems from these two scenarios, including Game of 24, 2D grid navigation, and Procgen games, to understand when is D-TSN more helpful. Through our experiments, we show that D-TSN is effective, especially when the world model with a latent state space is jointly learned.

[openreview] [pdf]

Abstract Resolving conflicts is essential to make the decisions of multi-view classification more reliable. Much research has been conducted on learning consistent and informative representations among different views, often assuming that all views are equally important and perfectly aligned. However, real-world multi-view data may not always conform to these assumptions, as some views may express distinct information. To address this issue, we develop a computational trust-based discounting method to enhance the existing Evidential Multi-view framework in scenarios where conflicts between different views may arise. Its belief fusion process considers the reliability of predictions made by individual views via an instance-wise probability-sensitive trust discounting mechanism. We evaluate our method on six real-world datasets, using Top-1 Accuracy, Fleiss’ Kappa, and a new metric called Multi-View Agreement with Ground Truth that takes into consideration the ground truth labels, to measure the reliability of the prediction. We also evaluate whether uncertainty measures can effectively indicate prediction correctness by calculating the AUROC. The experimental results show that computational trust can effectively resolve conflicts, paving the way for more reliable multi-view classification models in real-world applications.

1063Bootstrap Sampling Rate Greater than 1.0 May Improve Random Forest Performance

[openreview] [pdf]

Abstract Random forests utilize bootstrap sampling to create an individual training set for each component tree. This involves sampling with replacement, with the number of instances equal to the size of the original training set (NN). Research literature indicates that drawing fewer than NN observations can also yield satisfactory results. The ratio of the number of observations in each bootstrap sample to the total number of training instances is called the bootstrap rate (BR). Sampling more than NN observations (BR >> 1) has been explored in the literature only to a limited extent and has generally proven ineffective. In this paper, we re-examine this approach using 36 diverse datasets and consider BR values ranging from 1.2 to 5.0. Contrary to previous findings, we show that such parameterization can result in statistically significant improvements in classification accuracy compared to standard settings (BR \leq 1). Furthermore, we investigate what the optimal BR depends on and conclude that it is more a property of the dataset than a dependence on the random forest hyperparameters. Finally, we develop a binary classifier to predict whether the optimal BR is \leq 1 or >> 1 for a given dataset, achieving between 81.88% and 88.81% accuracy, depending on the experiment configuration. The code is available at: <<placeholder>>.

1064Forget but Recall: Incremental Latent Rectification in Continual Learning

[openreview] [pdf]

Abstract Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, LRB learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.

1065The Perils of Optimizing Learned Reward Functions: Low Training Error Does Not Guarantee Low Regret

[openreview] [pdf]

Abstract In reinforcement learning, specifying reward functions that capture the intended task can be very challenging. Reward learning aims to address this issue by learning the reward function. However, a learned reward model may have a low loss on the training distribution, and yet subsequently produce a policy with large regret. We say that such a reward model has an error-regret mismatch. The main source of an error-regret mismatch is the distribution shift that commonly occurs during policy optimization. In this paper, we mathematically show that a sufficiently low expected test error of the reward model guarantees low worst-case regret, but that for any fixed expected test error, there exist realistic data distributions that allow for error-regret mismatch to occur. We then show that similar problems persist even when using policy regularization techniques, commonly employed in methods such as RLHF. Our theoretical results highlight the importance of developing new ways to measure the quality of learned reward models.

1066Revisiting Convergence: A Study on Shuffling-Type Gradient Methods

[openreview] [pdf]

Abstract Shuffling-type gradient methods are favored in practice for their simplicity and rapid empirical performance. Despite extensive development of convergence guarantees under various assumptions in recent years, most require the Lipschitz smoothness condition, which is often not met in common machine learning models. We highlight this issue with specific counterexamples. To address this gap, we revisit the convergence rates of shuffling-type gradient methods without assuming Lipschitz smoothness. Using our stepsize strategy, the shuffling-type gradient algorithm not only converges under weaker assumptions but also match the current best-known convergence rates, thereby broadening its applicability. We prove the convergence rates for nonconvex, strongly convex, and non-strongly convex cases, each under both random reshuffling and arbitrary shuffling schemes, and under bounded or sub-Gaussian gradient noise. Numerical experiments further validate the performance of our shuffling-type gradient algorithm, underscoring its practical efficacy.

1067Hybrid Regularization Improves Diffusion-based Inverse Problem Solving

[openreview] [pdf]

Abstract Diffusion models, recognized for their effectiveness as generative priors, have become essential tools for addressing a wide range of visual challenges. Recently, there has been a surge of interest in leveraging Denoising processes for Regularization (DR) to solve inverse problems. However, existing methods often face issues such as mode collapse, which results in excessive smoothing and diminished diversity. In this study, we perform a comprehensive analysis to pinpoint the root causes of gradient inaccuracies inherent in DR. Drawing on insights from diffusion model distillation, we propose a novel approach called Consistency Regularization (CR), which provides stabilized gradients without the need for ODE simulations. Building on this, we introduce Hybrid Regularization (HR), a unified framework that combines the strengths of both DR and CR, harnessing their synergistic potential. Our approach proves to be effective across a broad spectrum of inverse problems, encompassing both linear and nonlinear scenarios, as well as various measurement noise statistics. Experimental evaluations on benchmark datasets, including FFHQ and ImageNet, demonstrate that our proposed framework not only achieves highly competitive results compared to state-of-the-art methods but also offers significant reductions in wall-clock time and memory consumption.

1068Learn hybrid prototypes for multivariate time series anomaly detection

[openreview] [pdf]

Abstract In multivariate time series anomaly detection (MTSAD), reconstruction-based models reconstruct testing series with learned knowledge of only normal series and identify anomalies with higher reconstruction errors. In practice, over-generalization often occurs with unexpectedly well reconstruction of anomalies. Although memory banks are employed by reconstruction-based models to fight against over-generalization, these models are only efficient to detect point anomalies since they learn normal prototypes from time points, leaving contextual anomalies and periodical anomalies to be discovered. To settle this problem, this paper propose a hybrid prototypes learning model for MTSAD based on reconstruction, named as H-PAD. First, normal prototypes are learned from different sizes of patches for time series to discover short-term anomalies. These prototypes in different sizes are integrated together to reconstruct query series so that any anomalies would be smoothed off and high reconstruction errors are produced. Furthermore, period prototypes are learned to discover periodical anomalies. One period prototype is memorized for one variable of query series. Finally, extensive experiments on five benchmark datasets show the effectiveness of H-PAD with state-of-the-art performance.

1069Bellman Unbiasedness: Toward Provably Efficient Distributional Reinforcement Learning with General Value Function Approximation

[openreview] [pdf]

Abstract Distributional reinforcement learning improves performance by effectively capturing environmental stochasticity. However, existing research on its regret analysis has relied heavily on structural assumptions that are difficult to implement in practice. In particular, there has been little attention to the infeasibility issue of dealing with the infinite-dimensionality of a distribution. To overcome this infeasibility, we present a regret analysis of distributional reinforcement learning with general value function approximation in a finite episodic Markov decision process setting throughstatistical functional dynamic programming. We first introduce a key notion ofBellman unbiasednesswhich is essential for exactly learnable and provably efficient updates. Our theoretical results demonstrate that the only way to exactly capture statistical information, including nonlinear statistical functionals, is by representing the infinite-dimensional return distribution with a finite number of moment functionals. Secondly, we propose a provably efficient algorithm,SF-LSVI, that achieves a tight regret bound of O~(dEH32K)\tilde{O}(d_E H^{\frac{3}{2}}\sqrt{K}) where HH is the horizon, KK is the number of episodes, and dEd_E is the eluder dimension of a function class.

1070Mixture of Parrots: Experts improve memorization more than reasoning

[openreview] [pdf]

Abstract The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate.We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.

1071Motion Inversion for Video Customization

[openreview] [pdf]

Abstract In this work, we present a novel approach for motion customization in video generation, addressing the widespread gap in the exploration of motion representation within video generative models. Recognizing the unique challenges posed by the spatiotemporal nature of video, our method introduces Motion Embeddings, a set of explicit, temporally coherent embeddings derived from a given video. These embeddings are designed to integrate seamlessly with the temporal transformer modules of video diffusion models, modulating self-attention computations across frames without compromising spatial integrity. Our approach provides a compact and efficient solution to motion representation, utilizing two types of embeddings: a Motion Query-Key Embedding to modulate the temporal attention map and a Motion Value Embedding to modulate the attention values. Additionally, we introduce an inference strategy that excludes spatial dimensions from the Motion Query-Key Embedding and applies a differential operation to the Motion Value Embedding, both designed to debias appearance and ensure the embeddings focus solely on motion. Our contributions include the introduction of a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method through extensive experiments.

1072Mixture of insighTful Experts (MoTE): The Synergy of Thought Chains and Expert Mixtures in Safety Self-Alignment

[openreview] [pdf]

Abstract As the capabilities of large language models (LLMs) have expanded dramatically, aligning these models with human values presents a significant challenge. Recent studies demonstrate that powerful LLMs can achieve self-alignment by either correcting their initial unsafe responses or autonomously ranking answers without human intervention. In this work, we identify two key limitations: first, they rely on the assumed emergent capabilities of LLMs, and second, they discard all intermediate reasoning steps when aligning the model with updated answers. To address these challenges, we propose a novel self-alignment method that utilizes a Chain of Thought (CoT) approach, termed AlignCoT. This method encompasses stages of Question Analysis, Answer Guidance, and Safe Answer production. It is designed to enable LLMs, even smaller and weaker models like 7B LLMs, to produce high-quality, safe responses. Furthermore, we introduce the Mixture of insighTful Experts (MoTE) architecture, which applies mixture of experts to enhance each component of the AlignCoT process, markedly increasing alignment efficiency. The MoTE approach not only outperforms existing methods in aligning LLMs with human values but also highlights the benefits of using self-generated data, revealing the dual benefits of improved alignment and training efficiency.

1073Dynamic Influence Tracker: Estimating Sample Influence in SGD-Trained Models across Arbitrary Time Windows

[openreview] [pdf]

Abstract Understanding how training samples affect models improves model interpretability, optimization strategies, and anomaly detection. However, existing methods for estimating sample influence provide only static assessments, rely on restrictive assumptions, and require high computational costs. We propose Dynamic Influence Tracker (DIT), a novel method to estimate time-varying sample influence in models trained with Stochastic Gradient Descent (SGD). DIT enables fine-grained analysis of sample influence within arbitrary time windows during training through a two-phase algorithm. The training phase efficiently captures and stores necessary information about the SGD trajectory, while the inference phase computes the influence of samples on the model within a specified time window. We provide a theoretical error bound for our estimator without assuming convexity, showing its reliability across various learning scenarios. Our experimental results reveal the evolution of sample influence throughout the training process, enhancing understanding of learning dynamics. We show DIT’s effectiveness in improving model performance through anomalous sample detection and its potential for advancing curriculum learning.

1074Kick Bad Guys Out! Conditionally Activated Anomaly Detection in Federated Learning with Zero-Knowledge Proof Verification

[openreview] [pdf]

Abstract Federated Learning (FL) systems are susceptible to adversarial attacks, where malicious clients submit poisoned models to disrupt the convergence or plant backdoors that cause the global model to misclassify some samples. Current defense methods are often impractical for real-world FL systems, as they either rely on unrealistic prior knowledge or cause accuracy loss even in the absence of attacks. Furthermore, these methods lack a protocol for verifying execution, leaving participants uncertain about the correct execution of the mechanism. To address these challenges, we propose a novel anomaly detection strategy that is designed for real-world FL systems. Our approach activates the defense only when potential attacks are detected, and enables the removal of malicious models without affecting the benign ones. Additionally, we incorporate zero-knowledge proofs to ensure the integrity of the proposed defense mechanism. Experimental results demonstrate the effectiveness of our approach in enhancing FL system security against a comprehensive set of adversarial attacks in various ML tasks.

1075Neighbor-aware Geodesic Transportation for Neighborhood Refinery

[openreview] [pdf]

Abstract Neighborhood refinery aims to enhance the neighbor relationships by refining the original distance matrix to ensure pairwise consistency. Traditional context-based methods, which encode instances alongside their local neighbors in a contextual affinity space, are limited in capturing global relationships and are vulnerable to the negative impacts of outliers in the neighborhood. To overcome these limitations, we propose a novel Neighbor-aware Geodesic Transportation (NGT) for the neighborhood refinery. NGT first constructs a global-aware distribution for each instance, capturing the intrinsic manifold relationships among all instances. This is followed by an optimization transportation process that utilizes the global-aware distribution within the underlying manifold, incorporating global geometric spatial information to generate a refined distance. NGT first involves Manifold-aware Neighbor Encoding (MNE) to project each instance into a global-aware distribution by constraining pairwise similarity with the corresponding affinity graph to capture global relationships. Subsequently, a Regularized Barycenter Refinery (RBR) module is proposed to integrate local neighbors into a barycenter, employing a Wasserstein term to reduce the influence of outliers. Lastly, Geodesic Transportation (GT) leverages geometric and global context information to transport the barycenter distribution along the geodesic paths within the affinity graph. Extensive evaluations on several tasks, such as re-ranking and deep clustering, demonstrate the superiority of our proposed NGT.

1076Distributed Quasi-Newton Method for Fair and Fast Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) is a promising technology that enables edge devices/clients to collaboratively and iteratively train a machine learning model under the coordination of a central server. The most common approach to FL is first-order methods, where clients send their local gradients to the server in each iteration. However, these methods often suffer from slow convergence rates. As a remedy, second-order methods, such as quasi-Newton, can be employed to accelerate convergence. Unfortunately, similarly to the first-order FL methods, the application of second-order methods in FL can lead to unfair models, achieving high average accuracy while performing poorly on certain clients’ local datasets. To tackle this issue, in this paper we introduce a novel second-order FL framework, dubbed distributed quasi-Newton federated learning (DQN-Fed). This approach seeks to ensure fairness while leveraging the fast convergence properties of quasi-Newton methods in the FL context. Specifically, DQN-Fed helps the server update the global model in such a way that (i) all local loss functions decrease to promote fairness, and (ii) the rate of change in local loss functions aligns with that of the quasi-Newton method. We prove the convergence of DQN-Fed and demonstrate its linear-quadratic convergence rate. Moreover, we validate the efficacy of DQN-Fed across a range of federated datasets, showing that it surpasses state-of-the-art fair FL methods in both fairness and convergence speed. The Code for paper is publicly available athttps://github.com/ICMLDQNFed/ICMLDQN.

1077CoS: Enhancing Personalization and Mitigating Bias with Context Steering

[openreview] [pdf]

Abstract To deliver high-quality, personalized responses, large language models (LLMs) must effectively incorporate \textit{context} — personal, demographic, and cultural information specific to an end-user. For example, asking the model to explain Newton’s second law with the context \textit{I am a toddler’‘} should produce a response different from when the context is \textit{I am a physics professor’'}. However, leveraging the context in practice is a nuanced and challenging task, and is often dependent on the specific situation or user base. The model must strike a balance between providing specific, personalized responses and maintaining general applicability. Current solutions, such as prompt-engineering and fine-tuning require collection of contextually appropriate responses as examples, making them time-consuming and less flexible to use across different contexts. In this work, we introduce \textbf{Context Steering (CoS)} —a simple, training-free decoding approach that amplifies the influence of the \textit{context} in next token predictions. CoS computes contextual influence by comparing the output probabilities from two LLM forward passes: one that includes the context and one that does not. By linearly scaling the contextual influence, CoS allows practitioners to flexibly control the degree of personalization for different use cases. We show that CoS can be applied to autoregressive LLMs, and demonstrates strong performance in personalized recommendations. Additionally, we show that CoS can function as a Bayesian Generative model to infer and quantify correlations between open-ended texts, broadening its potential applications.

1078Towards Distributed Backdoor Attacks with Network Detection in Decentralized Federated Learning

[openreview] [pdf]

Abstract Distributed backdoor attacks (DBA) have shown a higher attack success rate than centralized attacks in centralized federated learning (FL). However, it has not been investigated in the decentralized FL. In this paper, we experimentally demonstrate that, while directly applying DBA to decentralized FL, the attack success rate depends on the distribution of attackers in the network architecture. Considering that the attackers can not decide their location, this paper aims to achieve a high attack success rate regardless of the attackers’ location distribution. Specifically, we first design a method to detect the network by predicting the distance between any two attackers on the network. Then, based on the distance, we organize the attackers in different clusters. Lastly, we propose an algorithm to \textit{dynamically} embed local patterns decomposed from a global pattern into the different attackers in each cluster. We conduct a thorough empirical investigation and find that our method can, in benchmark datasets, outperform both centralized attacks and naive DBA in different decentralized frameworks.

1079How to Correctly Do Semantic Backpropagation on Language-based Agentic Systems

[openreview] [pdf]

Abstract Language-based agentic systems have shown great promise in recent years, transitioning from solving small-scale research problems to being deployed in challenging real-world tasks. However, optimizing these systems often requires substantial manual labor. Recent studies have demonstrated that these systems can be represented as computational graphs, enabling automatic optimization. Despite these advancements, most current efforts in Graph-based Agentic System Optimization (GASO) fail to properly assign feedback to the system’s components given feedback on the system’s output. To address this challenge, we formalize the concept of semantic backpropagation with semantic gradients—a generalization that aligns several key optimization techniques, including reverse-mode automatic differentiation and the more recent TextGrad by exploiting the relationship among nodes with a common successor. This serves as a method for computing directional information about how changes to each component of an agentic system might improve the system’s output. To use these gradients, we propose a method called semantic gradient descent which enables us to solve GASO effectively. Our results on both BIG-Bench Hard and GSM8K show that our approach outperforms existing state-of-the-art methods for solving GASO problems. A detailed ablation study on the LIAR dataset demonstrates the parsimonious nature of our method.

1080Adversarial Suffixes May Be Features Too!

[openreview] [pdf]

Abstract Despite significant ongoing efforts in safety alignment, large language models (LLMs) such as GPT-4 and LLaMA 3 remain vulnerable to jailbreak attacks that can induce harmful behaviors, including those triggered by adversarial suffixes. Building on prior research, we hypothesize that these adversarial suffixes are not mere bugs but may represent features that can dominate the LLM’s behavior. To evaluate this hypothesis, we conduct several experiments. First, we demonstrate that benign features can be effectively made to function as adversarial suffixes, i.e., we develop a feature extraction method to extract sample-agnostic features from benign dataset in the form of suffixes and show that these suffixes may effectively compromise safety alignment. Second, we show that adversarial suffixes generated from jailbreak attacks may contain meaningful features, i.e., appending the same suffix to different prompts results in responses exhibiting specific characteristics. Third, we show that such benign-yet-safety-compromising features can be easily introduced through fine-tuning using only benign datasets, i.e., even in the absence of harmful content. This highlights the critical risk posed by dominating benign features in the training data and calls for further research to reinforce LLM safety alignment. Our code and data is available at \url{https://github.com/anonymous}.

1081Robust Domain Generalisation with Causal Invariant Bayesian Neural Networks

[openreview] [pdf]

Abstract Deep neural networks can obtain impressive performance on various tasks under the assumption that their training domain is identical to their target domain. Performance can drop dramatically when this assumption does not hold. One explanation for this discrepancy is the presence of spurious domain-specific correlations in the training data that the network exploits. Causal mechanisms, in the other hand, can be made invariant under distribution changes as they allow disentangling the factors of distribution underlying the data generation. Yet, learning causal mechanisms to improve out-of-distribution generalisation remains an under-explored area. We propose a Bayesian neural architecture that disentangles the learning of the the data distribution from the inference process mechanisms. We show theoretically and experimentally that our model approximates reasoning under causal interventions. We demonstrate the performance of our method, outperforming point estimate-counterparts, on out-of-distribution image recognition tasks where the data distribution acts as strong adversarial confounders.

1082Focus On This, Not That! Steering LLMs With Adaptive Feature Specification

[openreview] [pdf]

Abstract Despite the success of Instruction Tuning (IT) in training large language models (LLMs) to perform arbitrary user-specified tasks, these models often still leverage spurious or biased features learned from their training data, leading to undesired behaviours when deploying them in new contexts. In this work, we introduceFocus Instruction Tuning(FIT), which trains LLMs to condition their responses by ‘‘focusing on’’ specific features whilst ignoring others, leading to different behaviours based on which features are specified. Across several experimental settings, we show that focus-tuned models can be adaptively steered by focusing on different features at inference-time, such as (a) improving robustness by focusing on task-causal features and ignoring spurious features, and (b) mitigating bias by ignoring demographic categories. Furthermore, FIT can steer behaviour in new contexts, generalising under distribution shift and to new unseen features at inference time, thereby facilitating more robust, fair, and explainable LLM applications in real-world environments.

1083Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data

[openreview] [pdf]

Abstract Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering high-quality data and rewriting inaccurate captions. Leveraging this pipeline, we have generated 1 million high-quality synthetic multi-view images with dense descriptive captions to address the shortage of high-quality 3D data. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and maintained view consistency.

1084IMPLICIT VARIATIONAL REJECTION SAMPLING

[openreview] [pdf]

Abstract Variational Inference (VI) is a cornerstone technique in Bayesian machine learning, employed to approximate complex posterior distributions. However, traditional VI methods often rely on mean-field assumptions, which may inadequately capture the true posterior’s complexity. To address this limitation, recent advancements have utilized neural networks to model implicit distributions, thereby offering increased flexibility. Despite this, the practical constraints of neural network architectures can still result in inaccuracies in posterior approximations. In this work, we introduce a novel method called Implicit Variational Rejection Sampling (IVRS), which integrates implicit distributions with rejection sampling to enhance the approximation of the posterior distribution. Our method employs neural networks to construct implicit proposal distributions and utilizes rejection sampling with a meticulously designed acceptance probability function. A discriminator network is employed to estimate the density ratio between the implicit proposal and the true posterior, thereby refining the approximation. We propose the Implicit Resampling Evidence Lower Bound (IR-ELBO) as a metric to characterize the quality of the resampled distribution, enabling the derivation of a tighter variational lower bound. Experimental results demonstrate that our method outperforms traditional variational inference techniques in terms of both accuracy and efficiency, leading to significant improvements in inference performance. This work not only showcases the effective combination of implicit distributions and rejection sampling but also offers a novel perspective and methodology for advancing variational inference.

1085Progress or Regress? Self-Improvement Reversal in Post-training

[openreview] [pdf]

Abstract Self-improvement through post-training methods such as iterative preference learning has been acclaimed for enhancing the problem-solving capabilities (e.g., mathematical reasoning) of Large Language Models (LLMs) without human intervention. However, as our exploration deepens, it is crucial to critically assess whether these enhancements indeed signify comprehensive progress or if they could lead to unintended regressions. Through rigorous experimentation and analysis across diverse problem-solving tasks, we uncover nuances in the self-improvement trajectories of LLMs. Our study introduces the concept of \emph{self-improvement reversal}, where models showing improved overall accuracy metrics might paradoxically exhibit declines in broader, essential capabilities. We propose a comprehensive evaluative framework to scrutinize the underlying mechanisms and outcomes of post-training self-improvement, aiming to discern between superficial metric improvements and genuine enhancements in model functionality. The findings emphasize the complexity of technological advancements in LLMs, underscoring the need for a nuanced understanding of the \textit{progress or regress} dichotomy in their development.

1086Bundle Neural Network for message diffusion on graphs

[openreview] [pdf]

Abstract The dominant paradigm for learning on graphs is message passing. Despite being a strong inductive bias, the local message passing mechanism faces challenges such as over-smoothing, over-squashing, and limited expressivity. To address these issues, we introduce Bundle Neural Networks (BuNNs), a novel graph neural network architecture that operates viamessage diffusiononflat vector bundles— geometrically inspired structures that assign to each node a vector space and an orthogonal map. A BuNN layer evolves node features through a diffusion-type partial differential equation, where its discrete form acts as a special case of the recently introduced Sheaf Neural Network (SNN), effectively alleviating over-smoothing. The continuous nature of message diffusion enables BuNNs to operate at larger scales, reducing over-squashing. We establish the universality of BuNNs in approximating feature transformations on infinite families of graphs with injective positional encodings, marking the first positive uniform expressivity result of its kind. We support our claims with formal analysis and synthetic experiments. Empirically, BuNNs perform strongly on heterophilic and long-range tasks, which demonstrates their robustness on a diverse range of challenging real-world tasks.

1087Demystifying amortized causal discovery with transformers

[openreview] [pdf]

Abstract Supervised learning for causal discovery from observational data often achieves competitive performance despite seemingly avoiding the explicit assumptions that traditional methods require for identifiability. In this work, we analyze CSIvA (Ke et al., 2023b) on bivariate causal models, a transformer architecture for amortized inference promising to train on synthetic data and transfer to real ones. First, we bridge the gap with identifiability theory, showing that the training distribution implicitly defines a prior on the causal model of the test observations: consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. Second, we find that CSIvA can not generalize to classes of causal models unseen during training: to overcome this limitation, we show that learning on datasets generated from different types of causal models, unambiguously identifiable in isolation, improves the test generalization. We analyze this empirical evidence with theory, illustrating that the ambiguous cases resulting from the mixture of identifiable causal models are unlikely to occur. Overall, we find that amortized causal discovery still adheres to identifiability theory, violating the previous hypothesis from Lopez-Paz et al. (2015) that supervised learning methods could overcome its restrictions.

1088Exchange of Perspective Prompting Enhances Reasoning in Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have made significant advancements in addressing diverse natural language processing (NLP) tasks. However, their performance is often limited by inherent comprehension of problems. To address this limitation, we propose Exchange-of-Perspective (EoP), a novel framework designed to incorporate external perspectives by swapping answers for the same question presented with different definitions. We conducted extensive and comprehensive experiments on seven benchmarks. The results show that EoP can significantly improve performance. For instance, compared to the non-commutative baseline PHP, with GPT-3.5-Turbo and EoP, we observe a 3.6% improvement on AQuA (60.6% → 64.2%), while GPT-4-powered EoP achieves a 7.7% overall accuracy improvement on Math (53.9% → 61.6%).

1089TGTOD: A Global Temporal Graph Transformer for Outlier Detection at Scale

[openreview] [pdf]

Abstract Graph outlier detection aims to identify anomalous substructures in graphs that deviate significantly from normal patterns. Traditional methods primarily focus on static graphs, overlooking the dynamic nature of real-world networks and ignoring valuable temporal signals crucial for outlier detection. While Transformers have revolutionized machine learning on time-series data, existing Transformers for temporal graphs face limitations in (1) restricted receptive fields, (2) overhead of subgraph extraction, and (3) suboptimal generalization capability beyond link prediction. In this paper, we propose TGTOD, a novel end-to-end Temporal Graph Transformer for Outlier Detection. TGTOD employs global attention to model both structural and temporal dependencies within temporal graphs. To tackle scalability, our approach divides large temporal graphs into spatiotemporal patches, which are then processed by a hierarchical Transformer architecture comprising Patch Transformer, Cluster Transformer, and Temporal Transformer. We evaluate TGTOD on three public datasets under two settings, comparing with a wide range of baselines. Our experimental results demonstrate the effectiveness of TGTOD, achieving AP improvement of 61% on Elliptic dataset. Furthermore, our efficiency evaluation shows that TGTOD reduces training time by 44×compared to existing Transformers for temporal graphs. To foster reproducibility, we make our implementation publicly available athttps://anonymous.4open.science/r/tgtod.

1090Towards Full Delegation: Designing Ideal Agentic Behaviors for Travel Planning

[openreview] [pdf]

Abstract How are LLM-based agents used in the future? While many of the existing work on agents has focused on improving the performance of a specific family of objective and challenging tasks, in this work, we take a different perspective by thinking about full delegation: agents take over humans’ routine decision-making processes and are trusted by humans to find solutions that fit people’s personalized needs and are adaptive to ever-changing context. In order to achieve such a goal, the behavior of the agents, i.e., agentic behaviors, should be evaluated not only on their achievements (i.e., outcome evaluation), but also how they achieved that (i.e., procedure evaluation). For this, we propose APEC Agent Constitution, a list of criteria that an agent should follow for good agentic behaviors, including Accuracy, Proactivity, Efficiency and Credibility. To verify whether APEC aligns with human preferences, we develop APEC-Travel, a travel planning agent that proactively extracts hidden personalized needs via multi-round dialog with travelers. APEC-Travel is constructed purely from synthetic data generated by Llama3.1-405B-Instruct with a diverse set of travelers’ persona to simulate rich distribution of dialogs. Iteratively fine-tuned to follow APEC Agent Constitution, APEC-Travel surpasses baselines by 20.7% on rule based metrics and 9.1% on LLM-as-a-Judge scores across the constitution axes.

1091Outcome-Driven Action Flexibility for Robust Offline Reinforcement Learning

[openreview] [pdf]

Abstract We address the challenge of offline reinforcement learning using realistic data, specifically non-expert data collected through sub-optimal behavior policies. A primary concern is that the learned policy must be conservative enough to manage \textit{distribution shift} while maintaining sufficient flexibility for generalization. To tackle this issue, we introduce a novel method called Outcome-Driven Action Flexibility (ODAF), which seeks to reduce reliance on the empirical action distribution of the behavior policy. Specifically, we develop a new reward mechanism that evaluates whether the subsequent states, following the current policy, meet specified performance requirements (e.g., safety—remaining within the state support area), rather than solely depending on the characteristics of the actions taken (e.g., whether the action imitates the behavior policy). Besides theoretical justification, we provide empirical evidence on widely used D4RL benchmarks, demonstrating that our ODAF method, implemented using uncertainty quantification techniques, effectively tolerates unseen transitions for improved “trajectory stitching,” while enhancing the agent’s ability to learn from realistic non-expert data.

1092Goal-Conditioned Reinforcement Learning with Virtual Experiences

[openreview] [pdf]

Abstract Goal-conditioned reinforcement learning often employs a technique known as Hindsight Experience Replay (HER) for data augmentation by relabeling goals. However, HER limits goal relabeling to a single trajectory, which hinders the utilization of experiences from diverse trajectories. To address this issue, we present a curriculum learning method to construct virtual experiences, incorporating actual state transitions and virtual goals selected from the replay buffer. Considering that virtual experiences may contain a lot of noise, we also propose a self-supervised subgoal planning method that guides the learning of virtual experiences by imitating the subgoal-conditioned policy. Our intuition is that achieving a virtual goal may be challenging for the goal-conditioned policy, whereas simplified subgoals can provide effective guidance. We empirically show that the virtual experiences from diverse historical trajectories significantly boost the sample-efficiency compared to the existing goal-conditioned reinforcement learning and hierarchical reinforcement learning methods, even enabling the agent to learn tasks it has never experienced.

1093Type-II Saddles and Probabilistic Stability of Stochastic Gradient Descent

[openreview] [pdf]

Abstract Characterizing and understanding the dynamics of stochastic gradient descent (SGD) around saddle points remains an open problem in neural network optimization. We identify two distinct types of saddle points, demonstrating that Type-II saddles pose a significant challenge due to vanishing gradient noise, which makes them particularly difficult for SGD to escape. We show that the dynamics around these saddles can be effectively modeled by a random matrix product process, allowing us to apply concepts from probabilistic stability and Lyapunov exponents. By leveraging ergodic theory, we establish that saddle points can be either attractive or repulsive for SGD, leading to a classification of four distinct dynamic phases based on the gradient’s signal-to-noise ratio near the saddle. We apply the theory to the training at the initial stage of neural networks, explaining an intriguing phenomenon that neural networks are prone to be stuck at the initialization point at a larger learning rate. Our results offer a novel theoretical framework for understanding the intricate behavior of SGD around saddle points, with implications for improving optimization strategies in deep learning.

1094Pairwise Elimination with Instance-Dependent Guarantees for Bandits with Cost Subsidy

[openreview] [pdf]

Abstract Multi-armed bandits (MAB) are commonly used in sequential online decision-making when the reward of each decision is an unknown random variable. In practice, however, the typical goal of maximizing total reward may be less important than minimizing the total cost of the decisions taken, subject to a reward constraint. For example, we may seek to make decisions that have at least the reward of a reference ``default’’ decision. This problem was recently introduced in the Multi-Armed Bandits with Cost Subsidy (MAB-CS) framework. MAB-CS is broadly applicable to problem domains where a primary metric (cost) is constrained by a secondary metric (reward), and there is an inability to explicitly determine the trade-off between these metrics. In our work, we first introduce the Pairwise-Elimination algorithm for a simplified variant of the cost subsidy problem with a known reference arm. We then generalize PE to PE-CS to solve the MAB-CS problem in the setting where the reference arm is the un-identified optimal arm. Next, we analyze the performance of both PE and PE-CS on the dual metrics of Cost and Quality Regret. Our instance-dependent analysis of PE and PE-CS reveals that both algorithms have an order-wise logarithmic upper bound on Cost and Quality Regret, making our policy the first with such a guarantee. Finally, experiments are conducted using the MovieLens 25M dataset for both PE and PE-CS and using a synthetic toy experiment for PE-CS revealing that our method invariably outperforms the ETC-CS baseline from the literature.

1095One Communication Round is All It Needs for Federated Fine-Tuning Foundation Models

[openreview] [pdf]

Abstract The recent advancement of large foundation models (FMs) has increased the demand for fine-tuning these models on large-scale and cross-domain datasets. To address this, federated fine-tuning has emerged as a solution, allowing models to be fine-tuned on distributed datasets across multiple devices while ensuring data privacy. However, the substantial parameter size of FMs and the multi-round communication required by traditional federated fine-tuning algorithms result in prohibitively high communication costs, challenging the practicality of federated fine-tuning. In this paper, we are the first to reveal, both theoretically and empirically, that the traditional multi-round aggregation algorithms may not be necessary for federated fine-tuning large FMs. Our experiments reveal that a single round of communication (i.e., one-shot federated fine-tuning) yields a global model performance comparable to that achieved through multiple rounds of communication. Through rigorous mathematical and empirical analyses, we demonstrate that large FMs, due to their extensive parameter sizes and pre-training on general tasks, achieve significantly lower training loss in one-shot federated fine-tuning compared to smaller models. Our extensive experiments show that one-shot federated fine-tuning not only reduces communication costs but also enables asynchronous aggregation, enhances privacy, and maintains performance consistency with multi-round federated fine-tuning for models larger than 1 billion parameters, on text generation and text-to-image generation tasks. Our findings have the potential to revolutionize federated fine-tuning in practice, enhancing efficiency, reducing costs, and expanding accessibility for large-scale models. This breakthrough paves the way for broader adoption and application of federated fine-tuning across various domains.

1096A Temporally Correlated Latent Exploration for Reinforcement Learning

[openreview] [pdf]

Abstract Efficient exploration remains one of the longstanding problems of deep reinforcement learning. Instead of depending solely on extrinsic rewards from the environments, existing methods use intrinsic rewards to enhance exploration. However, we demonstrate that these methods are vulnerable to Noisy TV and stochasticity. To tackle this problem, we propose Temporally Correlated Latent Exploration (TeCLE), which is a novel intrinsic reward formulation that employs an action-conditioned latent space and temporal correlation. The action-conditioned latent space models the probability distribution of states, thereby avoiding the assignment of excessive intrinsic rewards to unpredictable states and effectively addressing both problems. Whereas previous works inject temporal correlation for action selection, the proposed method injects it for intrinsic reward computation. We find that the injected temporal correlation determines the exploratory behaviors of agents. Various experiments show that the environment where the agent performs well depends on the amount of temporal correlation. To the best of our knowledge, the proposed TeCLE is the first approach to consider the action-conditioned latent space and temporal correlation for curiosity-driven exploration. We prove that the proposed TeCLE can be robust to the Noisy TV and stochasticity in benchmark environments, including Minigrid and Stochastic Atari.

1097EXPLORING THE IMPACT OF DATA AUGMENTATION ON LOCALIZED PERSONALIZED AI TRAINING WITH LLAMA3 AND LORA

[openreview] [pdf]

Abstract With the development of personalized AI models, particularly those emulating characters from novels, games, anime, and films, a significant challenge is the scarcity of suitable dialogue data. These works often feature distinctive styles and character dialogues that may not generalize well to everyday conversations. Data augmentation is crucial for enriching these limited datasets, ensuring sufficient data for learning the target character’s tone and linguistic habits. This paper investigates the impact of various data augmentation techniques on personalized AI models in NLP, specifically focusing on models trained using LLaMA3 through Low-Rank Adaptation (LoRA). We employ different data augmentation strategies, including random deletion, synonym replacement, swapping, random insertion, back translation, and paraphrasing. To provide a comprehensive analysis, we apply these techniques across three distinct datasets, each representing different dialogue styles and contexts. By systematically comparing these methods, we demonstrate their influence on model performance and robustness. This study provides valuable insights into the effectiveness of different data augmentation strategies for enhancing the versatility and robustness of personalized AI systems trained with LLaMA3 using LoRA.

1098Diffusion On Syntax Trees For Program Synthesis

[openreview] [pdf]

Abstract Large language models generate code one token at a time. Their autoregressive generation process lacks the feedback of observing the program’s output. Training LLMs to suggest edits directly can be challenging due to the scarcity of rich edit data. To address these problems, we propose neural diffusion models that operate on syntax trees of any context-free grammar. Similar to image diffusion models, our method also inverts “noise” applied to syntax trees. Rather than generating code sequentially, we iteratively edit it while preserving syntactic validity, which makes it easy to combine this neural model with search. We apply our approach to inverse graphics tasks, where our model learns to convert images into programs that produce those images. Combined with search, our model is able to write graphics programs, see the execution result, and debug them to meet the required specifications. We additionally show how our system can write graphics programs for hand-drawn sketches. Video results can be found athttps://td-anon.github.io.

1099PARSE-Ego4D: Personal Action Recommendation Suggestions for Ego-Centric Videos

[openreview] [pdf]

Abstract Intelligent assistance involves not only understanding but also action. Existing ego-centric video datasets contain rich annotations of the videos, but not of actions that an intelligent assistant could perform in the moment. To address this gap, we release PARSE-Ego4D, a new set of personal action recommendation annotations for the Ego4D dataset. We take a multi-stage approach to generating and evaluating these annotations. First, we used a prompt-engineered large language model (LLM) to generate context-aware action suggestions and identified over 18,000 action suggestions. While these synthetic action suggestions are valuable, the inherent limitations of LLMs necessitate human evaluation. To ensure high-quality and user-centered recommendations, we conducted a large-scale human annotation study that provides grounding in human preferences for all of PARSE-Ego4D. We analyze the inter-rater agreement and evaluate subjective preferences of participants. Based on our synthetic dataset and complete human annotations, we propose several new tasks for action suggestions based on ego-centric videos. We encourage novel solutions that improve latency and energy requirements. The annotations in PARSE-Ego4D will support researchers and developers who are working on building action recommendation systems for augmented and virtual reality systems.

1100Reward-free Policy Optimization with World Models

[openreview] [pdf]

Abstract As AI capabilities advance, their rapid progress is not keeping pace with the need for safe and value-aligned algorithms, raising concerns about autonomous systems. E.g., maximizing expected return in reinforcement learning can lead to unintended and potentially harmful consequences. This work introduces Reward-free Policy Optimization (RFPO), a method that prioritizes goal-oriented policy learning over reward maximization by eliminating rewards as the agent’s learning signal. Our approach learns a world model that simulates backward in time, and then uses it to construct a directed graph for planning, and finally learning a goal-conditioned policy from the graph. The algorithm has two requirements: (1) the goal has to be defined, and (2) the agent needs sufficient world knowledge, enabling it to plan. This method removes the risks associated with reward hacking and discourages unintended behaviors by allowing for human oversight. Additionally, it provides a framework for humans to build transparent and high-level algorithms by using the (low-level) learned policies. We demonstrate the effectiveness of RFPO on maze environments with pixel observations, where the agent successfully reaches arbitrarily selected goals and follows human-designed algorithms. In conclusion, RFPO enables agents to learn policies without rewards and provides a framework for creating high-level behaviors.

1101FlightBench: Benchmarking Learning-based Methods for Ego-vision-based Quadrotors Navigation

[openreview] [pdf]

Abstract Ego-vision-based navigation in cluttered environments is crucial for mobile systems, particularly agile quadrotors. While learning-based methods have shown promise recently, head-to-head comparisons with cutting-edge optimization-based approaches are scarce, leaving open the question of where and to what extent they truly excel. In this paper, we introduce FlightBench, the first comprehensive benchmark that implements various learning-based methods for ego-vision-based navigation and evaluates them against mainstream optimization-based baselines using a broad set of performance metrics. Additionally, we develop a suite of criteria to assess scenario difficulty and design test cases that span different levels of difficulty based on these criteria. Our results show that while learning-based methods excel in high-speed flight and faster inference, they struggle with challenging scenarios like sharp corners or view occlusion. Analytical experiments validate the correlation between our difficulty criteria and flight performance. We hope this benchmark and these criteria will drive future advancements in learning-based navigation for ego-vision quadrotors. The source code and documentation is available athttps://github.com/Anonymous314159265358/FlightBench.

1102Quantile Regression for Distributional Reward Models in RLHF

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) has become a key method for aligning large language models (LLMs) with human preferences through the use of reward models. However, traditional reward models typically generate point estimates, which oversimplify the diversity and complexity of human values and preferences. In this paper, we introduce Quantile Reward Models (QRMs), a novel approach to reward modeling that learns a distribution over rewards instead of a single scalar value. Our method uses quantile regression to estimate a full, potentially multimodal distribution over preferences, providing a more powerful and nuanced representation of preferences. This distributional approach can better capture the diversity of human values, addresses label noise, and accommodates conflicting preferences by modeling them as distinct modes in the distribution. Our experimental results show that QRM outperforms comparable traditional point-estimate models on RewardBench. Furthermore, we demonstrate that the additional information provided by the distributional estimates can be utilized in downstream applications, such as risk-aware reinforcement learning, resulting in LLM policies that generate fewer extremely negative responses. Our code and model will be released.

1103Generalization through variance: how noise shapes inductive biases in diffusion models

[openreview] [pdf]

Abstract How diffusion models generalize beyond their training set is not known, and is somewhat mysterious given two facts: the optimum of the denoising score matching (DSM) objective usually used to train diffusion models is the score function of the training distribution; and the networks usually used to learn the score function are expressive enough to learn this score to high accuracy. We claim that a certain feature of the DSM objective—the fact that its target is not the training distribution’s score, but a noisy quantity only equal to it in expectation—strongly impacts whether and to what extent diffusion models generalize. In this paper, we develop a mathematical theory that partly explains this ‘generalization through variance’ phenomenon. Our theoretical analysis exploits a physics-inspired path integral approach to compute the distributions typically learned by a few paradigmatic under- and overparameterized diffusion models. We find that the distributions diffusion models effectively learn to sample from resemble their training distributions, but with `gaps’ filled in, and that this inductive bias is due to the covariance structure of the noisy target used during training. We also characterize how this inductive bias interacts with feature-related inductive biases.

1104Distill Visual Chart Reasoning Ability from LLMs to MLLMs

[openreview] [pdf]

Abstract Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs). Recent studies highlight that these abilities consist of two main parts: recognizing key information from visual inputs and conducting reasoning over it. Thus, a promising approach to enhance MLLMs is to construct relevant training data focusing on the two aspects. However, collecting and annotating complex charts and questions is costly and time-consuming, and ensuring the quality of annotated answers remains a challenge. In this paper, we propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. The code serves as an intermediary that translates visual chart representations into textual representations, enabling LLMs to understand cross-modal information. Specifically, we employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs to enhance both recognition and reasoning abilities. Experiments show that when fine-tuned with our data, models not only perform well on chart-related benchmarks, but also demonstrate improved multimodal reasoning abilities on general mathematical benchmarks such as MathVista.

1105Flat Posterior Does Matter For Bayesian Model Averaging

[openreview] [pdf]

Abstract Bayesian neural network (BNN) approximates the posterior distribution of model parameters and utilizes the posterior for prediction via Bayesian Model Averaging (BMA). The quality of the posterior approximation is critical for achieving accurate and robust predictions. It is known that flatness in the loss landscape is strongly associated with generalization performance, and it necessitates consideration to improve the quality of the posterior approximation. In this work, we empirically demonstrate that BNNs often struggle to capture the flatness. Moreover, we provide both experimental and theoretical evidence showing that BMA can be ineffective without ensuring flatness. To address this, we propose Sharpness-Aware Bayesian Model Averaging (SA-BMA), a novel optimizer that seeks flat posteriors by calculating divergence in the parameter space. SA-BMA aligns with the intrinsic nature of BNN and the generalized version of existing sharpness-aware optimizers for DNN. In addition, we suggest a Bayesian Transfer Learning scheme to efficiently leverage pre-trained DNN. We validate the efficacy of SA-BMA in enhancing generalization performance in few-shot classification and distribution shift by ensuring flat posterior.

1106Offline RL with Smooth OOD Generalization in Convex Hull and its Neighborhood

[openreview] [pdf]

Abstract Offline Reinforcement Learning (RL) struggles with distributional shifts, leading to the QQ-value overestimation for out-of-distribution (OOD) actions. Existing methods address this issue by imposing constraints; however, they often become overly conservative when evaluating OOD regions, which constrains the QQ-function generalization. This over-constraint issue results in poor QQ-value estimation and hinders policy improvement. In this paper, we introduce a novel approach to achieve better QQ-value estimation by enhancing QQ-function generalization in OOD regions within Convex Hull and its Neighborhood (CHN). Under the safety generalization guarantees of the CHN, we propose the Smooth Bellman Operator (SBO), which updates OOD QQ-values by smoothing them with neighboring in-sample QQ-values. We theoretically show that SBO approximates true QQ-values for both in-sample and OOD actions within the CHN. Our practical algorithm, Smooth Q-function OOD Generalization (SQOG), empirically alleviates the over-constraint issue, achieving near-accurate QQ-value estimation. On the D4RL benchmarks, SQOG outperforms existing state-of-the-art methods in both performance and computational efficiency.

1107Reality Only Happens Once: Single-path Generalization Bounds for Transformers

[openreview] [pdf]

Abstract One of the inherent challenges in deploying transformers on time series is that \emph{reality only happens once}; namely, one typically only has access to a single trajectory of the data-generating process comprised of non-i.i.d.\ observations. We derive non-asymptotic statistical guarantees in this setting through bounds on the \textit{generalization} of a transformer network at a future-time tt, given that it has been trained using NtN\le t observations from a single perturbed trajectory of a {bounded and exponentially ergodic} Markov process. We obtain a generalization bound which effectively converges at the rate of O(1/N)\mathcal{O}(1/\sqrt{N}). Our bound depends explicitly on the activation function (Swish\operatorname{Swish}, GeLU\operatorname{GeLU}, or tanh\tanh are considered), the number of self-attention heads, depth, width, and norm-bounds defining the transformer architecture. Our bound consists of three components: (I) The first quantifies the gap between the stationary distribution of the data-generating Markov process and its distribution at time tt, this term converges exponentially to 0. (II) The next term encodes the complexity of the transformer model and, given enough time, eventually converges to 0 at the rate O(log(N)r/N)\mathcal{O}(\log(N)^r/\sqrt{N}) for any r>0r>0. (III) The third term guarantees that the bound holds with probability at least 1-δ\delta, and converges at a rate of O(log(1/δ)/N)\mathcal{O}(\sqrt{\log(1/\delta)}/\sqrt{N}).Example of (non i.i.d.) data-generating processes which we can treat are the projection of several SDEs onto a compact convex set CC, and bounded Markov processes satisfying a log-Sobolev inequality.

1108Training Nonlinear Transformers for Chain-of-Thought Inference: A Theoretical Generalization Analysis

[openreview] [pdf]

Abstract Chain-of-Thought (CoT) is an efficient prompting method that enables the reasoning ability of large language models by augmenting the query using multiple examples with multiple intermediate steps. Despite the empirical success, the theoretical understanding of how to train a Transformer to achieve the CoT ability remains less explored. This is primarily due to the technical challenges involved in analyzing the nonconvex optimization on nonlinear attention models. To the best of our knowledge, this work provides the first theoretical study of training Transformers with nonlinear attention to obtain the CoT generalization capability so that the resulting model can inference on unseen tasks when the input is augmented by examples of the new task. We first quantify the required training samples and iterations to train a Transformer model towards CoT ability. We then prove the success of its CoT generalization on unseen tasks with distribution-shifted testing data. Moreover, we theoretically characterize the conditions for an accurate reasoning output by CoT even when the provided reasoning examples contain noises and are not always accurate. In contrast, in-context learning (ICL), which can be viewed as one-step CoT without intermediate steps, may fail to provide an accurate output when CoT does. These theoretical findings are justified through experiments.

1109Lookers-On See Most of the Game: An External Insight-Guided Method for Enhancing Uncertainty Estimation

[openreview] [pdf]

Abstract Large Language Models (LLMs) have gained increasing attention for their impressive capabilities, alongside concerns about the reliability arising from their potential to generate hallucinations and factual inaccuracies. Uncertainty estimation for LLMs aims to quantify the uncertainty of model outputs, where high uncertainty scores indicate potential errors, signaling the need for rejection or further evaluation. However, existing methods often limited by inherent biases of LLMs like over-confidence and under-confidence. In this paper, we propose an external insight-driven correction method for refining uncertainty estimation. This method integrates uncertainty scores derived from a lightweight model trained on global information with those from existing uncertainty estimation approaches, providing a more robust solution. We present comprehensive experimental results that demonstrate the effectiveness and generalizability of our method across various models, datasets, and consistently surpassing all baselines.

1110Debiasing Online Preference Learning via Preference Feature Preservation

[openreview] [pdf]

Abstract While various preferred features determine human preferences, current preference learning frameworks for large language models (LLMs) simplify them with binary pairwise comparisons and scalar rewards. This simplification could make LLMs’ responses biased to mostly preferred features such as longer responses which would be exacerbated in online learning scenarios as the biases can be accumulate continuously throughout the iterations. To address these challenges, we propose a novel framework called PFP (Preference Feature Preservation). The key idea of PFP is maintaining the distribution of human preference features throughout the online preference learning process. Specifically, PFP first trains a feature classifier using the existing offline pairwise human preference data. Then, using this classifier and the distribution preserving optimization, PFP maps appropriate preference features for each input instruction during online learning. Lastly, PFP trains LLM using the existing preference learning framework, by incorporating the preference feature of each data into system prompts and enabling LLM to explicitly handle various human preferences. Our experiments demonstrate that PFP successfully mitigates the bias in preference features that arise during online learning, and achieves superior performance compared to previous preference learning methods on general benchmarks including AlpacaEval 2.0 and MT-Bench. We also observe that PFP almost resolves a length bias issue, a long-standing problem of online preference learning, even though it was not specifically designed to tackle this.

1111Enabling Pareto-Stationarity Exploration in Multi-Objective Reinforcement Learning: A Weighted-Chebyshev Multi-Objective Actor-Critic Approach

[openreview] [pdf]

Abstract In many multi-objective reinforcement learning (MORL) applications, being able to systematically explore the Pareto-stationary solutions under multiple non-convex reward objectives with theoretical finite-time sample complexity guarantee is an important and yet under-explored problem. This motivates us to take the first step and fill the important gap in MORL. Specifically, in this paper, we propose a weighted-Chebyshev multi-objective actor-critic (\policyns) algorithm for MORL, which uses multi-temporal-difference (TD) learning in the critic step and judiciously integrates the weighted-Chebychev (WC) and multi-gradient descent techniques in the actor step to enable systematic Pareto-stationarity exploration with finite-time sample complexity guarantee. Our proposed \policy algorithm achieves a sample complexity of O~(ϵ2pmin2)\tilde{\mathcal{O}}(\epsilon^{-2}p^{-2}_{\min}) in finding an ϵ\epsilon-Pareto-stationary solution, where pminp_{\min} denotes the minimum entry of a given weight vector pp in the WC-scarlarization. This result not only implies a state-of-the-art sample complexity that is independent of objective number MM, but also brand-new dependence result in terms of the preference vector pp. Furthermore, simulation studies on a large KuaiRand offline dataset, show that the performance of our \policy algorithm significantly outperforms other baseline MORL approaches.

1112PTE4TS: One Pre-Training Encoder is All Time Series Need

[openreview] [pdf]

Abstract In Natural Language Processing (NLP) and Computer Vision (CV) as well as myriad other domains, Large Models, especially pre-training models, have achieved significant breakthroughs. However, their advancements in the sphere of general Time Series Analysis (TSA) has been comparatively limited. The principal challenge lies in the dearth of extensive training data that is endemic to the field of TSA. This scarcity hampers the direct application of such pre-training models to time series data, resulting in unsatisfactory performance. Despite numerous attempts to adapt NLP or CV models, which have been pre-training on billions of tokens, to TSA to address this challenge, these pre-training models are not directly suitable for time series data. In this work, we introduce a new general Pre-Training Encoder specifically designed for Time Series analysis, called PTE4TS. It’s designed to be easily adaptable to a variety of downstream tasks, such as classification, anomaly detection, and forecasting. First, we revisited the masking methods in time series and found that patch masking, which was widely adopted previously, is not necessary. Therefore, we developed an improved masking model tailored to the characteristics of time series. Additionally, to address the issue of the Low-Rank structure in conventional bidirectional attention mechanisms, which may diminish the model’s expressiveness, we have developed a straightforward yet efficacious hybrid attention encoder. The combination of this encoder with our masking methods can improve the representation ability of the model. Finally, PTE4TS achieved state-of-the-art performance on several real-world datasets, further validating the potential of Large Model for general time series analysis. We hope that PTE4TS will not only open new perspectives in the field of TSA, enhancing feature representation and inferencing capabilities across various domains, but also lay the foundation for a general artificial intelligence that is capable of understanding and processing common time series data.

1113Distinct and Shared Concept Discovery for Fine-grained Concept Inversion

[openreview] [pdf]

Abstract A real-world object is expressed by composing distinctive characteristics that distinguish it from others and some common properties shared with different objects. Recent advances in generative modeling focus on identifying the shared concepts within images of individual identities. However, it remains unclear how to identify shared concepts beyond multiple identities while preserving the unique concepts inherent to each. In this work, we address this new problem of simultaneously discovering similarities and differences between two sets of images and propose a two-stage framework coined DISCOD (DIstinct and Shared COncept Discovery). In the first stage of DISCOD, we introduce information-regularized textual inversion, focusing on separating representative concepts distinctive from others while capturing the shared concepts among different objects. In the next stage, we further optimize them to align composited concepts of those with the corresponding objects, respectively. We demonstrate the effectiveness of DISCOD by showing that DISCOD discovers the concepts better than baselines, as measured by CLIPScore and success rate. The human study also validates the reasonable discovery capability of DISCOD. Furthermore, we show the practical applicability of our approach by applying to various applications: image editing, few-shot personalization of diffusion models, and group bias mitigation in recognition.

1114Towards Undistillable Models by Minimizing Conditional Mutual Information

[openreview] [pdf]

Abstract A deep neural network (DNN) is said to be undistillable if used as a black-box input-output teacher, it can not be distilled by knowledge distillation (KD) to train a student model so that the distilled student (called knockoff student) outperforms the student trained alone with label smoothing (LS student) in terms of prediction accuracy. To protect intellectual property of DNNs, it is desirable to build undistillable DNNs. To this end, it is first observed that an undistillable DNN may have the trait that each cluster of its output probability distributions in response to all sample instances with the same label should be highly concentrated to the extent that each cluster corresponding to each label should ideally collapse into one probability distribution. Based on this observation and by measuring the concentration of each cluster in terms of conditional mutual information (CMI), a new training method called CMI minimized (CMIM) method is proposed, which trains a DNN by jointly minimizing the conventional cross entropy (CE) loss and the CMI values of all temperature scaled clusters across the entire temperature spectrum. The resulting CMIM model is shown, by extensive experiments, to be undistillable by all tested KD methods existing in the literature. That is, the knockoff students distilled by these KD methods from the CMIM model underperform the respective LS students. In addition, the CMIM model is also shown to performs better than the model trained with the CE loss alone in terms of their own prediction accuracy.

1115Role of Momentum in Smoothing Objective Function and Generalizability of Deep Neural Networks

[openreview] [pdf]

Abstract For nonconvex objective functions, including deep neural networks, stochastic gradient descent (SGD) with momentum has faster convergence and better generalizability than SGD without momentum, but a theoretical explanation for this is lacking. Adding momentum is thought to reduce stochastic noise, but several studies have argued that stochastic noise actually contributes to the generalizability of the model, which raises a contradiction. We show that the stochastic noise in SGD with momentum smoothes the objective function, the degree of which is determined by the learning rate, the batch size, the momentum factor, the variance of the stochastic gradient, and the upper bound of the gradient norm. By numerically deriving the stochastic noise level in SGD with and without momentum, we provide theoretical findings that help explain the training dynamics of SGD with momentum, which were not explained by previous studies on convergence and stability, and that resolve the contradiction. We also provide experimental results for an image classification task using ResNets that support our assertion that model generalizability depends on the stochastic noise level.

1116Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement Learning (RL) suffers from sample inefficiency in sparse reward domains, and the problem is further pronounced in case of stochastic transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster. However, designing a useful reward shaping function for all desirable states in the Markov Decision Process (MDP) is challenging, even for domain experts. Given that Large Language Models (LLMs) have demonstrated impressive performance across a magnitude of natural language tasks, we aim to answer the following question: Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent’s sample efficiency?\textit{Can we obtain heuristics using LLMs for constructing a reward shaping function that can boost an RL agent's sample efficiency?} To this end, we aim to leverage off-the-shelf LLMs to generate a plan for an abstraction of the underlying MDP. We further use this LLM-generated plan as a heuristic to construct the reward shaping signal for the downstream RL agent. By characterizing the type of abstraction based on the MDP horizon length, we analyze the quality of heuristics when generated using an LLM, with and without a verifier in the loop. Our experiments across multiple domains with varying horizon length and number of sub-goals from the BabyAI environment suite, Household, Mario, and, Minecraft domain, show 1) the advantages and limitations of querying LLMs with and without a verifier to generate a reward shaping heuristic, and, 2) a significant improvement in the sample efficiency of PPO, A2C, and Q-learning when guided by the LLM-generated heuristics.

1117Scale-Aware Contrastive Reverse Distillation for Unsupervised Anomaly Detection

[openreview] [pdf]

Abstract Unsupervised anomaly detection using deep learning has garnered significant research attention due to its broad applicability, particularly in medical imaging where labeled anomalous data are scarce. While earlier approaches leverage generative models like autoencoders and generative adversarial networks (GANs), they often fall short due to overgeneralization. Recent methods explore various strategies, including memory banks, normalizing flows, self-supervised learning, and knowledge distillation, to enhance discrimination. Among these, knowledge distillation, particularly reverse distillation, has shown promise. Following this paradigm, we propose a novel scale-aware contrastive reverse distillation model that addresses two key limitations of existing reverse distillation methods: insufficient feature discriminability and inability to handle anomaly scale variations. Specifically, we introduce a contrastive student-teacher learning approach to derive more discriminative representations by generating and exploring out-of-normal distributions. Further, we design a scale adaptation mechanism to softly weight contrastive distillation losses at different scales to account for the scale variation issue. Extensive experiments on benchmark datasets demonstrate state-of-the-art performance, validating the efficacy of the proposed method. The code will be made publicly available.

1118Enhancing Diffusion Posterior Sampling for Inverse Problems by Integrating Crafted Measurements

[openreview] [pdf]

Abstract Diffusion models have emerged as a powerful foundation model for visual generation. With an appropriate sampling process, it can effectively serve as a generative prior to solve general inverse problems. Current posterior sampling based methods take the measurement (i.e., degraded image sample) into the posterior sampling to infer the distribution of the target data (i.e., clean image sample). However, in this manner, we show that high-frequency information can be prematurely introduced during the early stages, which could induce larger posterior estimate errors during the restoration sampling. To address this issue, we first reveal that forming the log posterior gradient with the noisy measurement ( i.e., samples from a diffusion forward process) instead of the clean one can benefit the reverse process. Consequently, we propose a novel diffusion posterior sampling method DPS-CM, which incorporates a Crafted Measurement (i.e., samples generated by a reverse denoising process, compared to random sampling with noise in standard methods) to form the posterior estimate. This integration aims to mitigate the misalignment with the diffusion prior caused by cumulative posterior estimate errors. Experimental results demonstrate that our approach significantly improves the overall capacity to solve general and noisy inverse problems, such as Gaussian deblurring, super-resolution, inpainting, nonlinear deblurring, and tasks with Poisson noise, relative to existing approaches.

1119Impact of Data Distribution on Fairness Guarantees in Equitable Deep Learning

[openreview] [pdf]

Abstract Fairness in machine learning is paramount to human society because machine learning systems increasingly influence various aspects of our daily lives, particularly in consequence-critical tasks such as medical diagnosis. Deep learning models for medical diagnosis often exhibit biased performance across diverse demographic groups. Theoretical analyses to understand unfairness in AI-based medical diagnosis systems are still lacking. This work presents a comprehensive theoretical analysis of the impact of disease prevalence and data distributions on the fairness guarantees of deep learning models for medical diagnosis. We formalize the fairness problem, introduce assumptions, and derive fairness error bounds, algorithmic complexity, generalization bounds, convergence rates, and group-specific risk bounds. Our analysis reveals that fairness guarantees are significantly influenced by the differences in disease prevalence rates and data distributions across demographic groups. We prove that considering fairness criteria can lead to better performance than standard supervised learning. Empirical results on diverse datasets, including FairVision, CheXpert, HAM10000 and FairFace, corroborate our theoretical findings, demonstrating the impact of disease prevalence and feature distribution disparities on the equitable performance of deep learning models for tasks such as glaucoma, diabetic retinopathy, age-related macular degeneration, and pleural effusion detection. The code for analysis is publicly available via \url{https://github.com/anonymous2research/fairness_guarantees}.

1120Physics-Informed Diffusion Models

[openreview] [pdf]

Abstract Generative models such as denoising diffusion models are quickly advancing their ability to approximate highly complex data distributions. They are also increasingly leveraged in scientific machine learning, where samples from the implied data distribution are expected to adhere to specific governing equations. We present a framework that unifies generative modeling and PDE fulfillment and meaningfully informs denoising diffusion models of such underlying constraints on generated samples during training. Our approach improves the alignment of the generated samples with the imposed constraints and significantly outperforms existing methods without affecting inference speed. Additionally, incorporating these constraints during training acts as a natural regularization mechanism against overfitting. Our framework is easy to implement and versatile in its applicability for imposing equality and inequality constraints as well as auxiliary optimization objectives.

1121Test-Time Adversarial Defense with Opposite Adversarial Path and high Attack time cost

[openreview] [pdf]

Abstract Deep learning models are known to be vulnerable to adversarial attacks by injecting sophisticated designed perturbations to input data. Training-time defenses still exhibit a significant performance gap between natural accuracy and robust accuracy. In this paper, we investigate a new test-time adversarial defense method via diffusion-based recovery along opposite adversarial paths (OAPs). We present a purifier that can be plugged into a pre-trained model to resist adversarial attacks. Different from prior arts, the key idea is excessive denoising or purification by integrating the opposite adversarial direction with reverse diffusion to push the input image further toward the opposite adversarial direction. For the first time, we also exemplify the pitfall of conducting AutoAttack (Rand) for diffusion-based defense methods. Through the lens of time complexity, we examine the trade-off between the effectiveness of adaptive attack and its computation complexity against our defense. Experimental evaluation along with time cost analysis verifies the effectiveness of the proposed method.

1122Multi-Agent Path Finding via Decision Transformer and LLM Collaboration

[openreview] [pdf]

Abstract Multi-Agent Path Finding (MAPF) is a significant problem with pivotal applications in robotics and logistics. The problem involves determining collision-free paths for multiple agents with specific goals in a 2D grid-world environment. Unfortunately, finding optimal solutions for MAPF is an NP-hard problem. Traditional centralized planning approaches are intractable for large numbers of agents and inflexible when adapting to dynamic changes in the environment. On the other hand, existing decentralized methods utilizing learning-based strategies suffer from two main drawbacks: (1) training takes times ranging from days to weeks, and (2) they often tend to exhibit self-centered agent behaviors leading to increased collisions. We introduce a novel approach leveraging the Decision Transformer (DT) architecture that enables agents to learn individual policies efficiently. We capitalize on the transformer’s capability for long-horizon planning and the advantages of offline reinforcement learning to drastically reduce training times to a few hours. We further show that integrating an LLM (GPT-4o), enhances the performance of DT policies in mitigating undesirable behaviors such as prolonged idling at specific positions and undesired deviations from goal positions. We focus our empirical evaluation on both scenarios with static environments and in dynamically changing environments where agents’ goals are altered during inference. Results demonstrate that incorporating an LLM for dynamic scenario adaptation in MAPF significantly enhances the agents’ performance and paves the way for more adaptable multi-agent systems.

1123Out-of-Distribution Detection using Synthetic Data Generation

[openreview] [pdf]

Abstract Distinguishing in- and out-of-distribution (OOD) inputs is crucial for reliable deployment of classification systems. However, OOD data is typically unavailable or difficult to collect, posing a significant challenge for accurate OOD detection. In this work, we present a method that harnesses the generative capabilities of Large Language Models (LLMs) to create high-quality synthetic OOD proxies, eliminating the dependency on any external OOD data source. We study the efficacy of our method on classical text classification tasks such as toxicity detection and sentiment classification as well as classification tasks arising in LLM development and deployment, such as training a reward model for RLHF and detecting misaligned generations. Extensive experiments on nine InD-OOD dataset pairs and various model sizes show that our approach dramatically lowers false positive rates (achieving a perfect zero in some cases) while maintaining high accuracy on in-distribution tasks, outperforming baseline methods by a significant margin.

1124Shallow diffusion networks provably learn hidden low-dimensional structure

[openreview] [pdf]

Abstract Diffusion models provide a powerful, general purpose framework for learning to sample from a target distribution. The remarkable empirical success of these models applied to high dimensional signals, including images and video frames, stands in stark contrast to the classical curse of dimensionality which arises in the general problem of learning distributions. In this work, we take a step towards understanding this gap. We show that learning diffusion models in Barron spaces---the function space of single-layer neural networks---provably adapts to simple forms of low dimensional structure. We combine our results with recent progress in analyzing the diffusion sampling process to provide end-to-end sample complexity bounds for learning to sample from structured distributions. Our results avoid exponential dependencies on the ambient dimension of the data, and instead reflect the intrinsic latent dimensionality of the underlying target distribution. Importantly, our results do not require specialized architectures which are specifically tailored for particular latent structures, and instead rely on the low-index structure of Barron classes to adapt to the underlying distribution.

1125Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

[openreview] [pdf]

Abstract Recent advances in text-to-image (T2I) diffusion models have significantly improved the quality of generated images. However, providing efficient control over individual subjects, particularly the attributes characterizing them, remains a key challenge. While existing methods have introduced mechanisms to modulate attribute expression, they typically provide either detailed, object-specific localization of such a modification or fine-grained, nuanced control of attributes. No current approach offers both simultaneously, resulting in a gap when trying to achieve precise continuous and subject-specific attribute modulation in image generation. In this work, we demonstrate that token-level directions exist within commonly used CLIP text embeddings that enable fine-grained, subject-specific control of high-level attributes in T2I models. We introduce two methods to identify these directions: a simple, optimization-free technique and a learning-based approach that utilizes the T2I model to characterize semantic concepts more specifically. Our methods allow the augmentation of the prompt text input, enabling fine-grained control over multiple attributes of individual subjects simultaneously, without requiring any modifications to the diffusion model itself. This approach offers a unified solution that fills the gap between global and localized control, providing competitive flexibility and precision in text-guided image generation.

1126Rethinking the Influence of Distribution Adjustment in Incremental Segmentation

[openreview] [pdf]

Abstract In an ever-changing world, incremental segmentation learning faces challenges due to the need for pixel-level accuracy and the practical application of gradually obtained samples. While most existing methods excel in stability by freezing model parameters or employing other regularization techniques to preserve the distribution of old knowledge, these approaches often fall short of achieving satisfactory plasticity. This phenomenon arises from the limited allocation of parameters for learning new knowledge. Meanwhile, in such a learning manner, the distribution of old knowledge cannot be optimized as new knowledge accumulates. As a result, the feature distribution of newly learned knowledge overlaps with old knowledge, leading to inaccurate segmentation performance on new classes and insufficient plasticity. This issue prompts us to explore how both old and new knowledge representations can be dynamically and simultaneously adjusted in the feature space during incremental learning. To address this, we conduct a mathematical structural analysis, which indicates that compressing the feature subspace and promoting sparse distribution is beneficial in allocating more space for new knowledge in incremental segmentation learning. Following compression principles, high-dimensional knowledge is projected into a lower-dimensional space in a contracted and dimensionally reduced manner. Regarding sparsity, the exclusivity of multiple peaks in Gaussian mixture distributions across different classes is preserved. Through effective knowledge transfer, both up-to-date and long-standing knowledge can dynamically adapt within a unified space, facilitating efficient adaptation to continuously incoming and evolving data. Extensive experiments across various incremental settings consistently demonstrate the significant improvements provided by our proposed method. In particular, regarding the plasticity of in the incremental stage, our approach outperforms the state-of-the-art method by 11.7% in MIoU scores for the challenging 10-1 setting. Source code is available in the supplementary materials.

1127Faster Rates for Private Adversarial Bandits

[openreview] [pdf]

Abstract We design new differentially private algorithms for the problems of adversarial bandits and bandits with expert advice. For adversarial bandits, we give a simple and efficient conversion of any non-private bandit algorithm to a private bandit algorithm. Instantiating our conversion with existing non-private bandit algorithms gives a regret upper bound of O(KTϵ)O\left(\frac{\sqrt{KT}}{\sqrt{\epsilon}}\right), improving upon the existing upper bound O(KTlog(KT)ϵ)O\left(\frac{\sqrt{KT \log(KT)}}{\epsilon}\right) for all ϵ1\epsilon \leq 1. In particular, our algorithms allow for sublinear expected regret even when ϵ1T\epsilon \leq \frac{1}{\sqrt{T}}, establishing the first known separation between central and local differential privacy for this problem. For bandits with expert advice, we give the first differentially private algorithms, with expected regret O(NTϵ),O(KTlog(N)log(KT)ϵ)O\left(\frac{\sqrt{NT}}{\sqrt{\epsilon}}\right), O\left(\frac{\sqrt{KT\log(N)}\log(KT)}{\epsilon}\right), and O~(N1/6K1/2T2/3log(NT)ϵ1/3+N1/2log(NT)ϵ)\tilde{O}\left(\frac{N^{1/6}K^{1/2}T^{2/3}\log(NT)}{\epsilon^{1/3}} + \frac{N^{1/2}\log(NT)}{\epsilon}\right), where KK and NN are the number of actions and experts respectively. These rates allow us to get sublinear regret for different combinations of small and large K,NK, N and ϵ.\epsilon.

1128Neural Exploratory Landscape Analysis

[openreview] [pdf]

Abstract Recent research in Meta-Black-Box Optimization (MetaBBO) have shown that meta-trained neural networks can effectively guide the design of black-box optimizers, significantly reducing the need for expert tuning and delivering robust performance across complex problem distributions. Despite their success, a paradox remains: MetaBBO still rely on human-crafted Exploratory Landscape Analysis features to inform the meta-level agent about the low-level optimization progress. To address the gap, this paper proposes Neural Exploratory Landscape Analysis (NeurELA), a novel framework that dynamically profiles landscape features through a two-stage, attention-based neural network, executed in an entirely end-to-end fashion. NeurELA is pre-trained over a variety of MetaBBO algorithms using a multi-task neuroevolution strategy. Extensive experiments show that NeurELA achieves consistently superior performance when integrated into different and even unseen MetaBBO tasks and can be efficiently fine-tuned for further performance boost. This advancement marks a pivotal step in making MetaBBO algorithms more autonomous and broadly applicable. The source code of NeurELA can be accessed athttps://anonymous.4open.science/r/Neur-ELA-303C.

1129Enhancing Software Agents with Monte Carlo Tree Search and Hindsight Feedback

[openreview] [pdf]

Abstract In complex and dynamic environments like software development, effective decision-making requires continuous adaptation, iterative learning, and strategic reconsideration. Current large language model (LLM)-based software agents often rely on rigid processes, limiting their ability to handle intricate, long-horizon tasks. These agents frequently fall into repetitive patterns, unable to assess the efficacy of their actions over time. To address these challenges, we propose SWE-search, a multi-agent framework that integrates Monte Carlo Tree Search (MCTS) with self-improvement mechanisms to enhance software agents’ performance in dynamic, repository-level tasks. SWE-search extends traditional MCTS by incorporating a hybrid value function that leverages LLMs for both numerical value estimation and qualitative evaluation. This combination enables self-feedback loops where agents iteratively refine their strategies based on both quantitative outcomes and the qualitative assessment of the paths taken. The framework includes a SWE-Agent for adaptive exploration, a Value Agent for iterative feedback, and a Discriminator Agent that facilitates multi-agent debate for collaborative decision-making. Applied to the SWE-Bench benchmark, our approach demonstrates a 23% relative improvement in performance across five models compared to standard open-source agents without MCTS. Our analysis reveals how performance scales with increased search breadth and identifies key factors that facilitate effective self-evaluation in software agents. This work highlights the potential of self-evaluation driven search techniques to enhance agent reasoning and planning in complex, dynamic software engineering environments.

1130Selective Concept Bottleneck Models Without Predefined Concepts

[openreview] [pdf]

Abstract Concept-based models like Concept Bottleneck Models (CBMs) have garnered significant interest for improving model interpretability by first predicting human-understandable concepts before mapping them to the output classes. Early approaches required costly concept annotations. To alleviate such, recent methods utilized large language models to automatically generate class-specific concept descriptions and learned mappings from a pretrained black-box model’s raw features to these concepts using vision-language models. However, these approaches assume prior knowledge of which concepts the black-box model has learned. In this work, we discover the concepts encoded by the model through unsupervised concept discovery techniques instead. We further propose an input-dependent concept selection mechanism that dynamically retains a sparse set of relevant concepts for each input, enhancing both sparsity and interpretability. Our approach not only improves downstream performance but also needs significantly fewer concepts for accurate classification. Lastly, we show how large vision-language models can guide the editing of our models’ weights to correct errors.

1131Robust Inverse Reinforcement Learning under State Adversarial Perturbations

[openreview] [pdf]

Abstract State adversarial perturbations –such as sensor noise, environmental interference, or targeted attacks– are common in real-world systems, often leading to compromised state observations. Despite this, Inverse Reinforcement Learning (IRL) in the context of State-Adversarial Markov Decision Processes (SA-MDPs) has received limited attention, primarily because conventional notions of optimality do not apply. In this paper, we introduce a novel definition of optimality that ensures the existence of an optimal policy within SA-MDPs. Building on this foundation, we propose the State-Adversarial Max-Margin IRL (SAMM-IRL) algorithm, designed for robustness against state adversarial perturbations. Our theoretical analysis, supported by empirical validation, demonstrates that SAMM-IRL significantly enhances IRL performance in adversarial environments, providing a robust framework for real-world applications that demand resilience.

1132Forewarned is Forearmed: Harnessing LLMs for Data Synthesis via Failure-induced Exploration

[openreview] [pdf]

Abstract Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data, leading to impressive performance across a range of downstream applications. Current methods often rely on human-annotated data or predefined task templates to direct powerful LLMs in synthesizing task-relevant data for effective model training. However, this dependence on manually designed components may constrain the scope of generated data, potentially overlooking critical edge cases or novel scenarios that could challenge the model. In this paper, we present a novel approach, \name, designed to automatically generate effective training samples that expose the weaknesses of LLMs. Specifically, we introduce a dedicated proposer trained to produce queries that lead target models to generate unsatisfactory responses. These failure-inducing queries are then used to construct training data, helping to address the models’ shortcomings and improve overall performance. Our approach is flexible and can be applied to models of various scales (3B, 7B, and 8B). We evaluate \name on three key applications—safety, honesty, and math—demonstrating that our generated data is both highly effective and diverse. Models fine-tuned with \name-generated data consistently outperform those trained on human-annotated or general model-generated data, offering a new perspective on data synthesis for task-specific LLM enhancement.

1133TIPS: Two-Level Prompt for Rehearsal-free Continual Learning

[openreview] [pdf]

Abstract Continual learning based on prompt tuning creates a key-value pool, where these key-value pairs are called prompts. Prompts are retrieved using input images as queries and input into a frozen backbone network. It requires training only a few parameters to quickly adapt to downstream tasks. Compared to other traditional Continual learning methods, it is more effective in resisting catastrophic forgetting. However, the effectiveness of these methods heavily depends on the selection strategy. Most existing methods overlook the model plasticity since they focus on solving the model’s stability issues, leading to a sharp decline in performance for new tasks in long task sequences of incremental learning. To address these limitations, we propose a novel prompt-based continual learning method called TIPS, which mainly consists of two modules: (1) design a novel two-level prompt selection strategy combined with a set of adaptive weights for sparse joint tuning, aiming to improve the accuracy of prompt selection; (2) design a semantic distillation module that enhances the generalization ability to unknown new classes by creating a language token and utilizing the encapsulated semantic information of class names. We validated TIPS on four datasets across three incremental scenarios. Our method outperformed the current state of the art (SOTA) by 2.03%, 4.78%, 1.18%, and 5.59% on CIFAR (10 tasks), ImageNet-R (20 tasks), CUB (10 tasks), and DomainNet (20 tasks). Notably, our approach consistently surpasses or matches SOTA in all settings, maintaining stable prompt selection accuracy throughout multiple incremental learning sessions.

1134Stick-breaking Attention

[openreview] [pdf]

Abstract The Transformer architecture’s self-attention mechanism traditionally relies on the softmax operator, necessitating positional embeddings like RoPE, or position biases to account for token order. But current methods still face length generalisation challenges. We propose an alternative attention mechanism based on the stick-breaking process: For each token before the current, we determine a break point βi,j\beta_{i,j}, which represents the proportion of the remaining stick to allocate to the current token. We repeat the process until the stick is fully allocated, resulting in a sequence of attention weights. This process naturally incorporates recency bias, which has linguistic motivations for grammar parsing (Shen et. al. 2017). We study the implications of replacing the conventional softmax-based attention mechanism with stick-breaking attention. We then discuss implementation of numerically stable stick-breaking attention and adapt Flash Attention to accommodate this mechanism. When used as a drop-in replacement for current softmax+RoPE attention systems, we find that stick-breaking attention performs competitively with current methods on length generalisation and downstream tasks. Stick-breaking also performs well at length generalisation, allowing a model trained with 211 context window to perform well at 214 with perplexity improvements.

1135Improving the Efficiency of Test-Time Search in LLMs with Backtracking

[openreview] [pdf]

Abstract Solving reasoning problems is an iterative multi-step computation, where a reasoning agent progresses through a sequence of steps, with each step logically building upon the previous one to reach a desired conclusion. If the desired solution is not attained, the agent must backtrack and try reasoning chains that are quite different from previous attempts. Though prior work such as test-time search against an outcome verifier can improve performance, most search is done in parallel via Best-of-N reranking, and independently for each attempt at a problem, thus wasting a significant amount of computation in sampling multiple full solutions even beyond the point that is needed. Can we reduce the total amount of computation by sharing information and computation across multiple attempts to a given problem? In this paper, we build a novel approach combining process verifiers that predict likelihoods of success \emph{per step} with preemptive backtracking to maximize performance per generated token. To do this, the PRM can be used to identify where a problematic step in a solution trace is by using the sensitivity of the predictions of the learned verifier and allowing the model to do focused resampling of the problematic portion of a solution. This approach can significantly reduce the amount of computation by leveraging partial computation from previous revisions. To further enhance the computational efficiency of inference, we introduce in-context process supervision, where the verifier is conditioned on the history of revisions that are attempted, reducing uncertainty in the verification decisions and improving the verifier’s confidence with each round of backtracking. This framework for iterative backtracking, leveraging in-context process supervision, enables an effective tradeoff between inference and model performance.

1136Plan B: Training LLMs to fail less severely

[openreview] [pdf]

Abstract Safety-trained LLMs can produce harmful responses across various input types, as shown by research on jailbreaks, data poisoning, and misalignment. Despite ongoing efforts, fully preventing such failures remains difficult. In this work, we propose a second line of defense: instead of solely focusing on eliminating harmful responses, we also aim to reduce their severity when they occur. As a case study, we experiment with an LLM trained to respond to a backdoor-trigger by complying with harmful requests. We fine-tune the model, without using the trigger in the training data, on the following pairwise preferences: (1) refusal is preferred over any harmful response, (2) less harmful responses are preferred over more harmful ones. We find that training on this preference ordering significantly reduces the harmfulness of backdoor-triggered responses. Finally, we demonstrate that our approach generalizes to several state-of-the-art jailbreak techniques.

1137Approximated Behavioral Metric-based State Projection for Federated Reinforcement Learning

[openreview] [pdf]

Abstract Federated reinforcement learning (FRL) methods usually share the encrypted local state or policy information and help each client to learn from others while preserving everyone’s privacy. In this work, we propose that sharing the approximated behavior metric-based state projection function is a promising way to enhance the performance of FRL and concurrently provides an effective protection of sensitive information. We introduce FedRAG, a FRL framework to learn a computationally practical projection function of states for each client and aggregating the parameters of projection functions at a central server. The FedRAG approach shares no sensitive task-specific information, yet provides information gain for each client. We conduct extensive experiments on the DeepMind Control Suite to demonstrate insightful results.

1138Learning Multiple Initial Solutions to Optimization Problems

[openreview] [pdf]

Abstract Sequentially solving similar optimization problems under strict runtime constraints is essential for many applications, such as robot control, autonomous driving, and portfolio management. The performance of local optimization methods in these settings is sensitive to the initial solution: poor initialization can lead to slow convergence or suboptimal solutions. To address this challenge, we propose learning to predict multiple diverse initial solutions given parameters that define the problem instance. We introduce two strategies for utilizing multiple initial solutions: (i) a single-optimizer approach, where the most promising initial solution is chosen using a selection function, and (ii) a multiple-optimizers approach, where several optimizers, potentially run in parallel, are each initialized with a different solution, with the best solution chosen afterward. We validate our method on three optimal control benchmark tasks: cart-pole, reacher, and autonomous driving, using different optimizers: DDP, MPPI, and iLQR. We find significant and consistent improvement with our method across all evaluation settings and demonstrate that it efficiently scales with the number of initial solutions required.

1139Learning Disentangled Representations for Fairness with Limited Demographics

[openreview] [pdf]

Abstract Fair representation learning is a promising way to mitigate discrimination in downstream tasks. Many existing fair representation learning methods require access to sensitive information, but the collection of sensitive information is often difficult and even involves privacy issues. Additionally, a model trained to be fair with respect to one sensitive attribute may not ensure fairness for other sensitive groups. Thus, how to flexibly address fairness issues when we have limited access to sensitive information is a challenging problem. In this work, we answer this question: ``given limited sensitive information, can we learn a representation to be fair w.r.t. varying sensitive groups?‘’ To achieve this, we propose a novel two-step framework. We first learn a disentangled representation by employing Non-linear Independent Component Analysis (Nonlinear ICA). Second, we remove sensitive information in the latent space to obtain fair representation. The learned representation can be easily adapted to be fair w.r.t different sensitive groups and to be used for different downstream tasks without re-training. Among the entire process, only a small portion of sensitive information is required in the second step to learn a fair representation. We compare with methods that require different amounts of sensitive information on real-world images and tabular datasets. We empirically demonstrate the utility and flexibility of our approach, and our method is capable of achieving improved fairness results in various tasks.

1140HELMET: How to Evaluate Long-context Models Effectively and Thoroughly

[openreview] [pdf]

Abstract Many benchmarks exist for evaluating long-context language models (LCLMs), but developers often rely on synthetic tasks like needle-in-a-haystack (NIAH) or arbitrarily selected subsets of datasets. It remains unclear whether these evaluations translate to the diverse downstream applications of LCLMs, and the inconsistency further complicates model comparison. We investigate the underlying reasons behind current practices and find that existing benchmarks often provide noisy signals due to low coverage of long-context applications, insufficient dataset lengths, unreliable metrics, and incompatibility with base models. In this work, we present HELMET (How to Evaluate Long-context Models Effectively and Thoroughly), a comprehensive benchmark encompassing seven diverse, application-centric categories. We also address many issues in previous benchmarks by adding controllable lengths up to 128k tokens, model-based evaluation for reliable metrics, and few-shot prompting in all tasks for evaluating base models. Consequently, we demonstrate that HELMET offers more reliable and distinct rankings of frontier LCLMs. Through a comprehensive study of 51 LCLMs, we find that (1) synthetic tasks like NIAH are not good predictors of downstream performance; (2) the diverse categories in HELMET exhibit distinct trends that do not correlate well with each other; and (3) while most LCLMs achieve perfect NIAH scores, open-source models significantly lag behind closed ones when the task requires full-context reasoning or following complex instructions---the gap widens with increased lengths. Finally, we recommend using our RAG tasks for fast model developments, as they are easy to run and more predictive of downstream applications than existing synthetic tasks; but ultimately, we advocate for a holistic evaluation across diverse tasks. We hope HELMET serves as a valuable resource for future long-context model development.

1141Reliability-Aware Preference Learning for LLM Reward Models

[openreview] [pdf]

Abstract Reward functions learned from human feedback serve as the training objective for RLHF, the current state-of-the-art approach for aligning large language models to our values. However, in practice, these reward models fail to robustly capture our desiderata, often attributing more value to features such as output length or agreement with the user and less value to important features like factual correctness. A major reason is that human annotators provide feedback that is an unreliable reflection of their true preferences because of knowledge gaps, limited resources, cognitive biases, or other factors. We focus on making preference learning robust to unreliable feedback by explicitly modeling the knowledge and judgment of annotators. In particular, we estimate reliablity scores for each provided pairwise comparison and incoporate them into the implicit human model used in RLHF, DPO, and other alignment techniques, a technique we call Reliability Aware Preference Learning (RAPL). To test our approach, we introduce the Length Incentivized Evaluations dataset as a setting in which annotators are particularly likely to provide unreliable feedback. Then, we curate the Testing Reasoning and Understanding Errors dataset for training models to predict reliability scores. We find that traditional preference learning on the LIE dataset and other commonly used RLHF datasets leads to models that place far more weight on output length than accuracy. In contrast, RAPL results in models that better capture the true values of annotators.

1142Is Forward Gradient an Effective Tool for Explaining Black-box Models?

[openreview] [pdf]

Abstract Gradients are widely used to explain the decisions of deep neural networks. However, as models become deeper and more complex, computing gradients becomes challenging and sometimes infeasible, hindering traditional explanation methods. Recently, the forward gradient method has garnered attention for training structure-agnostic models with discontinuous objective functions. This method perturbs only the parameters of interest for gradient computation and optimization. Inspired by this, we investigate whether the forward gradient can be employed to explain black-box models. In this work, we use the likelihood ratio method to estimate output-to-input gradients and utilize them for the explanation of model decision. Additionally, we propose block-wise computation techniques to enhance estimation accuracy. Extensive experiments in black-box settings validate the effectiveness of our method, demonstrating accurate gradient estimation and improved explainability under the black-box setting.

1143Global Optimality of In-context Markovian Dynamics Learning

[openreview] [pdf]

Abstract Transformers have demonstrated impressive capability of in-context learning (ICL): given a sequence of input-output pairs of an unseen task, a trained transformer can make reasonable predictions on query inputs, without fine-tuning its parameters.However, existing studies on ICL have mainly focused on linear regression tasks, often with i.i.d. inputs within a prompt. This paper seeks to unveil the mechanism of ICL for next-token prediction for Markov chains, focusing on the transformer architecture with linear self-attention (LSA). More specifically, we derive and interpret the global optimum of the ICL loss landscape: (1) We provide the closed-form expression of the global minimizer for single-layer LSA trained over random instances of length-2 in-context Markov chains, showing the Markovian data distribution necessitates a denser global minimum structure compared to ICL for linear tasks. (2) We establish tight bounds for the global minimum of single-layer LSA trained on arbitrary-length Markov chains. (3) Finally, we prove that multilayer LSA, with parameterization mirroring the global minimizer’s structure, performs preconditioned gradient descent for a multi-objective optimization problem over the in-context samples, balancing a squared loss with multiple linear objectives. We numerically explore ICL for Markov chains using both simplified transformers and GPT-2-based multilayer nonlinear transformers.

1144Online Reinforcement Learning in Non-Stationary Context-Driven Environments

[openreview] [pdf]

Abstract We study online reinforcement learning (RL) in non-stationary environments, where a time-varying exogenous context process affects the environment dynamics. Online RL is challenging in such environments due to catastrophic forgetting'' (CF). The agent tends to forget prior knowledge as it trains on new experiences. Prior approaches to mitigate this issue assume task labels (which are often not available in practice), employ brittle regularization heuristics or use off-policy methods that suffer from instability and poor performance.We present Locally Constrained Policy Optimization (LCPO), an online RL approach that combats CF by anchoring policy outputs on old experiences while optimizing the return on current experiences. To perform this anchoring, LCPO locally constrains policy optimization using samples from experiences that lie outside of the current context distribution. We evaluate LCPO in Mujoco, classic control and computer systems environments with a variety of synthetic and real context traces, and find that it outperforms a variety of baselines in the non-stationary setting, while achieving results on-par with a prescient’’ agent trained offline across all context traces.

1145Interference Among First-Price Pacing Equilibria: A Bias and Variance Analysis

[openreview] [pdf]

Abstract A/B testing is widely used in the internet industry. For online marketplaces (such as advertising markets), standard approaches to A/B testing may lead to biased results when buyers have budget constraints, as budget consumption in one arm of the experiment impacts performance of the other arm. This is often addressed using a budget-split design. Yet such splitting may degrade statistical performance as budgets become too small in each arm. We propose a parallel budget-controlled A/B testing design where we use market segmentation to identify submarkets in the larger market, and we run parallel budget-split experiments in each submarket. We demonstrate the effectiveness of this approach on real experiments on advertising markets at Meta. Then, we formally study interference that derives from such experimental designs, using the first-price pacing equilibrium framework as our model of market equilibration. We propose a debiased surrogate that eliminates the first-order bias of FPPE, and derive a plug-in estimator for the surrogate and establish its asymptotic normality. We then provide an estimation procedure for submarket parallel budget-controlled A/B tests. Finally, we present numerical examples on semi-synthetic data, confirming that the debiasing technique achieves the desired coverage properties.

1146End-to-End Conformal Prediction for Trajectory Optimization

[openreview] [pdf]

Abstract Conformal Prediction (CP) is a powerful tool to construct uncertainty sets with coverage guarantees, which has fueled its extensive adoption in generating prediction regions for decision-making tasks, e.g., Trajectory Optimization (TO) in uncertain environments. However, existing methods predominantly employ a sequential scheme, where decisions rely unidirectionally on the prediction regions, and consequently the information from the decision-making end fails to be transmitted back to instruct the CP end. In this paper, we propose a novel End-to-End CP (E2E-CP) framework for shrinking-horizon TO with a joint risk constraint over the entire mission time. Specifically, a CP-based posterior risk calculation method is developed by fully leveraging the realized trajectories to adjust the posterior allowable risk, which is then allocated to future times to update prediction regions. In this way, the information in the realized trajectories is continuously fed back to the CP end, enabling attractive end-to-end adjustments of the prediction regions and a provable online improvement in trajectory performance. Furthermore, we theoretically prove that such end-to-end adjustments consistently maintain the coverage guarantees of the prediction regions, thereby ensuring provable safety. Additionally, we develop a decision-focused iterative risk allocation algorithm with theoretical convergence analysis for allocating the posterior allowable risk which closely aligns with E2E-CP. The effectiveness and superiority of the proposed method are demonstrated through benchmark experiments.

1147Fewer Questions, Better Answers: Efficient Offline Preference-based Reinforcement Learning via In-Dataset Exploration

[openreview] [pdf]

Abstract Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, Offline PbRL via In-Dataset Exploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm’s efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.

1148Revisiting inverse Hessian vector products for calculating influence functions

[openreview] [pdf]

Abstract Influence functions are a popular tool for attributing models’ outputs to training data. The traditional approach relies on the calculation of inverse Hessian-vector products (iHVP), but the classical solver ``Linear time Stochastic Second-order Algorithm’’ (LiSSA, Agarwal et al. (2017)) is often deemed impractical for large models due to expensive computation and hyperparameter tuning. We show that the three hyperparameters --- the scaling factor, the batch size, and the number of steps --- can be chosen depending on two specific spectral properties of the Hessian: its trace and largest eigenvalue. By evaluating them with random sketching (Swartworth and Woodruff, 2023), we find that the batch size has to be sufficiently large for the LiSSA to converge; however, for all of the models we consider, the requirement is mild. We confirm our findings empirically by comparing to the Proximal Bregman Retraining Functions (PBRF, Bae et al. (2022)).

1149STDM: Spatio-Temporal Diffusion Models for Time Series Analysis

[openreview] [pdf]

Abstract Denoising diffusion models have emerged as a formidable method, consistently surpassing previous state-of-the-art benchmarks. However, a notable challenge in time series-related tasks like anomaly detection and forecasting is the conditioning for models to reconstruct inputs accurately or generate samples based on past time steps rather than producing entirely new samples. To address this, we introduce a novel technique that enhances the sampling capabilities of denoising diffusion models for time series analysis, namely Spatio-Temporal Diffusion Models (STDM). While recent methods fall short of mapping contextual neighborhood dependencies directly into the sampling of a noisy sample, we focus on guiding the forward process of the diffusion model. The degeneration of a sample is based on the idea that values of neighboring time steps are highly correlated. We benefit from this assumption by presenting a diffusion step-dependent convolutional kernel to capture spatial relations and a combined, correlated noise to degenerate the input. Our method can be integrated seamlessly into various existing time series diffusion models. We compare the results of anomaly detection and forecasting when using the traditional and our novel forward process. In our experiments on synthetic and real-world datasets, we show that an adaption of the forward process can be beneficial, as our approach outperforms diffusion models with the ordinary forward process in task-specific metrics, underscoring the efficacy of our strategy in enhancing time series analysis through advanced diffusion techniques.

1150Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond

[openreview] [pdf]

Abstract Large language models (LLMs) should undergo rigorous audits to identify potential risks, such as copyright and privacy infringements. Once these risks emerge, timely updates are crucial to remove undesirable responses, ensuring legal and safe model usage. It has spurred recent research into LLM unlearning, focusing on erasing targeted undesirable knowledge without compromising the integrity of other, non-targeted responses. Existing studies have introduced various unlearning objectives to pursue LLM unlearning without necessitating complete retraining. However, each of these objectives has unique properties, and no unified framework is currently available to comprehend them thoroughly. To fill the gap, we propose the metric of the G-effect, quantifying the impacts of unlearning objectives on model performance from a gradient lens. A significant advantage of our metric is its broad ability to detail the unlearning impacts from various aspects across instances, updating steps, and LLM layers. Accordingly, the G-effect offers new insights into identifying drawbacks of existing unlearning objectives, further motivating us to explore a series of candidate solutions for their mitigation and improvements. Finally, we outline promising directions that merit further studies, aiming at contributing to the community to advance this critical field.

1151Long-context Extrapolation via Periodic Extension

[openreview] [pdf]

Abstract Long-context extrapolation aims to extend the contextual window of large language models to process more contextual information, which is widely adopted in industrial applications. Current mainstream solutions involve increasing the rotation base of RoPE to varying degrees or introducing optimization strategies such as ``low-frequency extrapolation and high-frequency interpolation’‘, in order to enhance the model’s extrapolation capabilities for long context. Actually, these methods alter the representation distribution of positional information by adjusting the rotation frequency of positional encoding, resulting in inevitably disrupt the attention distribution within the original training length range. In this paper, we analyze this phenomenon from a theoretical perspective and propose a long-context extrapolation strategy that preserves the known distribution via periodic extension of high-dimensional positional encoding. Based on this strategy, we design two methods, namely Extra-PE and Extra-MPE, to significantly enhance the models’ long-context extrapolation capabilities without disrupting the positional encoding distribution within the original training length. Through extensive experimental results, it is found that the long-context extrapolation method based on periodic extension can enhance the model’s capability in extrapolating long-contexts. Specifically, a model fine-tuned on 32k tokens can extrapolate beyond 80k tokens, surpassing the performance of the NTK-32k model and approaching that of the YaRN-64k model. Furthermore, this method demonstrates significantly superior performance in extrapolating extremely long-contexts compared to other methods. Notably, a model fine-tuned on 8k tokens still does not exhibit perplexity explosion when extrapolating to 80k tokens. Additionally, during the fine-tuning process, our approach achieves optimal performance using only one-fourth of the fine-tuning steps (100 steps) compared to the YaRNmethod. Secondly, in our comparative experiments, we found that the period in which the model learns a sufficient number of positional encoding has a significant impact on long-context extrapolation capability. Finally, through attention analysis, we discovered that our method can still maintain a stable level of attention at ultra-long distances, with the mean attention value exceeding 0 at these distances.

1152Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback

[openreview] [pdf]

Abstract We consider offline reinforcement learning (RL) with preference feedback in which the implicit reward is a linear function of an unknown parameter. Given an offline dataset, our objective consists in ascertaining the optimal action for each state, with the ultimate goal of minimizing the {\em simple regret}. We propose an algorithm, \underline{RL} with \underline{L}ocally \underline{O}ptimal \underline{W}eights or {\sc RL-LOW}, which yields a simple regret of exp(Ω(n/H))\exp ( - \Omega(n/H) ) where nn is the number of data samples and HH denotes an instance-dependent hardness quantity that depends explicitly on the suboptimality gap of each action. Furthermore, we derive a first-of-its-kind instance-dependent lower bound in offline RL with preference feedback. Interestingly, we observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of {\sc RL-LOW}. In view of privacy considerations in practical applications, we also extend {\sc RL-LOW} to the setting of (ε,δ)(\varepsilon,\delta)-differential privacy and show, somewhat surprisingly, that the hardness parameter HH is unchanged in the asymptotic regime as nn tends to infinity; this underscores the inherent efficiency of {\sc RL-LOW} in terms of preserving the privacy of the observed rewards. Given our focus on establishing instance-dependent bounds, our work stands in stark contrast to previous works that focus on establishing worst-case regrets for offline RL with preference feedback.

1153Test-Time Graph Rebirth: Serving GNN Generalization Under Distribution Shifts

[openreview] [pdf]

Abstract Distribution shifts between training and test graphs typically lead to the decreased performance of graph neural networks (GNNs) with suboptimal generalization in real-world applications. Despite advances in graph learning under distribution shifts through designing various model architecture development with customized training strategies, existing solutions can be challenging in practical GNN deployment because they often require significant modifications or retraining of the GNNs. To address such challenges, in this work, we propose a novel method, i.e., Test-Time Graph REBirth, dubbed TT-GREB, to effectively generalize the well-trained GNN models to the test-time graphs under distribution shifts by directly manipulating the test graph data. Concretely, we develop an overall framework designed by two principles, corresponding to two submodules: (1) prototype extractor for re-extracting the environment-invariant features of the test-time graph; and (2) environment refiner for re-fining the environment-varying features to explore the potential shifts. Furthermore, we propose a dual test-time graph contrastive learning objective with an effective iterative optimization strategy to obtain optimal prototype components and environmental components of the test graph. By reassembling these two components, we could obtain a newly reborn test graph, which is better suited for generalization on the well-trained GNN model with shifts in graph distribution. Extensive experiments on real-world graphs under diverse test-time distribution shifts could verify the effectiveness of the proposed method, showcasing its superior ability to manipulate test-time graphs for better GNN generalization ability.

1154Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

[openreview] [pdf]

Abstract In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a larger model can yield high-quality data, we investigate whether smaller models can be used for perplexity-based pruning and how pruning is affected by the domain composition of the data being pruned. We demonstrate that for multiple dataset compositions, perplexity-based pruning of pretraining data can significantly improve downstream task performance: pruning based on perplexities computed with a 125 million parameter model improves the average performance on downstream tasks of a 3 billion parameter model by up to 2.04 and achieves up to a 1.45× reduction in pretraining steps to reach commensurate baseline performance. Furthermore, we demonstrate that such perplexity-based data pruning also yields downstream performance gains in the over-trained and data-constrained regimes.

1155FlickerFusion: Intra-trajectory Domain Generalizing Multi-agent Reinforcement Learning

[openreview] [pdf]

Abstract Multi-agent reinforcement learning has demonstrated significant potential in addressing complex cooperative tasks across various real-world applications. However, existing MARL approaches often rely on the restrictive assumption that the number of entities (e.g., agents, obstacles) remains constant between training and inference. This overlooks scenarios where entities are dynamically added or removed during\textit{during} the inference trajectory---a common occurrence in real-world environments like search and rescue missions and dynamic combat situations. In this paper, we tackle the challenge of intra-trajectory dynamic entity composition\textbf{intra-trajectory dynamic entity composition} under zero-shot out-of-domain (OOD) generalization\textbf{out-of-domain (OOD) generalization}, where such dynamic changes cannot be anticipated beforehand. Our empirical studies reveal that existing MARL methods suffer significant\textit{significant} performance degradation and increased uncertainty in these scenarios. In response, we propose FlickerFusion, a novel OOD generalization method that acts as a universally\textit{universally} applicable augmentation technique for MARL backbone methods. Our results show that FlickerFusion not only achieves superior inference rewards but also uniquely\textit{uniquely} reduces uncertainty vis-à-vis the backbone, compared to existing methods. For standardized evaluation, we introduce MPEv2, an enhanced version of Multi Particle Environments (MPE), consisting of 12 benchmarks. Benchmarks, implementations, and trained models are organized and open-sourced at \href\texttt{\href{flickerfusion305.github.io}{flickerfusion305.github.io}}, accompanied by ample demo video renderings.

1156Learning In-Distribution Representations for Anomaly Detection

[openreview] [pdf]

Abstract Anomaly detection involves identifying data patterns that deviate from the anticipated norm. Traditional methods struggle in high-dimensional spaces due to the curse of dimensionality. In recent years, self-supervised learning, particularly through contrastive objectives, has driven advances in anomaly detection by generating compact and discriminative feature spaces. However, vanilla contrastive learning faces challenges like class collision, especially when the In-Distribution (ID) consists primarily of normal, homogeneous data, where the lack of semantic diversity leads to increased overlap between positive and negative pairs. Existing methods attempt to address these issues by introducing hard negatives through synthetic outliers, Outlier Exposure (OE), or supervised objectives, though these approaches can introduce additional challenges. In this work, we propose the Focused In-distribution Representation Modeling (FIRM) loss, a novel multi-positive contrastive objective for anomaly detection. FIRM addresses class-collision by explicitly encouraging ID representations to be compact while promoting separation among synthetic outliers. We show that FIRM surpasses other contrastive methods in standard benchmarks, significantly enhancing anomaly detection compared to both traditional and supervised contrastive learning objectives. Our ablation studies confirm that FIRM consistently improves the quality of representations and shows robustness across a range of scoring methods. It performs particularly well in ensemble settings and benefits substantially from using OE. The code is available at \url{https://anonymous.4open.science/r/firm-8472/}.

1157Transformers Provably Solve Parity Efficiently with Chain of Thought

[openreview] [pdf]

Abstract This work provides the first theoretical analysis of training transformers to solve complex problems by recursively generating intermediate states, analogous to fine-tuning for chain-of-thought (CoT) reasoning. We consider training a one-layer transformer to solve the fundamental kk-parity problem, extending the work on RNNs by \citet{Wies23}. We establish three key results: (1) any finite-precision gradient-based algorithm, without intermediate supervision, requires substantial iterations to solve parity with finite samples. (2) In contrast, when intermediate parities are incorporated into the loss function, our model can learn parity in one gradient update when aided by \emph{teacher forcing}, where ground-truth labels of the reasoning chain are provided at each generation step. (3) Even without teacher forcing, where the model must generate CoT chains end-to-end, parity can be learned efficiently if augmented data is employed to internally verify the soundness of intermediate steps. These results rigorously show that task decomposition and stepwise reasoning naturally arise from optimizing transformers with CoT; moreover, self-consistency checking can improve reasoning ability, aligning with empirical studies of CoT.

1158Rethinking the Stability-Plasticity Trade-off in Continual Learning from an Architectural Perspective

[openreview] [pdf]

Abstract The quest for Continual Learning (CL) seeks to empower neural networks with the ability to learn and adapt incrementally. Central to this pursuit is addressing the stability-plasticity dilemma, which involves striking a balance between two conflicting objectives: preserving previously learned knowledge and acquiring new knowledge. Existing studies have proposed numerous CL methods to achieve this trade-off. However, these methods often overlook the impact of basic architecture on stability and plasticity, thus the trade-off is limited to the parameter level. In this paper, we delve into the conflict between stability and plasticity at the architectural level. We reveal that under an equal parameter constraint, deeper networks exhibit better plasticity, while wider networks are characterized by superior stability. To address this architectural-level dilemma, we introduce a novel framework denoted Dual-Architecture (Dual-Arch), which serves as a plug-in component for CL. This framework leverages the complementary strengths of two distinct and independent networks: one dedicated to plasticity and the other to stability. Each network is designed with a specialized and lightweight architecture, tailored to its respective objective. Extensive experiments across datasets and CL methods demonstrate that Dual-Arch can enhance the performance of existing CL methods while being up to 87% more compact in terms of parameters than the baselines.

1159Generalization of Transformers with In-Context Learning: An Empirical Study

[openreview] [pdf]

Abstract Large language models (LLMs) like GPT-4 and LLaMA-3 utilize the powerful in-context learning (ICL) capability of Transformer architecture to learn on the fly from limited examples. While ICL underpins many LLM applications, its full potential remains hindered by a limited understanding of its generalization boundaries and vulnerabilities. We present a systematic investigation of transformers’ generalization capability with ICL relative to training data coverage by defining a task-centric framework along three dimensions: inter-problem, intra-problem, and intra-task generalization. Through extensive simulation and real-world experiments, encompassing tasks such as function fitting, API calling, and translation, we find that transformers lack inter-problem generalization with ICL, but excel in intra-task and intra-problem generalization. Furthermore, when the training data includes a greater variety of mixed tasks, it significantly enhances the generalization ability of ICL on unseen tasks and even on known simple tasks. This guides us in designing training data to maximize the diversity of tasks covered and to combine different tasks whenever possible, rather than solely focusing on the target task for testing.

1160Promptus: Representing Real-World Video as Stable Diffusion Prompts for Video Streaming

[openreview] [pdf]

Abstract With the exponential growth of video traffic, traditional video streaming systems are approaching their limits in compression efficiency and communication capacity. To further reduce bitrate while maintaining quality, we propose Promptus, a disruptive novel system that streaming prompts instead of video content, which represents real-world video frames with a series of “prompts” for delivery and employs Stable Diffusion to generate videos at the receiver. To ensure that the prompt representation is pixel-aligned with the original video, a gradient descent-based prompt fitting framework is proposed. Further, a low-rank decomposition-based bitrate control algorithm is introduced to achieve adaptive bitrate. For inter-frame compression, a temporal smoothing-based prompt interpolation algorithm is proposed. Evaluations across various video genres demonstrate that, compared to H.265, Promptus can achieve more than a 4x bandwidth reduction while preserving the same perceptual quality. On the other hand, at extremely low bitrates, Promptus can enhance the perceptual quality by 0.139 and 0.118 (in LPIPS) compared to VAE and H.265, respectively, and decreases the ratio of severely distorted frames by 89.3% and 91.7%. Our work opens up a new paradigm for efficient video communication. Promptus will be open-sourced after publication.

1161Watch Less, Do More: Implicit Skill Discovery for Video-Conditioned Policy

[openreview] [pdf]

Abstract In this paper, we study the problem of video-conditioned policy learning. While previous works mostly focus on learning policies that perform a single skill specified by the given video, we take a step further and aim to learn a policy that can perform multiple skills according to the given video, and generalize to unseen videos by recombining these skills. To solve this problem, we propose our algorithm, Watch-Less-Do-More, an information bottleneck-based imitation learning framework for implicit skill discovery and video-conditioned policy learning. In our method, an information bottleneck objective is employed to control the information contained in the video representation, ensuring that it only encodes information relevant to the current skill (Watch-Less). By discovering potential skills from training videos, the learned policy is able to recombine them and generalize to unseen videos to achieve compositional generalization (Do-More). To evaluate our method, we perform extensive experiments in various environments and show that our algorithm substantially outperforms baselines (up to 2x) in terms of compositional generalization ability.

1162Entropic Distribution Matching for Supervised Fine-tuning of LLMs: Less Overfitting and Better Diversity

[openreview] [pdf]

Abstract Large language models rely on Supervised Fine-Tuning (SFT) to specialize in downstream tasks. Cross Entropy (CE) loss is the de facto choice in SFT. However, CE often results in overfitting and limited output diversity due to its aggressive distribution matching strategy, which forces the model’s generative distribution to closely mimic the empirical data distribution. This paper aims to address these issues by introducing the maximum entropy principle, encouraging models to resist overfitting while preserving output diversity. Specifically, we develop a new distribution matching method called GEM, which solves reverse Kullback-Leibler divergence minimization with an entropy regularizer.For the SFT of Llama-3-8B models, GEM outperforms CE in several aspects. First, when applied to acquire general instruction-following abilities, GEM exhibits reduced overfitting, as evidenced by lower perplexity and better performance on the IFEval benchmark. Second, this advantage is also observed in domain-specific fine-tuning, where GEM continues to outperform CE in specialized math reasoning and code generation tasks. Last, we show that GEM-tuned models offer better output diversity, which helps scale up test-time compute: with the same sampling budget, they achieve performance gains of up to 10 points in math reasoning and code generation tasks, compared with CE-tuned models.

1163Towards Efficient and No Forgetting Domain Continual Pretraining by Mitigating the Stability Gap

[openreview] [pdf]

Abstract Adapting Large Language Models (LLMs) to specialized domains like medicine and law through domain continual pre-training has become the cutting-edge method. However, contrary to our expectations of immediate gains, we’ve uncovered a surprising phenomenon: a temporary performance drop at the start of the process, followed by a performance recovery phrase. This drop is not only unexpected but remarkably consistent across different model sizes and domains, such as medical and law. To gain a deeper understanding of this issue, we introduce the concept of stability gap—borrowed from visual models dealing with new class classifications—to explain this initial drop in LLM performance. Based on this concept, we hypothesize that the initial performance drop arises from instability in the model’s general abilities, which we further validated through our experiments. We further reveal that this initial instability is intricately tied to training settings that involve distribution shifts. To address this initial instability and enhance LLM performance within a fixed compute budget, we propose one training strategy that reduces the instability by increasing the epoch number, along with two data sampling strategies focused on data quality and corpus distribution. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical and legal continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to continually pre-train and instruction-tune the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models and performs comparably to or even better than GPT-4 on several medical benchmarks.

1164Advancing Prompt-Based Methods for Replay-Independent General Continual Learning

[openreview] [pdf]

Abstract General continual learning (GCL) is a broad concept to describe real-world continual learning (CL) problems, which are often characterized by online data streams without distinct transitions between tasks, i.e., blurry task boundaries. These requirements result in poor initial performance, limited generalizability, and severe catastrophic forgetting, heavily impacting the effectiveness of mainstream GCL models trained from scratch. While the use of a frozen pretrained backbone with appropriate prompt tuning can partially address these challenges, such prompt-based methods remain sub-optimal for CL of remaining tunable parameters on the fly. In this regard, we propose an innovative approach named MISA(Mask and Initial Session Adaption) to advance prompt-based methods in GCL. It includes a forgetting-aware initial session adaption that employs pretraining data to initialize prompt parameters and improve generalizability, as well as a non-parametric logit mask of the output layers to mitigate catastrophic forgetting. Empirical results demonstrate substantial performance gains of our approach compared to recent competitors, especially without a replay buffer (e.g., up to 18.39%, 22.06%, and 11.96% performance lead on CIFAR-100, Tiny-ImageNet, and ImageNet-R, respectively). Moreover, our approach features the plug-in nature for prompt-based methods, independence of replay, ease of implementation, and avoidance of CL-relevant hyperparameters, serving as a strong baseline for GCL research.

1165PFT: Enhancing Prompt Injection Robustness via Position-Enhanced Finetuning

[openreview] [pdf]

Abstract Large Language Models (LLMs) are widely adopted in closed-domain applications, where differentiating between system instructions and user input is crucial to prevent unintended malicious actions. However, instruction-following LLMs often blindly follow instructions in user inputs, opening up the risk of prompt injection attacks. This paper investigates whether Supervised Fine-Tuning (SFT) can teach LLMs to strictly distinguish system instructions from user input. Our study reveals a key weakness: SFT-tuned models follow system instructions reliably only when the key instruction is placed immediately after the initial tokens. We find that the proximity of the key instruction to the initial tokens significantly influences the model’s ability to execute the intended task, and consequently, its susceptibility to prompt injection attacks.To address this issue, we propose PFT, a novel position-enhanced fine-tuning approach that leverages position IDs to more effectively distinguish between system and user tokens. The experimental results demonstrate that PFT improves the robustness of SFT-tuned models against prompt injection attacks, even when the key instruction is placed arbitrarily in the system prompt, without compromising performance. Our work sheds light on the importance of prompt format in enhancing the security of LLMs and offers a practical solution to improve their robustness.

1166Boosting Multiple Views for pretrained-based Continual Learning

[openreview] [pdf]

Abstract Recent research has shown that Random Projection (RP) can effectively improve the performance of pre-trained models in Continual learning (CL). The authors hypothesized that using RP to map features onto a higher-dimensional space can make them more linearly separable. In this work, we theoretically analyze the role of RP and present its benefits for improving the model’s generalization ability in each task and facilitating CL overall. Additionally, we take this result to the next level by proposing a Multi-View Random Projection scheme for a stronger ensemble classifier. In particular, we train a set of linear experts, among which diversity is encouraged based on the principle of AdaBoost, which was initially very challenging to apply to CL. Moreover, we employ a task-based adaptive backbone with distinct prompts dedicated to each task for better representation learning. To properly select these task-specific components and mitigate potential feature shifts caused by misprediction, we introduce a simple yet effective technique called the self-improvement process. Experimentally, our method consistently outperforms state-of-the-art baselines across a wide range of datasets.

1167Locally Connected Echo State Networks for Time Series Forecasting

[openreview] [pdf]

Abstract Echo State Networks (ESNs) are a class of recurrent neural networks in which only a small readout regression layer is trained, while the weights of the recurrent network, termed the reservoir, are randomly assigned and remain fixed. Our work introduces the Locally Connected ESN (LCESN), a novel ESN variant with a locally connected reservoir, forced memory, and a weight adaptation strategy. LCESN significantly reduces the asymptotic time and space complexities compared to the conventional ESN, enabling substantially larger networks. LCESN also improves the memory properties of ESNs without affecting network stability. We evaluate LCESN’s performance on the NARMA10 benchmark task and compare it to state-of-the-art models on nine real-world datasets. Despite the simplicity of our model and its one-shot training approach, LCESN achieves competitive results, even surpassing several state-of-the-art models. LCESN introduces a fresh approach to real-world time series forecasting and demonstrates that large, well-tuned random networks can rival complex gradient-trained models. Additionally, we provide a GPU-based implementation of LCESN as an open-source library.

[openreview] [pdf]

Abstract Offline Reinforcement Learning (RL) has emerged as a powerful alternative to imitation learning for behavior modeling in various domains, particularly in complex navigation tasks. An existing challenge with Offline RL is the signal-to-noise ratio, i.e. how to mitigate incorrect policy updates due to errors in value estimates. Towards this, multiple works have demonstrated the advantage of hierarchical offline RL methods, which decouples high-level path planning from low-level path following. In this work, we present a novel hierarchical transformer-based approach leveraging a learned quantizer of space. This quantization enables the training of a zone-conditioned low-level policy and simplifies planning, which is reduced to discrete autoregressive prediction. Among other benefits, zone-level reasoning in planning enables explicit trajectory stitching rather than implicit stitching based on noisy value function estimates. By combining this transformer-based planner with recent advancements in offline RL, our approach achieves state-of-the-art results in complex long-distance navigation environments.

1169Personalized Federated Learning via Tailored Lorentz Space

[openreview] [pdf]

Abstract Personalized Federated Learning (PFL) has gained attention for privacy-preserving training on heterogeneous data. However, existing methods fail to capture the unique inherent geometric properties across diverse datasets by assuming a unified Euclidean space for all data distributions. Drawing on hyperbolic geometry’s ability to fit complex data properties, we present FlatLand, a novel personalized federated learning method that embeds different clients’ data in tailored Lorentz space. FlatLand can directly tackle the challenge of heterogeneity through the personalized curvatures of their respective Lorentz model of hyperbolic geometry, which is manifested by the time-like dimension. Leveraging the Lorentz model properties, we further design a parameter decoupling strategy that enables direct server aggregation of common client information, with reduced heterogeneity interference and without the need for client-wise similarity estimation. To the best of our knowledge, this is the first attempt to incorporate Lorentz geometry into personalized federated learning. Empirical results on various federated graph learning tasks demonstrate that FlatLand achieves superior performance, particularly in low-dimensional settings.

1170SymDiff: Equivariant Diffusion via Stochastic Symmetrisation

[openreview] [pdf]

Abstract We propose SYMDIFF, a novel method for constructing equivariant diffusion models using the recently introduced framework of stochastic symmetrisation. SYMDIFF resembles a learned data augmentation that is deployed at sampling time, and is lightweight, computationally efficient, and easy to implement on top of arbitrary off-the-shelf models. Notably, in contrast to previous work, SYMDIFF typically does not require any neural network components that are intrinsically equivariant, avoiding the need for complex parameterizations and the use of higher-order geometric features. Instead, our method can leverage highly scalable modern architectures as drop-in replacements for these more constrained alternatives. We show that this additional flexibility yields significant empirical benefit on E(3)-equivariant molecular generation. To the best of our knowledge, this is the first application of symmetrisation to generative modelling, suggesting its potential in this domain more generally.

1171Competitive Fair Scheduling with Predictions

[openreview] [pdf]

Abstract We consider online non-clairvoyant scheduling to minimize the max-stretch under the learning-augmented framework, where the scheduler has access to job size predictions. We present a family of algorithms:Relaxed-Greedy (RG)with an O(η3P)O(\eta^3 \cdot \sqrt{P}) competitive ratio, where η\eta denotes the prediction error for job sizes and PP is the maximum job size ratio;Adaptive Relaxed-Greedywith an O(λ0.5η2.5P)O(\lambda^{0.5} \cdot \eta^{2.5} \cdot \sqrt{P}) competitive ratio, where λ\lambda denotes the prediction error for the minimum job size;Predictive Relaxed-Greedywith an O(λ0.5φ0.5ηmax{η,φ}P)O(\lambda^{0.5} \cdot \varphi^{0.5} \cdot \eta \cdot \max \{ \eta, \varphi \} \cdot \sqrt{P}) competitive ratio, where φ\varphi denotes the prediction error for the maximum job size. We also presentRGx{RG}^x, an algorithm that represents a trade-off between consistency and smoothness, with an O(η2+2xP1x)O(\eta^{2+2x} \cdot P^{1-x}) competitive ratio. We introduce a general method using resource augmentation to bound robustness, resulting inRR-augmentedRG, which achieves a (1+ϵ)(1 + \epsilon)-speed O(min{η3P,nϵ})O(\min \{ \eta^3 \sqrt{P}, \frac{n}{\epsilon} \}) competitive ratio. Finally, we conduct simulations on synthetic and real-world datasets to evaluate the practical performance of these algorithms.

1172Scalable Exploration via Ensemble++

[openreview] [pdf]

Abstract Scalable exploration is a persistent challenge in sequential decision-making, especially in high-dimensional environments with neural networks. Ensemble sampling, a computationally efficient approximation of Thompson sampling, is widely used but suffers from performance degradation in shared-layer ensemble networks due to ensemble coupling. To overcome this limitation, we propose the Ensemble++ architecture, which introduces decoupled optimization and lifted index sampling for efficient exploration and uncertainty estimation. Empirical results show that Ensemble++ outperforms existing methods in regret minimization while maintaining bounded per-step computation costs across a variety of tasks, including nonlinear bandits and language-based contextual bandits using a GPT backbone. Theoretically, we prove that Ensemble++ achieves the same regret bounds as exact Thompson sampling in linear contextual bandits, with O~(logT)\tilde{O}(\log T) per-step computation complexity. This provides the first rigorous analysis demonstrating ensemble sampling as a scalable and effective approximation to Thompson sampling, closing a key theoretical gap in exploration efficiency.

1173Domain Indexing Collaborative Filtering for Recommender System

[openreview] [pdf]

Abstract In cross-domain recommendation systems, addressing cold-start items remains a significant challenge. Previous methods typically focus on maximizing performance using cross-domain knowledge, often treating the knowledge transfer process as a black box. However, the recent development of domain indexing introduces a new approach to better address such challenges. We have developed an adversarial Bayesian framework, Domain Indexing Collaborative Filtering (DICF), that infers domain indices during cross-domain recommendation. This framework not only significantly improves the recommendation performance but also provides interpretability for cross-domain knowledge transfer. This is verified by our empirical results on both synthetic and real-world datasets.

1174Federated Granger Causality Learning For Interdependent Clients With State Space Representation

[openreview] [pdf]

Abstract Advanced sensors and IoT devices have improved the monitoring and control of complex industrial enterprises. They have also created an interdependent fabric of geographically distributed process operations (clients) across these enterprises. Granger causality is an effective approach to detect and quantify interdependencies by examining how the state of one client affects the states of others over time. Understanding these interdependencies helps capture how localized events, such as faults and disruptions, can propagate throughout the system, potentially leading to widespread operational impacts. However, the large volume and complexity of industrial data present significant challenges in effectively modeling these interdependencies. This paper develops a federated approach to learning Granger causality. We utilize a linear state space system framework that leverages low-dimensional state estimates to analyze interdependencies. This helps address bandwidth limitations and the computational burden commonly associated with centralized data processing. We propose augmenting the client models with the Granger causality information learned by the server through a Machine Learning (ML) function. We examine the co-dependence between the augmented client and server models and reformulate the framework as a standalone ML algorithm providing conditions for its sublinear and linear convergence rates. We also study the convergence of the framework to a centralized oracle model. Using synthetic data, we conduct comprehensive experiments to demonstrate the robustness of our approach to perturbations in causality, the scalability to the size of communication, number of clients, and the dimensions of raw data. We also evaluate the performance on two real-world industrial control system datasets by reporting the volume of data saved by decentralization.

1175Bellman Diffusion: Generative Modeling as Learning a Linear Operator in the Distribution Space

[openreview] [pdf]

Abstract Deep Generative Models (DGMs), including Energy-Based Models (EBMs) and Score-based Generative Models (SGMs), have advanced high-fidelity data generation and complex continuous distribution approximation. However, their application in Markov Decision Processes (MDPs), particularly in distributional Reinforcement Learning (RL), remains underexplored, with conventional histogram-based methods dominating the field. This paper rigorously highlights that this application gap is caused by the nonlinearity of modern DGMs, which conflicts with the linearity required by the Bellman equation in MDPs. For instance, EBMs involve nonlinear operations such as exponentiating energy functions and normalizing constants. To address this, we introduce Bellman Diffusion, a novel DGM framework that maintains linearity in MDPs through gradient and scalar field modeling. With divergence-based training techniques to optimize neural network proxies and a new type of stochastic differential equation (SDE) for sampling, Bellman Diffusion is guaranteed to converge to the target distribution. Our empirical results show that Bellman Diffusion achieves accurate field estimations and is a capable image generator, converging 1.5x faster than the traditional histogram-based baseline in distributional RL tasks. This work enables the effective integration of DGMs into MDP applications, unlocking new avenues for advanced decision-making frameworks.

1176On Bits and Bandits: Quantifying the Regret-Information Trade-off

[openreview] [pdf]

Abstract In many sequential decision problems, an agent performs a repeated task. He then suffers regret and obtains information that he may use in the following rounds. However, sometimes the agent may also obtain information and avoid suffering regret by querying external sources. We study the trade-off between the information an agent accumulates and the regret it suffers. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. We also prove regret upper bounds using the amount of information the agent accumulates. These bounds show that information measured in bits, can be traded off for regret, measured in reward. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.

1177No-regret Learning with Revealed Transitions in Adversarial Markov Decision Processes

[openreview] [pdf]

Abstract When learning in Adversarial Markov Decision Processes (MDPs), agents must deal with a sequence of arbitrarily chosen transition models and losses. In this paper, we consider the setting in which the transition model chosen by the adversary is revealed at the end of each episode. We propose the notion of smoothed MDP whose transition model aggregates with a generic function ftf_t the ones experienced so far. Coherently, we define the concept of smoothed regret, and we devise Smoothed Online Mirror Descent (SOMD), an enhanced version of OMD that leverages a novel regularization term to effectively learn in this setting. For specific choices of the aggregation function ftf_t defining the smoothed MDPs we retrieve, under full-feedback, a regret bound of order O~(L3/2TL+LCfP)\widetilde{\mathcal O}(L^{3/2}\sqrt{TL}+L\overline{C}_f^{\mathsf{P}}) where TT is the number of episodes, LL is the horizon of the episode, and CfP\overline{C}_f^{\mathsf{P}} is a novel index of the degree of maliciousness of the adversarially chosen transitions. Under bandit feedback on the losses, we obtain a bound of order O~(L3/2XAT+LCfP)\widetilde{\mathcal O}(L^{3/2}\sqrt{XAT}+L\overline{C}_f^{\mathsf{P}}) using a simple importance weighted estimator on the losses.

1178Nonmyopic Bayesian Optimization in Dynamic Cost Settings

[openreview] [pdf]

Abstract Bayesian optimization (BO) is a popular framework for optimizing black-box functions, leveraging probabilistic models such as Gaussian processes. However, conventional BO assumes static query costs, which limits its applicability to real-world problems with dynamic cost structures, such as geological surveys or biological sequence design, where query costs vary based on previous actions. To address this, we propose a cost-constrained nonmyopic BO algorithm that incorporates dynamic cost models. Our method employs a neural network policy for variational optimization over multi-step lookahead horizons to plan ahead in dynamic cost environments. Empirically, we benchmark our method on synthetic functions exhibiting a variety of dynamic cost structures. Furthermore, we apply our method to a real-world application in protein sequence design using a large language model-based policy, demonstrating its scalability and effectiveness in handling multi-step planning in a large and complex query space. Our nonmyopic BO algorithm consistently outperforms its myopic counterparts in both synthetic and real-world settings, achieving significant improvements in both efficiency and solution quality.

1179Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

[openreview] [pdf]

Abstract The Value Iteration Network (VIN) is an end-to-end differentiable architecture that performs value iteration on a latent Markov Decision Process (MDP) for planning in reinforcement learning (RL). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a 100×100100\times 100 maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module’s depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an “adaptive highway loss” that constructs skip connections to improve gradient flow. We evaluate our method on 2D maze navigation environments, the ViZDoom 3D navigation benchmark, and the real-world Lunar rover navigation task. We find that our new method, named \textit{Dynamic Transition VIN (DT-VIN)}, scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in RL environments.

[openreview] [pdf]

Abstract In response to critiques of existing evaluation methods for temporal link prediction (TLP) models, we propose a novel approach to verify if these models truly capture temporal patterns in the data. Our method involves a sanity check formulated as a counterfactual question: “What if a TLP model is tested on a temporally distorted version of the data instead of the real data?” Ideally, a TLP model that effectively learns temporal patterns should perform worse on temporally distorted data compared to real data. We provide an in-depth analysis of this hypothesis and introduce two data distortion techniques to assess well-known TLP models. Our contributions are threefold: (1) We introduce two simple techniques to distort temporal patterns within a graph, generating temporally distorted test splits of well-known datasets for sanity checks. These distortion methods are applicable to any temporal graph dataset. (2) We perform counterfactual analysis on six TLP models JODIE, TGAT, TGN, CAWN, GraphMixer, and DyGFormer to evaluate their capability in capturing temporal patterns across different datasets. (3) We introduce two metrics -- average time difference (ATD) and average count difference (ACD) -- to provide a comprehensive measure of a model’s predictive performance.

1181A robust federated learning client selection with combinatorial data class representations and data augmentation

[openreview] [pdf]

Abstract The federated learning (FL) client selection scheme can effectively mitigate global model performance degradation caused by the random aggregation of clients with heterogeneous data. Simultaneously, research has exposed FL’s susceptibility to backdoor attacks. However herein lies the dilemma, traditional client selection methods and backdoor defenses stand at odds, so their integration is an elusive goal. To resolve this, we introduce Grace, a resilient client selection framework blending combinational class sampling with data augmentation. On the client side, Grace first proposes a local model purification method, fortifying the model’s defenses by bolstering its innate robustness. After, local class representations are extracted for server-side client selection. This approach not only shields benign models from backdoor tampering but also allows the server to glean insights into local class representations without infringing upon the client’s privacy. On the server side, Grace introduces a novel representation combination sampling method. Clients are selected based on the interplay of their class representations, a strategy that simultaneously weeds out malicious actors and draws in clients whose data holds unique value. Our extensive experiments highlight Grace’s capabilities. The results are compelling: Grace enhances defense performance by over 50% compared to state-of-the-art (SOTA) backdoor defenses, and, in the best case, improves accuracy by 3.19% compared to SOTA client selection schemes. Consequently, Grace achieves substantial advancements in both security and accuracy.

1182Reducing Hallucinations in Large Vision-Language Models via Latent Space Steering

[openreview] [pdf]

Abstract Hallucination poses a challenge to the deployment of large vision-language models (LVLMs) in applications. Unlike in large language models (LLMs), hallucination in LVLMs often arises from misalignments between visual inputs and textual outputs. This paper investigates the underlying mechanisms of hallucination, focusing on the unique structure of LVLMs that distinguishes them from large language models (LLMs). We identify that hallucinations often arise from the sensitivity of text decoders to vision inputs, a natural phenomenon when image encoders and text decoders are pre-trained separately. Inspired by this, we introduce Visual and Textual Intervention (VTI), a novel technique designed to reduce hallucinations by steering latent space representations during inference to enhance the stability of vision features. As a task-agnostic test-time intervention, VTI can be easily applied to any problem without additional cost. Extensive experiments demonstrate that it can effectively reduce hallucinations and outperform baseline methods across multiple metrics, highlighting the critical role of vision feature stability in LVLMs.

1183On the Role of Depth and Looping for In-Context Learning with Task Diversity

[openreview] [pdf]

Abstract The intriguing in-context learning (ICL) abilities of \emph{deep Transformer models} have lately garnered significant attention. By studying in-context linear regression on unimodal Gaussian data, recent empirical and theoretical works have argued that ICL emerges from Transformers’ abilities to simulate learning algorithms like gradient descent. However, these works fail to capture the remarkable ability of Transformers to learn \emph{multiple tasks} in context. To this end, we study in-context learning for linear regression with diverse tasks, characterized by data covariance matrices with condition numbers ranging from [1,κ][1, \kappa], and highlight the importance of depth in this setting. More specifically, (1) (1) theoretical lower bounds of log(κ)\log(\kappa) (or κ\sqrt{\kappa}) linear attention layers in the unrestricted (or restricted) attention and (2) we show that the class of {\em multilayer Transformers} can indeed solve such tasks with a number of layers that matches the lower bounds. Furthermore, we show that this expressivity of multilayer Transformer comes at the price of robustness; in particular, multilayer Transformers are not robust to even distributional shifts as small as O(eL)O(e^{-L}) in Wasserstein distance, where LL is the depth of the network. We then demonstrate that Looped Transformers ---a special class of multilayer Transformers with weight-sharing--- not only exhibit similar expressive power but are also provably robust under mild assumptions. Besides out-of-distribution generalization, we also show that Looped transformers are the only models that exhibit a monotonic behavior of loss with respect to depth (or number of loops).

1184Group Diffusion Transformers are Unsupervised Multitask Learners

[openreview] [pdf]

Abstract While large language models (LLMs) have revolutionized natural language processing with their task-agnostic capabilities, visual generation tasks such as image translation, style transfer, and character customization still rely heavily on supervised, task-specific datasets. In this work, we introduce \textbf{Group Diffusion Transformers (GDTs)}, a novel framework that unifies diverse visual generation tasks by redefining them as a \textbf{group generation} problem. In this approach, a set of related images is generated simultaneously, optionally conditioned on a subset of the group. GDTs build upon diffusion transformers with minimal architectural modifications by concatenating self-attention tokens across images. This allows the model to implicitly capture cross-image relationships (\textit{e.g.}, identities, styles, layouts, surroundings, textures, and color schemes) through caption-based correlations. Our design enables scalable, unsupervised, and task-agnostic pretraining using extensive collections of image groups sourced from multimodal internet articles, image galleries, and video frames. We evaluate GDTs on a comprehensive benchmark featuring over 200 instructions across 30 distinct visual generation tasks, including picture book creation, font design, style transfer, sketching, colorization, drawing sequence generation, and character customization. Our models achieve competitive \textbf{zero-shot} performance without any additional fine-tuning or gradient updates. Furthermore, ablation studies confirm the effectiveness of key components such as data scaling, group size, and model design. These results demonstrate the potential of GDTs as scalable, general-purpose visual generation systems. We will release the code and models to support further research.

1185Slot-Guided Adaptation of Pre-trained Diffusion Models for Object-Centric Learning and Compositional Generation

[openreview] [pdf]

Abstract We present SlotAdapt, an object-centric learning method that combines slot attention with pretrained diffusion models by introducing adapters for slot-based conditioning. Our method preserves the generative power of pretrained diffusion models, while avoiding their text-centric conditioning bias. We also incorporate an additional guidance loss into our architecture to align cross-attention from adapter layers with slot attention. This enhances the alignment of our model with the objects in the input image without using external supervision. Experimental results show that our method outperforms state-of-the-art techniques in object discovery and image generation tasks across multiple datasets, including those with real images. Furthermore, we demonstrate through experiments that our method performs remarkably well on complex real-world images for compositional generation, in contrast to other slot-based generative methods in the literature.

1186Data-Centric Human Preference Optimization with Rationales

[openreview] [pdf]

Abstract Reinforcement learning from human feedback plays a crucial role in aligning language models towards human preferences, traditionally represented through comparisons between pairs or sets of responses within a given context. While many studies have enhanced algorithmic techniques to optimize learning from such data, this work shifts focus to improving preference learning through a data-centric approach. Specifically, we propose enriching existing preference datasets with machine-generated rationales that explain the reasons behind choices. We develop a simple and principled framework to augment current preference learning methods with rationale information. Our comprehensive analysis highlights how rationales enhance learning efficiency. Extensive experiments reveal that rationale-enriched preference learning offers multiple advantages: it improves annotation efficiency, accelerates convergence to higher-performing models, and reduces verbosity bias and hallucination. Furthermore, this framework is versatile enough to integrate with various preference optimization algorithms. Overall, our findings highlight the potential of re-imagining data design for preference learning, demonstrating that even freely available machine-generated rationales can significantly boost performance across multiple dimensions.

1187Super Robot View Transformer

[openreview] [pdf]

Abstract Learning a single model for multiple robotic manipulation tasks, particularly high-precision tasks, has been a long-standing challenge in robotics research due to uncertainties inherent in both the model and the data. These uncertainties, namely epistemic uncertainty arising from model limitations and aleatoric uncertainty stemming from data variability, hinder precise control. While the Robot View Transformer (RVT) improves performance by re-rendering point clouds from fixed viewpoints and processing structured 2D virtual images, it still suffers from occlusion artifacts in rendering and limited action precision due to resolution constraints. To address these limitations, we propose the Super Robot View Transformer (S-RVT) framework, which integrates three novel components: the Super Point Renderer (S-PR), the Super-resolution Multi-View Transformer (S-MVT), and the Hierarchical Sampling Policy (HSP). The S-PR enhances the rendering process to mitigate occlusion artifacts, while the S-MVT integrates super-resolution to the output heatmaps, enabling finer-grained manipulation. The HSP efficiently samples multi-view heatmaps in 3D space to obtain accurate 3D poses. These innovations collaboratively mitigate the challenges of occlusion and precision in manipulation tasks. Our experimental results demonstrate that S-RVT achieves a success rate of 87.8 % across 18 manipulation tasks, surpassing the state-of-the-art of 81.4 %. Notably, for high-precision manipulation tasks, S-RVT exhibits nearly a two-fold improvement over existing methods, underscoring its effectiveness in precise control scenarios. Our code and trained models will be released to support further research.

1188DoF: A Diffusion Factorization Framework for Offline Multi-Agent Decision Making

[openreview] [pdf]

Abstract Diffusion models have been widely adopted in image and language generation and are now being applied to decision-making. However, the application of diffusion models in offline cooperative Multi-Agent decision making (MADM) remains limited. Although some researches exist, they suffer from scalability or poor cooperation issues due to the lack of design principles for diffusion-based MADM. The Individual-Global-Max (IGM) principle is a popular design principle for cooperative MADM. Through satisfying such principles, MADM algorithms achieve remarkable performance with good scalability. In this work, we extend the IGM principle as the Individual-Global-identically-Distributed (IGD) principle. This principle stipulates that the generated outcome of a multi-agent diffusion model should be identically distributed as the collective outcomes from multiple individual-agent diffusion models. We propose DoF, a diffusion factorization framework for MADM. It uses noise factorization function to factorize a centralized diffusion model into multiple diffusion models. We theoretically show that the noise factorization functions satisfy the IGD principle. Further, DoF uses data factorization function to model the complex relationship among data generated by multiple diffusion models. Through extensive experiments, we demonstrate the effectiveness of DoF.

1189SigDiffusions: Score-Based Diffusion Models for Time Series via Log-Signature Embeddings

[openreview] [pdf]

Abstract Score-based diffusion models have recently emerged as state-of-the-art generative models for a variety of data modalities. Nonetheless, it remains unclear how to adapt these models to generate long multivariate time series. Viewing a time series as the discretization of an underlying continuous process, we introduce SigDiffusion, a novel diffusion model operating on log-signature embeddings of the data. The forward and backward processes gradually perturb and denoise log-signatures preserving their algebraic structure. To recover a signal from its log-signature, we provide new closed-form inversion formulae expressing the coefficients obtained by expanding the signal in a given basis (e.g. Fourier or orthogonal polynomials) as explicit polynomial functions of the log-signature. Finally, we show that combining \texttt{SigDiffusion} with these inversion formulae results in highly realistic time series generation, competitive with the current state-of-the-art on various datasets of synthetic and real-world examples.

1190Dual-level Bias Mitigation via Fairness-guided Distribution Discrepancy

[openreview] [pdf]

Abstract Modern artificial intelligence predominantly relies on pre-trained models, which are fine-tuned for specific downstream tasks rather than built from scratch. However, a key challenge persists: the fairness of learned representations in pre-trained models is not guaranteed when transferred to new tasks, potentially leading to biased outcomes, even if fairness constraints were applied during the original training. To address this issue, we propose Dual-level Bias Mitigation (DBM), which measures the fairness-guided distribution discrepancy between representations of different demographic groups. By optimizing both the fairness-guided distribution discrepancy and the task-specific objective, DBM ensures fairness at both the representation and task levels. Theoretically, we provide the generalization error bound of the fairness-guided distribution discrepancy to support the efficacy of our approach. Experimental results on multiple benchmark datasets demonstrate that DBM effectively mitigates bias in fine-tuned models on downstream tasks across a range of fairness metrics.

1191Provably safe Reinforcement Learning using Bender’s Decomposition Oracles

[openreview] [pdf]

Abstract One of the core challenges when applying reinforcement learning to solve real world problems is the violation of numerous safety, feasibility or physical constraints during training and deployment. We propose Bender’s Oracle Optimization (BOO) that manages to achieve provable safety during both training and deployment, under the assumption that one has access to a representation of the feasible set, e.g., through a (possibly inaccurate) simulator or encoded rules. This method is particularly useful for cases where a simple (deterministic) model of the problem is available, but said model is too inaccurate or incomplete to solve the problem directly. We showcase our method by applying it to a challenging reward-maximizing stochastic job-shop scheduling problem, where we demonstrate a 17% improvement, and a nonlinear, nonconvex packing problem where we achieve close to globally optimal performance while improving the convergence speed by a factor of 800.

1192Decentralized Training of Transformer Models in Heterogeneous Network

[openreview] [pdf]

Abstract Training large transformer-based models like GPT-4 and Llama3 is prohibitively expensive, often requiring vast resources, such as tens of thousands of GPUs running simultaneously for months. Traditionally, these models are trained in specialized clusters with high-speed, uniform interconnections and computational capabilities, enabling efficient data and pipeline parallelism. However, these clusters are costly, while more affordable GPUs are widely distributed across the globe. Existing approaches, such as Swarm and Dapple, primarily focus on distributed learning across data centers. In this paper, we introduce a novel framework designed to handle heterogeneous devices and unstable communication environments. Our framework employs a hybrid approach, combining parameter server architectures, pipeline parallelism, and task pool strategies to effectively manage device disconnections. Through comprehensive time-cost analysis and graph clustering techniques, we derive a near-optimal resource allocation scheme. We compare our method with existing large-scale training approaches and demonstrate its effectiveness by training a large language model using gaming GPUs in real-world internet conditions.

1193Controlling Language and Diffusion Models by Transporting Activations

[openreview] [pdf]

Abstract The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviours in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behaviour with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

1194A Large-scale Dataset and Benchmark for Commuting Origin-Destination Flow Generation

[openreview] [pdf]

Abstract Commuting Origin-Destination~(OD) flows are critical inputs for urban planning and transportation, providing crucial information about the population residing in one region and working in another within an interested area. Due to the high cost of data collection, researchers have developed physical and computational models to generate commuting OD flows using readily available urban attributes, such as sociodemographics and points of interest, for cities lacking historical OD flows \textemdash commuting OD flow generation. Existing works developed models based on different techniques and achieved improvement on different datasets with different evaluation metrics, which hinderes establishing a unified standard for comparing model performance. To bridge this gap, we introduce a large-scale dataset containing commuting OD flows for 3,333 areas including a wide range of urban environments around the United States. Based on that, we benchmark widely used models for commuting OD flow generation. We surprisingly find that the network-based generative models achieve the optimal performance in terms of both precision and generalization ability, which may inspire new research directions of graph generative modeling in this field. The dataset and benchmark are available athttps://anonymous.4open.science/r/CommutingODGen-Dataset-0D4C/.

1195Understanding Domain Generalization: A View of Necessity and Sufficiency

[openreview] [pdf]

Abstract Despite the rapid advancements in domain generalization (DG), the majority of DG studies center on establishing theoretical guarantee for generalization under the assumption of sufficient, diverse or even infinite domains. This assumption however is unrealistic, thus there remains no conclusive evidence as to whether the existing DG algorithms can truly generalize in practical settings where domains are limited. This paper aims to elucidate this matter. We first study the conditions for the existence and learnability of an optimal hypothesis. As the sufficient conditions are non-verifiable, our identified two necessary conditions become critical to guaranteeing the chance of finding the global optimal hypothesis in finite domain settings. In light of the theoretical insights, we provide a comprehensive review of DG algorithms explaining to what extent they can generalize effectively. We finally introduce a practical approach that leverages the joint effect of the two sets of conditions to boost generalization. Our proposed method demonstrates superior performance on well-established DG benchmarks.

1196Structured Joint Aleatoric and Epistemic Uncertainty for High Dimensional Output Spaces

[openreview] [pdf]

Abstract Uncertainty estimation plays a vital role in enhancing the reliability of deep learning model predictions, especially in scenarios with high-dimensional output spaces. This paper addresses the dual nature of uncertainty — aleatoric and epistemic — focusing on their joint integration in high-dimensional regression tasks. We introduce an approach to approximate joint uncertainty using a low-rank plus diagonal covariance matrix, which preserves essential output correlations while mitigating the computational complexity associated with full covariance matrices. Specifically, our method reduces memory usage and enhances sampling efficiency and log-likelihood calculations. Simultaneously, our representation matches the true posterior better than factorized joint distributions, offering a clear advancement in reliability and explainability for deep learning model predictions. Furthermore, we empirically show that our method can efficiently enhance out of distribution detection in specific applications.

1197KV Prediction for Improved Time to First Token

[openreview] [pdf]

Abstract Inference with transformer-based language models begins with a prompt processing step. In this step, the model generates the first output token and stores the KV cache needed for future generation steps. This prompt processing step can be computationally expensive, taking 10s of seconds or more for billion-parameter models on edge devices when prompt lengths or batch sizes rise. This degrades user experience by introducing significant latency into the model’s outputs. To reduce the time spent producing the first output (known as the ``time to first token’', or TTFT of a pretrained model, we introduce a novel method called KV Prediction. In our method, a small auxiliary model is used to process the prompt and produce an approximation of the KV cache used by a base model. This approximated KV cache is then used with the base model for autoregressive generation without the need to query the auxiliary model again. We demonstrate that our method produces a pareto-optimal efficiency-accuracy trade-off when compared to baselines. On TriviaQA, we demonstrate relative accuracy improvements in the range of 15%-50% across a range of TTFT FLOPs budgets. We also demonstrate accuracy improvements of up to 30% on HumanEval python code completion at fixed TTFT FLOPs budgets. Additionally, we benchmark models on an Apple M2 Pro CPU and demonstrate that our improvement in FLOPs translates to a TTFT speedup on hardware. We will release our code for reproducibility.

1198Self-Improving Robust Preference Optimization

[openreview] [pdf]

Abstract Both online and offline RLHF methods such as PPO and DPO have been extremely successful in aligning AI with human preferences. Despite their success, the existing methods suffer from some fundamental limitations: prominent among those limitations are (a) models trained with RLHF can learn from mistakes or negative examples through RL mechanism or contrastive loss at the time of training. However at the time of inference they are not equipped with an innate mechanism to correct mistakes by self-improvement. (b) The optimal solution of existing methods is highly task-dependent and thus it is difficult for them to generalize to new tasks. Here we propose Self-Improving Robust Preference Optimization (SRPO), a practical and mathematically principled offline RLHF framework that address both these challenges. The key idea of SRPO is to cast the problem of learning from human preferences as a self-improvement process, which can be mathematically expressed in terms of a min-max objective that aims at joint optimization of self-improvement policy and the generative policy in an adversarial fashion. The solution for this optimization problem is independent of the training task and thus it is robust to its changes. We then show that this objective can be re-expressed in the form of a non-adversarial offline loss which can be optimized using standard supervised optimization techniques at scale without any need for reward model and online inference. We show the effectiveness of SRPO in terms of AI Win-Rate (WR) against human (GOLD) completions. In particular, when \srpo is evaluated on the XSUM dataset, it outperforms the celebrated DPO by a clear margin of \mathbf{15%} after 5 self-revisions, achieving WR of 90\mathbf{90}%.

1199Directional Gradient Projection for Robust Fine-tuning of Foundation Models

[openreview] [pdf]

Abstract Robust fine-tuning aims to adapt large foundation models to downstream tasks while preserving their robustness to distribution shifts. Existing methods primarily focus on constraining and projecting current model towards the pre-trained initialization based on the magnitudes between fine-tuned and pre-trained weights, which often require extensive hyper-parameter tuning and can sometimes result in underfitting. In this work, we propose Di\textbf{Di}rectional Gra\textbf{Gra}dient P\textbf{P}rojection (DiGraP), a novel layer-wise trainable method that incorporates directional information from gradients to bridge regularization and multi-objective optimization. Besides demonstrating our method on image classification, as another contribution we generalize this area to the multi-modal evaluation settings for robust fine-tuning. Specifically, we first bridge the uni-modal and multi-modal gap by performing analysis on Image Classification reformulated Visual Question Answering (VQA) benchmarks and further categorize ten out-of-distribution (OOD) VQA datasets by distribution shift types and degree (i.e. near versus far OOD). Experimental results show that DiGraP consistently outperforms existing baselines across Image Classfication and VQA tasks with discriminative and generative backbones, improving both in-distribution (ID) generalization and OOD robustness.

1200SeaDAG: Semi-autoregressive Diffusion for Conditional Directed Acyclic Graph Generation

[openreview] [pdf]

Abstract We introduce SeaDAG, a semi-autoregressive diffusion model for conditional generation of Directed Acyclic Graphs~(DAGs). Considering their inherent layer-wise structure, we simulate layer-wise autoregressive generation by designing different denoising speed for different layers. Unlike conventional autoregressive generation that lacks a global graph structure view, our method maintains a complete graph structure at each diffusion step, enabling operations such as property control that require the full graph structure. Leveraging this capability, we evaluate the DAG properties during training by employing a graph property decoder. We explicitly train the model to learn graph conditioning with a condition loss, which enhances the diffusion model’s capacity to generate graphs that are both realistic and aligned with specified properties. We evaluate our method on two representative conditional DAG generation tasks: (1) circuit generation from truth tables, where precise DAG structures are crucial for realizing circuit functionality, and (2) molecule generation based on quantum properties. Our approach demonstrates promising results, generating high-quality and realistic DAGs that closely align with given conditions.

1201VVC-Gym: A Fixed-Wing UAV Reinforcement Learning Environment for Multi-Goal Long-Horizon Problems

[openreview] [pdf]

Abstract Multi-goal long-horizon problems are prevalent in real-world applications. The additional goal space introduced by multi-goal problems intensifies the spatial complexity of exploration; meanwhile, the long interaction sequences in long-horizon problems exacerbate the temporal complexity of exploration. Addressing the great exploration challenge posed by multi-goal long-horizon problems depends not only on the design of algorithms but also on the design of environments and the availability of demonstrations to assist in training. To facilitate the above research, we propose a multi-goal long-horizon Reinforcement Learning (RL) environment based on realistic fixed-wing UAV’s velocity vector control, named VVC-Gym, and generate multiple demonstration sets of various quality. Through experimentation, we analyze the impact of different environment designs on training, assess the quantity and quality of demonstrations and their influence on training, and assess the effectiveness of various RL algorithms, providing baselines on VVC-Gym and its corresponding demonstrations. The results suggest that VVC-Gym is suitable for studying: (1) the influence of environment designs on addressing multi-goal long-horizon problems with RL. (2) the assistance that demonstrations can provide in overcoming the exploration challenges of multi-goal long-horizon problems. (3) the RL algorithm designs with the least possible impact from environment designs on the efficiency and effectiveness of training.

1202Entropy-driven Data Knowledge Distillation in Digraph Representation Learning

[openreview] [pdf]

Abstract The directed graph (digraph), as a generalization of undirected graphs, exhibits superior representation capability in modeling complex topology systems and has garnered considerable attention in recent years. Despite the notable efforts made by existing DiGraph Neural Networks (DiGNNs) to leverage directed edges, they still fail to comprehensively delve into the abundant data knowledge concealed in the digraphs. This limitation results in sub-optimal performance and underscores the necessity of further exploring the potential correlations between the directed topology and node profiles from a data-centric perspective, thereby empowering model-centric neural networks with stronger encoding capabilities. In this paper, we propose \textbf{E}ntropy-driven \textbf{D}igraph knowl\textbf{E}dge distillatio\textbf{N} (EDEN), which can serve as a new data-centric digraph learning paradigm or a model-agnostic hot-and-plug data online knowledge distillation module for most existing DiGNNs to fully leverage informative digraphs. Specifically, EDEN first utilizes directed structural measurements from a topological perspective to construct a knowledge tree, guided by the hierarchical encoding theory. Subsequently, EDEN quantifies the mutual information of nodes from a feature perspective to further refine the knowledge flow, facilitating tree layer-wise knowledge distillation. As a general framework, EDEN also can naturally extend to undirected scenarios and demonstrate satisfactory performance. In our experiments, EDEN has been widely evaluated on 14 (di)graph datasets and across 4 downstream tasks. The results demonstrate that EDEN attains SOTA performance and exhibits strong improvement for prevalent (Di)GNNs.

1203Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits

[openreview] [pdf]

Abstract We study the optimal batch-regret tradeoff for batch linear contextual bandits. For this problem, we design batch learning algorithms and prove that they achieve the optimal regret bounds (up to logarithmic factors) for any batch number MM, number of actions KK, time horizon TT, and dimension dd. Therefore, we establish the \emph{full-parameter-range} (almost) optimal batch-regret tradeoff for the batch linear contextual bandit problem.Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.

1204Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

[openreview] [pdf]

Abstract What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples.Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples.Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.

1205Accelerating Goal-Conditioned Reinforcement Learning Algorithms and Research

[openreview] [pdf]

Abstract Self-supervision has the potential to transform reinforcement learning (RL), paralleling the breakthroughs it has enabled in other areas of machine learning. While self-supervised learning in other domains aims to find patterns in a fixed dataset, self-supervised goal-conditioned reinforcement learning (GCRL) agents discovernewbehaviors by learning from the goals achieved during unstructured interaction with the environment. However, these methods have failed to see similar success, both due to a lack of data from slow environment simulations as well as a lack of stable algorithms. We take a step toward addressing both of these issues by releasing a high-performance codebase and benchmark (JaxGCRL) for self-supervised GCRL, enabling researchers to train agents for millions of environment steps in minutes on a single GPU. By utilizing GPU-accelerated replay buffers, environments, and a stable contrastive RL algorithm, we reduce training time by up to 22×22\times. Additionally, we assess key design choices in contrastive RL, identifying those that most effectively stabilize and enhance training performance. With this approach, we provide a foundation for future research in self-supervised GCRL, enabling researchers to quickly iterate on new ideas and evaluate them in diverse and challenging environments. Code:https://anonymous.4open.science/r/JaxGCRL-2316/README.md

1206Relation-Aware Diffusion for Heterogeneous Graphs with Partially Observed Features

[openreview] [pdf]

Abstract Diffusion-based imputation methods, which impute missing features through the iterative propagation of observed features, have shown impressive performance in homogeneous graphs. However, these methods are not directly applicable to heterogeneous graphs, which have multiple types of nodes and edges, due to two key issues: (1) the presence of nodes with undefined features hinders diffusion-based imputation; (2) treating various edge types equally during diffusion does not fully utilize information contained in heterogeneous graphs. To address these challenges, this paper presents a novel imputation scheme that enables diffusion-based imputation in heterogeneous graphs. Our key idea involves (1) assigning a {\it virtual feature} to an undefined node feature and (2) determining the importance of each edge type during diffusion according to a new criterion. Through experiments, we demonstrate that our virtual feature scheme effectively serves as a bridge between existing diffusion-based methods and heterogeneous graphs, maintaining the advantages of these methods. Furthermore, we confirm that adjusting the importance of each edge type leads to significant performance gains on heterogeneous graphs. Extensive experimental results demonstrate the superiority of our scheme in both semi-supervised node classification and link prediction tasks on heterogeneous graphs with missing rates ranging from low to exceedingly high.

1207Bilevel Reinforcement Learning for Stock Data with A Conservative TD Ensemble

[openreview] [pdf]

Abstract Reinforcement learning (RL) has shown significant promise in stock trading. A typical solution involves optimizing cumulative returns using historical offline data. However, it may produce less generalizable policies that merely “memorize” optimal buying and selling actions from the offline data while neglecting the non-stationary nature of the financial market. We frame stock trading as a specific type of offline RL problem. Our method, MetaTrader, presents two key contributions. First, it introduces a novel bilevel actor-critic method that spans both the original stock data and its transformations. The fundamental idea is that an effective policy should be generalizable across out-of-distribution data. Second, we propose a novel variant of conservative TD learning, utilizing an ensemble-based TD target to mitigate value overestimation, particularly in scenarios with limited offline data. Our empirical findings across two publicly available datasets demonstrate the superior performance of MetaTrader over existing methods, including both RL-based approaches and stock prediction models.

1208Robust Video Moment Retrieval with Introspective Knowledge Distillation

[openreview] [pdf]

Abstract With the huge requirement of video content understanding and editing, Video moment retrieval (VMR) is becoming more and more critical, necessitating models that are adept at correlating video contents with textual queries. The effectiveness of prevailing VMR models, however, is often compromised by their reliance on training data biases, which significantly hampers their generalization capabilities when faced with out-of-distribution (OOD) content. This challenge underscores the need for innovative approaches that can adeptly navigate the intricate balance between leveraging in-distribution (ID) data for learning and maintaining robustness against OOD variations. Addressing this critical need, we introduce Reflective Knowledge Distillation (RefKD), a novel and comprehensive training methodology that integrates the dual processes of Introspective Learning and Extrospective Adjustment. This methodology is designed to refine the model’s ability to internalize and apply learned correlations in a manner that is both contextually relevant and resilient to bias-induced distortions. By employing a dual-teacher framework, RefKD encapsulates and contrasts the distinct bias perspectives prevalent in VMR datasets, facilitating a dynamic and reflective learning dialogue with the student model. This interaction is meticulously structured to encourage the student model to engage in a deeper introspection of learned biases and to adaptively recalibrate its learning focus in response to evolving content landscapes. Through this reflective learning process, the model develops a more nuanced and comprehensive understanding of content-query correlations, significantly enhancing its performance across both ID and OOD scenarios. Our extensive evaluations, conducted across several standard VMR benchmarks, demonstrate the unparalleled efficacy of RefKD. The methodology not only aligns with the OOD performance benchmarks set by existing debiasing methods but also, in many instances, significantly surpasses their ID performance metrics. By effectively bridging the gap between ID and OOD learning, RefKD sets a new standard for building VMR systems that are not only more adept at understanding and interpreting video content in a variety of contexts but also more equitable and reliable across diverse operational scenarios. This work not only contributes to the advancement of VMR technology but also paves the way for future research in the domain of bias-aware and robust multimedia content analysis.

1209Multi-Task Dense Predictions via Unleashing the Power of Diffusion

[openreview] [pdf]

Abstract Diffusion models have exhibited extraordinary performance in dense prediction tasks. However, there are few works exploring the diffusion pipeline for multi-task dense predictions. In this paper, we unlock the potential of diffusion models in solving multi-task dense predictions and propose a novel diffusion-based method, called TaskDiffusion, which leverages the conditional diffusion process in the decoder. Instead of denoising the noisy labels for different tasks separately, we propose a novel joint denoising diffusion process to capture the task relations during denoising. To be specific, our method first encodes the task-specific labels into a task-integration feature space to unify the encoding strategy. This allows us to get rid of the cumbersome task-specific encoding process. In addition, we also propose a cross-task diffusion decoder conditioned on task-specific multi-level features, which can model the interactions among different tasks and levels explicitly while preserving efficiency. Experiments show that our TaskDiffusion outperforms previous state-of-the-art methods for all dense prediction tasks on the widely-used PASCAL-Context and NYUD-v2 datasets. Our code will be made publicly available.

1210Informed Exploration via Generative Modeling

[openreview] [pdf]

Abstract Conventionally trained neural networks excel at prediction but often struggle to model uncertainty in their own predictions. We explore this challenge in a meta-learning bandit decision-making problem for news recommendations; this setting require decision-making algorithms to incorporate pretrained language models to process text data for the best performance. We present a scalable approach to Bayesian uncertainty quantification by posing it as a problem of autoregressive generative modeling of future rewards. First, we use historical data on previously released news articles to pre-train a generative model to predict sequences of future potential rewards. At inference time, our algorithm makes decisions based on limited previous rewards and autoregressively generated future rewards. Far from a heuristic, we synthesize insights from the literature to show our method is a novel implementation of Thompson (posterior) sampling, a prominent bandit algorithm. We prove our pretraining loss directly controls online decision-making performance, and we demonstrate our framework on a news recommendation task where we integrate end-to-end fine-tuning of a pretrained language model to process news article headline text to improve performance.

1211FairMT-Bench: Benchmarking Fairness for Multi-turn Dialogue in Conversational LLMs

[openreview] [pdf]

Abstract The growing use of large language model (LLM)-based chatbots has raised concerns about fairness. Fairness issues in LLMs can lead to severe consequences, such as bias amplification, discrimination, and harm to marginalized communities. While existing fairness benchmarks mainly focus on single-turn dialogues, multi-turn scenarios, which in fact better reflect real-world conversations, present greater challenges due to conversational complexity and potential bias accumulation. In this paper, we propose a comprehensive fairness benchmark for LLMs in multi-turn dialogue scenarios, FairMT-Bench. Specifically, we formulate a task taxonomy targeting LLM fairness capabilities across three stages: context understanding, user interaction, and instruction trade-offs, with each stage comprising two tasks. To ensure coverage of diverse bias types and attributes, we draw from existing fairness datasets and employ our template to construct a multi-turn dialogue dataset, FairMT 10K. For evaluation, GPT-4 is applied, alongside bias classifiers including Llama-Guard-3 and human validation to ensure robustness. Experiments and analyses on FairMT 10K reveal that in multi-turn dialogue scenarios, current LLMs are more likely to generate biased responses, and there is significant variation in performance across different tasks and models. Based on this, we curate a challenging dataset, FairMT 1K, and test 15 current state-of-the-art (SOTA) LLMs on this dataset. The results show the current state of fairness in LLMs and showcase the utility of this novel approach for assessing fairness in more realistic multi-turn dialogue contexts, calling for future work to focus on LLM fairness improvement and the adoption of FairMT 1K in such efforts.

1212Larger Language Models Provably Generalize Better

[openreview] [pdf]

Abstract Why do larger language models generalize better? To explore this question, we develop generalization bounds on the pretraining objective of large language models (LLMs) in the compute-optimal regime, as described by the Chinchilla scaling laws. We introduce a novel, fully empirical Freedman-type martingale concentration inequality that tightens existing bounds by accounting for the variance of the loss function. The generalization bound can be broken into three contributions: the number of parameters per token, the loss variance, and the quantization error at a fixed bitrate. As language models are scaled up, the number of parameters per data point stays constant; however, both the loss variance and the quantization error decrease, implying that larger models should have \emph{smaller} generalization gaps. We examine why larger models tend to be more quantizable from an information theoretic perspective, showing that the rate at which they can integrate new information grows slower than their capacity on the compute optimal frontier. From these findings we produce a scaling law for the generalization gap, showing that our bounds decrease in a predictable way.

1213Knowledge Retention in Continual Model-Based Reinforcement Learning

[openreview] [pdf]

Abstract We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: Synthetic Experience Rehearsal\textit{Synthetic Experience Rehearsal}, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and Regaining Memories Through Exploration\textit{Regaining Memories Through Exploration}, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.

1214Residual-MPPI: Online Policy Customization for Continuous Control

[openreview] [pdf]

Abstract Policies learned through Reinforcement Learning (RL) and Imitation Learning (IL) have demonstrated significant potential in achieving advanced performance in continuous control tasks. However, in real-world environments, it is often necessary to further customize a trained policy when there are additional requirements that were unforeseen during the original training phase. It is possible to fine-tune the policy to meet the new requirements, but this often requires collecting new data with the added requirements and access to the original training metric and policy parameters. In contrast, an online planning algorithm, if capable of meeting the additional requirements, can eliminate the necessity for extensive training phases and customize the policy without knowledge of the original training scheme or task. In this work, we propose a generic online planning algorithm for customizing continuous-control policies at the execution time which we call Residual-MPPI. It is able to customize a given prior policy on new performance metrics in few-shot and even zero-shot online settings. Also, Residual-MPPI only requires access to the action distribution produced by the prior policy, without additional knowledge regarding the original task. Through our experiments, we demonstrate that the proposed Residual-MPPI algorithm can accomplish the few-shot/zero-shot online policy customization task effectively, including customizing the champion-level racing agent, Gran Turismo Sophy (GT Sophy) 1.0, in the challenging car racing scenario, Gran Turismo Sport (GTS) environment. Code for MuJoCo experiments is included in the supplmentary and will be open-sourced upon acceptance. Demo videos are available on our website:https://sites.google.com/view/residual-mppi

1215Disentangling Latent Shifts of In-Context Learning Through Self-Training

[openreview] [pdf]

Abstract In-context learning (ICL) has become essential in natural language processing, particularly with autoregressive large language models capable of learning from demonstrations provided within the prompt. However, ICL faces challenges with stability and long contexts, especially as the number of demonstrations grows, leading to poor generalization and inefficient inference. To address these issues, we introduce STICL (Self-Training ICL), an approach that disentangles the latent shifts of demonstrations from the latent shift of the query through self-training. STICL employs a teacher model to generate pseudo-labels and trains a student model using these labels, encoded in an adapter module. The student model exhibits weak-to-strong generalization, progressively refining its predictions over time. Our empirical results show that STICL improves generalization and stability, consistently outperforming traditional ICL methods and other disentangling strategies across both in-domain and out-of-domain data.

1216Variable Forward Regularization to Replace Ridge in Online Linear Regression

[openreview] [pdf]

Abstract Forward regularization (-F) with unsupervised knowledge was proposed to replace canonical Ridge regularization (-R) in online linear learners, which achieves lower relative regret bounds. However, we observe that -F cannot perform as expected in practice, even possibly losing to -R for online learning tasks. We identify two main causes for this: (1) inappropriate intervention penalty; (2) potential non-i.i.d nature in online learning, both of which result in unstable posterior distribution and optima offset of the learner. To improve these, we propose Variable Forward regularization (-kkF), a more general style with -F intensity modulated by a variable kk. We further derive -kkF algorithm to online learning tasks, which shows holistic recursive closed-form updates and superior performance compared to both -R and -F. Moreover, we theoretically establish the relative regrets of -kkF in online learning, showing that it has a tighter upper bound than -F in adversarial settings. We also introduce an adaptive -kkF, termed -kkF-Bayes, to curb unstable penalties caused by non-i.i.d and mitigate intractable tuning of hard kk based on Bayesian learning for online learning. In experiments, we adapted -kkF and -kkF-Bayes into class incremental scenario, where it realized less forgetting and non-replay. Results distinctly demonstrate the efficacy of using -kkF and -kkF-Bayes.

1217NEXTLOCLLM: NEXT LOCATION PREDICTION USING LLMS

[openreview] [pdf]

Abstract Next location prediction is a critical task in human mobility analysis and serves as a foundation for various downstream applications. Existing methods typically rely on discrete IDs to represent locations, which inherently overlook spatial relationships and cannot generalize across cities. In this paper, we propose NextLocLLM, which leverages the advantages of large language models (LLMs) in processing natural language descriptions and their strong generalization capabilities for next location prediction. Specifically, instead of using IDs, NextLocLLM encodes locations based on continuous spatial coordinates to better model spatial relationships. These coordinates are further normalized to enable robust cross-city generalization. Another highlight of NextlocLLM is its LLM-enhanced POI embeddings. It utilizes LLMs’ ability to encode each POI category’s natural language description into embeddings. These embeddings are then integrated via nonlinear projections to form this LLM-enhanced POI embeddings, effectively capturing locations’ functional attributes. Furthermore, task and data prompt prefix, together with trajectory embeddings, are incorporated as input for partly-frozen LLM backbone. NextLocLLM further introduces prediction retrieval module to ensure structural consistency in prediction. Experiments show that NextLocLLM outperforms existing models in next location prediction, excelling in both supervised and zero-shot settings.

1218Hallucinating LLM Could Be Creative

[openreview] [pdf]

Abstract Large Language Models (LLMs), such as GPT-4o, frequently produce hallucinations—factually incorrect or nonsensical outputs generally regarded as undesirable. This study, however, explores the notion of “good” hallucinations that may contribute to creativity and innovation. We propose metrics to assess hallucination quality, focusing on correctness, consistency, and reasoning diversity, which are evaluated using sample responses and semantic clustering. Our experiments explore different prompting techniques and hyperparameter configurations to provide comprehensive results based on these metrics. Furthermore, we investigate the distinction between process and outcome supervision, using multiple reasoning paths to enhance both creativity and accuracy. Preliminary results indicate that LLMs can generate creative hallucinations with minimal factual inaccuracies. This research provides a refined perspective on hallucinations in LLMs and suggests strategies to harness their creative potential, improving the reliability and flexibility of AI systems.

1219Injective flows for star-like manifolds

[openreview] [pdf]

Abstract Normalizing Flows (NFs) are powerful and efficient models for density estimation. When modeling densities on manifolds, NFs can be generalized to injective flows but the Jacobian determinant becomes computationally prohibitive. Current approaches either consider bounds on the log-likelihood or rely on some approximations of the Jacobian determinant. In contrast, we propose injective flows for star-like manifolds and show that for such manifolds we can compute the Jacobian determinant exactly and efficiently, with the same cost as NFs. This aspect is particularly relevant for variational inference settings, where no samples are available and only some unnormalized target is known. Among many, we showcase the relevance of modeling densities on star-like manifolds in two settings. Firstly, we introduce a novel Objective Bayesian approach for penalized likelihood models by interpreting level-sets of the penalty as star-like manifolds. Secondly, we consider probabilistic mixing models and introduce a general method for variational inference by defining the posterior of mixture weights on the probability simplex.

1220Student-Informed Teacher Training

[openreview] [pdf]

Abstract Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limited observations, e.g., in a robot navigation task, the teacher might have access to distances to nearby obstacles, while the student only receives visual observations of the scene. However, privileged imitation learning faces a key challenge: the student might be unable to imitate the teacher’s behavior due to partial observability. This problem arises because the teacher is trained without considering if the student is capable of imitating the learned behavior. To address this teacher-student asymmetry, we propose a framework for joint training of the teacher and student policies, encouraging the teacher to learn behaviors that can be imitated by the student despite the latters’ limited access to information and its partial observability. Based on the performance bound in imitation learning, we add (i) the approximated action difference between teacher and student as a penalty term to the reward function of the teacher, and (ii) a supervised teacher-student alignment step. We motivate our method with a maze navigation task and demonstrate its effectiveness on complex vision-based quadrotor flight and manipulation tasks.

1221Multi-Granularity Semantic Revision for Large Language Model Distillation

[openreview] [pdf]

Abstract Knowledge distillation plays a key role in compressing the Large Language Models (LLMs), which boosts a small-size student model under large teacher models’ guidance. However, existing LLM distillation methods overly rely on student-generated outputs, which may introduce generation errors and misguide the distillation process. Moreover, the distillation loss functions introduced in previous works struggle to align the most informative part due to the complex distribution of LLMs’ outputs. To address these problems, we propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation (SCRG) strategy. SCRG first calculates the semantic cognitive difference between the teacher and student to detect the error token, then corrects it with the teacher-generated one, and re-generates the sequence to reduce generation errors and enhance generation diversity. At the token level, we design a distribution adaptive clipping Kullback-Leibler (DAC-KL) loss as the distillation objective function. DAC-KL loss exploits a learnable sub-network to adaptively extract semantically dense areas from the teacher’s output, avoiding the interference of redundant information in the distillation process. Finally, at the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student’s probability correlations to be consistent, further enhancing the transfer of semantic information. Extensive experiments across different model families with parameters ranging from 0.1B to 13B demonstrate the superiority of our method compared to existing methods.

1222Herald: A Natural Language Annotated Lean 4 Dataset

[openreview] [pdf]

Abstract Verifiable formal languages like Lean have profoundly impacted mathematical reasoning, particularly through the use of large language models (LLMs) for automated reasoning. A significant challenge in training LLMs for these formal languages is the lack of parallel datasets that align natural language with formal language proofs. To address this challenge, this paper introduces a novel framework for translating the Mathlib4 corpus (a unified library of mathematics in formal language Lean 4) into natural language. Building upon this, we employ a dual augmentation strategy that combines tactic-based and informal-based approaches, leveraging the Lean-jixia system, a Lean 4 analyzer. We present the results of this pipeline on Mathlib4 as Herald (Hierarchy and Retrieval-based Translated Lean Dataset). We also propose the Herald Translator, which is fine-tuned on Herald. Herald translator achieves a 93.2% accuracy (Pass@128) on formalizing statements in the miniF2F-test and a 22.5% accuracy on our internal graduate-level textbook dataset, outperforming InternLM2-Math-Plus-7B (74.0% and 7.5%) and TheoremLlama (50.1% and 4.0%). Furthermore, we propose a section-level translation framework for real-world applications. As a direct application of Herald translator, we have successfully translated a template section in the Stack project, marking a notable progress in the automatic formalization of graduate-level mathematical literature. Our model, along with the datasets, will be open-sourced to the public soon.

1223Unleashing the Power of Task-Specific Directions in Parameter Efficient Fine-tuning

[openreview] [pdf]

Abstract Large language models demonstrate impressive performance on downstream tasks, yet requiring extensive resource consumption when fully fine-tuning all parameters. To mitigate this, Parameter Efficient Fine-Tuning (PEFT) strategies, such as LoRA, have been developed. In this paper, we delve into the concept of task-specific directions (TSDs)—critical for transitioning large models from pretrained states to task-specific enhancements in PEFT. We propose a framework to clearly define these directions and explore their properties, and practical utilization challenges. We then introduce a novel approach, LoRA-Dash, which aims to maximize the impact of TSDs during the fine-tuning process, thereby enhancing model performance on targeted tasks. Extensive experiments have conclusively demonstrated the effectiveness of LoRA-Dash, and in-depth analyses further reveal the underlying mechanisms of LoRA-Dash.

1224Large-Scale Training Data Attribution with Efficient Influence Functions

[openreview] [pdf]

Abstract Training data attribution (TDA) quantifies the contribution of individual training examples to model predictions, enabling a range of applications such as data curation, data citation, and model debugging. However, applying existing TDA methods to recent large models and training datasets has been largely limited by prohibitive compute and memory costs. In this work, we focus on influence functions, a popular gradient-based TDA method, and significantly improve its scalability with an efficient gradient projection strategy called LoGra that leverages the gradient structure in backpropagation. We then provide a theoretical motivation of gradient projection approaches to influence functions to promote trust in the TDA process. Lastly, we lower the barrier to implementing TDA systems by introducing LogIX, a software package that can transform existing training code into TDA code with minimal effort. In our TDA experiments, LoGra achieves competitive accuracy against more expensive baselines while showing up to 6,500x improvement in throughput and 5x reduction in GPU memory usage when applied to Llama3-8B-Instruct and the 1B-token dataset.

1225Second-Order Forward-Mode Automatic Differentiation for Optimization

[openreview] [pdf]

Abstract Forward gradient methods offer a promising alternative to backpropagation. Optimization that only requires forward passes could simplify hardware implementation, improve parallelism, lower memory cost, and allow for more biologically plausible learning models. This has motivated recent forward-mode automated differentiation (AD) methods. This paper presents a novel second-order forward-mode AD method for optimization that generalizes a second-order line search to a KK-dimensional hyperplane. Unlike recent work that relies on directional derivatives (or Jacobian–Vector Products, JVPs), we use hyper-dual numbers to jointly evaluate both directional derivatives and their second-order quadratic terms. As a result, we introduce forward-mode weight perturbation with Hessian information for K-dimensional hyper-plane search (FoMoH-KKD). We derive the convergence properties of FoMoH-KKD and show how it generalizes to Newton’s method for K=DK = D. We demonstrate this generalization empirically, and compare the performance of FoMoH-KKD to forward gradient descent (FGD) on three case studies: Rosenbrock function used widely for evaluating optimization methods, logistic regression with 7,850 parameters, and learning a CNN classifier with 431,080 parameters. Our experiments show that FoMoH-KKD not only achieves better performance and accuracy, but also converges faster, thus, empirically verifying our theoretical results.

1226Progressive Autoregressive Video Diffusion Models

[openreview] [pdf]

Abstract Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 5 seconds or 120 frames, due to computation limitations during training. In this work, we show that existing models can be naturally adapted to autoregressive video diffusion models without changing the architectures. Our key idea is to assign the latent frames with progressively increasing noise levels rather than a single noise level. Thus, each latent can condition on all the less noisy latents before it and provide condition for all the more noisy latents after it. Such progressive video denoising allows our models to autoregressively generate frames without quality degradation. We present state-of-the-art results on long video generation at 1 minute (1440 frames at 24 FPS). Our results are available at this anonymous url:https://progressive-autoregressive-vdm.github.io/.

1227Low-Dimension-to-High-Dimension Generalization and Its Implications for Length Generalization

[openreview] [pdf]

Abstract Low-Dimension-to-High-Dimension (LDHD) generalization is a special case of Out-of-Distribution (OOD) generalization, where the training data are restricted to a low-dimensional subspace of the high-dimensional testing space. Assuming that each instance is generated from a latent variable and the dimension of the latent variable reflects the problem scale, the inherent scaling challenge in length generalization can be captured by the LDHD generalization in the latent space. We theoretically demonstrate that LDHD generalization is generally unattainable without exploiting prior knowledge to provide appropriate inductive bias. Specifically, we explore LDHD generalization in Boolean functions. We verify that different architectures trained with (S)GD converge to \emph{min-degree interpolators w.r.t. different independent sets}. LDHD generalization is achievable if and only if the target function coincides with this inductive bias. Applying the insights from LDHD generalization to length generalization, we explain the effectiveness of CoT as changing the structure latent space to enable better LDHD generalization. We also propose a principle for position embedding design to handle both the inherent LDHD generalization and the nuisances such as the data format. Following the principle, we propose a novel position embedding called RPE-Square that remedies the RPE for dealing with the data format nuisance.

1228The Computational Complexity of Positive Non-Clashing Teaching in Graphs

[openreview] [pdf]

Abstract We study the classical and parameterized complexity of computing the positive non-clashing teaching dimension of a set of concepts, that is, the smallest number of examples per concept required to successfully teach an intelligent learner under the considered, previously established model. For any class of concepts, it is known that this problem can be effortlessly transferred to the setting of balls in a graph GG. We establish (1) the NP-hardness of the problem even when restricted to instances with positive non-clashing teaching dimension k=2k=2 and where all balls in the graph are present, (2) near-tight running time upper and lower bounds for the problem on general graphs, (3) fixed-parameter tractability when parameterized by the vertex integrity of GG, and (4) a lower bound excluding fixed-parameter tractability when parameterized by the feedback vertex number and pathwidth of GG, even when combined with kk. Our results provide a nearly complete understanding of the complexity landscape of computing the positive non-clashing teaching dimension and answer open questions from the literature.

1229Positive-Unlabeled Diffusion Models for Preventing Sensitive Data Generation

[openreview] [pdf]

Abstract Diffusion models are powerful generative models but often generate sensitive data that are unwanted by users, mainly because the unlabeled training data frequently contain such sensitive data. Since labeling all sensitive data in the large-scale unlabeled training data is impractical, we address this problem by using a small amount of labeled sensitive data. In this paper, we propose positive-unlabeled diffusion models, which prevent the generation of sensitive data using unlabeled and sensitive data. Our approach can approximate the evidence lower bound (ELBO) for normal (negative) data using only unlabeled and sensitive (positive) data. Therefore, even without labeled normal data, we can maximize the ELBO for normal data and minimize it for labeled sensitive data, ensuring the generation of only normal data. Through experiments across various datasets and settings, we demonstrated that our approach can prevent the generation of sensitive images without compromising image quality.

1230Mutual-Inform SMoE: Improving Routing Stability via Probabilistic Graphical Model

[openreview] [pdf]

Abstract Sparse Mixture of Experts (SMoE) has emerged as a breakthrough approach for achieving unprecedented scalability in deep learning. By enabling models to expand their parameter count exponentially while selectively activating only a small subset of parameters per sample, SMoEs maintain high efficiency. However, SMoE models are susceptible to routing fluctuations, leading to instability and non-robustness. In this work, we unveils SMoE-based attention as a point estimate of a regression function of a 3-layer hierarchical mixture of experts regression. Through this probabilistic graphical model (PGM) framework, we highlight the conditional independence in expert-selection process of tokens, which exposes the model to routing fluctuation and non-robustness. Motivating by this PGM framework, we propose Mutual-Inform SMoEs, including Similarity and Attention-Inform SMoE, which eliminate the assumption of conditional independence by allowing tokens to directly influence each other on expert-decisions. We theoretically demonstrate that our methods lower the entropy in decision-making, enabling more confident and consistent expert assignments. Finally, we empirically validate our models on ImageNet classification and Wikitext-103 language modeling, showing significant improvements in reducing routing fluctuations, enhancing performance, and increasing model robustness compared to baseline Transformer-SMoE models.

1231MGDA Converges under Generalized Smoothness, Provably

[openreview] [pdf]

Abstract Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning. Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard LL-smooth or bounded-gradient assumptions, which typically do not hold for neural networks, such as Long short-term memory (LSTM) models and Transformers. In this paper, we study a more general and realistic class of generalized \ell-smooth loss functions, where \ell is a general non-decreasing function of gradient norm. We revisit and analyze the fundamental multiple gradient descent algorithm (MGDA) and its stochastic version with double sampling for solving the generalized \ell-smooth MOO problems, which approximate the conflict-avoidant (CA) direction that maximizes the minimum improvement among objectives. We provide a comprehensive convergence analysis of these algorithms and show that they converge to an ϵ\epsilon-accurate Pareto stationary point with a guaranteed ϵ\epsilon-level average CA distance (i.e., the gap between the updating direction and the CA direction) over all iterations, where totally O(ϵ2)\mathcal{O}(\epsilon^{-2}) and O(ϵ4)\mathcal{O}(\epsilon^{-4}) samples are needed for deterministic and stochastic settings, respectively. We prove that they can also guarantee a tighter ϵ\epsilon-level CA distance in each iteration using more samples. Moreover, we analyze an efficient variant of MGDA named MGDA-FA using only O(1)\mathcal{O}(1) time and space, while achieving the same performance guarantee as MGDA.

1232Do Symbolic or Black-Box Representations Generalise Better In Learned Optimisation?

[openreview] [pdf]

Abstract Until recently, behind every algorithmic advance in machine learning was a human researcher. Now, however, algorithms can be meta-learned automatically, with little human input. However, to be truly useful, such algorithms must generalise beyond their training distribution. This is especially challenging in reinforcement learning (RL), where transferring algorithms between environments with vastly different dynamics is difficult and training on diverse environments often requires prohibitively expensive large-scale data collection.Learned optimisation is a branch of algorithmic discovery that meta-learns optimiser update rules. Learned optimisers can be classified into two groups: black-box algorithms, where the optimiser is a neural network; or symbolic algorithms, where the optimiser is represented using mathematical functions or code. While some claim that symbolic algorithms generalise better than black-box ones, testing such assertions is complicated by the fact that symbolic algorithms typically include additional hyperparameters, and thus their evaluation is done many-shot. This is an unfair comparison with the zero-shot evaluation of black-box optimisers. In this work, we build a pipeline to discover symbolic optimisers which are hyperparameter-free, enabling a fair comparison of the generalisation of symbolic optimisers with that of an open-source state-of-the-art black-box optimiser trained for RL. Based on our analysis, we propose suggestions to improve the symbolic optimiser discovery pipeline for RL, with an overall objective of reducing the need for hyperparameter tuning to train an agent.

1233DuaRot: Dual Rotation for Advanced Outlier Mitigation in Rotated LLMs

[openreview] [pdf]

Abstract By employing rotation, outliers in activations can be effectively mitigated without altering the output, thereby facilitating the quantization of large language models (LLMs). However, existing rotation-based methods only consider global activation distributions, leaving the finer-grained distributions underexplored. Additionally, these methods predominantly rely on the Walsh–Hadamard transform (WHT) to accelerate online rotation operations, while not fully considering performance between matrix multiplication~(Matmul) and WHT in actual runtime. These limitations hinder the rotation’s ability to effectively reduce quantization errors and decrease inference speed. Therefore, improvements are needed in their performance regarding both accuracy and speed. In this paper, we propose a dual rotation method for rotation matrices, dubbed DuaRot, based on reparameterization. During training, DuaRot sequentially refines global and local features to achieve effective outlier mitigation. During inference, global and local rotations can be merged, which maintains rotational invariance without introducing additional computational overhead. Meanwhile, we propose a hardware-aware matrix configuration strategy, which determines whether the online Hadamard matrix should be expanded into a trainable parameter space by taking the runtime of the WHT and Matmul into account. This approach further enhances the reduction of quantization errors in online rotation operations without compromising inference speed. Extensive experiments demonstrate that DuaRot outperforms existing methods across various models and quantization configurations. For instance, when applied to LLaMA3-8B, DuaRot achieves WikiText-2 perplexities of 7.49 and 7.41 under W4A4KV4 and W4A4KV16 configurations with Round-to-Nearest (RTN), improving by 0.51 and 0.41 over the state-of-the-art, respectively. The code will be publicly available soon.

[openreview] [pdf]

Abstract Link prediction is a widely studied task in Graph Representation Learning (GRL) for modeling relational data. Early theories in GRL were based on the assumption of a symmetric adjacency matrix, reflecting an undirected setting. As a result, much of the following state-of-the-art research has continued to operate under this symmetry assumption, even though real-world data often involves crucial information conveyed through the direction of relationships. This oversight limits the ability of these models to fully capture the complexity of directed interactions. In this paper, we focus on the challenge of directed link prediction by evaluating key heuristics that have been successful in the undirected settings. We propose simple but effective adaptations of these heuristics to the directed link prediction task and demonstrate that these modifications yield competitive performance compared to leading Graph Neural Networks (GNNs) originally designed for undirected graphs. Through an extensive set of experiments, we derive insights that inform the development of a novel framework for directed link prediction, which not only surpasses baseline methods but also outperforms state-of-the-art GNNs on multiple benchmarks.

1235Concept Bottleneck Models under Label Noise

[openreview] [pdf]

Abstract Concept bottleneck models (CBMs) are a class of interpretable neural network models that make the final predictions based on intermediate representations known as concepts. With these concepts being human-interpretable, CBMs enable one to better understand the decisions made by neural networks. Despite this advantage, we find that CBMs face a critical limitation: they require additional labeling efforts for concept annotation, which can easily increase the risk of mislabeling, i.e., CBMs need to be trained with noisy labels. In this work, we systematically investigate the impact of label noise on CBMs, demonstrating that it can significantly compromise both model performance and interpretability. Specifically, we measure the impact of varying levels of label noise across different training schemes, through diverse lenses including extensive numerical evaluations, feature visualizations, and in-depth analysis of individual concepts, identifying key factors contributing to the breakdowns and establishing a better understanding of underlying challenges. To mitigate these issues, we propose leveraging a robust optimization technique called sharpness-aware minimization (SAM). By improving the quality of intermediate concept predictions, SAM enhances both the subsequent concept-level interpretability and final target prediction performance.

1236Latent Bayesian Optimization via Autoregressive Normalizing Flows

[openreview] [pdf]

Abstract Bayesian Optimization (BO) has been recognized for its effectiveness in optimizing expensive and complex objective functions. Recent advancements in Latent Bayesian Optimization (LBO) have shown promise by integrating generative models such as variational autoencoders (VAEs) to manage the complexity of high-dimensional and structured data spaces. However, existing LBO approaches often suffer from the value discrepancy problem, which arises from the reconstruction gap between latent and input spaces. This value discrepancy problem propagates errors throughout the optimization process, which induces suboptimal optimization outcomes. To address this issue, we propose a Normalizing Flow-based Bayesian Optimization (NF-BO), which utilizes normalizing flow as a generative model to establish accurate and one-to-one mappings between latent and input spaces. To deal with sequence-based inputs, we introduce SeqFlow, an autoregressive sequence-specialized normalizing flow model designed to maintain one-to-one mappings between the input and latent spaces. Moreover, we develop a token-level adaptive candidate sampling strategy that dynamically adjusts the exploration probability of each token based on the token-level importance in the optimization process. Through extensive experiments, our NF-BO method demonstrates superior performance in molecule generation tasks, significantly outperforming traditional optimization methods and existing LBO approaches.

1237Coordinate In and Value Out: Training Flow Transformers in Ambient Space

[openreview] [pdf]

Abstract Flow matching models have emerged as a powerful method for generative modeling on domains like images or videos, and even on unstructured data like 3D point clouds. These models are commonly trained in two stages: first, a data compressor (\ie a variational auto-encoder) is trained, and in a subsequent training stage a flow matching generative model is trained in the low-dimensional latent space of the data compressor. This two stage paradigm adds complexity to the overall training recipe and sets obstacles for unifying models across data domains, as specific data compressors are used for different data modalities. To this end, we introduce Ambient Space Flow Transformers (ASFT), a domain-agnostic approach to learn flow matching transformers in ambient space, sidestepping the requirement of training compressors and simplifying the training process. We introduce a conditionally independent point-wise training objective that enables ASFT to make predictions continuously in coordinate space. Our empirical results demonstrate that using general purpose transformer blocks, ASFT effectively handles different data modalities such as images and 3D point clouds, achieving strong performance in both domains and outperforming comparable approaches. ASFT is a promising step towards domain-agnostic flow matching generative models that can be trivially adopted in different data domains.

1238Personalized Federated Learning on Flowing Data Heterogeneity under Restricted Storage

[openreview] [pdf]

Abstract Recent years, researchers focused on personalized federated learning (pFL) to address the inconsistent requirements of clients causing by data heterogeneity in federated learning (FL). However, existing pFL methods typically assume that local data distribution remains unchanged during FL training, the changing data distribution in actual heterogeneous data scenarios can affect model convergence rate and reduce model performance. In this paper, we focus on solving the pFL problem under the situation where data flows through each client like a flowing stream which called Flowing Data Heterogeneity under Restricted Storage, and shift the training goal to the comprehensive performance of the model throughout the FL training process. Therefore, based on the idea of category decoupling, we design a local data distribution reconstruction scheme and a related generator architecture to reduce the error of the controllable replayed data distribution, then propose our pFL framework, pFedGRP, to achieve knowledge transfer and personalized aggregation. Comprehensive experiments on five datasets with multiple settings show the superiority of pFedGRP over eight baseline methods.

1239To Tackle Adversarial Transferability: A Novel Ensemble Training Method with Fourier Transformation

[openreview] [pdf]

Abstract Ensemble methods are commonly used for enhancing robustness in machine learning. However, due to the ‘‘transferability’’ of adversarial examples, the performance of an ensemble model can be seriously affected even it contains a set of independently trained sub-models. To address this issue, we propose an efficient data transformation method based on a cute ‘‘weakness allocation’’ strategy, to diversify non-robust features. Our approach relies on a fine-grained analysis on the relation between non-robust features and adversarial attack directions. Moreover, our approach enjoys several other advantages, e.g., it does not require any communication between sub-models and the construction complexity is also quite low. We conduct a set of experiments to evaluate the performance of our proposed method and compare it with several popular baselines. The results suggest that our approach can achieve significantly improved robust accuracy over most existing ensemble methods, and meanwhile preserve high clean accuracy.

1240Improving Source Extraction with Diffusion and Consistency Models

[openreview] [pdf]

Abstract In this work, we demonstrate the integration of a score-matching diffusion model into a deterministic architecture for time-domain musical source extraction, resulting in enhanced audio quality. To address the typically slow iterative sampling process of diffusion models, we apply consistency distillation and reduce the sampling process to a single step, achieving performance comparable to that of diffusion models, and with two or more steps, even surpassing them. Trained on the Slakh2100 dataset for four instruments (bass, drums, guitar, and piano), our model shows significant improvements across objective metrics compared to baseline methods. Sound examples are available athttps://consistency-separation.github.io/.

1241In Praise of Stubbornness: The Case for Cognitive-Dissonance Aware Continual Update of Knowledge in LLMs

[openreview] [pdf]

Abstract Despite remarkable capabilities, large language models (LLMs) struggle to continually update their knowledge without catastrophic forgetting. In contrast, humans effortlessly integrate new information, detect conflicts with existing beliefs, and selectively update their mental models. This paper introduces a novel incremental update paradigm inspired by human cognition. We implement and evaluate two key components within existing LLM architectures: (1) Dissonance and Familiarity Awareness, enabling LLMs to classify new information as novel, familiar, or dissonant; and (2) Targeted Network Updates, which involve continuously tracking past gradient usage to distinguish between frequently used (stubborn) and rarely used (plastic) neurons.Through a series of carefully designed experiments, we uncover a number of empirical findings and demonstrate the potential of this approach. First, dissonance awareness is feasible even using simple features like activations and gradients. Second, unlike non-dissonant updates which largely preserve prior knowledge even with naive fine-tuning, dissonant updates prove catastrophically destructive to the model’s knowledge base, indiscriminately affecting even information unrelated to the current updates. Finally, our history-aware targeted updates, which continuously monitor and leverage past gradient information, alleviate the negative impact of dissonant updates significantly better than state-of-the-art editing methods. We plan to develop dedicated conflict resolution methods in future work.

1242Learning Long Range Dependencies on Graphs via Random Walks

[openreview] [pdf]

Abstract Message-passing graph neural networks (GNNs) excel at capturing local relationships but struggle with long-range dependencies in graphs. In contrast, graph transformers (GTs) enable global information exchange but often oversimplify the graph structure by representing graphs as sets of fixed-length vectors. This work introduces a novel architecture that overcomes the shortcomings of both approaches by combining the long-range information of random walks with local message passing. By treating random walks as sequences, our architecture leverages recent advances in sequence models to effectively capture long-range dependencies within these walks. Based on this concept, we propose a framework that offers (1) more expressive graph representations through random walk sequences, (2) the ability to utilize any sequence model for capturing long-range dependencies, and (3) the flexibility by integrating various GNN and GT architectures. Our experimental evaluations demonstrate that our approach achieves significant performance improvements on 19 graph and node benchmark datasets, notably outperforming existing methods by up to 13% on the PascalVoc-SP and COCO-SP datasets.

1243Learning Long Range Dependencies on Graphs via Random Walks

[openreview] [pdf]

Abstract No absctract

1244Maximum Next-State Entropy for Efficient Reinforcement Learning

[openreview] [pdf]

Abstract Maximum entropy algorithms have demonstrated significant progress in Reinforcement Learning~(RL), which offers an additional guidance in the form of entropy, particularly beneficial in tasks with sparse rewards. Nevertheless, current approaches grounded in policy entropy encourage the agent to explore diverse actions, yet they do not directly help agent explore diverse states. In this study, we theoretically reveal the challenge for optimizing the next-state entropy of agent. To address this limitation, we introduce Maximum Next-State Entropy (MNSE), a novel method which maximizes next-state entropy through an action mapping layer following the inner policy. We provide a theoretical analysis demonstrating that MNSE can maximize next-state entropy by optimizing the action entropy of the inner policy. We conduct extensive experiments on various continuous control tasks and show that MNSE can significantly improve the exploration capability of RL algorithms.

1245Adversarial Search Engine Optimization for Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly used in applications where the model selects from competing third-party content, such as in LLM-powered search engines or chatbot plugins. In this paper, we introducePreference Manipulation Attacks, a new class of attacks that manipulate an LLM’s selections to favor the attacker. We demonstrate that carefully crafted website content or plugin documentations can trick an LLM to promote the attacker products and discredit competitors, thereby increasing user traffic and monetization. We show this leads to aprisoner’s dilemma, where all parties are incentivized to launch attacks, but the collective effect degrades the LLM’s outputs for everyone. We demonstrate our attacks on production LLM search engines (Bing and Perplexity) and plugin APIs (for GPT-4 and Claude). As LLMs are increasingly used to rank third-party content, we expect Preference Manipulation Attacks to emerge as a significant threat.

[openreview] [pdf]

Abstract With a growing interest in understanding neural network prediction strategies, Concept Activation Vectors (CAVs) have emerged as a popular tool for modeling human-understandable concepts in the latent space. Commonly, CAVs are computed by leveraging linear classifiers optimizing theseparabilityof latent representations of samples with and without a given concept. However, in this paper we show that such a separability-oriented computation leads to solutions, which may diverge from the actual goal of precisely modeling the concept direction. This discrepancy can be attributed to the significant influence of distractor directions, i.e., signals unrelated to the concept, which are picked up by filters (i.e., weights) of linear models to optimize class-separability. To address this, we introducepattern-based CAVs, solely focussing on concept signals, thereby providing more accurate concept directions. We evaluate various CAV methods in terms of their alignment with the true concept direction and their impact on CAV applications, including concept sensitivity testing and model correction for shortcut behavior caused by data artifacts. We demonstrate the benefits of pattern-based CAVs using the Pediatric Bone Age, ISIC2019, and FunnyBirds datasets with VGG, ResNet, ReXNet, EfficientNet, and Vision Transformer as model architectures.

1247Bootstrapping Language Models with DPO Implicit Rewards

[openreview] [pdf]

Abstract Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM model to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate refinements that debias the length of the responses and enhance the quality of the preference dataset to further improve our approach. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8%\% in length-controlled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback.

1248MoIN: Mixture of Introvert Experts to Upcycle an LLM

[openreview] [pdf]

Abstract The goal of this paper is to improve (upcycle) an existing large language model without the prohibitive requirements of continued pre-training of the full-model. The idea is to split the pre-training data into semantically relevant groups and train an expert on each subset. An expert takes the form of a lightweight adapter added on the top of a frozen base model. During inference, an incoming query is first routed to the most relevant expert which is then loaded onto the base model for the forward pass. Unlike typical Mixture of Experts (MoE) models, the experts in our method do not work with other experts for a single query. Hence, we dub them ``introvert’’ experts. Freezing the base model and keeping the experts as lightweight adapters allows extreme parallelism during training and inference. Training of all experts can be done in parallel without any communication channels between them. Similarly, the inference can also be heavily parallelized by distributing experts on different GPUs and routing each request to the GPU containing its relevant expert. We implement a proof-of-concept version of this method and show the validity of our approach.

1249Learned Reference-based Diffusion Sampler for multi-modal distributions

[openreview] [pdf]

Abstract Over the past few years, several approaches utilizing score-based diffusion have been proposed to sample from probability distributions, that is without having access to exact samples and relying solely on evaluations of unnormalized densities. The resulting samplers approximate the time-reversal of a noising diffusion process, bridging the target distribution to an easy-to-sample base distribution. In practice, the performance of these methods heavily depends on key hyperparameters that require ground truth samples to be accurately tuned. Our work aims to highlight and address this fundamental issue, focusing in particular on multi-modal distributions, which pose significant challenges for existing sampling methods. Building on existing approaches, we introduceLearned Reference-based Diffusion Sampler(LRDS), a methodology specifically designed to leverage prior knowledge on the location of the target modes in order to bypass the obstacle of hyperparameter tuning. LRDS proceeds in two steps by (i) learning areferencediffusion model on samples located in high-density space regions and tailored for multimodality, and (ii) using this reference model to foster the training of a diffusion-based sampler. We experimentally demonstrate that LRDS best exploits prior knowledge on the target distribution compared to competing algorithms on a variety of challenging distributions.

1250Proto Successor Measure: Representing the space of all possible solutions of Reinforcement Learning

[openreview] [pdf]

Abstract Having explored an environment, intelligent agents should be able to transfer their knowledge to most downstream tasks within that environment. Referred to as ``zero-shot learning," this ability remains elusive for general-purpose reinforcement learning algorithms. While recent works have attempted to produce zero-shot RL agents, they make assumptions about the nature of the tasks or the structure of the MDP. We present \emph{Proto Successor Measure}: the basis set for all possible solutions of Reinforcement Learning in a dynamical system. We provably show that any possible policy can be represented using an affine combination of these policy independent basis functions. Given a reward function at test time, we simply need to find the right set of linear weights to combine these basis corresponding to the optimal policy. We derive a practical algorithm to learn these basis functions using only interaction data from the environment and show that our approach can produce the optimal policy at test time for any given reward function without additional environmental interactions.

1251Is In-Context Learning Sufficient for Instruction Following in LLMs?

[openreview] [pdf]

Abstract In-context learning (ICL) allows LLMs to learn from examples without changing their weights: this is a particularly promising capability for long-context LLMs that can potentially learn from many examples. Recently, Lin et al. (2024) proposed URIAL, a method using only three in-context examples to align base LLMs, achieving non-trivial instruction following performance. In this work, we show that, while effective, ICL alignment with URIAL still underperforms compared to instruction fine-tuning on established benchmarks such as MT-Bench and AlpacaEval 2.0 (LC), especially with more capable base LLMs. We then uncover the most relevant elements for successful in-context alignment, finding the crucial role of the decoding parameters. Based on these insights, we show that the approach of URIAL can indeed be improved by adding more, potentially carefully selected, high-quality demonstrations in context, getting closer to the performance of instruct models. Finally, we provide the first, to our knowledge, systematic comparison of ICL and instruction fine-tuning (IFT) for instruction following in the low data regime. Overall, our work advances the understanding of ICL as an alignment technique and its relationship to IFT.

1252S4M: S4 for multivariate time series forecasting with Missing values

[openreview] [pdf]

Abstract Multivariate time series data are integral to numerous real-world applications, including finance, healthcare, and meteorology, where accurate forecasting is paramount for informed decision-making and proactive measures. However, the presence of missing data poses significant challenges, often undermining the performance of predictive models. Traditional two-step approaches that first impute missing values and then perform forecasting tend to accumulate errors, particularly in complex multivariate settings with high missing ratios and intricate dependency structures. In this work, we present S4M, an end-to-end time series forecasting framework that seamlessly integrates missing data handling within the Structured State Space Sequence (S4) model architecture. Unlike conventional methods that treat imputation as a separate preprocessing step, S4M leverages the latent space of S4 models to recognize and represent missing data patterns directly, thereby capturing the underlying temporal and multivariate dependencies more effectively. Our approach comprises two key modules: the Adaptive Temporal Prototype Mapper (ATPM) and the Missing-Aware Dual Stream S4 (MDS-S4). The ATPM utilizes a prototype bank to derive robust and informative representations from historical data patterns, while MDS-S4 processes these representations alongside missingness masks as dual input streams to perform accurate forecasting. Extensive empirical evaluations on diverse real-world datasets demonstrate that S4M consistently achieves state-of-the-art performance, validating the efficacy of our integrated approach in handling missing data, highlighting its robustness and superiority over traditional imputation-based methods. These results highlight the potential of our method for advancing reliable time series forecasting in practical applications.

1253Concept forgetting via label annealing

[openreview] [pdf]

Abstract The effectiveness of current machine learning models relies on their ability to grasp diverse concepts present in datasets. However, biased and noisy data can inadvertently cause these models to be biased toward certain concepts, undermining their ability to generalize and provide utility. Consequently, modifying a trained model to forget these concepts becomes imperative for their responsible deployment. We refer to this problem asconcept forgetting. Our goal is to develop techniques for forgetting specific undesired concepts from a pre-trained classification model’s prediction. To achieve this goal, we present an algorithm calledLabelANnealing (LAN). This iterative algorithm employs a two-stage method for each iteration. In the first stage, pseudo-labels are assigned to the samples by annealing or redistributing the original labels based on the current iteration’s model predictions of all samples in the dataset. During the second stage, the model is fine-tuned on the dataset with pseudo-labels. We illustrate the effectiveness of the proposed algorithms across various models and datasets. Our method reducesconcept violation, a metric that measures how much the model forgets specific concepts, by about 85.35% on the MNIST dataset, 73.25% on the CIFAR-10 dataset, and 69.46% on the CelebA dataset while maintaining high model accuracy. Our implementation can be found at this following link: \url{https://anonymous.4open.science/r/LAN-141B/}

1254Interplay Between Task Learning and Skill Discovery for Agile Locomotion

[openreview] [pdf]

Abstract Agile locomotion of legged robots, characterized by high momentum and frequent contact changes, is a challenging task that demands precise motor control. Therefore, the training process for such skills often relies on additional techniques, such as reward engineering, expert demonstrations, and curriculum learning. However, these requirements hinder the generalizability of methods because we may lack sufficient prior knowledge or demonstration datasets for some tasks. In this work, we consider the problem of automated learning agile motions using its intrinsic motivation, which can greatly reduce the effort of a human engineer. Inspired by unsupervised skill discovery, our learning framework encourages the agent to explore various skills to maximize the given task reward. Finally, we train a parameter to balance the two distinct rewards through a bi-level optimization process. We demonstrate that our method can train quadrupeds to perform highly agile motions, ranging from crawling, jumping, and leaping to complex maneuvers such as jumping off a perpendicular wall.

1255Context-aware Prompt Tuning: Advancing In-Context Learning with Adversarial Methods

[openreview] [pdf]

Abstract Fine-tuning Large Language Models (LLMs) typically involves updating at least a few billions of parameters. A more parameter-efficient approach is Prompt Tuning (PT), which updates only a few learnable tokens, and differently, In-Context Learning (ICL) adapts the model to a new task by simply including examples in the input without any training. When applying optimization-based methods, such as fine-tuning and PT for few-shot learning, the model is specifically adapted to the small set of training examples, whereas ICL leaves the model unchanged. This distinction makes traditional learning methods more prone to overfitting; in contrast, ICL is less sensitive to the few-shot scenario. While ICL is not prone to overfitting, it does not fully extract the information that exists in the training examples. This work introduces Context-aware Prompt Tuning (CPT), a method inspired by ICL, PT, and adversarial attacks. We build on the ICL strategy of concatenating examples before the input, but we extend this by PT-like learning, refining the context embedding through iterative optimization to extract deeper insights from the training examples. We carefully modify specific context tokens, considering the unique structure of input and output formats. Inspired by adversarial attacks, we adjust the input based on the labels present in the context, focusing on minimizing, rather than maximizing, the loss. Moreover, we apply a projected gradient descent algorithm to keep token embeddings close to their original values, under the assumption that the user-provided data is inherently valuable. Our method has been shown to achieve superior accuracy across multiple classification tasks using various LLM models.

1256Neighborhood and Global Perturbations Supported SAM in Federated Learning: From Local Tweaks To Global Awareness

[openreview] [pdf]

Abstract Federated Learning (FL) can be coordinated under the orchestration of a central server to build a privacy-preserving model without collaborative data exchange. However, participant data heterogeneity leads to local optima divergence, affecting convergence outcomes. Recent research focused on global sharpness-aware minimization (SAM) and dynamic regularization to enhance consistency between global and local generalization and optimization objectives. Nonetheless, the estimation of global SAM introduces additional computational and memory overhead. At the same time, the local dynamic regularizer cannot capture the global update state due to training isolation. This paper proposes a novel FL algorithm, FedTOGA, designed to consider optimization and generalization objectives while maintaining minimal uplink communication overhead. By linking local perturbations to global updates, global generalization consistency is improved. Additionally, linking the local dynamic regularizer to global updates increases the perception of the global gradient and enhances optimization consistency. Global updates are passively received by clients, reducing overhead. We also propose neighborhood perturbation to approximate local perturbation, analyzing its strengths and working principle. Theoretical analysis shows FedTOGA achieves faster convergence O(1/T)O(1/T) under non-convex functions. Empirical studies demonstrate that FedTOGA outperforms state-of-the-art algorithms, with a 1% accuracy increase and 30% faster convergence, achieving state-of-the-art.

1257Personalized Federated Learning With Similarity Information Supervisor

[openreview] [pdf]

Abstract A crucial issue in federated learning is the heterogeneity of data between clients, which can lead to model weight divergence, eventually deteriorating the model performance. Personalized federated learning (pFL) has been proven to be an effective approach to addressing data heterogeneity in federated learning. However, existing pFL studies seldom verify whether the broadcast global model is beneficial for the local model performance. To address this, we propose a novel pFL method, called federated learning with similarity information supervision (FedSimSup). Specifically, FedSimSup incorporates a local supervisor to assist the model training and a personalized model for global information aggregation. The role of the supervisor is to refine the personalized model when it is not beneficial for the local model performance, ensuring the effective global information aggregation while aligning with the local heterogeneous data. Additionally, the similarity relationships between the clients are measured using label distribution differences of the local raw data to weight the personalized models, promoting information usage among similar clients. Experimental results demonstrate three advantages of FedSimSup: (1) It shows better performance over heterogeneous data compared with seven state-of-the-art federated learning methods; (2) It can allow for different model architectures across different clients; (3) It offers a certain degree of interpretability.

1258Task-Adaptation Curriculum Learning

[openreview] [pdf]

Abstract A large distribution gap between a target task and pre-training tasks could undermine the task adaptation performance of pretrained models. When the target-task data are scarce, naive finetuning results in overfitting and forgetting. In various domains, skills can be transferred across semantically related tasks, among which the general-purposed ones often have more training data. Can we bridge the gap between a pre-trained model and a low-resource target task by leveraging data from other tasks? In this paper, we address the low-resource task adaptation challenge by a transfer learning curriculum, which finetunes a model on a curated sequence of intermediate tasks, thereby progressively bridging the gap between the pre-trained model and the target task. To this end, we formulate the task curriculum as a graph search problem and improve the efficiency of estimating transferability between tasks. Two search algorithms are studied, i.e., greedy best-first search and Monte Carlo tree search. We evaluate our approach, i.e., ``task-adaptation curriculum learning (TaCL)‘’ on two benchmark settings. Extensive evaluations on different target tasks demonstrate the effectiveness and advantages of TaCL on highly specific and low-resource downstream tasks.

1259Provable In-context Learning for Mixture of Linear Regressions using Transformers

[openreview] [pdf]

Abstract We theoretically investigate the in-context learning capabilities of transformers in the context of learning mixtures of linear regression models. For the case of two mixtures, we demonstrate the existence of transformers that can achieve an accuracy, relative to the oracle predictor, of order O~((d/n)1/4)\mathcal{\tilde{O}}((d/n)^{1/4}) in the low signal-to-noise ratio (SNR) regime and O~(d/n)\mathcal{\tilde{O}}(\sqrt{d/n}) in the high SNR regime, where nn is the length of the prompt, and dd is the dimension of the problem. Additionally, we derive in-context excess risk bounds of order O(L/B)\mathcal{O}(L/\sqrt{B}), where BB denotes the number of (training) prompts, and LL represents the number of attention layers. The order of LL depends on whether the SNR is low or high. In the high SNR regime, we extend the results to KK-component mixture models for finite KK. Extensive simulations also highlight the advantages of transformers for this task, outperforming other baselines such as the EM algorithm.

1260Your Weak LLM is Secretly a Strong Teacher for Alignment

[openreview] [pdf]

Abstract The burgeoning capabilities of large language models (LLMs) have underscored the need for alignment to ensure these models act in accordance with human values and intentions. Existing alignment frameworks present constraints either in the form of expensive human effort or high computational costs. This paper explores a promising middle ground, where we employ a weak LLM that is significantly less resource-intensive than top-tier models, yet offers more automation than purely human feedback. We present a systematic study to evaluate and understand weak LLM’s ability to generate feedback for alignment. Our empirical findings demonstrate that weak LLMs can provide feedback that rivals or even exceeds that of fully human-annotated data. Our study indicates a minimized impact of model size on feedback efficacy, shedding light on a scalable and sustainable alignment strategy. To deepen our understanding of alignment under weak LLM feedback, we conduct a series of qualitative and quantitative analyses, offering novel insights into the quality discrepancies between human feedback vs. weak LLM feedback.

1261Tractable Multi-Agent Reinforcement Learning through Behavioral Economics

[openreview] [pdf]

Abstract A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that---by imbuing agents with important features of human decision-making like risk aversion and bounded rationality---a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all nn-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents’ degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples’ patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.

1262Reasoning Elicitation in Language Models via Counterfactual Feedback

[openreview] [pdf]

Abstract Despite the increasing effectiveness of language models, their reasoning capabilities remain underdeveloped. In particular, causal reasoning through counterfactual question answering is lacking. This work aims to bridge this gap. We first derive novel metrics that balance accuracy in factual and counterfactual questions, capturing a more complete view of the reasoning abilities of language models than traditional factual-only based metrics. Second, we propose several fine-tuning approaches that aim to elicit better reasoning mechanisms, in the sense of the proposed metrics. Finally, we evaluate the performance of the fine-tuned language models in a variety of realistic scenarios. In particular, we investigate to what extent our fine-tuning approaches systemically achieve better generalization with respect to the base models in several problems that require, among others, inductive and deductive reasoning capabilities.

1263Adversarial Robustness Overestimation and Instability in TRADES

[openreview] [pdf]

Abstract This paper examines the phenomenon of probabilistic robustness overestimation in TRADES, a prominent adversarial training method. Our study reveals that TRADES sometimes yields disproportionately high PGD validation accuracy compared to the AutoAttack testing accuracy in the multiclass classification task. This discrepancy highlights a significant overestimation of robustness for these instances, potentially linked to gradient masking. We further analyze the parameters contributing to unstable models that lead to overestimation. Our findings indicate that smaller batch sizes, lower beta values (which control the weight of the robust loss term in TRADES), larger learning rates, and higher class complexity (e.g., CIFAR-100 versus CIFAR-10) are associated with an increased likelihood of robustness overestimation. By examining metrics such as the First-Order Stationary Condition (FOSC), inner-maximization, and gradient information, we identify the underlying cause of this phenomenon as gradient masking and provide insights into it. Furthermore, our experiments show that certain unstable training instances may return to a state without robust overestimation, inspiring our attempts at a solution. In addition to adjusting parameter settings to reduce instability or retraining when overestimation occurs, we recommend incorporating Gaussian noise in inputs when the FOSC score exceed the threshold. This method aims to mitigate robustness overestimation of TRADES and other similar methods at its source, ensuring more reliable representation of adversarial robustness during evaluation.

1264Optimizing Preference Alignment with Differentiable NDCG Ranking

[openreview] [pdf]

Abstract Aligning large language models with human preferences improves interaction quality and safety by ensuring outputs better reflect human values. A promising strategy involves Reinforcement Learning from Human Feedback (RLHF), starting with collecting and ranking responses generated by a supervised fine-tuning model to refine alignment. Current methods (DPO) focus on learning from pairwise preference data, categorizing responses into preferred and less preferred pairs, and optimizing by maximizing pairwise margins. Recent studies have uncovered a substantial discrepancy between the theoretical aspirations of preference learning and its real-world results. Current preference alignment techniques underperform expectations, with ranking accuracies below 6060% on standard datasets. This suggests existing methods inadequately capture ideal preference relationships within sequences. To address this challenge, this paper introduces \underline{D}irect \underline{R}anking \underline{P}reference \underline{O}ptimization (DRPO), a novel method that views human preference alignment as a Learning-to-Rank (LTR) task. DRPO leverages NDCG, a widely used LTR metric, to optimize the ranking of responses within lists based on preference data, thereby enhancing ranking accuracies. Due to the nondifferentiability of NDCG, we propose diffNDCG loss, a differentiable approximation facilitated by a sorting network to simulate NDCG. Furthermore, to improve the quality of generated response, we propose a novel margin-based Adaptive Rank Policy Score. Extensive experiments have shown that DRPO outperforms existing baseline methods, enhancing the quality of the generated responses.

1265LoRA Unleashed: Effortlessly Advancing from Low to Arbitrary Rank

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large foundation models, facilitating a reduction in trainable parameters through the utilization of low-rank matrices to represent weight changes A\mathbf{A} and B\mathbf{B} (\textit{i.e.,} ΔW=BA\Delta \mathbf{W} = \mathbf{B} \mathbf{A}). Although LoRA has demonstrated considerable success, its expressiveness is inherently limited by the constrained capacity of its low-rank structure. To ameliorate this limitation, we introduce \underline{Fo}urier-based Flexible \underline{R}ank \underline{A}daptation (FoRA), which harnesses the robust expressiveness of the Fourier basis to re-parameterize A\mathbf{A} and B\mathbf{B} from a sparse spectral subspace. Utilizing FoRA, adaptation matrices can overcome conventional rank limitations, achieving up to a 15x reduction in the parameter budget. We illustrate that FoRA achieves an optimal balance of efficiency and performance across various tasks, including natural language understanding, mathematical reasoning, commonsense reasoning, and image classification. Our codes are available athttps://anonymous.4open.science/r/FoRA-0E9C.

1266DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors

[openreview] [pdf]

Abstract Large-scale latent diffusion models (LDMs) excel in content generation across various modalities, but their reliance on phonemes and durations in text-to-speech (TTS) limits scalability and access from other fields. While recent studies show potential in removing these domain-specific factors, performance remains suboptimal. In this work, we introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors. Through rigorous analysis and empirical exploration, we find that (1) DiT with minimal modifications outperforms U-Net, (2) variable-length modeling with a speech length predictor significantly improves results over fixed-length approaches, and (3) conditions like semantic alignment in speech latent representations are key to further enhancement. By scaling our training data to 82K hours and the model size to 790M parameters, we achieve superior or comparable zero-shot performance to state-of-the-art TTS models in naturalness, intelligibility, and speaker similarity, all without relying on domain-specific factors. Speech samples are available athttps://lactojoy.github.io.

1267A Scalable Temporal-Spatial Framework for Transaction Anomaly Detection in Ethereum Networks

[openreview] [pdf]

Abstract The rapid evolution of the Ethereum network necessitates sophisticated techniques to ensure its robustness against potential threats and to maintain transparency. While Graph Neural Networks (GNNs) have pioneered anomaly detection in such platforms, capturing the intricacies of both spatial and temporal transactional patterns has remained a challenge. This study presents a fusion of Graph Convolutional Networks (GCNs) with Temporal Random Walks (TRW) enhanced by probabilistic sampling to bridge this gap. Our approach, unlike traditional GCNs, leverages the strengths of TRW to discern complex temporal sequences in Ethereum transactions, thereby providing a more nuanced transaction anomaly detection mechanism. Extensive evaluations demonstrate that our TRW-GCN framework substantially advances the performance metrics over conventional GCNs in detecting irregularities such as suspiciously timed transactions, patterns indicative of token pump and dump schemes, or anomalous behavior in smart contract executions over time. As baseline algorithms for comparison, common unsupervised methods such as Isolation Forest, One-Class SVM, and DBSCAN (as classifier for TRW-GCN embedding) are employed; finally our novel TRW-GCN plus scoring method is compared with the state-of-the-art temporal graph attention algorithm.

1268One Wave to Explain Them All: A Unifying Perspective on Post-hoc Explainability

[openreview] [pdf]

Abstract Despite the growing use of deep neural networks in safety-critical decision-making, their inherent black-box nature hinders transparency and interpretability. Explainable AI (XAI) methods have thus emerged to understand a model’s internal workings, and notably attribution methods also called Saliency maps. Conventional attribution methods typically identify the locations - the where - of significant regions within an input. However, because they overlook the inherent structure of the input data, these methods often fail to interpret what these regions represent in terms of structural components (e.g., textures in images or transients in sounds). Furthermore, existing methods are usually tailored to a single data modality, limiting their generalizability. In this paper, we propose leveraging the wavelet domain as a robust mathematical foundation for attribution. Our approach, the Wavelet Attribution Method (WAM) extends the existing gradient-based feature attributions into the wavelet domain, providing a unified framework for explaining classifiers across images, audio, and 3D shapes. Empirical evaluations demonstrate that WAM matches or surpasses state-of-the-art methods across faithfulness metrics and models in image, audio, and 3D explainability. Finally, we show how our method explains not only the where - the important parts of the input - but also the what - the relevant patterns in terms of structural components.

1269LLMs for Generalizable Language-Conditioned Policy Learning under Minimal Data Requirements

[openreview] [pdf]

Abstract To develop autonomous agents capable of executing complex, multi-step decision-making tasks as specified by humans in natural language, existing reinforcement learning approaches typically require expensive labeled datasets or access to real-time experimentation. Moreover, conventional methods often face difficulties in generalizing to unseen goals and states, thereby limiting their practical applicability. This paper presents TEDUO, a novel training pipeline for offline language-conditioned policy learning. TEDUO operates on easy-to-obtain, unlabeled datasets and is suited for the so-called in-the-wild evaluation, wherein the agent encounters previously unseen goals and states. To address the challenges posed by such data and evaluation settings, our method leverages the prior knowledge and instruction-following capabilities of large language models (LLMs) to enhance the fidelity of pre-collected offline data and enable flexible generalization to new goals and states. Empirical results demonstrate that the dual role of LLMs in our framework—as data enhancers and generalizers—facilitates both effective and data-efficient learning of generalizable language-conditioned policies.

1270Preference Optimization as Probabilistic Inference

[openreview] [pdf]

Abstract Existing preference optimization methods are mainly designed for directly learning from human feedback with the assumption that paired examples (preferred vs. dis-preferred) are available. In contrast, we propose a method that can leverage unpaired preferred or dis-preferred examples, and works even when only one type of feedback (positive or negative) is available. This flexibility allows us to apply it in scenarios with varying forms of feedback and models, including training generative language models based on human feedback as well as training policies for sequential decision-making problems, where learned (value) functions are available. Our approach builds upon the probabilistic framework introduced in (Dayan & Hinton, 1997), which proposes to use expectation-maximization (EM) to directly optimize the probability of preferred outcomes (as opposed to classic expected reward maximization). To obtain a practical algorithm, we identify and address a key limitation in current EM-based methods: when applied to preference optimization, they solely maximize the likelihood of preferred examples, while neglecting dis-preferred samples. We show how one can extend EM algorithms to explicitly incorporate dis-preferred outcomes, leading to a novel, theoretically grounded, preference optimization algorithm that offers an intuitive and versatile way to learn from both positive and negative feedback.

1271Can a Large Language Model be a Gaslighter?

[openreview] [pdf]

Abstract Large language models (LLMs) have gained human trust due to their capabilities and helpfulness. However, this in turn may allow LLMs to affect users’ mindsets by manipulating language. It is termed as gaslighting, a psychological effect. In this work, we aim to investigate the vulnerability of LLMs under prompt-based and fine-tuning-based gaslighting attacks. Therefore, we propose a two-stage framework DeepCoG designed to: 1) elicit gaslighting plans from LLMs with the proposed DeepGaslighting prompting template, and 2) acquire gaslighting conversations from LLMs through our Chain-of-Gaslighting method. The gaslighting conversation dataset along with a corresponding safe dataset is applied to fine-tuning-based jailbreak on open-source LLMs and anti-gaslighting safety alignment on these LLMs. Experiments demonstrate that both prompt-based and fine-tuning-based attacks transform three open-source LLMs into gaslighters. In contrast, we advanced three safety alignment strategies to strengthen (by 12.05%) the safety guardrail of LLMs. Our safety alignment strategies have minimal impacts on the utility of LLMs. Empirical studies indicate that an LLM may be a potential gaslighter, even if it passed the harmfulness test on general dangerous queries.

1272Algorithmic Stability Based Generalization Bounds for Adversarial Training

[openreview] [pdf]

Abstract In this paper, we present a novel stability analysis of adversarial training and prove generalization upper bounds in terms of an expansiveness property of adversarial perturbations used during training and used for evaluation. These expansiveness parameters appear not only govern the vanishing rate of the generalization error but also govern its scaling constant. Our proof techniques do not rely on artificial assumptions of the adversarial loss, as are typically used in previous works. Our bound attributes the robust overfitting in PGD-based adversarial training to the sign function used in the PGD attack, resulting in a bad expansiveness parameter. The peculiar choice of sign function in the PGD attack appears to impact adversarial training both in terms of (inner) optimization and in terms of generalization, as shown in this work. This aspect has been largely overlooked to date. Going beyond the sign-function based PGD attacks, we further show that poor expansiveness properties exist in a wide family of PGD-like iterative attack algorithms, which may highlight an intrinsic difficulty in adversarial training.

[openreview] [pdf]

Abstract Recently, Large Language Models (LLMs) attained impressive performance in math and reasoning benchmarks. However, they still often struggle with multi-step reasoning which is relatively easy for humans. To further investigate this, we introduce a new benchmark, SearchBench, containing 11 unique combinatorial problems that avoid training contamination (each equipped with automated pipelines to generate an arbitrary number of instances) and analyze the feasibility, correctness, and optimality of LLM-generated solutions. We show that even the most advanced LLMs fail to solve these problems end-to-end in text, e.g., GPT4 and o1-preview respectively solve only 1.4% and 18.6% correctly. SearchBench problems require considering multiple pathways to the solution and backtracking, posing a significant challenge to auto-regressive models. Instructing LLMs to generate code that solves the problem helps only slightly. We next introduce an in-context learning approach that prompts the model to implement A*, an informed search algorithm, to comprehensively traverse the problem state space, improving the performance of models. We further extend this approach and propose the Multi-Stage-Multi-Try inference method which breaks down the A* algorithm implementation into two stages and auto-verifies the first stage against unit tests, raising GPT-4’s performance above 57%.

1274Length-Induced Embedding Collapse in Transformer-based Models

[openreview] [pdf]

Abstract Text embeddings enable various applications, but their performance deteriorates on longer texts. In this paper, we find that the performance degradation is due to a phenomenon called \textbf{Length Collapse}, where longer text embeddings collapse into a narrow space. This collapse results in a distributional inconsistency between embeddings of different text lengths, ultimately hurting the performance of downstream tasks. Theoretically, by considering the self-attention mechanism inherently functions as a low-pass filter, we prove that long sequences increase the attenuation rate of the low-pass filter effect of the self-attention mechanism. With layers going deeper, excessive low-pass filtering causes the token signals to retain only their Direct-Current (DC) component, which means the input token feature maps will collapse into a narrow space, especially in long texts. Based on the above analysis, we propose to mitigate the undesirable length collapse limitation by introducing a temperature in \softmax(\cdot), which achieves a higher low-filter attenuation rate. The tuning-free method, called \textbf{TempScale}, can be plugged into multiple transformer-based embedding models. Empirically, we demonstrate that TempScale can improve existing embedding models especially on long text inputs, bringing up to \textbf{0.53%} performance gains on 40 datasets from Massive Text Embedding Benchmark (MTEB) and \textbf{0.82%} performance gains on 4 datasets from LongEmbed, which specifically focuses on long context retrieval. The source code is available at \textcolor{blue}{\url{https://anonymous.4open.science/r/Length_Collapse-22D2}}.

1275MallowsPO: Fine-Tune Your LLM with Preference Dispersions

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has recently emerged as a popular approach to improve reinforcement learning with human feedback (RLHF), leading to better techniques to fine-tune large language models (LLM). A weakness of DPO, however, lies in its lack of capability to characterize the diversity of human preferences. Inspired by Mallows’ theory of preference ranking, we develop in this paper a new approach, theMallowsPO. A distinct feature of this approach is adispersion index, which reflects the dispersion of human preference to prompts. We show that existing DPO models can be reduced to special cases of this dispersion index, thus unified with MallowsPO. More importantly, we demonstrate empirically how to use this dispersion index to enhance the performance of DPO in a broad array of benchmark tasks, from synthetic bandit selection to controllable generation and dialogues, while maintaining great generalization capabilities. MallowsPO is also compatible with other SOTA offline preference optimization methods, boosting nearly 2% extra LC win rate when used as a plugin for fine-tuning Llama3-Instruct.

1276Connecting Federated ADMM to Bayes

[openreview] [pdf]

Abstract We provide new connections between two distinct federated learning approaches based on (i) ADMM and (ii) Variational Bayes (VB), and propose new variants by combining their complementary strengths. Specifically, we show that the dual variables in ADMM naturally emerge through the “site” parameters used in VB with isotropic Gaussian covariances. Using this, we derive two versions of ADMM from VB that use flexible covariances and functional regularisation, respectively. Through numerical experiments, we validate the improvements obtained in performance. The work shows connection between two fields that are believed to be fundamentally different and combines them to improve federated learning.

1277Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) has demonstrated great promise in aligning large language models (LLMs) with human preference. Depending on the availability of preference data, both online and offline RLHF are active areas of investigation. A key bottleneck is understanding how to incorporate uncertainty estimation in the reward function learned from the preference data for RLHF, regardless of how the preference data is collected. While the principles of optimism or pessimism under uncertainty are well-established in standard reinforcement learning (RL), a practically-implementable and theoretically-grounded form amenable to large language models is not yet available, as standard techniques for constructing confidence intervals become intractable under arbitrary policy parameterizations.In this paper, we introduce a unified approach to online and offline RLHF --- value-incentivized preference optimization (VPO) --- which regularizes the maximum-likelihood estimate of the reward function with the corresponding value function, modulated by a sign to indicate whether the optimism or pessimism is chosen. VPO also directly optimizes the policy with implicit reward modeling, and therefore shares a simpler RLHF pipeline similar to direct preference optimization. Theoretical guarantees of VPO are provided for both online and offline settings, matching the rates of their standard RL counterparts. Moreover, experiments on text summarization, dialogue, and standard benchmarks verify the practicality and effectiveness of VPO.

1278Decentralized Optimization with Coupled Constraints

[openreview] [pdf]

Abstract We consider the decentralized minimization of a separable objective i=1nfi(xi)\sum_{i=1}^{n} f_i(x_i), where the variables are coupled through an affine constraint i=1n(Aixibi)=0\sum_{i=1}^n\left(\mathbf{A}_i x_i - b_i\right) = 0. We assume that the functions fif_i, matrices Ai\mathbf{A}_i, and vectors bib_i are stored locally by the nodes of a computational network, and that the functions fif_i are smooth and strongly convex.This problem has significant applications in resource allocation and systems control and can also arise in distributed machine learning. We propose lower complexity bounds for decentralized optimization problems with coupled constraints and a first-order algorithm achieving the lower bounds. To the best of our knowledge, our method is also the first linearly convergent first-order decentralized algorithm for problems with general affine coupled constraints.

1279Does equivariance matter at scale?

[openreview] [pdf]

Abstract Given large data sets and sufficient compute, is it beneficial to design neural architectures for the structure and symmetries of each problem? Or is it more efficient to learn them from data? We study empirically how equivariant and non-equivariant networks scale with compute and training samples. Focusing on a benchmark problem of rigid-body interactions and on general-purpose transformer architectures, we perform a series of experiments, varying the model size, training steps, and dataset size. We find evidence for three conclusions. First, equivariance improves data efficiency, but training non-equivariant models with data augmentation can close this gap given sufficient epochs. Second, scaling with compute follows a power law, with equivariant models outperforming non-equivariant ones at each tested compute budget. Finally, the optimal allocation of a compute budget onto model size and training duration differs between equivariant and non-equivariant models.

1280Understanding Adversarially Robust Generalization via Weight-Curvature Index

[openreview] [pdf]

Abstract Despite extensive research on adversarial examples, the underlying mechanisms of adversarially robust generalization, a critical yet challenging task for deep learning, remain largely unknown. In this work, we propose a novel perspective to decipher adversarially robust generalization through the lens of the Weight-Curvature Index (WCI). The proposed WCI quantifies the vulnerability of models to adversarial perturbations using the Frobenius norm of weight matrices and the trace of Hessian matrices. We prove generalization bounds based on PAC-Bayesian theory and second-order loss function approximations to elucidate the interplay between robust generalization gap, model parameters, and loss landscape curvature. Our theory and experiments show that WCI effectively captures the robust generalization performance of adversarially trained models. By offering a nuanced understanding of adversarial robustness based on the scale of model parameters and the curvature of the loss landscape, our work provides crucial insights for designing more resilient deep learning models, enhancing their reliability and security.

1281DeMo: Decoupled Momentum Optimization

[openreview] [pdf]

Abstract Training large scale neural networks typically involves sharing the gradients between all accelerators, which necessitates specialized high-speed interconnects. Taking cues from signal processing, we show that it is not necessary to share or synchronize the full optimizer states and model parameters during training. By decoupling the momentum and allowing divergence in the optimizer states across accelerators, it is possible to even improve convergence compared to previous state of the art optimizers. From this, we introduce a Decoupled Momentum optimization algorithm (DeMo) that reduces the communication requirements by several orders of magnitude, potentially enabling future training of large neural networks on slow internet bandwidths with heterogeneous networking hardware. Furthermore, our method is agnostic to the network topology and neural network architecture, and supports scalable clock-synchronous distributed training with negligible compute and memory overhead. Empirically, we show that models trained with DeMo match or surpass the performance of equal models trained with AdamW, entirely bypassing the need for high-speed interconnects for pre-training large scale foundation models.

1282Risk Quadrangle and Robust Optimization Based onφ-Divergence

[openreview] [pdf]

Abstract The Fundamental Risk Quadrangle (FRQ) is a unified framework linking risk management, statistical estimation, and optimization. Distributionally robust optimization (DRO) based on φ\varphi-divergence minimizes the maximal expected loss, where the maximum is over a φ\varphi-divergence uncertainty set. This paper introduces the \emph{extended} φ\varphi-divergence and the extended φ\varphi-divergence quadrangle, which integrates DRO into the FRQ framework. We derive the primal and dual representations of the quadrangle elements (risk, deviation, regret, error, and statistic). The dual representation provides an interpretation of classification, portfolio optimization, and regression as robust optimization based on the extended φ\varphi-divergence. The primal representation offers tractable formulations of these robust optimizations as convex optimization. We provide illustrative examples showing that many common problems, such as least-squares regression, quantile regression, support vector machines, and CVaR optimization, fall within this framework. Additionally, we conduct a case study to visualize the optimal solution of the inner maximization in robust optimization.

1283The adaptive complexity of log-concave sampling

[openreview] [pdf]

Abstract In large-data applications, such as the inference process of diffusion models, it is desirable to design sampling algorithms with a high degree of parallelization. In this work, we study the adaptive complexity of sampling, which is the minimal number of sequential rounds required to achieve sampling given polynomially many queries executed in parallel at each round. For unconstrained sampling, we examine distributions that are log-smooth or log-Lipschitz and log strongly or non-strongly concave. We show that an almost linear iteration algorithm cannot return a sample with a specific exponentially small accuracy under total variation distance. For box-constrained sampling, we show that an almost linear iteration algorithm cannot return a sample with sup-polynomially small accuracy under total variation distance for log-concave distributions. Our proof relies upon novel analysis with the characterization of the output for the hardness potentials based on the chain-like structure with random partition and classical smoothing techniques.

1284Discovering Long-Term Effects on Parameter Efficient Fine-tuning

[openreview] [pdf]

Abstract Pre-trained Artificial Neural Networks (ANNs) demonstrate robust pattern recognition abilities, closely mirroring the functionality of Biological Neural Networks (BNNs). We are particularly intrigued by these models’ capacity for acquiring new knowledge through fine-tuning, such, Parameter-efficient Fine-tuning (PEFT). Given that both ANNs and BNNs propagate information layer-by-layer, a useful analogy can be drawn: ANN weights correspond to synapses in BNNs, while features (latent variables or activations) parallel the neurotransmitters released by neurons. Building upon this clue, we delve deeper into exploring the connections between feature adjustment and weight adjustment, resulting in our proposed method Synapses & Neurons (SAN) that learns scaling matrices for features and propagates their effects towards posterior weight matrices. Our approach draws strong inspiration from well-known neuroscience phenomena - Long-term Potentiation (LTP) and Long-term Depression (LTD), which also reveal the relationship between synapse development and neurotransmitter release levels. We conducted extensive comparisons of PEFT on 26 datasets using attention-based networks as well as convolution-based networks, leading to significant improvements compared to other tuning methods, +8.5% over fully-finetune, +7% over Visual Prompt Tuning, and +3.2% over Low-Rank Adapter.

[openreview] [pdf]

Abstract It is commonly believed that Message Passing Neural Networks (MPNNs) struggle in link prediction settings due to limitations in their expressive power. Recent work has focused on developing more expressive model classes, which are capable of learning link representations through techniques such as labeling tricks, the inclusion of structural features, or the use of subgraph methods. These approaches have yielded significant performance improvements across a range of benchmark datasets. However, an interesting question remains: have we fully wrung out the performance by optimizing the other aspects of the training process? In this work, we present results that indicate that significant amounts of model performance have been left on the table by the use of easy negative-samples during training. We theoretically explore the generalization gap and excess risk to quantify the performance loss caused by easy negatives. Motivated by this analysis, we introduce Risk Aware Negative Sampling in Link Prediction (RANS), which efficiently performs dynamic hard-negative-mining. Empirical results show that a simple GCN augmented by RANS realizes between 20% and 50% improvements in predictive accuracy when compared with the same model trained with standard negative samples.

1286Generalizing Reasoning Problems to Longer Lengths

[openreview] [pdf]

Abstract Length generalization (LG) (or length extrapolation) is a challenging problem in learning to reason. It refers to the phenomenon that when trained on reasoning problems of smaller lengths/sizes, the model struggles with problems of larger sizes or lengths. Although researchers have proven that reasoning can be learned if the intermediate reasoning steps (also known as chain-of-thought (CoT)) are given in the training data, their studies only apply to within a given length (interpolation), while LG is about extrapolation beyond the given length. This paper proposes an LG theory. It first introduces a theorem to show the LG problem’s root cause, highlighting what is necessary to resolve it. It then proposes and proves a sufficient condition, called (n, r)-consistency, under which LG can be achieved. Specifically, the theory says that if the CoT representation of a class of reasoning problems can satisfy the condition, LG is achievable for the class of problems. In the experimental evaluation, we present CoT representations based on the proposed theory to learn to solve challenging reasoning problems like arithmetic, parity, addition, multiplication, and division using a Transformer to achieve perfect LG.

1287Private Learning Fast and Slow: Two Algorithms for Prediction with Expert Advice Under Local Differential Privacy

[openreview] [pdf]

Abstract We study the classic problem of prediction with expert advice under the constraint of differential privacy (DP). In contrast to earlier work in this area, we are interested in distributed settings with no trusted central curator. In this context, we first show that a classical online learning algorithm naturally satisfies DP and then design two new algorithms that extend and improve it: (1) RW-AdaBatch, which provides a novel form of privacy amplification at negligible utility cost, and (2) RW-Meta, which improves utility on non-adversarial data with zero privacy cost. Our theoretical analysis is supported by an empirical evaluation using real-world data reported by hospitals during the COVID-19 pandemic. RW-Meta outperforms the classical baseline at predicting which hospitals will report a high density of COVID-19 cases by a factor of more than 2×\times at realistic privacy levels.

1288Communication-Efficient Federated Learning via Model Update Distillation

[openreview] [pdf]

Abstract Federated learning (FL) is a popular distributed machine learning framework for edge computing. However, it faces a significant challenge: the communication overhead caused by frequent model updates between clients and the central server. Previous studies have overlooked a crucial piece of information: the central server already knows the initial model on each client before local training begins in every round. This oversight leads to significant redundancy in communication, as full model information are transmitted unnecessarily. To address this, we propose a novel framework called \textit{model update distillation} (MUD), which leverages this prior knowledge to decouple model parameters from the network architecture. Instead of transmitting raw parameter updates, our method synthesizes and transmits compact tensor sequences that encode only the essential information for synchronization. This dramatically reduces communication overhead while still allowing recipients to accurately reconstruct the intended model updates. Extensive experimental results demonstrate that FedMUD achieves substantial improvements in communication efficiency, making it a highly effective solution for federated learning in bandwidth-constrained environments. The PyTorch-like core code can be found in \ref{alg: pytorch}.

1289Policy optimization emerges from noisy representation learning

[openreview] [pdf]

Abstract Nervous systems learn representations of the world and policies to act within it. We present a framework that uses reward-dependent noise to facilitate policy optimization in representation learning networks. These networks balance extracting normative features and task-relevant information to solve tasks. Moreover, their representation changes reproduce several experimentally observed shifts in the neural code during task learning. Our framework presents a biologically plausible mechanism for emergent policy optimization amid evidence that representation learning plays a vital role in governing neural dynamics.

1290Enhancing Optimizer Stability: Momentum Adaptation of NGN Step-size

[openreview] [pdf]

Abstract Modern optimization algorithms that incorporate momentum and adaptive step-size offer improved performance in various challenging Deep Learning tasks. However, their effectiveness is often highly sensitive to the choice of hyper-parameters, especially the learning rate. Tuning these parameters is often difficult, resource-intensive, and time-consuming. State-of-the-art optimization algorithms incorporating momentum and adaptive step size are the algorithms of choice in several challenging Deep Learning domains. However, their effectiveness is frequently dependent on selecting the right hyper-parameters, especially the learning rate. Therefore, recent efforts have been directed toward enhancing the stability of optimizers across a wide range of hyper-parameter choices (Schaipp et al., 2024). In this paper, we introduce an algorithm that matches the performance of state-of-the-art optimizers while improving stability through a novel adaptation of the NGN step-size method (Orvieto & Xiao, 2024). Specifically, we propose a momentum-based version (NGN-M) that attains the standard convergence rate of O(1/K)\mathcal{O}(1/\sqrt{K}) under common assumptions, without the need for interpolation condition or assumptions of bounded stochastic gradients or iterates, in contrast to previous approaches. Additionally, we empirically demonstrate that the combination of the NGN step-size with momentum results in high robustness while delivering performance that is comparable to or surpasses other state-of-the-art optimizers.

1291A Benchmark Study For Limit Order Book (LOB) Models and Time Series Forecasting Models on LOB Data

[openreview] [pdf]

Abstract We present a comprehensive benchmark to evaluate the performance of deep learning models on limit order book (LOB) data. Our work makes four significant contributions: (i) We evaluate existing LOB models on a proprietary futures LOB dataset to examine the transferability of LOB model performance between various assets; (ii) We are the first to benchmark existing LOB models on the mid-price return forecasting (MPRF) task. (iii) We present the first benchmark study to evaluate SOTA time series forecasting models on the MPRF task to bridge the two fields of general-purpose time series forecasting and LOB time series forecasting; and (iv) we propose an architecture of convolutional cross-variate mixing layers (CVML) as an add-on to any deep learning multivariate time series model to significantly enhance MPRF performance on LOB data. Our empirical results highlight the value of our benchmark results on our proprietary futures LOB dataset, demonstrating a performance gap between the commonly used open-source stock LOB dataset and our futures dataset. Furthermore, the results demonstrate that LOB-aware model design is essential for achieving optimal prediction performance on LOB datasets. Most importantly, our results show that our proposed CVML architecture brings about an average improvement of 244.9% to various time series models’ mid-price return forecasting performance.

1292Your Agent Can Defend Itself against Backdoor Attacks

[openreview] [pdf]

Abstract Intelligent agents powered by large language models (LLMs) have gained surging popularity due to their versatile and customizable capabilities across diverse environments. However, recent studies also reveal their critical vulnerability: LLM agents are highly susceptible to backdoor attacks during training or fine-tuning. Such compromised agents can subsequently be manipulated to execute malicious operations when presented with specific triggers in their inputs or environments. To address this pressing risk, we present ReAgent, a novel defense against a range of backdoor attacks on LLM-based agents. Intuitively, backdoor attacks often result in inconsistencies among the user’s instruction, the agent’s planning, and its execution. Drawing on this insight, ReAgent employs a two-level approach to detect potential backdoors. At the execution level, ReAgent verifies consistency between the agent’s thoughts and actions; at the planning level, ReAgent leverages the agent’s capability to reconstruct the instruction based on its thought trajectory, checking for consistency between the reconstructed instruction and the user’s instruction. Extensive evaluation demonstrates ReAgent’s effectiveness against various backdoor attacks across diverse tasks. For instance, ReAgent reduces the attack success rate by up to 90% in database operation tasks, outperforming existing defenses by large margins. This work reveals the potential of utilizing compromised agents themselves to mitigate backdoor risks.

1293Insufficient Task Description can Impair In-context Learning: A Study from Information Perspective

[openreview] [pdf]

Abstract Transformers have demonstrated remarkable performance in a wide range of applications, making in-context learning an essential technique. In-context learning primarily relies on two types of information: in-context examples and task description. While previous research has extensively investigated the influence of in-context examples on learning behavior, the role of task description has not been adequately explored, despite their practical significance. In this paper, we present a study examining the impact of task description on the in-context learning performance of transformers. We devise a synthetic experiment setting, making the information of task description controllable. Through a series of well-designed experiments, we systematically vary task description information and assess the resulting effects on model performance across multiple tasks. Our findings reveal the double-side roles of task description: insufficient task description will lead the model to ignore in-context examples, resulting a poor in-context performance; once the information in task description surpasses a certain threshold, the impact of task description transfers from negative to positive, and a performance emergence can be observed. We further conduct the tasks on GPT-4 and observe a similar double-side impact. In conclusion, this study contributes to a deeper understanding of the in-context learning from a task description perspective.

1294ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities

[openreview] [pdf]

Abstract Forecasts of future events are essential inputs into informed decision-making. Machine learning (ML) systems have the potential to deliver forecasts at scale, but there is no framework for evaluating the accuracy of ML systems on a standardized set of forecasting questions. To address this gap, we introduce ForecastBench: a dynamic benchmark that evaluates the accuracy of ML systems on an automatically generated and regularly updated set of 1,000 forecasting questions. To avoid any possibility of data leakage, ForecastBench is comprised solely of questions about future events that have no known answer at the time of submission. We quantify the capabilities of current ML systems by collecting forecasts from expert (human) forecasters, the general public, and LLMs on a random subset of questions from the benchmark (N = 200). While LLMs have achieved super-human performance on many benchmarks, they perform less well here: expert forecasters outperform the top-performing LLM (p-value = 0.01). We display system and human scores in a public leaderboard atwww.anonymousurl.org.

1295Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

[openreview] [pdf]

Abstract In this paper, we study the problem of procedure planning in instructional videos, which involves making goal-directed plans based on current visual observations in unstructured, real-life videos. Prior research leverages different forms of supervision to bridge the gap between observed states and unobserved actions. Building on this foundation, we propose an innovative approach by introducing a latent space temporal logical interpolation module within the diffusion model framework. This module enables the intermediate supervision of temporal logical relationships that were previously nonexistent. In terms of details, we employ an interpolator to guide the intermediate process within the diffusion model, using the start and end observation features as inputs. This involves extracting latent features through an encoder and applying an interpolation strategy with transformer encoder blocks to derive the latent features. Furthermore, to ensure the accuracy of actions in the outputs, we implement a masking strategy to constrain the scope of predictions and a task-adaptive masked proximity loss for the training process. Results across these three datasets of varying scales demonstrate that our MTID model achieves state-of-the-art performance on the overwhelming majority of key metrics. The code is available athttps://anonymous.4open.science/r/MTID-E2E3/README.md.

1296Cross-Domain Reinforcement Learning via Preference Consistency

[openreview] [pdf]

Abstract Cross-domain reinforcement learning (CDRL) aims to utilize the knowledge acquired from a source domain to efficiently learn tasks in a target domain. Unsupervised CDRL assumes no access to any signal (e.g., rewards) from the target domain, and most methods utilize state-action correspondence or cycle consistency. In this work, we identify the critical correspondence identifiability issue (CII) that arises in existing unsupervised CDRL methods. To address this identifiability issue, we propose leveraging pairwise trajectory preferences in the target domain as weak supervision. Specifically, we introduce the principle of cross-domain preference consistency (CDPC)–a policy is more transferable across the domains if the source and target domains have similar preferences over trajectories–to provide additional guidance for establishing proper correspondence between the source and target domains. To substantiate the principle of CDPC, we present an algorithm that integrates a state decoder learned through preference consistency loss during training with a cross-domain MPC method for action selection during inference. Through extensive experiments in both MuJoCo and Robosuite, we demonstrate that CDPC enables effective and data-efficient knowledge transfer across domains, outperforming state-of-the-art CDRL benchmark methods.

1297Simultaneous Reward Distillation and Preference Learning: Get You a Language Model Who Can Do Both

[openreview] [pdf]

Abstract Reward modeling of human preferences is one of the cornerstones of building usable generative large language models (LLMs). While traditional RLHF-based alignment methods explicitly maximize the expected rewards from a separate reward model, more recent supervised alignment methods like Direct Preference Optimization (DPO) circumvent this phase to avoid problems including model drift and reward overfitting. Although popular due to its simplicity, DPO and similar direct alignment methods can still lead to degenerate policies, and rely heavily on the Bradley-Terry-based preference formulation to model reward differences between pairs of candidate outputs. This formulation is challenged by non-deterministic or noisy preference labels, for example human scoring of two candidate outputs is of low confidence. In this paper, we introduce DRDO (Direct Reward Distillation and policy-Optimization), a supervised knowledge distillation-based preference alignment method that simultaneously models rewards and preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences from a novel preference likelihood formulation. Our experimental results on the Ultrafeedback and TL;DR datasets demonstrate that policies trained using DRDO surpass previous methods such as DPO and e-DPO in terms of expected rewards and are more robust, on average, to noisy preference signals as well as out-of-distribution (OOD) settings.

1298Multi-Objective Alignment of LLMs with ORPO using Self-Judgement

[openreview] [pdf]

Abstract The alignment of Large Language Models (LLMs) is achieved through fine-tuning with human preference data, where preference optimization has become a critical part of the process. Many methods have scaled LLM performance by incorporating self-judgement, highlighting the importance of unifying LLM-as-a-judge with the alignment process. One such method, called Self-rewarding LLMs, iteratively samples new data from the model to improve alignment using self-judgement. Since this additional data is generated by the LLM, we argue that similar improvements can be achieved without new data. We propose a method that reuses alignment data in the form of a self-judgement classification task and defines a multi-objective optimization problem. Our self-judgement task is derived from a simple transformation of the primary alignment data, asking the LLM to select the superior response. It introduces no new data beyond the existing alignment data. Thus, we claim the improvements are due to positive interference between the two tasks. We focus on a direct preference optimization method called Odds-Ratio Preference Optimization (ORPO). We conduct a thorough study of linear scalarization on two objectives and introduce two alternative approaches that vary the emphasis on alignment versus self-judgement objectives. Our results on Mistral 7B indicate a promising direction for fine-tuning LLMs on multiple objectives, particularly for improving performance on related tasks without additional natural language data.

1299Decentralizing Test-time Adaptation under Heterogeneous Data Streams

[openreview] [pdf]

Abstract While Test-Time Adaptation (TTA) has shown promise in addressing distribution shifts between training and testing data, its effectiveness diminishes with heterogenous data streams due to uniform target estimation. As previous attempts merely stabilize model fine-tuning over time to handle continually changing environments, they fundamentally assume a homogeneous target domain at any moment, leaving the intrinsic real-world data heterogeneity unresolved. This paper delves into TTA under heterogeneous data streams, moving beyond current model-centric limitations. By revisiting TTA from a data-centric perspective, we discover that decomposing samples into Fourier space facilitates an accurate data separation across different frequency levels. Drawing from this insight, we propose a novel Frequency-based Decentralized Adaptation framework, which transitions data from globally heterogeneous to locally homogeneous in Fourier space and employs decentralized adaptation to manage diverse distribution shifts. Particularly, multiple local models are allowed to independently adjust to their specific data segments while periodically exchanging knowledge to form a cohesive global model. As such, not only can data diversity be captured, but also the overall model generalization can be enhanced across multiple distribution shifts. Importantly, we devise a novel Fourier-based augmentation strategy to assist in decentralizing adaptation, which selectively augments samples for each type of distribution shift and further enhances model robustness in complex real-world environments. Extensive experiments across various settings (corrupted, natural, and medical) demonstrate the superiority of our proposed framework over the state-of-the-arts.

1300Diffusion Models as Cartoonists! The Curious Case of High Density Regions

[openreview] [pdf]

Abstract We investigate what kind of images lie in the high-density regions of diffusion models. We introduce a theoretical mode-tracking process capable of pinpointing the exact mode of the denoising distribution, and we propose a practical high-probability sampler that consistently generates images of higher likelihood than usual samplers. Our empirical findings reveal the existence of significantly higher likelihood samples that typical samplers do not produce, often manifesting as cartoon-like drawings or blurry images depending on the noise level. Curiously, these patterns emerge in datasets devoid of such examples. We also present a novel approach to track sample likelihoods in diffusion SDEs, which remarkably incurs no additional computational cost.

1301SparseDM: Toward Sparse Efficient Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have been extensively used in data generation tasks and are recognized as one of the best generative models. However, their time-consuming deployment, long inference time, and requirements on large memory limit their application. In this paper, we propose a method based on the improved Straight-Through Estimator to improve the deployment efficiency of diffusion models. Specifically, we add sparse masks to the Convolution and Linear layers in a pre-trained diffusion model, then transfer learn the sparse model during the fine-tuning stage and turn on the sparse masks during inference. Experimental results on a Transformer and UNet-based diffusion models demonstrate that our method reduces MACs by 50% while increasing FID by only 0.44 on average. Sparse models are accelerated by approximately 1.2x on the GPU. Under other MACs conditions, the FID is also lower than 1 compared to other methods.

1302Preference-based Credit Assignment for Reinforcement Learning with Delayed and Noised Rewards

[openreview] [pdf]

Abstract Credit assignment has been utilized as a common technique for determining the past key state-action pairs and assigning corresponding rewards that have strong relevance with the final outputs in reinforcement learning, especially for environments with delayed rewards. However, current reward function design methods rely heavily on domain knowledge and may not accurately reflect the actual reward that should be received, which will lead to noised reward assignments during the credit assignment process and deteriorate the performance of the agent. To address this issue, in this paper, by leveraging the benefits of Preference-based Reinforcement Learning (PbRL), we propose a novel trajectory preference-based credit assignment method, where each trajectory is assigned to one of three different preferences according to its related delayed reward and the entire trajectory space. Then, a causal Transformer framework is introduced to predict the relevance between the decisions at each timestep and the different trajectory preferences to guide the credit assignment. Despite the unavoidable noised reward related to each trajectory, we demonstrate that our method can still effectively guide agents to learn superior strategies. Experiments on the Mujoco task and the treatment of sepsis under extremely delayed reward setting show that our method can mitigate the adverse effects resulting from the delayed noised rewards and provide effective guidelines for agents.

1303Instance-dependent Early Stopping

[openreview] [pdf]

Abstract In machine learning practice, early stopping has been widely used to regularize models and can save computational costs by halting the training process when the model’s performance on a validation set stops improving. However, conventional early stopping applies the same stopping criterion to all instances without considering their individual learning statuses, which leads to redundant computations on instances that are already well-learned. To further improve the efficiency, we propose an Instance-dependent Early Stopping (IES) method that adapts the early stopping mechanism from the entire training set to the instance level, based on the core principle that once the model has mastered an instance, the training on it should stop. IES considers an instance as mastered if the second-order differences of its loss value remain within a small range around zero. This offers a more consistent measure of an instance’s learning status compared with directly using the loss value, and thus allows for a unified threshold to determine when an instance can be excluded from further backpropagation. We show that excluding mastered instances from backpropagation can increase the gradient norms, thereby accelerating the decrease of the training loss and speeding up the training process. Extensive experiments on benchmarks demonstrate that IES method can reduce backpropagation instances by 10%-50% while maintaining or even slightly improving the test accuracy and transfer learning performance of a model.

1304LaMPlace: Learning to Optimize Cross-Stage Metrics in Macro Placement

[openreview] [pdf]

Abstract Machine learning techniques have shown great potential in enhancing macro placement, a critical stage in modern chip design. However, existing methods primarily focus ononlineoptimization ofintermediate surrogate metricsthat are available at the current placement stage, rather than directly targeting thecross-stage metrics---such as the timing performance---that measure the final chip quality. This is mainly because of the high computational costs associated with performing post-placement stages for evaluating such metrics, making theonlineoptimization impractical. Consequently, these optimizations struggle to align with actual performance improvements and can even lead to severe manufacturing issues. To bridge this gap, we proposeLaMPlace, whichLearnsaMask for optimizing cross-stage metrics in macro placement. Specifically, LaMPlace trains a predictor onofflinedata to estimate thesecross-stage metricsand then leverages the predictor to quickly generate a mask, i.e., a pixel-level feature map that quantifies the impact of placing a macro in each chip grid location on the design metrics. This mask essentially acts as a fast evaluator, enabling placement decisions based oncross-stage metricsrather thanintermediate surrogate metrics. Experiments on commonly used benchmarks demonstrate that LaMPlace significantly improves the chip quality across several key design metrics, achieving an average improvement of 9.6%, notably 43.0% and 30.4% in terms of WNS and TNS, respectively, which are two crucial cross-stage metrics that reflect the final chip quality in terms of the timing performance.

1305Gaussian Mixture Counterfactual Generator

[openreview] [pdf]

Abstract Generating synthetic control arms is a key challenge in clinical research, particularly in crossover trials where placebo data becomes unavailable after patients switch to active treatment. The absence of placebo data complicates estimating long-term efficacy and safety. To solve this, we propose a Gaussian mixture model that generates counterfactual data without needing control data for training. This method handles time-varying, continuous doses and estimates effects between treatment switchers and an extended placebo group, providing valuable insights for treatment effects, evidence generation, and decision-making.

1306Towards Domain Adaptive Neural Contextual Bandits

[openreview] [pdf]

Abstract Contextual bandit algorithms are essential for solving real-world decision making problems. In practice, collecting a contextual bandit’s feedback from different domains may involve different costs. For example, measuring drug reaction from mice (as a source domain) and humans (as a target domain). Unfortunately, adapting a contextual bandit algorithm from a source domain to a target domain with distribution shift still remains a major challenge and largely unexplored. In this paper, we introduce the first general domain adaptation method for contextual bandits. Our approach learns a bandit model for the target domain by collecting feedback from the source domain. Our theoretical analysis shows that our algorithm maintains a sub-linear regret bound even adapting across domains. Empirical results show that our approach outperforms the state-of-the-art contextual bandit algorithms on real-world datasets.

1307Does SGD really happen in tiny subspaces?

[openreview] [pdf]

Abstract Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. We observe similar behavior across practical setups, including the large learning rate regime (also known as Edge of Stability), Sharpness-Aware Minimization, momentum, and adaptive optimizers. We discuss the main causes and implications of this spurious alignment, shedding light on the dynamics of neural network training.

1308DrivingRecon: Large 4D Gaussian Reconstruction Model For Autonomous Driving

[openreview] [pdf]

Abstract Photorealistic 4D reconstruction of street scenes is essential for developing real-world simulators in autonomous driving. However, most existing methods perform this task offline and rely on time-consuming iterative processes, limiting their practical applications. To this end, we introduce the Large 4D Gaussian Reconstruction Model (DrivingRecon), a generalizable driving scene reconstruction model, which directly predicts 4D Gaussian from surround-view videos. To better integrate the surround-view images, the Prune and Dilate Block (PD-Block) is proposed to eliminate overlapping Gaussian points between adjacent views and remove redundant background points. To enhance cross-temporal information, dynamic and static decoupling is tailored to learn geometry and motion features better. Experimental results demonstrate that DrivingRecon significantly improves scene reconstruction quality and novel view synthesis compared to existing methods. Furthermore, we explore applications of DrivingRecon in model pre-training, vehicle adaptation, and scene editing. Our code will be made publicly available.

1309Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

[openreview] [pdf]

Abstract Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we showcase how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies. The key insight is to use unlabelled trajectories twice, 1) to extract a set of low-level skills offline, and 2) as additional data for a high-level policy that composes these skills to explore. We utilize a simple strategy of learning an optimistic reward model from online samples, and relabeling past trajectories into high-level, task-relevant examples. We instantiate these insights as SUPE (Skills from Unlabeled Prior data for Exploration), and empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.

1310Online Pre-Training for Offline-to-Online Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement Learning (RL) has achieved notable success in tasks requiring complex decision making, with offline RL offering the ability to train agents using fixed datasets, thereby avoiding the risks and costs associated with online interactions. However, offline RL is inherently limited by the quality of the dataset, which can restrict an agent’s performance. Offline-to-online RL aims to bridge the gap between the cost-efficiency of offline RL and the performance potential of online RL by pre-training an agent offline before fine-tuning it through online interactions. Despite its promise, recent studies show that offline pre-trained agents often underperform during online fine-tuning due to inaccurate value function, with random initialization proving more effective in certain cases. In this work, we propose a novel method, Online Pre-Training for Offline-to-Online RL (OPT), to address the issue of inaccurate value estimation in offline pre-trained agents. OPT introduces a new learning phase, Online Pre-Training, which allows the training of a new value function that enhances the subsequent fine-tuning process. Implementation of OPT on TD3 and SPOT demonstrates an average 30% improvement in performance across D4RL environments, such as MuJoCo, Antmaze, and Adroit.

1311Hierarchical Object-Oriented POMDP Planning for Object Rearrangement

[openreview] [pdf]

Abstract We present an online planning framework for solving multi-object rearrangement problems in partially observable, multi-room environments. Current object rearrangement solutions, primarily based on Reinforcement Learning or hand-coded planning methods, often lack adaptability to diverse challenges. To address this limitation, we introduce a novel Hierarchical Object-Oriented Partially Observed Markov Decision Process (HOO-POMDP) planning approach. This approach comprises of (a) an object-oriented POMDP planner generating sub-goals, (b) a set of low-level policies for sub-goal achievement, and (c) an abstraction system converting the continuous low-level world into a representation suitable for abstract planning. We evaluate our system on varying numbers of objects, rooms, and problem types in AI2-THOR simulated environments with promising results.

1312The Complexity Dynamics of Grokking

[openreview] [pdf]

Abstract We investigate the phenomenon of generalization through the lens of compression. In particular, we study the complexity dynamics of neural networks to explain \emph{grokking}, where networks suddenly transition from memorizing to generalizing solutions long after over-fitting the training data. To this end we introduce a new measure of intrinsic complexity for neural networks based on the theory of Kolmogorov complexity. Tracking this metric throughout network training, we find a consistent pattern in training dynamics, consisting of a rise and fall in complexity. We demonstrate that this corresponds to memorization followed by generalization. Based on insights from rate--distortion theory and the minimum description length principle, we lay out a principled approach to lossy compression of neural networks, and connect our complexity measure to explicit generalization bounds. Based on a careful analysis of information capacity in neural networks, we propose a new regularization method which encourages networks towards low-rank representations by penalizing their spectral entropy, and find that our regularizer outperforms baselines in total compression of the dataset.

1313On the Convergence of Adam under Non-uniform Smoothness: Separability from SGDM and Beyond

[openreview] [pdf]

Abstract This paper aims to clearly distinguish between Stochastic Gradient Descent with Momentum (SGDM) and Adam in terms of their convergence rates. We demonstrate that Adam achieves a faster convergence compared to SGDM under the condition of non-uniformly bounded smoothness. Our findings reveal that: (1) in deterministic environments, Adam can attain the known lower bound for the convergence rate of deterministic first-order optimizers, whereas the convergence rate of Gradient Descent with Momentum (GDM) has higher order dependence on the initial function value; (2) in stochastic setting, Adam’s convergence rate upper bound matches the lower bounds of stochastic first-order optimizers, considering both the initial function value and the final error, whereas there are instances where SGDM fails to converge with any learning rate. These insights distinctly differentiate Adam and SGDM regarding their convergence rates. Additionally, by introducing a novel stopping-time based technique, we further prove that if we consider the minimum gradient norm during iterations, the corresponding convergence rate can match the lower bounds across all problem hyperparameters. The technique can also help proving that Adam with a specific hyperparameter scheduler is parameter-agnostic, which hence can be of independent interest.

1314Are Large Language Models Truly Democratizing Financial Knowledge? Identifying Knowledge Gaps

[openreview] [pdf]

Abstract Large Language Models (LLMs) are frequently utilized as sources of knowledge for question-answering. While it is known that LLMs may lack access to real-time data or newer data produced after the model’s cutoff date, it is less clear how their knowledge spans acrosshistoricalinformation. In this study, we assess the breadth of LLMs’ knowledge using financial data of U.S. publicly traded companies by evaluating more than 190k questions and comparing model responses to factual data. We further explore the impact of company characteristics, such as size, retail investment, institutional attention, and readability of financial filings, on the accuracy of knowledge represented in LLMs. Our results reveal that LLMs are less informed about past financial performance, but they display a stronger awareness of larger companies and more recent information. Interestingly, at the same time, our analysis also reveals that LLMs are more likely to hallucinate for larger companies, especially for data from more recent years. We will make the code, prompts, and model outputs public upon the publication of the work.

1315Data-Driven Discovery of PDEs via the Adjoint Method

[openreview] [pdf]

Abstract In this work, we present an adjoint-based method for discovering the underlying governing partial differential equations (PDEs) given data. The idea is to consider a parameterized PDE in a general form and formulate a PDE-constrained optimization problem aimed at minimizing the error of the PDE solution from data. Using variational calculus, we obtain an evolution equation for the Lagrange multipliers (adjoint equations) allowing us to compute the gradient of the objective function with respect to the parameters of PDEs given data in a straightforward manner. In particular, we consider a family of parameterized PDEs encompassing linear, nonlinear, and spatial derivative candidate terms, and elegantly derive the corresponding adjoint equations. We show the efficacy of the proposed approach in identifying the form of the PDE up to machine accuracy, enabling the accurate discovery of PDEs from data. We also compare its performance with the famous PDE Functional Identification of Nonlinear Dynamics method known as PDE-FIND [Rudy, Samuel H., et al. Science advances 3.4 (2017): e1602614.], on both smooth and noisy data sets. Even though the proposed adjoint method relies on forward/backward solvers, it outperforms PDE-FIND for large data sets thanks to the analytic expressions for gradients of the cost function with respect to each PDE parameter.

1316Second Order Bounds for Contextual Bandits with Function Approximation

[openreview] [pdf]

Abstract Many works have developed algorithms no-regret algorithms for contextual bandits with function approximation, where the mean rewards over context-action pairs belongs to a function class F\mathcal{F}. Although there are many approaches to this problem, one that has gained in importance is the use of algorithms based on the optimism principle such as optimistic least squares. It can be shown the regret of this algorithm scales as O~(deluder(F)log(F)T)\widetilde{\mathcal{O}}\left(\sqrt{d_{\mathrm{eluder}}(\mathcal{F}) \log(\mathcal{F}) T }\right) where deluder(F)d_{\mathrm{eluder}}(\mathcal{F}) is a statistical measure of the complexity of the function class F\mathcal{F} known as eluder dimension. Unfortunately, even if the variance of the measurement noise of the rewards at time tt equals σt2\sigma_t^2 and these are close to zero, the optimistic least squares algorithm’s regret scales with T\sqrt{T}. In this work we are the first to develop algorithms that satisfy regret bounds for contextual bandits with function approximation of the form O~(σlog(F)deluder(F)T+deluder(F)log(F))\widetilde{\mathcal{O}}\left( \sigma \sqrt{\log(\mathcal{F})d_{\mathrm{eluder}}(\mathcal{F}) T } + d_{\mathrm{eluder}}(\mathcal{F}) \cdot \log(|\mathcal{F}|)\right) when the variances are unknown and satisfy σt2=σ\sigma_t^2 = \sigma for all tt and O~(deluder(F)log(F)t=1Tσt2+deluder(F)log(F))\widetilde{\mathcal{O}}\left( d_{\mathrm{eluder}}(\mathcal{F})\sqrt{\log(\mathcal{F})\sum_{t=1}^T \sigma_t^2 } + d_{\mathrm{eluder}}(\mathcal{F}) \cdot \log(|\mathcal{F}|)\right) when the variances change every time-step. These bounds generalize existing techniques for deriving second order bounds in contextual linear problems.

1317Privacy as a Free Lunch: Crafting Initial Distilled Datasets through the Kaleidoscope

[openreview] [pdf]

Abstract The advancement of deep learning necessitates stringent data privacy guarantees. Dataset distillation has shown potential in preserving differential privacy while maintaining training efficiency. This study first identifies that data generated by state-of-the-art dataset distillation methods strongly resembles to real data, indicating severe privacy leakage. We define this phenomenon as explicit privacy leakage. We theoretically analyze that although distilled datasets can ensure differential privacy to some extent, a high \IPC can weaken both differential privacy and explicit privacy. Furthermore, we reveal that the primary source of privacy leakage in distilled data stems from the common approach of initializing distilled images as real data. To address this, we propose a plug-and-play module, Kaleidoscopic Transformation (KT), designed to introduce enhanced strong perturbations to the selected real data during the initialization phase. Extensive experiments demonstrate that our method ensures both differential privacy and explicit privacy, while preserving the generalization performance of the distilled data. Our code will be publicly available.

1318ICAM: Rethinking Instance-Conditioned Adaptation in Neural Vehicle Routing Solver

[openreview] [pdf]

Abstract The neural combinatorial optimization (NCO) has shown great potential for solving routing problems without requiring expert knowledge. However, existing constructive NCO methods still struggle to directly solve large-scale instances, which significantly limits their application prospects. To address these crucial shortcomings, this work proposes a novel Instance-Conditioned Adaptation Model (ICAM) for better large-scale generalization of neural routing solvers. In particular, we design a simple yet efficient instance-conditioned adaptation function to significantly improve the generalization performance of existing NCO models with a very small time and memory overhead. In addition, with a systematic investigation on the performance of information incorporation between different attention mechanisms, we further propose a powerful yet low-complexity instance-conditioned adaptation module to generate better solutions for instances across different scales. Experimental results show that our proposed method is capable of obtaining promising results with a very fast inference time in solving Traveling Salesman Problems (TSPs), Capacitated Vehicle Routing Problems (CVRPs) and Asymmetric Traveling Salesman Problems (ATSPs). To the best of our knowledge, our model achieves state-of-the-art performance among all RL-based constructive methods for TSPs and ATSPs with up to 1,000 nodes and extends state-of-the-art performance to 5,000 nodes on CVRP instances, and our method also generalizes well to solve cross-distribution instances.

1319Decentralized Finite-Sum Optimization over Time-Varying Networks

[openreview] [pdf]

Abstract We consider decentralized time-varying stochastic optimization problems where each of the functions held by the nodes has a finite sum structure. Such problems can be efficiently solved using variance reduction techniques. Our aim is to explore the lower complexity bounds (for communication and number of stochastic oracle calls) and find optimal algorithms. The paper studies strongly convex and nonconvex scenarios. To the best of our knowledge, variance reduced schemes and lower bounds for time-varying graphs have not been studied in the literature. For nonconvex objectives, we obtain lower bounds and develop an optimal method GT-PAGE. For strongly convex objectives, we propose the first decentralized time-varying variance-reduction method ADOM+VR and establish lower bound in this scenario, highlighting the open question of matching the algorithms complexity and lower bounds even in static network case.

1320Selective Attention Improves Transformer

[openreview] [pdf]

Abstract Unneeded elements in the attention’s context degrade performance. We introduce Selective Attention, a simple parameter-free change to the standard attention mechanism which reduces attention to unneeded elements. Selective attention consistently improves language modeling performance across model sizes and context lengths. For example, a range of transformers trained with the language modeling objective on C4 with selective attention perform equivalently to transformers with standard attention modules with ~2X more parameters and heads. In addition, selective attention allows reducing the size of the attention’s context buffer, leading to substantial reductions in the memory and compute requirements during inference. For example, transformers with 100M parameters and context sizes of 512, 1,024, and 2,048 need 16X, 25X, and 47X less memory for their attention module, respectively, when equipped with selective attention, as those without selective attention, with the same validation perplexity.

1321ReDeEP: Detecting Hallucination in Retrieval-Augmented Generation via Mechanistic Interpretability

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) models are designed to incorporate external knowledge, reducing hallucinations caused by insufficient parametric (internal) knowledge. However, even with accurate and relevant retrieved content, RAG models can still produce hallucinations by generating outputs that conflict with the retrieved information. Detecting such hallucinations requires disentangling how Large Language Models (LLMs) balance external and parametric knowledge. Current detection methods often focus on one of these mechanisms or without decoupling their intertwined effects, making accurate detection difficult. In this paper, we investigate the internal mechanisms behind hallucinations in RAG scenarios. We discover hallucinations occur when theKnowledge FFNsin LLMs overemphasize parametric knowledge in the residual stream, whileCopying Headsfail to effectively retain or integrate external knowledge from retrieved content. Based on these findings, we proposeReDeEP, a novel method that detects hallucinations by decoupling LLM’s utilization of external context and parametric knowledge. Our experiments show that ReDeEP significantly improves RAG hallucination detection accuracy. Additionally, we introduce AARF, which mitigates hallucinations by modulating the contributions of Knowledge FFNs and Copying Heads.

1322Policy-aware Reward Modeling with Uncertainty-Gradient based Data Augmentation

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has emerged as a standard and effective approach for training large language models (LLMs) with human preferences. In this framework, a learned reward model approximates human preferences and guides policy optimization, making it crucial to develop an accurate reward model. However, without the ``true’’ reward function, challenges arise when the reward model is an imperfect proxy for human preference. Since the policy optimization continuously shifts the human preference training dataset’s distribution. The fixed reward model suffers from this problem of off-distribution, especially the on policy methods. While collecting new preference data can mitigate this issue, it is costly and challenging to optimize. Thus, reusing the policy interaction samples becomes a possible way to further refine the reward model. To tackle these challenges, we introduce a novel method \textbf{U}ncertainty-\textbf{G}radient based \textbf{D}ata \textbf{A}ugmentation (\textbf{UGDA} for short) to enhance reward modeling by leveraging policy samples to maintain on-distribution performance. Specifically, UGDA selects interaction samples based on the uncertainty of the reward ensembles and the gradient based influence of policy optimization. After the reward relabeling of selected samples, we use supervised learning to refine the reward ensembles, then get the retrained policy. Extensive experiments demonstrate that by leveraging UGDA to select a few samples without the costly human preference data collection, we can improve the ability of the policy and surpass the state-of-the-art methods.

1323Reward-free World Models for Online Imitation Learning

[openreview] [pdf]

Abstract Imitation learning (IL) enables agents to acquire skills directly from expert demonstrations, providing a compelling alternative to reinforcement learning. However, prior online IL approaches struggle with complex tasks characterized by high-dimensional inputs and complex dynamics. In this work, we propose a novel approach to online imitation learning that leverages reward-free world models. Our method learns environmental dynamics entirely in latent spaces without reconstruction, enabling efficient and accurate modeling. We adopt the inverse soft-Q learning objective, reformulating the optimization process in the Q-policy space to mitigate the instability associated with traditional optimization in reward-policy space. By employing a learned latent dynamics model and planning for control, our approach consistently achieves stable, expert-level performance in tasks with high-dimensional observation or action spaces and intricate dynamics. We evaluate our method on a diverse set of benchmarks, including DMControl, MyoSuite, and ManiSkill2, demonstrating superior empirical performance compared to existing approaches.

1324Interpretable Contrastive Monte Carlo Tree Search Reasoning

[openreview] [pdf]

Abstract We propose (S)peculative(C)ontrastive\textbf{(S)}peculative \textbf{(C)}ontrastive MCTS\textbf{MCTS}^\mathbf{*}: a novel Monte Carlo Tree Search (MCTS) reasoning algorithm for Large Language Models (LLMs), significantly improves both reasoning accuracy and speed. Our motivation comes from: 1. Previous MCTS LLM reasoning works often overlooked its biggest drawback—slower speed compared to CoT; 2. Previous research mainly used MCTS as a tool for LLM reasoning on various tasks with limited quantitative analysis or ablation studies of its components from reasoning interpretability perspective. 3. The reward model is the most crucial component in MCTS, however previous work has rarely conducted in-depth study or improvement of MCTS’s reward models. Thus, we conducted extensive ablation studies and quantitative analysis on components of MCTS, revealing the impact of each component on the MCTS reasoning performance of LLMs. Building on this, (i) we designed a highly interpretable reward model based on the principle of contrastive decoding and (ii) achieved an average speed improvement of 51.9% per node using speculative decoding. Additionally, (iii) we improved UCT node selection strategy and backpropagation used in previous works, resulting in significant performance improvement. We outperformed o1-mini by an average of 17.4% on the Blocksworld multi-step reasoning dataset using Llama-3.1-70B with SC-MCTS*.

1325DESIRE: Dynamic Knowledge Consolidation for Rehearsal-Free Continual Learning

[openreview] [pdf]

Abstract Continual learning aims to equip models with the ability to retain previously learned knowledge like a human. Recent work incorporating Parameter-Efficient Fine-Tuning has revitalized the field by introducing lightweight extension modules. However, existing methods usually overlook the issue of information leakage caused by the fact that the experiment data have been used in pre-trained models. Once these duplicate data are removed in the pre-training phase, their performance can be severely affected. In this paper, we propose a new LoRA-based rehearsal-free method named DESIRE\textbf{DESIRE}. Our method avoids imposing additional constraints during training to mitigate catastrophic forgetting, thereby maximizing the learning of new classes. To integrate knowledge from old and new tasks, we propose two efficient post-processing modules. On the one hand, we retain only two sets of LoRA parameters for merging and propose dynamic representation consolidation to calibrate the merged feature representation. On the other hand, we propose decision boundary refinement to address classifier bias when training solely on new class data. Extensive experiments demonstrate that our method achieves state-of-the-art performance on multiple datasets and strikes an effective balance between stability and plasticity. Our code will be publicly available.

1326Controlling Forgetting with Test-Time Data in Continual Learning

[openreview] [pdf]

Abstract Foundational vision-language models have shown impressive performance on various downstream tasks. Yet, there is still a pressing need to update these models later as new tasks or domains become available. Ongoing Continual Learning (CL) research provides techniques to overcome catastrophic forgetting of previous information when new knowledge is acquired. To date, CL techniques focus only on the supervised training sessions. This results in significant forgetting yielding inferior performance to even the prior model zero shot performance. In this work, we argue that test-time data hold great information that can be leveraged in a self supervised manner to refresh the model’s memory of previous learned tasks and hence greatly reduce forgetting at no extra labelling cost. We study how unsupervised data can be employed online to improve models’ performance on prior tasks upon encountering representative samples. We propose a simple yet effective student-teacher model with gradient based sparse parameters updates and show significant performance improvements and reduction in forgetting, which could alleviate the role of an offline episodic memory/experience replay buffer.

1327The Invariance Starvation Hypothesis

[openreview] [pdf]

Abstract Deep neural networks are known to learn and rely on spurious correlations during training, preventing them from being reliable and able to solve highly complex problems. While there exist many proposed solutions that overcome such reliance in different, tailored settings, current understanding regarding the formation of spurious correlations is limited. All proposed solutions with promising results assume that networks trained with empirical risk minimization will learn spurious correlations due to a preference for simpler features and that a solution to this problem requires further processing on the networks’ learned representations or re-training on a modified dataset where the proportion of training data with spurious features is significantly lower. In this paper, we aim to form a better understanding regarding the formation of spurious correlations by performing a rigorous study regarding the role that data plays in the formation of spurious correlations. We show that in reasoning tasks with simple input samples, simply drawing more data from the same training distribution overcomes spurious correlations, even though we maintain the proportion of samples with spurious features. In other words, we find that if the network has enough data to encode the invariant function appropriately, it no longer relies on spurious features, regardless of its strength. We observe the same results in settings with more complex distributions with an intractable number of participating features, such as vision and language. However, we find that in such settings, drawing more samples from the training distribution while maintaining proportion can exacerbate spurious correlations at times, due to the introduction of new samples that are significantly different from samples in the original training set. Taking inspiration from reasoning tasks, we present an effective remedy to this problem to ensure that drawing more samples from the distribution always overcomes spurious correlations.

1328Leveraging Object Detection for Diverse and Accurate Long-Horizon Events Forecasting

[openreview] [pdf]

Abstract Long-horizon event forecasting is critical across various domains, including retail, finance, healthcare, and social networks. Traditional methods, such as Marked Temporal Point Processes (MTPP), often rely on autoregressive models to predict multiple future events. However, these models frequently suffer from issues like converging to constant or repetitive outputs, which limits their effectiveness and general applicability. To address these challenges, we introduce DeTPP (Detection-based Temporal Point Processes), a novel approach inspired by object detection techniques from computer vision. DeTPP employs a unique matching-based loss function that selectively prioritizes reliably predictable events, improving the accuracy and diversity of predictions during inference. Our method establishes a new state-of-the-art in long-horizon event forecasting, achieving up to a 77% relative improvement over existing MTPP and next-K methods. The proposed hybrid approach enhances the accuracy of next event prediction by up to 2.7% on a large transactional dataset. Notably, DeTPP is also among the fastest methods for inference.

1329Flat Reward in Policy Parameter Space Implies Robust Reinforcement Learning

[openreview] [pdf]

Abstract Investigating flat minima on loss surfaces in parameter space is well-documented in the supervised learning context, highlighting its advantages in model generalization. However, limited attention has been paid to the reinforcement learning (RL) context, where the impact of flatter reward in policy parameter space remains mostly unexplored. Beyond the naive guessing from the lesson of supervised learning, which makes us anticipate a link from flat rewards to the enhanced generalization, we here aim to formally bridge the flatness in reward surface to the robustness of RL models. For a policy model case, where the deep model determines actions, the flatter behavior of rewards against the parameter perturbations primarily leads to consistent rewards against perturbed actions. Moreover, the action robustness further affects to the robustness against other variations from the changes of state transition probabilities and reward functions. We extensively simulate various RL environments, confirming the consistent gains of flatter reward in bolstering the robustness of RL in varying circumstances, including action selection, transition dynamics, and reward functions.

1330Industrial Benchmarking of LLMs: Assessing Hallucination in Traffic Incident Scenarios with a Novel Spatio-Temporal Dataset

[openreview] [pdf]

Abstract Large language models (LLMs) hold revolutionary potential to digitize and enhance the Health & Public Services (H&PS) industry. Despite their advanced linguistic abilities, concerns about accuracy, stability, and traceability still persist, especially in high-stakes areas such as transportation systems. Moreover, the predominance of English in LLM development raises questions about how they perform in non-English contexts.This study introduces a novel cross-lingual benchmark dataset comprising nearly 99,869 real traffic incident records from Vienna (2013-2023) to assess the robustness of state-of-the-art LLMs (>9) in the spatio and temporal domain of traffic incident classification. We then explored three hypotheses — sentence indexing, date-to-text conversion, and German-to-English translation — and incorporated Retrieval Augmented Generation (RAG) to further examine the models’ ability to handle hallucinations in both spatial and temporal contexts.Our experiments with GPT-4 and Llama models reveal significant performance disparities across these hypotheses in the spatio-temporal domain and also demonstrate how RAG can mitigate what types of hallucinations. These findings underscore the need for enhanced cross-lingual capabilities and improved explainability in LLMs. We provide open access to our Health & Public Services (H&PS) traffic incident dataset, with the project demo and code available atWebsite.

1331Combinatorial Reinforcement Learning with Preference Feedback

[openreview] [pdf]

Abstract In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action—an assortment of multiple items—to a user, whose preference feedback follows a multinomial logit (MNL) model. This framework allows us to model real-world scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits (which only account for single-step preference feedback), and (2) computational complexity due to the combinatorial action space. In this paper, we assume a contextual MNL preference model, where mean utilities are linear, and the value of each item is approximated using general function approximation. We propose an algorithm, MNL-VQQL, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference model), we establish a regret lower bound and show that MNL-VQQL achieves near-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.

1332Gradient-Free Adversarial Attack on Time Series Regression: Targeting XAI Explanations

[openreview] [pdf]

Abstract Explainable Artificial Intelligence (XAI) sheds light on the decision-making ground of black-box models by offering explanations. These explanations need to be robust for trustworthy time series regression applications in high-stake areas like medicine or finance, which yet remains largely unexplored. Furthermore, most adversarial attack methods currently rely on white-box strategies, which require access to gradient information from both the model and the XAI method. In real-world scenarios, such information is often difficult or impossible to obtain. To address these challenges, we propose a novel gradient-free adversarial attack method specifically designed for time series explanations, targeting non-differentiable XAI techniques. To enhance the effectiveness of our method for time series data, we introduce an attack objective function based on Dynamic Time Warping (DTW). Additionally, we implement an explanation-based local attack strategy, which ensures that the adversarial perturbations remain imperceptible within the time series data. In our experiments, we generate adversarial examples to attack four different XAI methods across three black-box models, using two time series datasets. The results reveal the vulnerability of current non-differentiable XAI methods. Furthermore, by comparing our approach with existing attack methods, we demonstrate the superiority of our proposed objective function and local attack strategy.

1333Penalizing Infeasible Actions and Reward Scaling in Reinforcement Learning with Offline Data

[openreview] [pdf]

Abstract Reinforcement learning with offline data often suffers from Q-value extrapolation errors due to limited data, which poses significant challenges and limits overall performance. Existing methods such as layer normalization and reward relabeling have shown promise in addressing these errors and achieving empirical improvements. In this paper, we extend these approaches by introducing reward scaling with layer normalization (RS-LN) to further mitigate extrapolation errors and enhance performance. Furthermore, based on the insight that Q-values should be lower for infeasible action spaces—where neural networks might otherwise extrapolate into undesirable regions—we propose a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS on a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning across the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.

1334Exploring The Forgetting in Adversarial Training: A Novel Method for Enhancing Robustness

[openreview] [pdf]

Abstract In recent years, there has been an explosion of research into developing robust deep neural networks against adversarial examples. As one of the most successful methods, Adversarial Training (AT) has been widely studied before, but there is still a gap to achieve promising clean and robust accuracy for many practical tasks. In this paper, we consider the AT problem from a new perspective which connects it to catastrophic forgetting in continual learning (CL). Catastrophic forgetting is a phenomenon in which neural networks forget old knowledge upon learning a new task. Although AT and CL are two different problems, we show that they actually share several key properties in their training processes. Specifically, we conduct an empirical study and find that this forgetting phenomenon indeed occurs in adversarial robust training across multiple datasets (SVHN, CIFAR-10, CIFAR-100, and TinyImageNet) and perturbation models (\ell_{\infty} and 2\ell_{2}). Based on this observation, we propose a novel method called Adaptive Multi-teachers Self-distillation (AMS), which leverages a carefully designed adaptive regularizer to mitigate the forgetting by aligning model outputs between new and old ``stages’'. Moreover, our approach can be used as a unified method to enhance multiple different AT algorithms. Our experiments demonstrate that our method can significantly enhance robust accuracy and meanwhile preserve high clean accuracy, under several popular adversarial attacks (e.g., PGD, CW, and Auto Attacks). As another benefit of our method, we discover that it can largely alleviate the robust overfitting issue of AT in our experiments.

1335Learning DAGs and Root Causes from Time-Series Data

[openreview] [pdf]

Abstract We introduce DAG-TFRC, a novel method for learning directed acyclic graphs (DAGs) from time series with few root causes. By this, we mean that the data are generated by a small number of events at certain, unknown nodes and time points under a structural vector autoregression model. For such data, we (i) learn the DAGs representing both the instantaneous and time-lagged dependencies between nodes, and (ii) discover the location and time of the root causes. For synthetic data with few root causes, DAG-TFRC shows superior performance in accuracy and runtime over prior work, scaling up to thousands of nodes. Experiments on simulated and real-world financial data demonstrate the viability of our sparse root cause assumption. On S&P 500 data, DAG-TFRC successfully clusters stocks by sectors and discovers major stock movements as root causes.

1336On the Out-of-Distribution Generalization of Self-Supervised Learning

[openreview] [pdf]

Abstract In this paper, we focus on the out-of-distribution (OOD) generalization of self-supervised learning (SSL). By analyzing the mini-batch construction during SSL training phase, we first give one plausible explanation for SSL having OOD generalization. Then, from the perspective of data generation and causal inference, we analyze and conclude that SSL learns spurious correlations during the training process, which leads to a reduction in OOD generalization. To address this issue, we propose a post-intervention distribution (PID) grounded in the Structural Causal Model. PID offers a scenario where the relationships between variables are free from the influence of spurious correlations. Besides, we demonstrate that if each mini-batch during SSL training satisfies PID, the resulting SSL model can achieve optimal worst-case OOD performance. This motivates us to develop a batch sampling strategy that enforces PID constraints through the learning of a latent variable model. Through theoretical analysis, we demonstrate the identifiability of the latent variable model and validate the effectiveness of the proposed sampling strategy. Experiments conducted on various downstream OOD tasks demonstrate the effectiveness of the proposed sampling strategy.

1337DivScene: Benchmarking LVLMs for Object Navigation with Diverse Scenes and Objects

[openreview] [pdf]

Abstract Object navigation in unknown environments is crucial for deploying embodied agents in real-world applications. While we have witnessed huge progress due to large-scale scene datasets, faster simulators, and stronger models, previous studies mainly focus on limited scene types and target objects. In this paper, we study a new task of navigating to diverse target objects in a large number of scene types. To benchmark the problem, we present a large-scale scene dataset, DivScene, which contains 4,614 scenes across 81 different types. With the dataset, we build an end-to-end embodied agent, NatVLM, by fine-tuning a Large Vision Language Model (LVLM) through imitation learning. The LVLM is trained to take previous observations from the environment and generate the next actions. We also introduce CoT explanation traces of the action prediction for better performance when tuning LVLMs. Our extensive experiments find that we can build a performant LVLM-based agent through imitation learning on the shortest paths constructed by a BFS planner without any human supervision. Our agent achieves a success rate that surpasses GPT-4o by over 20%. Meanwhile, we carry out various analyses showing the generalization ability of our agent.

1338Improved Noise Schedule for Diffusion Training

[openreview] [pdf]

Abstract Diffusion models have emerged as the de facto choice for generating high-quality visual content across multiple domains. However, training a single model to predict noise at multiple levels presents significant challenges, requiring numerous iterations and resulting in substantial computational costs. Various approaches, such as loss weighting strategy design and architectural refinements, have been introduced to expedite convergence and improve model performance. In this study, we propose a novel approach to design the noise schedule for enhancing the training of diffusion models. Our key insight is that the importance sampling of the logarithm of the Signal-to-Noise ratio (logSNR\log \text{SNR}), theoretically equivalent to a modified noise schedule, is particularly beneficial for training efficiency when increasing the sample frequency around logSNR=0\log \text{SNR}=0. This strategic sampling allows the model to focus on the critical transition point between signal dominance and noise dominance, potentially leading to more robust and accurate predictions. We empirically demonstrate the superiority of our noise schedule over the standard cosine schedule. Furthermore, we highlight the advantages of our noise schedule design on the ImageNet benchmark, showing that the designed schedule consistently benefits different prediction targets. Our findings contribute to the ongoing efforts to optimize diffusion models, potentially paving the way for more efficient and effective training paradigms in the field of generative AI.

1339High Probability Contextual Bandits for Optimal Dosage Selection

[openreview] [pdf]

Abstract Multi-Armed Bandit (MAB\textit{MAB}) formulations are commonly used to model the problem of Optimal Dose-Finding\textit{Optimal Dose-Finding}. However, in many practical applications, it is necessary to receive data about the patient’s current state and then administer a drug dosage adapted to that state. To overcome this issue, we adopt a linear contextual bandit formulation with stage-wise constraints. At each round, the learner selects a dosage and receives both a reward signal and a cost signal. The learner’s goal is to maximize the drug’s efficacy—captured as the expected cumulative reward—while ensuring that the toxicity, reflected by the cost signal, remains below a known threshold. Satisfying the cost signal constraint only in expectation can be dangerous, as it may lead to over-dosage complications in certain cases. To address this issue, we introduce a novel model that controls the realization of the cost signal with high probability, in contrast to previous works where control was only applied to the expected cost signal. Our algorithm follows the UCB\textit{UCB} approach, for which we establish a regret bound over TT rounds and run numerical experiments. We further generalize our results to non-linear\textit{non-linear} functions and provide a regret bound in terms of the eluder dimension\textit{eluder dimension}, a measure of function class complexity.

1340Reinforcement Learning on Synthetic Navigation Data allows Safe Navigation in Blind Digital Twins

[openreview] [pdf]

Abstract Limited access to dedicated navigation data in visually impaired individuals is a significant bottleneck for developing AI-driven assistive devices. For this purpose, we have developped a virtual environment designed to extract various human-like navigation data from procedurally generated labyrinths. Using reinforcement learning and semantic segmentation, we trained a convolutional neural network to perform obstacle avoidance from synthetic data. Our model outperformed state-of-the-art backbones including DINOv2-B in safe pathway identification in real world. In conclusion, despite being trained only on synthetic data, our model successfully extracted features compatible with safe navigation in real-world settings, opening new avenues for visually impaired.

1341Discrete Copula Diffusion

[openreview] [pdf]

Abstract Discrete diffusion models have recently shown significant progress in modeling complex data, such as natural languages and DNA sequences. However, unlike diffusion models for continuous data, which can generate high-quality samples in just a few denoising steps, modern discrete diffusion models still require hundreds or even thousands of denoising steps to perform well. In this paper, we identify a fundamental limitation that prevents discrete diffusion models from achieving strong performance with fewer steps -- they fail to capture dependencies between output variables at each denoising step. To address this issue, we provide a formal explanation and introduce a general approach to supplement the missing dependency information by incorporating another deep generative model, termed the copula model. Our method does not require fine-tuning either the diffusion model or the copula model, yet it enables high-quality sample generation with significantly fewer denoising steps. When we apply this approach to autoregressive copula models, the combined model outperforms both models individually in unconditional and conditional text generation. Specifically, the hybrid model achieves better (un)conditional text generation using 8 to 32 times fewer denoising steps than the diffusion model alone. In addition to presenting an effective discrete diffusion generation algorithm, this paper emphasizes the importance of modeling inter-variable dependencies in discrete diffusion.

1342Language Guided Skill Discovery

[openreview] [pdf]

Abstract Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for downstream tasks, obtaining a semantically diverse repertoire of skills is crucial. While some approaches use discriminators to acquire distinguishable skills and others focus on increasing state coverage, the direct pursuit of ‘semantic diversity’ in skills remains underexplored. We hypothesize that leveraging the semantic knowledge of large language models (LLM) can lead us to improve semantic diversity of resulting behaviors. In this sense, we introduce Language Guided Skill Discovery (LGSD), a skill discovery framework that aims to directly maximize the semantic diversity between skills. LGSD takes user prompts as input and outputs a set of semantically distinctive skills. The prompts serve as a means to constrain the search space into a semantically desired subspace, and the generated LLM outputs guide the agent to visit semantically diverse states within the subspace. We demonstrate that LGSD enables legged robots to visit different user-intended areas on a plane by simply changing the prompt. Furthermore, we show that language guidance aids in discovering more diverse skills compared to five existing skill discovery methods in robot-arm manipulation environments. Lastly, LGSD provides a simple way of utilizing learned skills via natural language.

1343Safety-Prioritizing Curricula for Constrained Reinforcement Learning

[openreview] [pdf]

Abstract Curriculum learning aims to accelerate reinforcement learning (RL) by generating curricula, i.e., sequences of tasks of increasing difficulty. Although existing curriculum generation approaches provide benefits in sample efficiency, they overlook safety-critical settings where an RL agent must adhere to safety constraints. Thus, these approaches may generate tasks that cause RL agents to violate safety constraints during training and behave suboptimally after. We develop a safe curriculum generation approach (SCG) that aligns the objectives of constrained RL and curriculum learning: improving safety during training and boosting sample efficiency. SCG generates sequences of tasks where the RL agent can be safe and performant by initially generating tasks with minimum safety violations over high-reward ones. We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.

1344LLM Unlearning via Loss Adjustment with Only Forget Data

[openreview] [pdf]

Abstract Unlearning in Large Language Models (LLMs) is essential for ensuring ethical and responsible AI use, especially in addressing privacy leak, bias, safety, and evolving regulations. Existing approaches to LLM unlearning often rely on retain data or a reference LLM, yet they struggle to adequately balance unlearning performance with overall model utility. This challenge arises because leveraging explicit retain data or implicit knowledge of retain data from a reference LLM to fine-tune the model tends to blur the boundaries between the forgotten and retain data, as different queries often elicit similar responses. In this work, we propose eliminating the need to retain data or the reference LLM for response calibration in LLM unlearning. Recognizing that directly applying gradient ascent on the forget data often leads to optimization instability and poor performance, our method guides the LLM on what not to respond to, and importantly, how to respond, based on the forget data. Hence, we introduce Forget data only Loss AjustmenT (FLAT), a “flat” loss adjustment approach which addresses these issues by maximizing ff-divergence between the available template answer and the forget answer only w.r.t. the forget data. The variational form of the defined ff-divergence theoretically provides a way of loss adjustment by assigning different importance weights for the learning w.r.t. template responses and the forgetting of responses subject to unlearning. Empirical results demonstrate that our approach not only achieves superior unlearning performance compared to existing methods but also minimizes the impact on the model’s retained capabilities, ensuring high utility across diverse tasks, including copyrighted content unlearning on Harry Potter dataset and MUSE Benchmark, and entity unlearning on the TOFU dataset.

1345Towards a Theoretical Understanding of Synthetic Data in LLM Post-Training: A Reverse-Bottleneck Perspective

[openreview] [pdf]

Abstract Synthetic data has become a pivotal resource in post-training tasks for large language models (LLMs) due to the scarcity of high-quality, specific data. While various methods have been developed to generate synthetic data, there remains a discernible gap between the practical effects of synthetic data and our theoretical comprehension. To address this challenge, we commence by presenting a detailed modeling of the prevalent synthetic data generation process. Building upon this modeling, we demonstrate that the generalization capability of the post-trained model is critically determined by the information gain derived from the generative model, as analyzed from a novel reverse-bottleneck perspective. Moreover, we introduce the concept of Generalization Gain via Mutual Information (GGMI) and elucidate the relationship between generalization gain and information gain. This analysis serves as a theoretical foundation for synthetic data generation and further highlights its connection with the generalization capability of post-trained models, offering an understanding about the design of synthetic data generation techniques and the optimization of the post-training process. We open source our code through an anonymous GitHub repository athttps://anonymous.4open.science/r/Understanding-Synthetic.

1346Stochastic Matching Bandits under Preference Feedback

[openreview] [pdf]

Abstract In this study, we propose a new bandit framework of stochastic matching employing the Multinomial Logit (MNL) choice model with feature information. In this framework, agents on one side are assigned to arms on the other side, and each arm stochastically accepts an agent among the assigned pool of agents based on its unknown preference, allowing a possible outside option of not accepting any. The objective is to minimize regret by maximizing the probability of successful matching. For this framework, we first propose an elimination-based algorithm that achieves a regret bound of O~(KrKT)\tilde{O}\big(K\sqrt{rKT} \big) over time horizon TT, where KK is the number of arms and rr is the rank of feature space. Furthermore, we propose an approach to resolve the computation issue regarding combinatorial optimization in the algorithm. Lastly, we evaluate the performances of our algorithm through experiments comparing with the existing showing the superior performances of our algorithm.

1347The Advancement in Stochastic Zeroth-Order Optimization: Mechanism of Accelerated Convergence of Gaussian Direction on Objectives with Skewed Hessian Eigenvalues

[openreview] [pdf]

Abstract This paper primarily investigates large-scale finite-sum optimization problems, which are particularly prevalent in the big data era. In the field of zeroth-order optimization, stochastic optimization methods have become essential tools. Natural zeroth-order stochastic optimization methods are primarily based on stochastic gradient descent (SGD\texttt{SGD}). The method of preprocessing the stochastic gradient with Gaussian vector is referred to as ZO-SGD-Gauss\texttt{ZO-SGD-Gauss} (ZSG\texttt{ZSG}), while estimating partial derivatives along coordinate directions to compute the stochastic gradient is known as ZO-SGD-Coordinate\texttt{ZO-SGD-Coordinate} (ZSC\texttt{ZSC}). Compared to ZSC\texttt{ZSC}, ZSG\texttt{ZSG} often demonstrates superior performance in practice. However, the underlying mechanisms behind this phenomenon remain unclear in the academic community. To the best of our knowledge, our work is the first to theoretically analyze the potential advantages of ZSG\texttt{ZSG} compared to ZSC\texttt{ZSC}. Unlike the fundamental assumptions applied in general stochastic optimization analyses, the quadratic regularity assumption is proposed to generalize the smoothness and strong convexity to the Hessian matrix. This assumption allows us to incorporate Hessian information into the complexity analysis. When the objective function is quadratic, the quadratic regularity assumption reduces to the second-order Taylor expansion of the function, and we focus on analyzing and proving the significant improvement of ZSG\texttt{ZSG}. For other objective function classes, we also demonstrate the convergence of ZSG\texttt{ZSG} and its potentially better query complexity than that of ZSC\texttt{ZSC}. Finally, experimental results on both synthetic and real-world datasets substantiate the effectiveness of our theoretical analysis.

1348Harnessing Spatial Dependency for Domain Generalization in Multivariate Time-series Sensor Data

[openreview] [pdf]

Abstract Analyzing human-related behavior with multi-variate time-series (MTS) data through multi-sensor structure and temporal features often results in various distributions across different domains. Furthermore, distribution shifts caused by spatial dependencies—such as sensor misalignments—pose significant challenges in domain generalization (DG) settings, where the model is expected to be robust across various domains. While existing spatiotemporal encoders for MTS and DG methods in other domains address related issues, DG approaches specifically tailored for MTS data remain underexplored, particularly in learning and aligning spatial dependencies across domains. To bridge this gap, we propose ASAM (Adaptive Spatial Dependency Alignment in MTS Data for Domain Generalization), a novel framework designed to enhance robustness to spatial dependencies in MTS data. ASAM proposes a DG layer with domain generalization loss function and two- view regularization loss functions to adaptively align spatial dependencies between domains. We adopt a two-phase approach to effectively align different set of domains: the first phase captures spatiotemporal features, and the second phase applies the DG layer and domain generalization loss function to align spatial dependencies across domains. An input-aware graph generation process and a GNN- based DG layer, coupled with the domain generalization loss function, adaptively align the spatial dependencies learned in the second phase with those from the first phase, ensuring a more precise alignment. We additionally incorporate a two- view regularization method to capture underlying spatial dependency effectively in both phases and is comprised of spatial decorrelation loss and Gaussian kernel loss. Our theoretical analysis demonstrates that the DG layer effectively assimilates invariant information, ensuring robustness across diverse distributions. Ex- tensive evaluations of the four real-world datasets show ASAM outperforms ten recent baselines. To the best of our knowledge, this work is among the first to explore DG approaches for MTS data by focusing on spatial dependency alignment. Our code is available athttps://anonymous.4open.science/r/ASAM/.

1349Predictive Differential Training Guided by Training Dynamics

[openreview] [pdf]

Abstract This paper centers around a novel concept proposed recently by researchers from the control community where the training process of a deep neural network can be considered a nonlinear dynamical system acting upon the high-dimensional weight space. Koopman operator theory, a data-driven dynamical system analysis framework, can then be deployed to discover the otherwise non-intuitive training dynamics. Taking advantage of the predictive power of the Koopman operator theory, the time-consuming Stochastic Gradient Descent ( SGD) iterations can be bypassed by directly predicting network weights a few epochs later. This novel predictive training framework, however, often suffers from gradient explosion especially for more extensive and complex models. In this paper, we incorporate the idea of differential learning, where different parts of the network can undergo different learning rates during training, into the predictive training framework and propose the so-called "predictive differential training’’ (PDT) to sustain robust performance for accelerated learning even for complex network structures. The key contribution is the design of an effective masking strategy based on Koopman analysis of training dynamics of each parameter in order to select the subset of parameters that exhibits "good’’ prediction performance. PDT also includes the design of an acceleration scheduler to keep track of the prediction error so that the training process can roll back to the traditional GD-based approaches to "correct’’ deviations from off-predictions. We demonstrate that PDT can be seamlessly integrated as a plug-in with existing optimizers, including, for example, SGD, momentum, and Adam. The experimental results have shown consistent performance improvement in terms of faster convergence, lower training/testing loss, and fewer number of epochs to achieve the best loss of Baseline.

1350A transfer learning framework for weak to strong generalization

[openreview] [pdf]

Abstract Modern large language model (LLM) alignment techniques rely on human feedback, but it is unclear whether the techniques fundamentally limit the capabilities of aligned LLMs. In particular, it is unclear whether it is possible to align (stronger) LLMs with superhuman capabilities with (weaker) human feedbackwithout degrading their capabilities. This is an instance of the weak-to-strong generalization problem: using weaker (less capable) feedback to train a stronger (more capable) model. We prove that weak-to-strong generalization is possible by eliciting latent knowledge from pre-trained LLMs. In particular, we cast the weak-to-strong generalization problem as a transfer learning problem in which we wish to transfer a latent concept from a weak model to a strong pre-trained model. We prove that a naive fine-tuning approach suffers from fundamental limitations, but an alternative refinement-based approach suggested by the problem structure provably overcomes the limitations of fine-tuning. Finally, we demonstrate the practical applicability of the refinement approach in multiple LLM alignment tasks.

[openreview] [pdf]

Abstract Large Language Models (LLMs) are known to be susceptible to crafted adversarial attacks or jailbreaks that lead to the generation of objectionable content despite being aligned to human preferences using safety fine-tuning methods. While the large dimensionality of input token space makes it inevitable to find \emph{adversarial} prompts that can jailbreak these models, we aim to evaluate whether safety fine-tuned LLMs are safe against \emph{natural} prompts which are semantically related to toxic prompts that elicit safe responses after alignment. We surprisingly find that popular aligned LLMs such as GPT4 can be compromised using naive prompts that are NOT even crafted with an objective of jailbreaking the model. Furthermore, we empirically show that given a seed prompt that elicits a toxic response from an unaligned model, one can systematically generate several semantically related \emph{natural} prompts that can jailbreak aligned LLMs. Towards this, we propose a method of \emph{Response Guided Question Augmentation (ReG-QA)} to evaluate the generalization of safety aligned LLMs to natural prompts, by first generating several toxic answers from a seed question using an unaligned LLM (Q to A), and further prompting another LLM to generate questions that are likely to produce these answers (A to Q). We interestingly find that safety fine-tuned LLMs such as GPT-4o are vulnerable to producing natural jailbreak \textit{questions} from unsafe content (without denial) and can thus be used for the latter (A to Q) step. Using the proposed approach, we obtain attack success rates that are comparable to/ better than leading adversarial attack methods on the JailbreakBench leaderboard.

1352Data Shapley in One Training Run

[openreview] [pdf]

Abstract Data Shapley offers a principled framework for attributing the contribution of data within machine learning contexts. However, the traditional notion of Data Shapley requires re-training models on various data subsets, which becomes computationally infeasible for large-scale models. Additionally, this retraining-based definition cannot evaluate the contribution of data for a specific model training run, which may often be of interest in practice. This paper introduces a novel concept, In-Run Data Shapley, which eliminates the need for model retraining and is specifically designed for assessing data contribution for a particular model of interest. In-Run Data Shapley calculates the Shapley value for each gradient update iteration and accumulates these values throughout the training process. We present several techniques that allow the efficient scaling of In-Run Data Shapley to the size of foundation models. In its most optimized implementation, our method adds negligible runtime overhead compared to standard model training. This dramatic efficiency improvement makes it possible to perform data attribution for the foundation model pretraining stage. We present several case studies that offer fresh insights into pretraining data’s contribution and discuss their implications for copyright in generative AI and pretraining data curation.

1353Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery

[openreview] [pdf]

Abstract Imitation learning is a data-driven approach to learning policies from expert behavior, but it is prone to unreliable outcomes in out-of-sample (OOS) regions. While previous research on stable dynamical system policies guarantees convergence to a desired state, it often overlooks transient behavior. We propose a framework for learning policies modeled by contractive dynamical systems, ensuring that all policy rollouts converge regardless of perturbations, and in turn, enable efficient OOS recovery. By leveraging recurrent equilibrium networks and coupling layers, the policy structure guarantees contractivity for any parameter choice, which facilitates unconstrained optimization. Furthermore, we provide theoretical upper bounds for worst-case and expected loss terms, rigorously establishing the reliability of our method in deployment. Empirically, we demonstrate substantial OOS performance improvements in robotics manipulation and navigation tasks.

1354Towards Improving Exploration through Sibling Augmented GFlowNets

[openreview] [pdf]

Abstract Exploration is a key factor for the success of an active learning agent, especially when dealing with sparse extrinsic terminal rewards and long trajectories. We introduce Sibling Augmented Generative Flow Networks (SA-GFN), a novel framework designed to enhance exploration and training efficiency of Generative Flow Networks (GFlowNets). SA-GFN uses a decoupled dual network architecture, comprising of a main Behavior Network and an exploratory Sibling Network, to enable a diverse exploration of the underlying distribution using intrinsic rewards. Inspired by the ideas on exploration from reinforcement learning, SA-GFN provides a general-purpose exploration and learning paradigm that integrates with multiple GFlowNet training objectives and is especially helpful for exploration over a wide range of sparse or low reward distributions and task structures. An extensive set of experiments across a diverse range of tasks, reward structures and trajectory lengths, along with a thorough set of ablations, demonstrate the superior performance of SA-GFN in terms of exploration efficacy and convergence speed as compared to the existing methods. In addition, SA-GFN’s versatility and compatibility with different GFlowNet training objectives and intrinsic reward methods underscores its broad applicability in various problem domains.

1355Off-policy Evaluation with Deeply-abstracted States

[openreview] [pdf]

Abstract Off-policy evaluation (OPE) is crucial for assessing a target policy’s impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging. This paper studies state abstractions -- originally designed for policy learning -- in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE, and derive a backward-model-irrelevance condition for achieving irrelevance in (marginalized) importance sampling ratios by constructing a time-reversed Markov decision process (MDP) based on the standard MDP. (ii) We propose a novel iterative procedure that sequentially projects the original state space into a smaller space, resulting in a deeply-abstracted state, which substantially simplify the sample complexity of OPE arising from high cardinality. (iii) We prove the Fisher consistencies of various OPE estimators when applied to our proposed abstract state spaces.

1356Pareto-Optimal Learning from Preferences with Hidden Context

[openreview] [pdf]

Abstract Ensuring AI models align with human values is essential for their safety and functionality. Reinforcement learning from human feedback (RLHF) leverages human preferences to achieve this alignment. However, when preferences are sourced from diverse populations, point estimates of reward can result in suboptimal performance or be unfair to specific groups. We propose Pareto Optimal Preference Learning (POPL), which enables pluralistic alignment by framing discrepant group preferences as objectives with potential trade-offs, aiming for policies that are Pareto-optimal on the preference dataset. POPL utilizes lexicase selection, an iterative process that selects diverse and Pareto-optimal solutions. Our theoretical and empirical evaluations demonstrate that POPL surpasses baseline methods in learning sets of reward functions and policies, effectively catering to distinct groups without access to group numbers or membership labels. We verify the performance of POPL on a stateless preference learning setting, a Minigrid RL domain, Metaworld robotics benchmarks, as well as large language model (LLM) fine-tuning. We illustrate that POPL can also serve as a foundation for techniques optimizing specific notions of group fairness, ensuring safe and equitable AI model alignment.

1357Reward as Observation: Learning Reward-based Policies for Rapid Adaptation

[openreview] [pdf]

Abstract This paper explores a reward-based policy to achieve zero-shot transfer between source and target environments with completely different observation spaces. While humans can demonstrate impressive adaptation capabilities, deep neural network policies often struggle to adapt to a new environment and require a considerable amount of samples for successful transfer. Instead, we propose a novel reward-based policy only conditioned on rewards and actions, enabling zero-shot adaptation to new environments with completely different observations. We discuss the challenges and feasibility of a reward-based policy and then propose a practical algorithm for training. We demonstrate that a reward policy can be trained within three different environments, Pointmass, Cartpole, and 2D Car Racing, and transferred to completely different observations, such as different color palettes or 3D rendering, in a zero-shot manner. We also demonstrate that a reward-based policy can further guide the training of an observation-based policy in the target environment.

1358Learning Distributions of Complex Fluid Simulations with Diffusion Graph Networks

[openreview] [pdf]

Abstract Physical systems with complex unsteady dynamics, such as fluid flows, are often poorly represented by a single mean solution. For many practical applications, it is crucial to access the full distribution of possible states, from which relevant statistics (e.g., RMS and two-point correlations) can be derived. Here, we propose a graph-based latent diffusion model that enables direct sampling of states from their equilibrium distribution, given a mesh discretization of the system and its physical parameters. This allows for the efficient computation of flow statistics without running long and expensive numerical simulations. The graph-based structure enables operations on unstructured meshes, which is critical for representing complex geometries with spatially localized high gradients, while latent-space diffusion modeling with a multi-scale GNN allows for efficient learning and inference of entire distributions of solutions. A key finding of our work is that the proposed networks can accurately learn full distributions even when trained on incomplete data from relatively short simulations. We apply this method to a range of fluid dynamics tasks, such as predicting pressure distributions on 3D wing models in turbulent flow, demonstrating both accuracy and computational efficiency in challenging scenarios. The ability to directly sample accurate solutions, and capturing their diversity from short ground-truth simulations, is highly promising for complex scientific modeling tasks.

1359State-space models can learn in-context by gradient descent

[openreview] [pdf]

Abstract Deep state-space models (Deep SSMs) have shown capabilities for in-context learning on autoregressive tasks, similar to transformers. However, the architectural requirements and mechanisms enabling this in recurrent networks remain unclear. This study demonstrates that state-space model architectures can perform gradient-based learning and use it for in-context learning. We prove that a single structured state-space model layer, augmented with local self-attention, can reproduce the outputs of an implicit linear model with least squares loss after one step of gradient descent. Our key insight is that the diagonal linear recurrent layer can act as a gradient accumulator, which can be `applied’ to the parameters of the implicit regression model. We validate our construction by training randomly initialized augmented SSMs on simple linear regression tasks. The empirically optimized parameters match the theoretical ones, obtained analytically from the implicit model construction. Extensions to multi-step linear and non-linear regression yield consistent results. The constructed SSM encompasses features of modern deep state-space models, with the potential for scalable training and effectiveness even in general tasks. The theoretical construction elucidates the role of local self-attention and multiplicative interactions in recurrent architectures as the key ingredients for enabling the expressive power typical of foundation models.

1360Mitigating the Backdoor Effect for Multi-Task Model Merging via Safety-Aware Subspace

[openreview] [pdf]

Abstract Model merging has gained significant attention as a cost-effective approach to integrate multiple single-task fine-tuned models into a unified one that can perform well on multiple tasks. However, existing model merging techniques primarily focus on resolving conflicts between task-specific models, they often overlook potential security threats, particularly the risk of backdoor attacks in the open-source model ecosystem. In this paper, we first investigate the vulnerabilities of existing model merging methods to backdoor attacks, identifying two critical challenges: backdoor succession and backdoor transfer. To address these issues, we propose a novel Defense-Aware Merging (DAM) approach that simultaneously mitigates task interference and backdoor vulnerabilities. Specifically, DAM employs a meta-learning-based optimization method with dual masks to identify a shared and safety-aware subspace for model merging. These masks are alternately optimized: the Task-Shared mask identifies common beneficial parameters across tasks, aiming to preserve task-specific knowledge while reducing interference, while the Backdoor-Detection mask isolates potentially harmful parameters to neutralize security threats. This dual-mask design allows us to carefully balance the preservation of useful knowledge and the removal of potential vulnerabilities. Compared to existing merging methods, DAM achieves a more favorable balance between performance and security, reducing the attack success rate by 2-10 percentage points while sacrificing only about 1% in accuracy. Furthermore, DAM exhibits robust performance and broad applicability across various types of backdoor attacks and the number of compromised models involved in the merging process.

1361Risk-Aware Distributional Intervention Policies for Language Models

[openreview] [pdf]

Abstract Language models are prone to occasionally undesirable generations, such as harmful or toxic content, despite their impressive capability to produce texts that appear accurate and coherent. In this paper, we present a new two-stage approach to detect and mitigate undesirable content generations by rectifying activations. First, we train an ensemble of layer-wise classifiers to detect undesirable content using activations by minimizing a smooth surrogate of the risk-aware score. Then, for contents that are detected as undesirable, we propose layer-wise distributional intervention policies that perturb the attention heads minimally while guaranteeing probabilistically the effectiveness of the intervention. Benchmarks on several language models and datasets show that our method outperforms baselines in reducing the generation of undesirable output.

1362NNetscape Navigator: Complex Demonstrations for Web Agents Without a Demonstrator

[openreview] [pdf]

Abstract We introduce NNetscape Navigator (NNetnav), a method for training web agents entirely through synthetic demonstrations. These demonstrations are collected by first interacting with a browser to generate trajectory rollouts, which are then retroactively labeled into instructions using a language model. Most work on training browser agents has relied on expensive human supervision, and the limited previous work on such \emph{interaction-first} synthetic data techniques has failed to provide effective search through the exponential space of exploration. In contrast, NNetnav exploits the hierarchical structure of language instructions to make this search more tractable: complex instructions are typically decomposable into simpler subtasks, allowing NNetnav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. We use NNetnav demonstrations from a language model for supervised fine-tuning of a smaller language model policy, and find improvements of 6 points on WebArena and over 20 points on MiniWoB++, two popular environments for web-agents. Notably, on WebArena, we observe that language model policies can be further enhanced when fine-tuned with NNetnav demonstrations derived from the \emph{same} language model. Finally, we collect and release a dataset of over 6k NNetnav demonstrations on WebArena, spanning a diverse and complex set of instructions.

1363MISR: Measuring Instrumental Self-Reasoning in Frontier Models

[openreview] [pdf]

Abstract We propose a suite of tasks to evaluate the instrumental self-reasoning ability of large language model (LLM) agents. Instrumental self-reasoning ability could improve adaptability and enable self-modification, but it could also pose significant risks, such as enabling deceptive alignment. Prior work has only evaluated self-reasoning in non-agentic settings or in limited domains. In this paper, we propose evaluations for instrumental self-reasoning ability in agentic tasks in a wide range of scenarios, including self-modification, knowledge seeking, and opaque self-reasoning. We evaluate agents built using state-of-the-art LLMs, including commercial and open source systems. We find that instrumental self-reasoning ability emerges only in the most capable frontier models and that it is highly context-dependent. No model passes the the most difficult versions of our evaluations, hence our evaluation can be used to measure increases in instrumental self-reasoning ability in future models.

1364VILA^2: VLM Augmented VLM with Self-Improvement

[openreview] [pdf]

Abstract Visual language models (VLMs) have rapidly progressed, driven by the success of large language models (LLMs). While model architectures and training infrastructures advance rapidly, data curation remains under-explored. When data quantity and quality become a bottleneck, existing work either directly crawls more raw data from the Internet that does not have a guarantee of data quality or distills from black-box commercial models (e.g., GPT-4V / Gemini) causing the performance upper bounded by that model. In this work, we introduce a novel approach that includes a self-augment step and a specialist-augment step to iteratively improve data quality and model performance. In the self-augment step, a VLM recaptions its own pretraining data to enhance data quality, and then retrains from scratch using this refined dataset to improve model performance. This process can iterate for several rounds. Once self-augmentation saturates, we employ several specialist VLMs finetuned from the self-augmented VLM with domain-specific expertise, to further infuse specialist knowledge into the generalist VLM through task-oriented recaptioning and retraining. With the combined self-augmented and specialist-augmented training, we introduce VILA2 (VLM-augmented-VLM), a VLM family that consistently improves the accuracy on a wide range of tasks over prior art, including MMMU leaderboard, with a reusable pretraining dataset that is 300x more cost-efficient than human labeling.

1365Accelerating Task Generalisation with Multi-Level Hierarchical Options

[openreview] [pdf]

Abstract Creating reinforcement learning agents that generalise effectively to new tasks is a key challenge in AI research. This paper introduces Fracture Cluster Options (FraCOs), a multi-level hierarchical reinforcement learning method that achieves state-of-the-art performance on difficult generalisation tasks. FraCOs identifies patterns in agent behaviour and forms options based on the expected future usefulness of those patterns, enabling rapid adaptation to new tasks. In tabular settings, FraCOs demonstrates effective transfer and improves performance as it grows in hierarchical depth. We evaluate FraCOs against state-of-the-art deep reinforcement learning algorithms in several complex procedurally generated environments. Our results show that FraCOs achieves higher in-distribution and out-of-distribution performance than competitors.

1366Can LLMs Understand Time Series Anomalies?

[openreview] [pdf]

Abstract Large Language Models (LLMs) have gained popularity in time series forecasting, but their potential for anomaly detection remains largely unexplored. Our study investigates whether LLMs can understand and detect anomalies in time series data, focusing on zero-shot and few-shot scenarios. Inspired by conjectures about LLMs’ behavior from time series forecasting research, we formulate key hypotheses about LLMs’ capabilities in time series anomaly detection. We design and conduct principled experiments to test each of these hypotheses.Our investigation reveals several surprising findings about LLMs for time series:LLMs understand time series better asimagesrather than as textLLMs did not demonstrate enhanced performance when prompted to engage inexplicit reasoningabout time series analysisContrary to common beliefs, LLM’s understanding of time seriesdo notstem from their repetition biases or arithmetic abilitiesLLMs’ behaviors and performance in time series analysisvary significantlyacross different model architecturesThis study provides the first comprehensive analysis of contemporary LLM capabilities in time series anomaly detection. Our results suggest that while LLMs can understand time series anomalies, many common conjectures based on their reasoning capabilities do not hold. These insights pave the way for more effective LLM-based approaches in time series analysis, bridging the gap between forecasting and anomaly detection applications.

1367Revisit the open nature of open vocabulary segmentation

[openreview] [pdf]

Abstract In Open-Vocabulary Segmentation (OVS), we observe a consistent drop in model performance as the query vocabulary set expands, especially when it includes se- mantically similar and ambiguous vocabularies, such as ‘sofa’ and ‘couch’. The previous OVS evaluation protocol, however, does not account for such ambiguity, as any mismatch between predicted and human-annotated pairs is simply treated as incorrect on a pixel-wise basis. This contradicts the open nature of OVS, where ambiguous categories can both be correct from an open-world perspective. To address this, in this work, we further study the open nature of OVS and pro- pose a mask-wise evaluation protocol thatis based on matched and mismatched mask pairs between prediction and annotation respectively. Extensive experimen- tal evaluations show that OVS models consistently perform better under the pro- posed mask-wise protocol compared to the previous pixel-wise one. Moreover, analysis of mismatched mask pair reveals that large amount of ambiguous cate- gories exist in commonly used OVS datasets. Interestingly, we find that reducing these ambiguities during both training and inference enhances zero-shot inference capabilities. These findings and the new evaluation protocol encourage further exploration of the open nature of OVS and broader open-world challenges.

1368Density estimation with LLMs: a geometric investigation of in-context learning trajectories

[openreview] [pdf]

Abstract Large language models (LLMs) demonstrate remarkable emergent abilities to perform in-context learning across various tasks, including time series forecasting. This work investigates LLMs’ ability to estimate probability density functions (PDFs) from data observed in-context; such density estimation (DE) is a fundamental task underlying many probabilistic modeling problems. We leverage the Intensive Principal Component Analysis (InPCA) to visualize and analyze the in-context learning dynamics of LLaMA-2 models. Our main finding is that these LLMs all follow similar learning trajectories in a low-dimensional InPCA space, which are distinct from those of traditional density estimation methods like histograms and Gaussian kernel density estimation (KDE). We interpret the LLaMA in-context DE process as a KDE with an adaptive kernel width and shape. This custom kernel model captures a significant portion of LLaMA’s behavior despite having only two parameters. We further speculate on why LLaMA’s kernel width and shape differs from classical algorithms, providing insights into the mechanism of in-context probabilistic reasoning in LLMs.

1369Is self-supervision enough for training sentence embeddings?

[openreview] [pdf]

Abstract In NLP, sentence embeddings are crucial for many tasks such as information retrieval, classification, clustering, or visualizing collections of texts. Currently, top-performing sentence embeddings are derived from pre-trained language models that undergo extensive supervised fine-tuning. This contrasts with computer vision, where self-supervised training has demonstrated remarkable success. Here we show that self-supervision alone can produce high-quality sentence embeddings, albeit slightly below those from state-of-the-art supervised models. We systematically compare several existing augmentation strategies for positive pair generation in contrastive learning and show that text crops strongly outperform popular dropout-based augmentation. Using text crops, well-performing embeddings can be obtained even when training from scratch without using pre-trained model weights, or when training a bare token embedding layer without any transformer architecture. Overall, we show that self-supervised learning allows rapid training of text embeddings of a given dataset.

1370Diff-Prompt: Diffusion-driven Prompt Generator with Mask Supervision

[openreview] [pdf]

Abstract Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diff-Prompt (Diffusion-driven Prompt Generator), aiming to use the diffusion model to generate rich, fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved DiT (Diffusion Transformer) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available athttps://anonymous.4open.science/r/Diff-Prompt-FF2D.

1371Expanding the Web, Smaller Is Better: A Comprehensive Study in Post-training

[openreview] [pdf]

Abstract General-purpose large language models (GLLMs) like GPT-4 and LLaMA have demonstrated exceptional performance across a wide range of tasks. However, their performance often falls short in domain- or task-specific applications, where deeper, specialized knowledge is essential, while maintaining general knowledge remains crucial for handling broader, unseen tasks. Post-training has been widely applied to make LLMs specialized, typically consisting of multiple stages, including DomainAdaptive Pre-Training (DAPT) and Supervised Fine-Tuning (SFT). In this work, we conduct a comprehensive study on three key aspects of post-training taking Finance as a target domain: (1) the distinct roles of DAPT and SFT in post-training, (2) strategies to mitigate knowledge forgetting across stages, and (3) evaluation methods that capture both general and domain-specific capabilities. Our results show that DAPT and SFT require distinct training objectives, joint training of DAPT and SFT is essential for maintaining stage knowledge and encouraging knowledge transfer across stages, and replay mechanisms are critical for preventing forgetting. Evaluation should encompass general, seen, and unseen tasks for a complete assessment. Based on these insights, we developed a Joint-and-Replay post-training recipe and built LLaMA3-8B-Fin, a smaller yet more powerful stateof-the-art financial LLM trained through post-training. Despite its smaller size, LLaMA3-8B-Fin surpasses larger models like GPT-4o and LLaMA3.1-70b on both seen and unseen financial tasks while retaining general knowledge, demonstrating that a well-structured post-training can “expand the web” of capabilities in smaller LLMs, enabling them to outperform much larger models.

1372Learning from Contrastive Prompts: Automated Optimization and Adaptation

[openreview] [pdf]

Abstract As LLMs evolve, significant effort is spent on manually crafting prompts. While existing prompt optimization methods automate this process, they rely solely on learning from incorrect samples, leading to a sub-optimal performance. Additionally, an unexplored challenge in the literature is prompts effective for prior models may not perform well on newer versions or different languages. We propose the Learning from Contrastive Prompts (LCP) framework to address these gaps, enhancing both prompt optimization and adaptation. LCP employs contrastive learning to generate effective prompts by analyzing patterns in good and bad prompt examples. Our evaluation on the Big-Bench Hard dataset shows that LCP has a win rate of over 76% over existing methods in prompt optimization and demonstrates strong adaptability across different model versions, families, and languages. LCP offers a systematic approach to prompt engineering, reducing manual effort in deploying LLMs across varied contexts.

1373How To Be A Good Teacher? Process Strong Pretrained Models For Effective Knowledge Distillation

[openreview] [pdf]

Abstract Transferring the world knowledge encoded in pretrained models through knowledge distillation is an effective approach to improve the performance of small, task-specific production models. However, the effectiveness of such knowledge transfer drops greatly for strong models that are pretrained in a large scale. In this paper, we explore methods to preprocess strong pretrained models to improve the effectiveness of its knowledge transfer. From a mutual information perspective of distillation effectiveness, we propose to incorporate mutual information-aware optimization into the fine-tuning of strong pretrained models. For small or highly-imbalanced downstream datasets where such optimization is less effective, we further propose to heuristically reweight the MLP blocks, which is inspired by our observation that top MLP blocks often cause the loss of mutual information. Our method enables small student models to benefit from those pretrained models among the strongest.

1374Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

[openreview] [pdf]

Abstract Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during model roll-out. In this paper, we propose the Any-step Dynamics Model (ADM) to mitigate the compounding error by reducing bootstrapping prediction to direct prediction. ADM allows for the use of variable-length plans as inputs for predicting future states without frequent bootstrapping. We design two algorithms, ADMPO-ON and ADMPO-OFF, which apply ADM in online and offline model-based frameworks, respectively. In the online setting, ADMPO-ON demonstrates improved sample efficiency compared to previous state-of-the-art methods. In the offline setting, ADMPO-OFF not only demonstrates superior performance compared to recent state-of-the-art offline approaches but also offers better quantification of model uncertainty using only a single ADM.

1375FlipNet: Fourier Lipschitz Smooth Policy Network for Reinforcement Learning

[openreview] [pdf]

Abstract Deep reinforcement learning (RL) is an effective method for decision-making and control tasks. However, RL-trained policies encounter the action fluctuation problem, where consecutive actions significantly differ despite minor variations in adjacent states. This problem results in actuators’ wear, safety risk, and performance reduction in real-world applications. To address the problem, we identify the two fundamental reasons causing action fluctuation, i.e. policy non-smoothness and observation noise, then propose the Fourier Lipschitz Smooth Policy Network (FlipNet). FlipNet adopts two innovative techniques to tackle the two reasons in a decoupled manner. Firstly, we prove the Jacobian norm is an approximation of Lipschitz constant and introduce a Jacobian regularization technique to enhance the smoothness of policy network. Secondly, we introduce a Fourier filter layer to deal with observation noise. The filter layer includes a trainable filter matrix that can automatically extract important observation frequencies and suppress noise frequencies. FlipNet can be seamlessly integrated into most existing RL algorithms as an actor network. Simulated tasks on DMControl and a real-world experiment on vehicle-robot driving show that FlipeNet has excellent action smoothness and noise robustness, achieving a new state-of-the-art performance. The code and videos are publicly available.

1376Visual Description Grounding Reduces Hallucinations and Boosts Reasoning in LVLMs

[openreview] [pdf]

Abstract Large Vision-Language Models (LVLMs) often produce responses that misalign with factual information, a phenomenon known as hallucinations. While hallucinations are well-studied, the exact causes behind them remain underexplored. In this paper, we first investigate the root causes of hallucinations in LVLMs. Our findings reveal that existing mitigation techniques primarily reduce hallucinations for visual recognition prompts—those that require simple descriptions of visual elements—but fail for cognitive prompts that demand deliberate reasoning. We identify the core issue as a lack of true visual perception in LVLMs: although they can accurately recognize visual elements, they struggle to fully interpret these elements in the context of the input prompt and effectively link this recognition to their internal knowledge, which is critical for reasoning. To address this gap, we introduce Visual Description Grounded Decoding (VDGD), a simple, robust, and training-free method designed to enhance visual perception and improve reasoning capabilities in LVLMs. VDGD works by first generating a detailed description of the image and appending it as a prefix to the instruction. During response generation, tokens are sampled based on their KL divergence to the description, favoring candidates with lower divergence. Experimental results on multiple visual reasoning benchmarks and LVLMs demonstrate that VDGD consistently outperforms existing baselines 2% - 33%. Finally, we introduce VaLLu, a benchmark designed for comprehensive evaluation of the cognitive capabilities of LVLMs.

1377Debiasing Federated Learning with Correlated Client Participation

[openreview] [pdf]

Abstract In cross-device federated learning (FL) with millions of mobile clients, only a small subset of clients participate in training in every communication round, and Federated Averaging (FedAvg) is the most popular algorithm in practice. Existing analyses of FedAvg usually assume the participating clients are independently sampled in each round from a uniform distribution, which does not reflect real-world scenarios. This paper introduces a theoretical framework that models client participation in FL as a Markov chain to study optimization convergence when clients have non-uniform and correlated participation across rounds. We apply this framework to analyze a more practical pattern: every client must wait a minimum number of RR rounds (minimum separation) before re-participating. We theoretically prove and empirically observe that increasing minimum separation reduces the bias induced by intrinsic non-uniformity of client availability in cross-device FL systems. Furthermore, we develop an effective debiasing algorithm for FedAvg that provably converges to the unbiased optimal solution under arbitrary minimum separation and unknown client availability distribution.

1378NoVo: Norm Voting off Hallucinations with Attention Heads in Large Language Models

[openreview] [pdf]

Abstract Hallucinations in Large Language Models (LLMs) remain a major obstacle, particularly in high-stakes applications where factual accuracy is critical. While representation editing and reading methods have made strides in reducing hallucinations, their heavy reliance on specialised tools and training on in-domain samples, makes them difficult to scale and prone to overfitting. This limits their accuracy gains and generalizability to diverse datasets. This paper presents a lightweight method, Norm Voting (NoVo), which harnesses the untapped potential of attention head norms to dramatically enhance factual accuracy in zero-shot multiple-choice questions (MCQs). NoVo begins by automatically selecting truth-correlated head norms with an efficient, inference-only algorithm using only 30 random samples, allowing NoVo to effortlessly scale to diverse datasets. Afterwards, selected head norms are employed in a simple voting algorithm, which yields significant gains in prediction accuracy. On TruthfulQA MC1, NoVo surpasses the current state-of-the-art and all previous methods by an astounding margin---at least 19 accuracy points. NoVo demonstrates exceptional generalization to 20 diverse datasets, with significant gains in over 90% of them, far exceeding all current representation editing and reading methods. NoVo also reveals promising gains to finetuning strategies and building textual adversarial defence. NoVo’s effectiveness with head norms opens new frontiers in LLM interpretability, robustness and reliability.

1379Can Transformers Do Enumerative Geometry?

[openreview] [pdf]

Abstract We introduce a Transformer-based approach to computational enumerative geometry, specifically targeting the computation of ψ\psi-class intersection numbers on the moduli space of curves. Traditional methods for calculating these numbers suffer from factorial computational complexity, making them impractical to use. By reformulating the problem as a continuous optimization task, we compute intersection numbers across a wide value range from 10-45 to 1045. To capture the recursive and hierarchical nature inherent in the intersection numbers, we propose the Dynamic Range Activator (DRA), a new activation function that enhances the Transformer’s ability to model recursive patterns and handle severe heteroscedasticity. Given precision requirements for computing ψ\psi-class intersections, we quantify the uncertainty of the predictions using Conformal Prediction with a dynamic sliding window adaptive to the partitions of equivalent number of marked points. Beyond simply computing intersection numbers, we explore the enumerative “world-model” of Transformers. Our interpretability analysis reveals that the network is implicitly modeling the Virasoro constraints in a purely data-driven manner. Moreover, through abductive hypothesis testing, probing, and causal inference, we uncover evidence of an emergent internal representation of the the large-genus asymptotic of ψ\psi-class intersection numbers. These findings suggest that the network internalizes the parameters of the asymptotic closed-form formula linearly, while capturing the polynomiality phenomenon of ψ\psi-class intersection numbers in a nonlinear manner.

1380Fewer May Be Better: Enhancing Offline Reinforcement Learning with Reduced Dataset

[openreview] [pdf]

Abstract Research in offline reinforcement learning (RL) marks a paradigm shift in RL. However, a critical yet under-investigated aspect of offline RL is determining the subset of the offline dataset, which is used to improve algorithm performance while accelerating algorithm training. Moreover, the size of reduced datasets can uncover the requisite offline data volume essential for addressing analogous challenges. Based on the above considerations, we propose identifying Reduced Datasets for Offline RL (ReDOR) by formulating it as a gradient approximation optimization problem. We prove that the common actor-critic framework in reinforcement learning can be transformed into a submodular objective. This insight enables us to construct a subset by adopting the orthogonal matching pursuit (OMP). Specifically, we have made several critical modifications to OMP to enable successful adaptation with Offline RL algorithms. The experimental results indicate that the data subsets constructed by the ReDOR can significantly improve algorithm performance with low computational complexity.

1381A Gradient Descent Optimizer with auto-controlled large Learning Rates, dynamic Batch Sizes and without Momentum

[openreview] [pdf]

Abstract We present a novel, fast gradient based momentum-free optimizer algorithm with dynamic learning rate and dynamic batch size. The main ideas are to exponentially adapt the learning rate α \alpha by situational awareness, mainly striving for orthogonal neighboring gradients, and to increase the batch size when the gradients become too noisy, leading to random walks rather than gradient descent. The method has a high success and fast convergence rate and relies only on few hyper-parameters, providing greater universality. It scales only linearly (of order O(n)O(n)) with dimension and is rotation invariant, thereby overcoming known limitations. The optimization method is termed ELRA (Exponential Learning Rate Adaption). The impressive performance of ELRA is demonstrated by experiments on several benchmark data-sets (ranging from MNIST to ImageNet) against common optimizers such as Adam, Lion and SGD.

1382Exploiting Task Relationships for Continual Learning with Transferability-aware Task Embedding

[openreview] [pdf]

Abstract Continual learning (CL) has been a crucial topic in contemporary deep neural network usages, where catastrophic forgetting (CF) can impede a model’s ability to progressively acquire knowledge, leading to critical training inefficiency and constraint in the improvement of model’s overall capacity. Existing CL strategies mostly mitigate CF either by regularizing model weights and outputs during finetuning or by distinguishing task-specific and task-sharing model components to adapt the training process accordingly. Yet despite their effectiveness, these previous explorations are mainly limited to elements of task models, while we speculate a deeper exploitation of interrelationship among tasks can provide more enhancement for CL. Therefore, to better capture and utilize the task relations, we propose a transferability task embedding guided hypernet for continual learning. By introducing the information theoretical transferability based task embedding named H-embedding and incorporating it in a hypernetwork, we establish an online framework capable of capturing the statistical relations among the CL tasks and leveraging these knowledge for deriving task-conditioned model weights. The framework is also characterized by notable practicality, in that it only requires storing a low dimensional task embedding for each task, and can be efficiently trained in an end-to-end way. Extensive evaluations and experimental analyses on datasets including Permuted MNIST, Cifar10/100 and ImageNet-R showcase that our framework performs prominently compared to various baseline methods, as well as displays great potential in obtaining intrinsic task relationships.

1383Straightness of Rectified Flow: A Theoretical Insight into Wasserstein Convergence

[openreview] [pdf]

Abstract Diffusion models have emerged as a powerful tool for image generation and denoising. Typically, generative models learn a trajectory between the starting noise distribution and the target data distribution. Recently Liu et al. (2023b) designed a novel alternative generative model Rectified Flow (RF), which aims to learn straight flow trajectories from noise to data using a sequence of convex optimization problems with close ties to optimal transport. If the trajectory is curved, one must use many Euler discretization steps or novel strategies, such as exponential integrators, to achieve a satisfactory generation quality. In contrast, RF has been shown to theoretically straighten the trajectory through successive rectifications, reducing the number of function evaluations (NFEs) while sampling. It has also been shown empirically that RF may improve the straightness in two rectifications if one can solve the underlying optimization problem within a sufficiently small error. In this paper, we make two key theoretical contributions: 1) we provide the first theoretical analysis of the Wasserstein distance between the sampling distribution of RF and the target distribution. Our error rate is characterized by the number of discretization steps and a new formulation of straightness stronger than that in the original work. 2) In line with the previous empirical findings, we show that, for a rectified flow from a Gaussian to a mixture of two Gaussians, two rectifications are sufficient to achieve a straight flow. Additionally, we also present empirical results on both simulated and real datasets to validate our theoretical findings

1384ProdInfluencerNet: A Novel Product-Centric Influencer Recommendation Framework Based on Heterogeneous Networks

[openreview] [pdf]

Abstract With the proliferation of social media, influencer marketing has emerged as a popular strategy for brands to promote their products. Recent studies have increasingly explored the use of machine learning to recommend suitable influencers for brands. This typically involves analyzing the compatibility of influencer profiles with brand attributes. However, for brands entering new markets or promoting products in unfamiliar categories, existing solutions may be limited due to insufficient information for accurate compatibility matching.In this paper, we propose ProdInfluencerNet (PIN), a product-centric framework designed for influencer recommendation. PIN effectively models the complex relationships between brands, products, and influencers using Heterogeneous Information Networks (HINs). We categorize sponsored post images using the Google Taxonomy through image classification techniques. By leveraging the taxonomy’s hierarchical structure and adopting an inductive learning approach, PIN can accurately recommend influencers for brands, even in new markets or with innovative products. We validate PIN’s effectiveness and superiority over existing methods using two Instagram datasets. Furthermore, our analysis reveals that text features in profiles are more critical than images for identifying cooperative relationships between product categories and influencers.

1385Continual Task Learning through Adaptive Policy Self-Composition

[openreview] [pdf]

Abstract Training a generalizable agent to continually learn a sequence of tasks from offline trajectories is a natural requirement for long-lived agents, yet remains a significant challenge for current offline reinforcement learning (RL) algorithms. Specifically, an agent must be able to rapidly adapt to new tasks using newly collected trajectories (plasticity), while retaining knowledge from previously learned tasks (stability). However, systematic analyses of this setting are scarce, and it remains unclear whether conventional continual learning (CL) methods are effective in continual offline RL (CORL) scenarios. In this study, we develop the Offline Continual World benchmark and demonstrate that traditional CL methods struggle with catastrophic forgetting, primarily due to the unique distribution shifts inherent to CORL scenarios. To address this challenge, we introduce CompoFormer, a structure-based continual transformer model that adaptively composes previous policies via a meta-policy network. Upon encountering a new task, CompoFormer leverages semantic correlations to selectively integrate relevant prior policies alongside newly trained parameters, thereby enhancing knowledge sharing and accelerating the learning process. Our experiments reveal that CompoFormer outperforms conventional CL methods, particularly in longer task sequences, showcasing a promising balance between plasticity and stability.

1386Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

[openreview] [pdf]

Abstract Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWItching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

1387Projection Head is Secretly an Information Bottleneck

[openreview] [pdf]

Abstract Recently, contrastive learning has risen to be a promising paradigm for extracting meaningful data representations. Among various special designs, adding a projection head on top of the encoder during training and removing it for downstream tasks has proven to significantly enhance the performance of contrastive learning. However, despite its empirical success, the underlying mechanism of the projection head remains under-explored. In this paper, we develop an in-depth theoretical understanding of the projection head from the information-theoretic perspective. By establishing the theoretical guarantees on the downstream performance of the features before the projector, we reveal that an effective projector should act as an information bottleneck, filtering out the information irrelevant to the contrastive objective. Based on theoretical insights, we introduce modifications to projectors with training and structural regularizations. Empirically, our methods exhibit consistent improvement in the downstream performance across various real-world datasets, including CIFAR-10, CIFAR-100, and ImageNet-100. We believe our theoretical understanding on the role of the projection head will inspire more principled and advanced designs in this field.

1388Statistical Advantages of Perturbing Cosine Router in Mixture of Experts

[openreview] [pdf]

Abstract The cosine router in Mixture of Experts (MoE) has recently emerged as an attractive alternative to the conventional linear router. Indeed, the cosine router demonstrates favorable performance in image and language tasks and exhibits better ability to mitigate the representation collapse issue, which often leads to parameter redundancy and limited representation potentials. Despite its empirical success, a comprehensive analysis of the cosine router in MoE has been lacking. Considering the least square estimation of the cosine routing MoE, we demonstrate that due to the intrinsic interaction of the model parameters in the cosine router via some partial differential equations, regardless of the structures of the experts, the estimation rates of experts and model parameters can be as slow as O(1/logτ(n))\mathcal{O}(1/\log^{\tau}(n)) where τ>0\tau > 0 is some constant and nn is the sample size. Surprisingly, these pessimistic non-polynomial convergence rates can be circumvented by the widely used technique in practice to stabilize the cosine router --- simply adding noises to the L2L^2 norms in the cosine router, which we refer to asperturbed cosine router. Under the strongly identifiable settings of the expert functions, we prove that the estimation rates for both the experts and model parameters under the perturbed cosine routing MoE are significantly improved to polynomial rates. Finally, we conduct extensive simulation studies in both synthetic and real data settings to empirically validate our theoretical results.

1389Stable Segment Anything Model

[openreview] [pdf]

Abstract The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts which, however, often require good skills to specify. To make SAM robust to casual prompts, this paper presents the first comprehensive analysis on SAM’s segmentation stability across a diverse spectrum of prompt qualities, notably imprecise bounding boxes and insufficient points. Our key finding reveals that given such low-quality prompts, SAM’s mask decoder tends to activate image features that are biased towards the background or confined to specific object parts. To mitigate this issue, our key idea consists of calibrating solely SAM’s mask attention by adjusting the sampling locations and amplitudes of image features, while the original SAM model architecture and weights remain unchanged. Consequently, our deformable sampling plugin (DSP) enables SAM to adaptively shift attention to the prompted target regions in a data-driven manner. During inference, dynamic routing plugin (DRP) is proposed that toggles SAM between the deformable and regular grid sampling modes, conditioned on the input prompt quality. Thus, our solution, termed Stable-SAM, offers several advantages: 1) improved SAM’s segmentation stability across a wide range of prompt qualities, while 2) retaining SAM’s powerful promptable segmentation efficiency and generality, with 3) minimal learnable parameters (0.08 M) and fast adaptation. Extensive experiments validate the effectiveness and advantages of our approach, underscoring Stable-SAM as a more robust solution for segmenting anything.

1390Inverse Attention Agent in Multi-Agent System

[openreview] [pdf]

Abstract A major challenge for Multi-Agent Systems (MAS) is enabling agents to adapt dynamically to diverse environments in which opponents and teammates may continually change. Agents trained using conventional methods tend to excel only within the confines of their training cohorts; their performance drops significantly when confronting unfamiliar agents. To address this shortcoming, we introduce Inverse Attention Agents that adopt concepts from the Theory of Mind (ToM) implemented algorithmically using an attention mechanism trained in an end-to-end manner. Crucial to determining the final actions of these agents, the weights in their attention model explicitly represent attention to different goals. We furthermore propose an inverse attention network that deduces the ToM of agents based on observations and prior actions. The network infers the attentional states of other agents, thereby refining the attention weights to adjust the agent’s final action. We conduct experiments in a continuous environment, tackling demanding tasks encompassing cooperation, competition, and a blend of both. They demonstrate that the inverse attention network successfully infers the attention of other agents, and that this information improves agent performance. Additional human experiments show that, compared to baseline agent models, our inverse attention agents exhibit superior cooperation with humans and better emulate human behaviors.

1391Interactive Dialogue Agents via Reinforcement Learning with Hindsight Regenerations

[openreview] [pdf]

Abstract Recent progress on large language models (LLMs) has enabled dialogue agents to generate highly naturalistic and plausible text. However, current LLM language generation focuses on responding accurately to questions and requests with a single effective response. In reality, many real dialogues are interactive, meaning an agent’s utterances will influence their conversational partner, elicit information, or change their opinion. Accounting for how an agent can effectively steer a conversation is a crucial ability in many dialogue tasks, from healthcare to preference elicitation. Existing methods for fine-tuning dialogue agents to accomplish such tasks would rely on curating some amount of expert data. However, doing so often requires understanding the underlying cognitive processes of the conversational partner, which is a skill neither humans nor LLMs trained on human data can reliably do. Our key insight is that while LLMs may not be adept at identifying effective strategies for steering conversations a priori, or in the middle of an ongoing conversation, they can do so post-hoc, or in hindsight, after seeing how their conversational partner responds. We use this fact to rewrite and augment existing suboptimal data, and train via offline reinforcement learning (RL) an agent that outperforms both prompting and learning from unaltered human demonstrations. We apply our approach to two domains that require understanding human mental state, intelligent interaction, and persuasion: mental health support, and soliciting charitable donations. Our results in a user study with real humans show that our approach greatly outperforms existing state-of-the-art dialogue agents.

1392A Sinkhorn-type Algorithm for Constrained Optimal Transport

[openreview] [pdf]

Abstract Entropic optimal transport (OT) and the Sinkhorn algorithm have made it practical for machine learning practitioners to perform the fundamental task of calculating transport distance between statistical distributions. In this work, we focus on a general class of OT problems under a combination of equality and inequality constraints. We derive the corresponding entropy regularization formulation and introduce a Sinkhorn-type algorithm for such constrained OT problems supported by theoretical guarantees. We first bound the approximation error when solving the problem through entropic regularization, which reduces exponentially with the increase of the regularization parameter. Furthermore, we prove a sublinear first-order convergence rate of the proposed Sinkhorn-type algorithm in the dual space by characterizing the optimization procedure with a Lyapunov function. To achieve fast and higher-order convergence under weak entropy regularization, we augment the Sinkhorn-type algorithm with dynamic regularization scheduling and second-order acceleration. Overall, this work systematically combines recent theoretical and numerical advances in entropic optimal transport with the constrained case, allowing practitioners to derive approximate transport plans in complex scenarios. In addition, we extend the formulation of this work to partial optimal transport and propose a fast algorithm with practical super-exponential convergence.

1393MORE: A MIXTURE OF LOW-RANK EXPERTS FOR ADAPTIVE MULTI-TASK LEARNING

[openreview] [pdf]

Abstract With the rapid development of Large Language Models (LLMs), Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant attention, which aims to achieve efficient fine-tuning of LLMs with fewer parameters. As a representative PEFT method, Low-Rank Adaptation (LoRA) introduces low-rank matrices to approximate the incremental tuning parameters and achieves impressive performance over multiple scenarios. After that, plenty of improvements have been proposed for further improvement. However, these methods either focus on single-task scenarios or separately train multiple LoRA modules for multi-task scenarios, limiting the efficiency and effectiveness of LoRA in multi-task scenarios. To better adapt to multi-task fine-tuning, in this paper, we propose a novel Mixture of Low-Rank Experts (MoRE) for multi-task PEFT. Specifically, instead of using an individual LoRA for each task, we align different ranks of LoRA module with different tasks, which we named low-rank experts. Moreover, we design a novel adaptive rank selector to select the appropriate expert for each task. By jointly training low-rank experts, MoRE can enhance the adaptability and efficiency of LoRA in multi-task scenarios. Finally, we conduct extensive experiments over multiple multi-task benchmarks along with different LLMs to verify model performance. Experimental results demonstrate that compared to traditional LoRA and its variants, MoRE significantly improves the performance of LLMs in multi-task scenarios and incurs no additional inference cost. We also release the model and code to facilitate the community.

1394BISIMULATION METRIC FOR MODEL PREDICTIVE CONTROL

[openreview] [pdf]

Abstract Model-based reinforcement learning (MBRL) has shown promise for improving sample efficiency and decision-making in complex environments. However, existing methods face challenges in training stability, robustness to noise, and computational efficiency. In this paper, we propose Bisimulation Metric for Model Predictive Control (BS-MPC), a novel approach that incorporates bisimulation metric loss in its objective function to directly optimize the encoder. This optimization enables the learned encoder to extract intrinsic information from the original state space while discarding irrelevant details. BS-MPC improves training stability, robustness against input noise, and computational efficiency by reducing training time. We evaluate BS-MPC on both continuous control and image-based tasks from the DeepMind Control Suite, demonstrating superior performance and robustness compared to state-of-the-art baseline methods.

1395Differentially Private One Permutation Hashing

[openreview] [pdf]

Abstract Minwise hashing (MinHash) is a standard hashing algorithm for large-scale search and learning with the binary Jaccard similarity. One permutation hashing (OPH) is an effective and efficient alternative of MinHash which splits the data into K bins and generates hash values within each bin. In this paper, to protect the privacy of the output sketches, we combine differential privacy (DP) with OPH, and propose DP-OPH framework with three variants: DP-OPH-fix, DP-OPH-re and DP-OPH-rand, depending on the densification strategy to deal with empty bins in OPH. Detailed algorithm design and privacy and utility analysis are provided. The proposed DP-OPH methods significantly improves the DP minwise hashing (DP-MH) alternative in the literature. Experiments on similarity search confirm the effectiveness of our proposed algorithms. We also provide an extension to real-value data, named DP-BCWS, in the appendix.

1396Intention Model: A Novel Explanation for In-context Learning

[openreview] [pdf]

Abstract In-context learning (ICL) has demonstrated remarkable success in enabling large language models (LLMs) to learn to do a downstream task by simply conditioning on a few input-output demonstrations. Distinct from traditional learning paradigms, ICL does not require model updates, thus attracting significant interest in understanding the mechanisms behind LLMs’ ICL capabilities. Advanced works aim to understand ICL through an empirical viewpoint to provide the multifaceted nature of ICL, while some works aim to explain how ICL can emerge theoretically. However, the current theoretical analysis exhibits a weak connection to empirical explorations due to strong assumptions, e.g., perfect LLMs and ideal demonstrations. This work proposes an intention model, providing a novel theoretical framework for explaining ICL. With mild assumptions, we present a ``no-free-lunch’’ theorem for ICL: whether ICL emerges depends on the prediction error and prediction noise, which are determined by \emph{\textbf{i)}} LLMs’ error of next-token prediction, \emph{\textbf{ii)}} LLMs’ prediction smoothness, and \emph{\textbf{iii)}} the quality of demonstrations. Moreover, our intention model provides a novel explanation for the learning behavior of ICL under various input-output relations, e.g., learning with flipped labels. This is fortunately consistent with our experimental observations.

1397DynST: Large-Scale Spatial-Temporal Dataset for Transferable Traffic Forecasting with Dynamic Road Networks

[openreview] [pdf]

Abstract In real-world traffic networks, it is common to encounter a shortage of historical data in the target region. Researchers often address this issue through transfer learning. However, transfer learning tasks in traffic prediction currently lack dedicated datasets and instead rely on datasets designed for non-transfer prediction tasks. The major drawback of these existing datasets is the adoption of a fixed network topology to model the real world’s road networks. This does not align with reality and limits the model’s transferability. To tackle this issue, we propose DynST, a dataset specifically designed for transfer learning tasks in traffic prediction, with a massive data volume of 20.35 billion, spanning 20 years and 9 regions. The key feature of DynST is evolving dynamic road network topology, which reflects the evolution of real road networks. Moreover, to address the shortcomings of the distance-based adjacency generation algorithm, we introduce a novel tree-based algorithm. Extensive experiments demonstrate that the adoption of DynST as the source dataset can significantly enhance the performance of the target region. The comparative experiment also validates that our adjacency matrix generation algorithm can lead to improved prediction accuracy. We believe that DynST, with rich spatial variation information, will facilitate research in the field of transfer traffic prediction.

1398Capturing the Temporal Dependence of Training Data Influence

[openreview] [pdf]

Abstract Traditional data influence estimation methods, like influence function, assume that learning algorithms are permutation-invariant with respect to training data. However, modern training paradigms—especially for foundation models using stochastic algorithms and non-convergent, multi-stage curricula—are sensitive to data ordering, thus violating this assumption. This mismatch renders influence functions inadequate for answering some critical questions in current machine learning: How can we differentiate the influence of the same data contributing at different stages of training? More generally, how can we capture the dependence of data influence on the optimization trajectory during training? To address this gap, we formalize the concept of \emph{trajectory-specific leave-one-out (LOO) error}, which quantifies the impact of removing a data point from a specific iteration during training, accounting for the exact sequence of data encountered and the model’s optimization trajectory. However, exactly evaluating the trajectory-specific LOO presents a significant computational challenge. To address this, we propose \emph{data value embedding}, a novel technique enabling efficient approximation of trajectory-specific LOO. Specifically, we compute a training data embedding that encapsulates the cumulative interactions between data and the evolving model parameters. The LOO can then be efficiently approximated through a simple dot-product between the data value embedding and the gradient of the given test data. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics. In particular, we uncover distinct phases of data influence, revealing that data points in the early and late stages of training exert a greater impact on the final model. These insights translate into actionable strategies for managing the computational overhead of data selection by strategically timing the selection process, potentially opening new avenues in data curation research.

1399Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions

[openreview] [pdf]

Abstract Large language models (LLMs) must often respond to highly ambiguous user requests. In such cases, the LLM’s best response may be to ask a clarifying question to elicit more information. We observe existing LLMs often respond by presupposing a single interpretation of such ambiguous requests, frustrating users who intended a different interpretation. We speculate this is caused by current preference data labeling practice, where LLM responses are evaluated only on their prior contexts. To address this, we propose to assign preference labels by simulating their expected outcomes in the future turns. This allows LLMs to learn to ask clarifying questions when it can generate responses that are tailored to each user interpretation in future turns. In experiments on open-domain QA, we compare systems that trained using our proposed preference labeling methods against standard methods, which assign preferences based on only prior context. We evaluate systems based on their ability to ask clarifying questions that can recover each user’s interpretation and expected answer, and find that our training with our proposed method trains LLMs to ask clarifying questions with a 5% improvement in F1 measured against the answer set from different interpretations of each query.

1400The Probability Simplex is Compatible

[openreview] [pdf]

Abstract In retrieval systems, updating the base model involves re-extracting feature vectors for all gallery data due to changes in internal feature representations. This process can be computationally expensive and time-consuming, especially for large-scale gallery sets. To address this issue, backward compatible learning was introduced, allowing direct comparison between the representations of the old model and those obtained by the newly trained model. Existing backward compatible methods introduce additional losses or specific network architecture changes, which require the availability of base models, thereby limiting compatibility with models trained independently. In this paper, we show that any independently trained model can be made compatible with any other by simply using features derived from softmax outputs. We leverage the geometric properties of the softmax function, which projects vectors into the Probability Simplex, preserving the alignment of softmax vectors across model updates and verifying the definition of compatibility. A similar property is observed when using logits as a feature representation. They distribute during training in a simplex configuration, but with a wider spread in the feature distribution than softmax outputs, leading to a more robust and transferable representation. Our framework achieves state-of-the-art performance on standard benchmarks, where either the number of training classes extends across multiple steps or the base model is updated with advanced network architectures. This demonstrates that any publicly available pretrained model can be made compatible without requiring any additional training or adaptation. Our code will be made available upon acceptance.

1401Social Bayesian Optimization for Building Truthful Consensus

[openreview] [pdf]

Abstract We introduceSocial Bayesian Optimization(SBO), a query-efficient algorithm for consensus-building in collective decision-making. In contrast to single-agent scenarios, collective decision-making encompasses group dynamics that may distort agents’ preference feedback, thereby impeding their capacity to achieve a truthful consensus. We demonstrate that under standard rationality assumptions, reaching truthful consensus—the most preferable decision based on the aggregated latent agent utilities—using noisy feedback alone is impossible. To address this, SBO employs a dual voting system: cost-effective but noisy public votes, and more accurate, though expensive, private votes. We model social influence using an unknown social graph and leverage the dual voting system to efficiently learn this graph. Our findings show that social graph estimation converges faster than the black-box estimation of agents’ utilities, allowing us to reduce reliance on costly private votes early in the process. This enables efficient consensus-building primarily through noisy public votes, which are debiased based on the estimated social graph to infer truthful feedback. We validate the effectiveness of SBO across multiple real-world applications, including thermal comfort optimization, team building, travel destination discussion, and strategic alliance in energy trading.

1402Mitigating Forgetting in LLM Supervised Fine-Tuning and Preference Learning

[openreview] [pdf]

Abstract Post-training of pre-trained LLMs, which typically consists of the supervised fine-tuning (SFT) stage and the preference learning (RLHF or DPO) stage, is crucial to effective and safe LLM applications. The widely adopted approach in post-training popular open-source LLMs is to sequentially perform SFT and RLHF/DPO. However, sequential training is sub-optimal in terms of SFT and RLHF/DPO trade-off: the LLM gradually forgets about the first stage’s training when undergoing the second stage’s training. We theoretically prove the sub-optimality of sequential post-training. Furthermore, we propose a practical joint post-training framework that has theoretical convergence guarantees and empirically outperforms sequential post-training framework, while having similar computational cost.

1403FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models

[openreview] [pdf]

Abstract One-Shot Federated Learning (OSFL), a special decentralized machine learning paradigm, has recently gained significant attention. OSFL requires only a single round of client data or model upload, which reduces communication costs and mitigates privacy threats compared to traditional FL. Despite these promising prospects, existing methods face challenges due to client data heterogeneity and limited data quantity when applied to real-world OSFL systems. Recently, Latent Diffusion Models (LDM) have shown remarkable advancements in synthesizing high-quality images through pretraining on large-scale datasets, thereby presenting a potential solution to overcome these issues. However, directly applying pretrained LDM to heterogeneous OSFL results in significant distribution shifts in synthetic data, leading to performance degradation in classification models trained on such data. This issue is particularly pronounced in rare domains, such as medical imaging, which are underrepresented in LDM’s pretraining data. To address this challenge, we propose Federated Bi-Level Personalization (FedBiP), which personalizes the pretrained LDM at both instance-level and concept-level. Hereby, FedBiP synthesizes images following the client’s local data distribution without compromising the privacy regulations. FedBiP is also the first approach to simultaneously address feature space heterogeneity and client data scarcity in OSFL. Our method is validated through extensive experiments on three OSFL benchmarks with feature space heterogeneity, as well as on challenging medical and satellite image datasets with label heterogeneity. The results demonstrate the effectiveness of FedBiP, which substantially outperforms other OSFL methods.

1404Why Does the Effective Context Length of LLMs Fall Short?

[openreview] [pdf]

Abstract Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce Shifted Rotray Position Embedding (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with STRING even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

1405Adversarial Mixup Unlearning

[openreview] [pdf]

Abstract Machine unlearning is a critical area of research aimed at safeguarding data privacy by enabling the removal of sensitive information from machine learning models. One unique challenge in this field is catastrophic unlearning, where erasing specific data from a well-trained model unintentionally removes essential knowledge, causing the model to deviate significantly from a retrained one. To address this, we introduce a novel approach that regularizes the unlearning process by utilizing synthesized mixup samples, which simulate the data susceptible to catastrophic effects. At the core of our approach is a generator-unlearner framework, MixUnlearn, where a generator adversarially produces challenging mixup examples, and the unlearner effectively forgets target information based on these synthesized data. Specifically, we first introduce a novel contrastive objective to train the generator in an adversarial direction: generating examples that prompt the unlearner to reveal information that should be forgotten, while losing essential knowledge. Then the unlearner, guided by two other contrastive loss terms, processes the synthesized and real data jointly to ensure accurate unlearning without losing critical knowledge, overcoming catastrophic effects. Extensive evaluations across four benchmark datasets demonstrate that our method significantly outperforms state-of-the-art approaches, offering a robust solution to machine unlearning. This work not only deepens understanding of unlearning mechanisms but also lays the foundation for effective machine unlearning with mixup augmentation.

1406Buckle Up: Robustifying LLMs at Every Customization Stage via Data Curation

[openreview] [pdf]

Abstract Large language models (LLMs) are extensively adapted for downstream applications through a process known as “customization,” with fine-tuning being a common method for integrating domain-specific expertise. However, recent studies have revealed a vulnerability that tuning LLMs with malicious samples can compromise their robustness and amplify harmful content, an attack known as “jailbreaking.” To mitigate such attack, we propose an effective defensive framework utilizing data curation to revise commonsense texts and enhance their safety implication from the perspective of LLMs. The curated texts can mitigate jailbreaking attacks at every stage of the customization process: before customization to immunize LLMs against future jailbreak attempts, during customization to neutralize jailbreaking risks, or after customization to restore the compromised models. Since the curated data strengthens LLMs through the standard fine-tuning workflow, we do not introduce additional modules during LLM inference, thereby preserving the original customization process. Experimental results demonstrate a substantial reduction in jailbreaking effects, with up to a 100% success in generating responsible responses. Notably, our method is effective even with commonsense texts, which are often more readily available than safety-relevant data. With the every-stage defensive framework and supporting experimental performance, this work represents a significant advancement in mitigating jailbreaking risks and ensuring the secure customization of LLMs.

1407Distributional reasoning in LLMs: Parallel Reasoning Processes in Multi-hop Reasoning

[openreview] [pdf]

Abstract Large language models (LLMs) have shown an impressive ability to perform tasks believed to require "thought processes”. When the model does not document an explicit thought process, it’s difficult to understand the processes occurring within its hidden layers, and to determine if this process can be referred to as reasoning. We introduce a novel and interpretable analysis of internal multi-hop reasoning processes in LLMs. We demonstrate that the prediction process for compositional reasoning questions can be modeled using a simple linear transformation between two semantic category spaces. We show that during inference, the middle layers of the network generate highly interpretable embeddings that represent a set of potential intermediate answers for the multi-hop question. We use statistical analyses to show that a corresponding subset of tokens is activated in the model’s output, implying the existence of parallel reasoning paths. These observations hold true even when the model lacks the necessary knowledge to solve the task. Our findings can help uncover the strategies that LLMs use to solve reasoning tasks, offering insights into the types of thought processes that can emerge from artificial intelligence. Finally, we also discuss the implication of cognitive modeling of these results.

1408WaveDiffusion: Exploring Full Waveform Inversion via Joint Diffusion in the Latent Space

[openreview] [pdf]

Abstract Full Waveform Inversion (FWI) is a vital technique for reconstructing high-resolution subsurface velocity maps from seismic waveform data, governed by partial differential equations (PDEs) that model wave propagation. Traditional machine learning approaches typically map seismic data to velocity maps by encoding seismic waveforms into latent embeddings and decoding them into velocity maps. In this paper, we introduce a novel framework that reframes FWI as a joint diffusion process in a shared latent space, bridging seismic waveform data and velocity maps. Our approach has two key components: first, we merge the bottlenecks of two separate autoencoders—one for seismic data and one for velocity maps—into a unified latent space using vector quantization to establish a shared codebook. Second, we train a diffusion model in this latent space, enabling the simultaneous generation of seismic and velocity map pairs by sampling and denoising the latent representations, followed by decoding each modality with its respective decoder. Remarkably, our jointly generated seismic-velocity pairs approximately satisfy the governing PDE without any additional constraint, offering a new geometric interpretation of FWI. The diffusion process learns to score the latent space according to its deviation from the PDE, with higher scores representing smaller deviations from the true solutions. By following this diffusion process, the model traces a path from random initialization to a valid solution of the governing PDE. Our experiments on the OpenFWI dataset demonstrate that the generated seismic and velocity map pairs not only exhibit high fidelity and diversity but also adhere to the physical constraints imposed by the governing PDE.

1409A Unified Causal Framework for Auditing Recommender Systems for Ethical Concerns

[openreview] [pdf]

Abstract As recommender systems become widely deployed in different domains, they increasingly influence their users’ beliefs and preferences. Auditing recommender systems is crucial as it not only ensures the improvement of recommendation algorithms but also provides ways to assess and address ethical concerns surrounding them. In this work, we view recommender system auditing from a causal lens and provide a general recipe for defining auditing metrics. Under this general causal auditing framework, we categorize existing auditing metrics and identify gaps in them—notably, the lack of metrics for auditing user agency while accounting for the multi-step dynamics of the recommendation process. We leverage our framework and propose two classes of such metrics: future- and past-reachability and stability, that measure the ability of a user to influence their own and other users’ recommendations, respectively. We provide both a gradient-based and a black-box approach for computing these metrics, allowing the auditor to compute them under different levels of access to the recommender system. Empirically, we demonstrate the efficacy of methods for computing the proposed metrics and inspect the design of recommender systems through these proposed metrics.

1410POIL: Preference Optimization for Imitation Learning

[openreview] [pdf]

Abstract Imitation learning (IL) enables agents to learn policies by mimicking expert demonstrations. While online IL methods require interaction with the environment, which is costly, risky, or impractical, offline IL allows agents to learn solely from expert datasets without any interaction with the environment. In this paper, we propose Preference Optimization for Imitation Learning (POIL), a novel approach inspired by preference optimization techniques in large language model alignment. POIL eliminates the need for adversarial training and reference models by directly comparing the agent’s actions to expert actions using a preference-based loss function. We evaluate POIL on MuJoCo control tasks under two challenging settings: learning from a single expert demonstration and training with different dataset sizes (100%, 10%, 5%, and 2%) from the D4RL benchmark. Our experiments show that POIL consistently delivers superior or competitive performance against state-of-the-art methods in the past, including Behavioral Cloning (BC), IQ-Learn, DMIL, and O-DICE, especially in data-scarce scenarios, such as using one expert trajectory or as little as 2% of the full expert dataset. These results demonstrate that POIL enhances data efficiency and stability in offline imitation learning, making it a promising solution for applications where environment interaction is infeasible and expert data is limited.

1411Camera Pose Estimation Emerging In Video Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion-based video generators are now a reality. Being trained on a large corpus of real videos, such models can generate diverse yet realistic videos (Brooks et al., 2024; Zheng et al., 2024). Given that the videos appear visually coherent across camera changes, we ask, do the underlying generators implicitly learn camera registrations? Hence, we propose a novel adaptation to repurpose the intermediate features of the generator for camera pose estimation by linking them to the SoTA camera calibration decoder of DUSt3R (Wang et al., 2024a). This effectively unifies the video generation and camera estimation into a single framework. On top of unifying two different networks into one, our architecture can directly be trained on real video and simultaneously produces correspondence, with respect to the first frame, for all the video frames. Our final model, named JOG3R can be used in text-to-video mode, and additionally it produces camera pose estimates at a quality on par with the SoTA model DUSt3R, which was trained exclusively for camera pose estimation. We report that the synergy between video generation and 3D camera reconstruction tasks leads to around 25% better FVD scores with JOG3R against pretrained OpenSora.

1412On the Linear Speedup of Personalized Federated Reinforcement Learning with Shared Representations

[openreview] [pdf]

Abstract Federated reinforcement learning (FedRL) enables multiple agents to collaboratively learn a policy without sharing their own local trajectories collected during agent-environment interactions. However, in practice, the environments faced by different agents are often heterogeneous, leading to poor performance by the single policy learned by existing FedRL algorithms on individual agents. In this paper, we take a further step and introduce a personalized FedRL framework (PFedRL) by taking advantage of possibly shared common structure among agents in heterogeneous environments. Specifically, we develop a class of PFedRL algorithms named PFedRL-Rep that learns (1) a shared feature representation collaboratively among all agents and (2) an agent-specific weight vector personalized to its local environment. We analyze the convergence of PFedTD-Rep, a particular instance of the framework with temporal difference (TD) learning and linear representations. To the best of our knowledge, we are the first to prove a linear convergence speedup with respect to the number of agents in the PFedRL setting. To achieve this, we show that PFedTD-Rep is an example of the federated two-timescale stochastic approximation with Markovian noise. Experimental results demonstrate that PFedTD-Rep, along with an extension to the control setting based on deep Q-networks (DQN), not only improve learning in heterogeneous settings, but also provide better generalization to new environments.

1413Autonomous agents from automatic reward modeling and planning

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable capabilities across a range of text-generation tasks. However, LLMs still struggle with problems requiring multi-step decision-making and environmental feedback, such as online shopping, scientific reasoning, and mathematical problem-solving. Unlike pure text data, collecting large-scale decision-making data is challenging. Moreover, many powerful LLMs are only accessible through APIs, which hinders their fine-tuning for agent tasks due to cost and complexity. To address LLM agents’ limitations, we propose a framework that can automatically learn a reward model from the environment without human annotations. This model can be used to evaluate the action trajectories of LLM agents and provide heuristics for task planning. Specifically, our approach involves employing one LLM-based agent to navigate an environment randomly, generating diverse action trajectories. Subsequently, a separate LLM is leveraged to assign a task intent and synthesize a negative response alongside the correct response for each trajectory. These triplets (task intent, positive response, and negative response) are then utilized as training data to optimize a reward model capable of scoring action trajectories. This reward model can be integrated with LLM-based agents and various planning algorithms to enhance task-solving performance. The effectiveness and generalizability of our framework are demonstrated through evaluations conducted on different agent benchmarks. In conclusion, our proposed framework represents a significant advancement in enhancing LLM agents’ decision-making capabilities. By automating the learning of reward models, we overcome the challenges of data scarcity and API limitations, potentially revolutionizing the application of LLMs in complex and interactive environments. This research paves the way for more sophisticated AI agents capable of tackling a wide range of real-world problems requiring multi-step decision-making.

1414GOttack: Universal Adversarial Attacks on Graph Neural Networks via Graph Orbits Learning

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have demonstrated superior performance in node classification tasks across diverse applications. However, their vulnerability to adversarial attacks, where minor perturbations can mislead model predictions, poses significant challenges. This study introduces GOttack, a novel adversarial attack framework that exploits the topological structure of graphs to undermine the integrity of GNN predictions systematically.By defining a topology-aware method to manipulate graph orbits, our approach can generate adversarial modifications that are both subtle and effective, posing a severe test to the robustness of GNNs. We evaluate the efficacy of GOttack across multiple prominent GNN architectures using standard benchmark datasets. Our results show that GOttack outperforms existing state-of-the-art adversarial techniques and completes training in approximately 55% of the time required by the fastest competing model, achieving the highest average misclassification rate in 155 tasks. This work not only sheds light on the susceptibility of GNNs to structured adversarial attacks but also shows that certain topological patterns may play a significant role in the underlying robustness of the GNNs.

1415A Unified Approach to Routing and Cascading for LLMs

[openreview] [pdf]

Abstract The widespread applicability of large language models (LLMs) has increased the availability of many fine-tuned models of various sizes targeting specific tasks. Given a set of such specialized models, to maximize overall performance, it is important to figure out the optimal strategy for selecting the right model for a given user query. An effective strategy could drastically increase overall performance and even offer improvements over a single large monolithic model. Existing approaches typically fall into two categories: routing, where a single model is selected for each query, and cascading, which runs a sequence of increasingly larger models until a satisfactory answer is obtained. However, both have notable limitations: routing commits to an initial model without flexibility, while cascading requires executing every model in sequence, which can be inefficient. Additionally, the conditions under which these strategies are provably optimal remain unclear. In this work, we derive optimal strategies for both routing and cascading. Building on this analysis, we propose a novel approach calledcascade routing, which combines the adaptability of routing with the cost-efficiency of cascading. Our experiments demonstrate that cascade routing consistently outperforms both routing and cascading across a variety of settings, improving both output quality and lowering computational cost, thus offering a unified and efficient solution to the model selection problem.

1416TOWARDS LAYER-WISE PERSONALIZED FEDERATED LEARNING: ADAPTIVE LAYER DISENTANGLEMENT VIA CONFLICTING GRADIENTS

[openreview] [pdf]

Abstract In personalized Federated Learning (pFL), high data heterogeneity can cause significant gradient divergence across devices, adversely affecting the learning process. This divergence, especially when gradients from different users form an obtuse angle during aggregation, can negate progress, leading to severe weight and gradient update degradation. To address this issue, we introduce a new approach to pFL design, namely Federated Learning with Layer-wise Aggregation via Gradient Analysis (FedLAG), utilizing the concept of gradient conflict at the layer level. Specifically, when layer-wise gradients of different clients form acute angles, those gradients align in the same direction, enabling updates across different clients toward identifying client-invariant features. Conversely, when layer-wise gradient pairs make create obtuse angles, the layers tend to focus on client-specific tasks. In hindsights, FedLAG assigns layers for personalization based on the extent of layer-wise gradient conflicts. Specifically, layers with gradient conflicts are excluded from the global aggregation process. The theoretical evaluation demonstrates that when integrated into other pFL baselines, FedLAG enhances pFL performance by a certain margin. Therefore, our proposed method achieves superior convergence behavior compared with other baselines. Extensive experiments show that our FedLAG outperforms several state-of-the-art methods and can be easily incorporated with many existing methods to further enhance performance.

1417DUET: Decentralized Bilevel Optimization without Lower-Level Strong Convexity

[openreview] [pdf]

Abstract Decentralized bilevel optimization (DBO) provides a powerful framework for multi-agent systems to solve local bilevel tasks in a decentralized fashion without the need for a central server. However, most existing DBO methods rely on lower-level strong convexity (LLSC) to guarantee unique solutions and and a well-defined hypergradient for stationarity measure, hindering their applicability in many practical scenarios not satisfying LLSC. To overcome this limitation, we introduce a new single-loop DBO algorithm called diminishing quadratically-regularized bilevel decentralized optimization (DUET), which eliminates the need for LLSC by introducing a diminishing quadratic regularization to the lower-level (LL) objective. We show that DUET achieves an iteration complexity of O(1/T15p114τ)O(1/T^{1-5p-\frac{11}{4}\tau}) for approximate KKT-stationary point convergence under relaxed assumptions, where pp and τ\tau are control parameters for LL learning rate and averaging, respectively. In addition, our DUET algorithm incorporates gradient tracking to address data heterogeneity, a key challenge in DBO settings. To the best of our knowledge, this is the first work to tackle DBO without LLSC under decentralized settings with data heterogeneity. Numerical experiments validate the theoretical findings and demonstrate the practical effectiveness of our proposed algorithms.

1418Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations

[openreview] [pdf]

Abstract Training embodied agents to perform complex robotic tasks presents significant challenges due to the entangled factors of task compositionality, environmental diversity, and dynamic changes. In this work, we introduce a novel imitation learning framework to train closed-loop concept-guided policies that enhance long-horizon task performance by leveraging discovered manipulation concepts. Unlike methods that rely on predefined skills and human-annotated labels, our approach allows agents to autonomously abstract manipulation concepts from their proprioceptive states, thereby alleviating misalignment due to ambiguities in human semantics and environmental complexity. Our framework comprises two primary components: anAutomatic Concept Discoverymodule that identifies meaningful and consistent manipulation concepts, and aConcept-Aware Policy Learningmodule that effectively utilizes these manipulation concepts for adaptive task execution, including aConcept Selection Transformerfor concept-based guidance and aConcept-Guided Policyfor action prediction with the selected concepts. Experimental results demonstrate that our approach significantly outperforms baseline methods across a range of tasks and environments, while showcasing emergent consistency in motion patterns associated with the discovered concepts. Our code and models will be public.

1419Rethinking Fairness Representation in Multi-Task Learning: a Performance-Informed Variance Reduction Approach

[openreview] [pdf]

Abstract Multi-task learning (MTL) can leverage shared knowledge across tasks to improve data efficiency and generalization performance, and has been applied in various scenarios. However, task imbalance remains a major challenge for existing MTL methods. While the prior works have attempted to mitigate inter-task unfairness through loss-based and gradient-based strategies, they still exhibit imbalanced performance across tasks on common benchmarks. This key observation motivates us to consider performance-level information as an explicit fairness indicator, which can more accurately reflect the current optimization status of each task, and accordingly help to adjust the gradient aggregation process. Specifically, we utilize the performance variance among tasks as the fairness indicator and introduce a dynamic weighting strategy to gradually reduce the performance variance. Based on this, we propose PIVRG, a novel performance-informed variance reduction gradient aggregation approach. Extensive experiments show that PIVRG achieves state-of-the-art performance across various benchmarks, spanning both supervised learning and reinforcement learning tasks with task numbers ranging from 2 to 40. Results from the ablation study also show that our approach can be integrated into existing methods, significantly enhancing their performance while reducing the variance in task performance, thus achieving fairer optimization.

1420MOVIS: Enhancing Multi-Object Novel View Synthesis for Indoor Scenes

[openreview] [pdf]

Abstract Repurposing pre-trained diffusion models has been proven to be effective for NVS. However, these methods are mostly limited to a single object; directly applying such methods to compositional multi-object scenarios yields inferior results, especially incorrect object placement and inconsistent shape and appearance under novel views. How to enhance and systematically evaluate the cross-view consistency of such models remains under-explored. To address this issue, we propose MOVIS to enhance the structural awareness of the view-conditioned diffusion model for multi-object NVS in terms of model inputs, auxiliary tasks, and training strategy. First, we inject structure-aware features, including depth and object mask, into the denoising U-Net to enhance the model’s comprehension of object instances and their spatial relationships. Second, we introduce an auxiliary task requiring the model to simultaneously predict novel view object masks, further improving the model’s capability in differentiating and placing objects. Finally, we conduct an in-depth analysis of the diffusion sampling process and carefully devise a structure-guided timestep sampling scheduler during training, which balances the learning of global object placement and fine-grained detail recovery. To systematically evaluate the plausibility of synthesized images, we propose to assess cross-view consistency and novel view object placement alongside existing image-level NVS metrics. Extensive experiments on challenging synthetic and realistic datasets demonstrate that our method exhibits strong generalization capabilities and produces consistent novel view synthesis, highlighting its potential to guide future 3D-aware multi-object NVS tasks.

1421Last Iterate Convergence of Incremental Methods as a Model of Forgetting

[openreview] [pdf]

Abstract Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. We establish the first nonasymptotic convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study last iterate convergence of the incremental proximal method as a mathematical abstraction of forgetting in continual learning and prove a lower bound that certifies that a large amount of regularization is crucial to mitigating catastrophic forgetting---one of the key considerations in continual learning. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.

1422Variational Inequality Methods for Multi-Agent Reinforcement Learning: Performance and Stability Gains

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) presents unique challenges as agents learn strategies through experiences. Gradient-based methods are often sensitive to hyperparameter selection and initial random seed variations. Concurrently, significant advances have been made in solving Variational Inequalities (VIs)—which include equilibrium-finding problems— particularly in addressing the non-converging rotational dynamics that impede convergence of traditional gradient-based optimization methods. This paper explores the potential of leveraging VI-based techniques to improve MARL training. Specifically, we study the performance of VI methods—namely, Nested-Lookahead VI (nLA-VI) and Extragradient (EG)—in enhancing the multi-agent deep deterministic policy gradient (MADDPG) algorithm. We present a VI reformulation of the actor-critic algorithm for both single- and multi-agent settings. We introduce three algorithms that use nLA-VI, EG, and a combination of both, named LA-MADDPG, EG-MADDPG, and LA-EG-MADDPG, respectively. Our empirical results demonstrate that these VI-based approaches yield significant performance improvements in benchmark environments, such as the zero-sum games: rock-paper-scissors and matching pennies, where equilibrium strategies can be quantitatively assessed, and the Multi- Agent Particle Environment: Predator-prey benchmark, where VI-based methods also yield balanced participation of agents from the same team.

1423Inverse Prompt Engineering for Task-Specific LLM Safety

[openreview] [pdf]

Abstract Most real-world deployments of large language models (LLMs) operate within well-scoped tasks, yet current safety measures are general-purpose and fail to leverage this information. As a result, even in narrowly-scoped tasks, LLM applications remain vulnerable to adversarial jailbreaks. In these settings, we argue that task-specific safety guardrails solve a more tractable problem than general-purpose methods. We introduce Inverse Prompt Engineering (IPE) as an initial approach to building automatic, task-specific safety guardrails around LLMs. Our key insight is that robust safety guardrails can be derived from prompt engineering data that is already on hand. IPE operationalizes the principle of least privilege from computer security, restricting LLM functionality to only what is necessary for the task. We evaluate our approach in two settings. First, in an example chatbot application, where IPE outperforms existing methods against both human-written and automated adversarial attacks. Second, on TensorTrust, a crowdsourced dataset of prompt-based attacks and defenses. Here, IPE improves average defense robustness by 93%, using real-world prompt engineering data.

1424First-Person Fairness in Chatbots

[openreview] [pdf]

Abstract Some chatbots have access to a user’s name when responding. Prior work has shown that large language model outputs can change based on the demographic traits correlated with a name, such as gender or race. In this study, we introduce a privacy-preserving and scalable method for studying one form offirst-person fairness—fairness towards the user based on their demographic information— across a large and heterogeneous corpus of actual chats. We leverage a language model as an AI “research assistant” (AI RA) that can privately and scalably analyze chat data, surfacing broader trends without exposing specific examples to the researchers. We corroborate the labels of the AI RA with independent human annotations, finding it highly consistent with human ratings of gender bias (less so for racial bias). We apply this methodology to a large set of chats with a commercial chatbot.We assess overall quality of responses conditional on different names and also subtle differences in similar-quality responses that may in aggregate reinforce harmful stereotypes based on gender or race. The largest detected biases are gender biases in older generations of models and in open-ended tasks, like writing a story. Finally, we discuss methods for monitoring and further reducing such biases. Some chatbots have access to a user’s name when responding. Prior work has shown that large language model outputs can change based on the demographic traits correlated with a name, such as gender or race. In this study, we introduce a privacy-preserving and scalable method for studying one form of first-person fairness—fairness towards the user based on their demographic information— across a large and heterogeneous corpus of actual chats. We leverage a language model as an AI “research assistant” (AI RA) that can privately and scalably analyze chat data, surfacing broader trends without exposing specific examples to the researchers. We corroborate the labels of the AI RA with independent human annotations, finding it highly consistent with human ratings of gender bias (less so for racial bias). We apply this methodology to a large set of chats with a commercial chatbot.We assess overall quality of responses conditional on different names and also subtle differences in similar-quality responses that may in aggregate reinforce harmful stereotypes based on gender or race. The largest detected biases are gender biases in older generations of models and in open-ended tasks, like writing a story. Finally, we discuss methods for monitoring and further reducing such biases.

1425Contrastive Meta Learning for Dynamical Systems

[openreview] [pdf]

Abstract Recent advancements in deep learning have significantly impacted the study of dynamical systems. Traditional approaches predominantly rely on supervised learning paradigms, limiting their scope to large scale problems and adaptability to new systems. This paper introduces a novel meta learning framework tailored for dynamical system forecasting, hinging on the concept of mapping the observed trajectories to a system-specific embedding space which encapsulates the inter-system characteristics and enriches the feature set for downstream prediction tasks. Central to our framework is the use of contrastive learning for trajectory data coupled with a series of neural network architecture designs to extract the features as augmented embedding for modeling system behavior. We present the application of zero-shot meta-learning to dynamical systems, demonstrating a substantial enhancement in performance metrics compared to existing baseline models. A notable byproduct of our methodology is the improved interpretability of the embeddings, which now carries explicit physical significance. Our results not only set a new benchmark in the field but also pave the way for enhanced interpretability and deeper understanding of complex dynamical systems, potentially opens new directions for how we approach system analysis and prediction.

1426Lowering Data Diversity can Accelerate Training: Case Studies in Synthetic Tasks

[openreview] [pdf]

Abstract We identify a loss plateau at the start of training in the three synthetic settings of in-context linear regression, sparse parity, and fact memorization. While careful tweaks to the optimization algorithm can mitigate these plateaus, we find that a simpler orthogonal approach oflowering the data diversity, and in doing so, biasing the training distributionawayfrom the test distribution, counter-intuitively also speeds up training. This connection between data diversity and training speed holds for three different diversity-reducinginterventions across our varied synthetic settings. Our findings offer a new perspective on data filtering and curriculum learning for training machine learning models.

1427RainbowPO: A Unified Framework for Combining Improvements in Preference Optimization

[openreview] [pdf]

Abstract Recently, numerous preference optimization algorithms have been introduced as extensions to the Direct Preference Optimization (DPO) family. While these methods have successfully aligned models with human preferences, there is a lack of understanding regarding the contributions of their additional components. Moreover, fair and consistent comparisons are scarce, making it difficult to discern which components genuinely enhance downstream performance. In this work, we propose RainbowPO, a unified framework that demystifies the effectiveness of existing DPO methods by categorizing their key components into seven broad directions. We integrate these components into a single cohesive objective, enhancing the performance of each individual element. Through extensive experiments, we demonstrate that RainbowPO outperforms existing DPO variants. Additionally, we provide insights to guide researchers in developing new DPO methods and assist practitioners in their implementations.

1428No-Regret and Incentive-Compatible Combinatorial Online Prediction

[openreview] [pdf]

Abstract We study the combinatorial online learning prediction problem with bandit feedback in a strategic setting, where the experts can strategically influence the learning algorithm’s predictions by manipulating their beliefs about a sequence of binary events. There are two learning objectives for the algorithm. The first is maximizing its cumulative utility over a fixed time horizon, equivalent to minimizing regret. The second objective is to ensure incentive compatibility, guaranteeing that each expert’s optimal strategy is to report their true beliefs about the outcomes of each event. In real applications, the learning algorithm only receives the utility corresponding to their chosen experts, which is referred to as the full-bandit setting. In this work, we present an algorithm based on mirror descent, which achieves a regret of O(T3/4)O(T^{3/4}) under both the full-bandit or semi-bandit feedback model, while ensuring incentive compatibility. To our best knowledge, this is the first algorithm that can simultaneously achieve sublinear regret and incentive compatibility. To demonstrate the effectiveness of our algorithm, we conduct extensive empirical evaluation with the algorithm on a synthetic dataset.

1429Easing Training Process of Rectified Flow Models Via Lengthening Inter-Path Distance

[openreview] [pdf]

Abstract Recent research pinpoints that different diffusion methods and architectures trained on the same dataset produce similar results for the same input noise. This property suggests that they have some preferable noises for a given sample. By visualizing the noise-sample pairs of rectified flow models and stable diffusion models in two-dimensional spaces, we observe that the preferable paths, connecting preferable noises to the corresponding samples, are much well organized with significant fewer crossings comparing with the random paths, connecting random noises to training samples. In high-dimensional space, paths rarely intersect. The path crossings in two-dimensional spaces indicate the shorter inter-path distance in the corresponding high-dimensional spaces. Inspired by this observation, we propose the Distance-Aware Noise-Sample Matching (DANSM) method to lengthen the inter-path distance for speeding up the model training. DANSM is derived from rectified flow models, which allow using a closed-form formula to calculate the inter-path distance. To further simplify the optimization, we derive the relationship between inter-path distance and path length, and use the latter in the optimization surrogate. DANSM is evaluated on both image and latent spaces by rectified flow models and diffusion models. The experimental results show that DANSM can significantly improve the training speed by 30% \sim 40% without sacrificing the generation quality.

1430Towards Replication-Robust Data Markets

[openreview] [pdf]

Abstract Despite widespread adoption of machine learning throughout industry, many firms face a common challenge: relevant datasets are typically distributed amongst market competitors that are reluctant to share information. Recent works propose data markets to provide monetary incentives for collaborative machine learning, where agents share features with each other and are rewarded based on their contribution to improving the predictions others. These contributions are determined by their relative Shapley value, which is computed by treating features as players and their interactions as a characteristic function game. However, in its standard form, this setup further provides an incentive for agents to replicate their data and act under multiple false identities in order to increase their own revenue and diminish that of others, restricting their use in practice. In this work, we develop a replication-robust data market for supervised learning problems. We adopt Pearl’s do-calculus from causal reasoning to refine the characteristic function game by differentiating between observational and interventional conditional probabilities. By doing this, we derive Shapley value-based rewards that are robust to this malicious replication by design, whilst preserving desirable market properties.

1431A Non-Contrastive Learning Framework for Sequential Recommendation with Preference-Preserving Profile Generation

[openreview] [pdf]

Abstract Contrastive Learning (CL) proves to be effective for learning generalizable user representations in Sequential Recommendation (SR), but it suffers from high computational costs due to its reliance on negative samples. To overcome this limitation, we propose the first Non-Contrastive Learning (NCL) framework for SR, which eliminates computational overhead of identifying and generating negative samples. However, without negative samples, it is challenging to learn uniform representations from only positive samples, which is prone to representation collapse. Furthermore, the alignment of the learned representations may be substantially compromised because existing ad-hoc augmentations can produce positive samples that have inconsistent user preferences. To tackle these challenges, we design a novel preference-preserving profile generation method to produce high-quality positive samples for non-contrastive training. Inspired by differential privacy, our approach creates augmented user profiles that exhibit high diversity while provably retaining consistent user preferences. With larger diversity and consistency of the positive samples, our NCL framework significantly enhances the alignment and uniformity of the learned representations, which contributes to better generalization. The experimental results on various benchmark datasets and model architectures demonstrate the effectiveness of the proposed method. Finally, our investigations reveal that both uniformity and alignment play a vital role in improving generalization for SR. Interestingly, in our data-sparse setting, alignment is usually more important than uniformity.

1432Efficient Off-Policy Learning for High-Dimensional Action Spaces

[openreview] [pdf]

Abstract Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we leverage a weighted importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, enabling more effective exploration and exploitation in complex environments.

1433Steer a Crowd: Learning to Persuade a Population in a Stackelberg Game

[openreview] [pdf]

Abstract Multi-agent systems are prevalent across various domains, characterized by misaligned objectives and information asymmetry, which facilitate the study of incentive design and information design. Existing research often assumes known models and static environments. Motivated by this, we propose a Dynamic Incentive and Information Design (DIID) framework for finite-horizon Markov games, involving a principal and multiple agents. Our focus is on how the principal learns their optimal policy based on data generated through interactions with agents. The main challenge lies in balancing the principal’s regret and violations of agents’ incentive compatibility constraints during interactions. We establish a lower bound characterizing the trade-off between the two objectives and propose an algorithm attaining the optimal trade-off, i.e. O~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret and constraint violation. Additionally, with access to additional unilateral deviation information of the agents, we propose an algorithm attaining improved guarantees that achieve O~(T1/2)\tilde{\mathcal{O}}(T^{1/2}) for both regret and constraint violation simultaneously.

1434Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

[openreview] [pdf]

Abstract Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to responses with the highest rewards, which are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. This dataset is easily integrated with existing direct alignment algorithms and is applicable to any preference dataset. The experimental results across instruction-following benchmarks including AlpacaEval 2.0, MT-Bench, and Arena-Hard-Auto demonstrate that our approach consistently boosts the performance of DPO by a considerable margin across diverse models such as Zephyr, Mistral, Qwen2, Llama3.1, Gemma2, and SPPO. Additionally, on six academic benchmarks including GSM8K, GPQA, MUSR, TruthfulQA, BBH, and ARC, our method improves their average accuracy. When applying our method to on-policy data, the resulting DPO model outperforms various baselines and achieves state-of-the-art results on AlpacaEval 2.0. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere dataset expansion.

1435SeeThruAnything: Learning to Remove Any Obstructions Across Distributions

[openreview] [pdf]

Abstract Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose SeeThruAnything, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual commands, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that SeeThruAnything consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during training.

1436The Overcooked Generalisation Challenge

[openreview] [pdf]

Abstract We introduce the Overcooked Generalisation Challenge (OGC) – the first bench-mark to study reinforcement learning agents’ zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has evaluated cooperating agents only on the same level or with the same partner, thus failing to capture generalisation abilities essential for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first evaluated with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license:http://anonymised.edu. We show that state-of-the-art DCD algorithms fail to produce useful policies on this novel challenge, even if combined with recent network architectures specifically designed for scalability and generalisability. As such, the OGC pushes the boundaries of real-world human-AI cooperation by enabling research on the impact of generalisation on cooperating agents.

1437What Are Good Positional Encodings for Directed Graphs?

[openreview] [pdf]

Abstract Positional encodings (PEs) are essential for building powerful and expressive graph neural networks and graph transformers, as they effectively capture the relative spatial relationships between nodes. Although extensive research has been devoted to PEs in undirected graphs, PEs for directed graphs remain relatively unexplored. This work seeks to address this gap. We first introduce the notion ofWalk Profile, a generalization of walk-counting sequences for directed graphs. A walk profile encompasses numerous structural features crucial for directed graph-relevant applications, such as program analysis and circuit performance prediction. We identify the limitations of existing PE methods in representing walk profiles and propose a novelMulti-q Magnetic Laplacian PE, which extends the Magnetic Laplacian eigenvector-based PE by incorporating multiple potential factors. The new PE can provably express walk profiles. Furthermore, we generalize prior basis-invariant neural networks to enable the stable use of the new PE in the complex domain. Our numerical experiments validate the expressiveness of the proposed PEs and demonstrate their effectiveness in solving sorting network satisfiability and performing well on general circuit benchmarks.

1438Projection Optimal Transport on Tree-Ordered Lines

[openreview] [pdf]

Abstract Many variants of Optimal Transport (OT) have been developed to address its heavy computation. Among them, notably, Sliced Wasserstein (SW) is widely used for application domains by projecting the OT problem onto one-dimensional lines, and leveraging the closed-form expression of the univariate OT to reduce the computational burden. However, projecting measures onto low-dimensional spaces can lead to a loss of topological information. To mitigate this issue, in this work, we propose to replace one-dimensional lines with a more intricate structure, called \emph{tree systems}. This structure is metrizable by a tree metric, which yields a closed-form expression for OT problems on tree systems. We provide an extensive theoretical analysis to formally define tree systems with their topological properties, introduce the concept of splitting maps, which operate as the projection mechanism onto these structures, then finally propose a novel variant of Radon transform for tree systems and verify its injectivity. This framework leads to an efficient metric between measures, termed Tree-Sliced Wasserstein distance on Systems of Lines (TSW-SL). By conducting a variety of experiments on gradient flows, image style transfer, and generative models, we illustrate that our proposed approach performs favorably compared to SW and its variants.

1439Synthetic continued pretraining

[openreview] [pdf]

Abstract Pretraining on large-scale, unstructured internet text enables language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient---to learn a fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source corpus and then generates diverse text by drawing connections between those entities. Synthetic continued pretraining with EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If the source documents are instead available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can “rearrange” knowledge to enable more data-efficient learning.

1440Differentiation Through Black-Box Quadratic Programming Solvers

[openreview] [pdf]

Abstract In recent years, many deep learning approaches have incorporated layers that solve optimization problems (e.g., linear, quadratic, and semidefinite programs). Integrating these optimization problems as differentiable layers requires computing the derivatives of the optimization problem’s solution with respect to its objective and constraints. This has so far prevented the use of state-of-the-art black-box numerical solvers within neural networks, as they lack a differentiable interface. To address this issue for one of the most common convex optimization problems - quadratic programming (QP) - we introducedQP, a modular framework that enables plug-and-play differentiation for any QP solver, allowing seamless integration into neural networks and bi-level optimization tasks. Our solution is based on the core theoretical insight that knowledge of the active constraint set at the QP optimum allows forexplicitdifferentiation. This insight reveals a unique relationship between the computation of the solution and its derivative, enabling efficient differentiation of any solver, that only requires the primal solution. Our implementation, which will be made publicly available, interfaces with an existing framework that supports over 15 state-of-the-art QP solvers, providing each with a fully differentiable backbone for immediate use as a differentiable layer in learning setups. To demonstrate the scalability and effectiveness of dQP, we evaluate it on a large benchmark dataset of QPs with varying structures. We compare dQP with existing differentiable QP methods, demonstrating its advantages across a range of problems, from challenging small and dense problems to large-scale sparse ones, including a novel bi-level geometry optimization problem.

1441Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

[openreview] [pdf]

Abstract Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C = g(f(A)) ) effortlessly, even without encountering ( AC ) or ( ABC ) together, showcasing their compositional generalization ability. In this paper, we introduce a learning task, “FTCT” (Fragmented at Training, Chained at Testing), to assess if Transformers can replicate this skill. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and data’s relative knowledge ratio. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.

1442GAUSSIANFLOW: SPLATTING GAUSSIAN DYNAMICS FOR 4D CONTENT CREATION

[openreview] [pdf]

Abstract Creating 4D fields of Gaussian Splatting from images or videos is a challenging task due to its under-constrained nature. While the optimization can draw photometric reference from the input videos or be regulated by generative models, directly supervising Gaussian motions remains underexplored. In this paper, we introduce a novel concept, Gaussian flow, which connects the dynamics of 3D Gaussians and pixel velocities between consecutive frames. The Gaussian flow can be efficiently obtained by splatting Gaussian dynamics into the image space. This differentiable process enables direct dynamic supervision from optical flow. Our method significantly benefits 4D dynamic content generation and 4D novel view synthesis with Gaussian Splatting, especially for contents with rich motions that are hard to be handled by existing methods. The common color drifting issue that happens in 4D generation is also resolved with improved Guassian dynamics. Superior visual quality on extensive experiments demonstrates our method’s effectiveness. As shown in our evaluation, Gaussian Flow can drastically improve both quantitative and qualitative results for 4D Generation and 4D novel view synthesis.

1443Doubly Optimal Policy Evaluation for Reinforcement Learning

[openreview] [pdf]

Abstract Policy evaluation estimates the performance of a policy by (1) collecting data from the environment and (2) processing raw data into a meaningful estimate. Due to the sequential nature of reinforcement learning, any improper data-collecting policy or data-processing method substantially deteriorates the variance of evaluation results over long time steps. Thus, policy evaluation often suffers from large variance and requires massive data to achieve the desired accuracy. In this work, we design an optimal combination of data-collecting policy and data-processing baseline. Theoretically, we prove our doubly optimal policy evaluation method is unbiased and guaranteed to have lower variance than previously best-performing methods. Empirically, compared with previous works, we show our method reduces variance substantially and achieves superior empirical performance.

1444Adaptive Inference: Theoretical Limits and Opportunities for Efficient AI

[openreview] [pdf]

Abstract With the commercial deployment of increasingly larger and more complex neural networks at the cloud and the edge in recent years, inference has become too costly in terms of compute workload worldwide. Adaptive inference methods, which dynamically adjust a neural network’s size or structure during inference, offer a means to enhance efficiency of neural networks beyond what static network compression and optimization methods can fundamentally achieve.This paper introduces the first theoretical framework for quantifying the efficiency and performance gain opportunity size of adaptive inference algorithms. We provide new approximate and exact bounds for the achievable efficiency and performance gains, supported by empirical evidence demonstrating the potential for 10-100x efficiency improvements in both Computer Vision and Natural Language Processing tasks without incurring any performance penalties. Additionally, we offer insights on improving achievable efficiency gains through the optimal selection and design of adaptive inference state spaces.

1445Explainable Sequential Optimization

[openreview] [pdf]

Abstract We propose formulating stochastic model predictive control into a coalition game to use Shapley values for feature attribution. Such analysis is crucial for transparency and achieving optimal outcomes in high-stake applications such as portfolio optimization and autonomous driving. We categorize Shapley values estimation methods into three families: those based on weighted linear regression, sampling permutations, and multilinear extension. We survey, benchmark, and provide valuable insight into these methods, previously not attempted in this context. Our experiments show that halved Owen sampling from multilinear extension and KernelShap-Paired from weighted linear regression, both utilizing antithetic sampling, perform best.

1446Can we talk models into seeing the world differently?

[openreview] [pdf]

Abstract Unlike traditional vision-only models, vision language models (VLMs) offer an intuitive way to access visual content through language prompting by combining a large language model (LLM) with a vision encoder. However, both the LLM and the vision encoder come with their own set of biases, cue preferences, and shortcuts, which have been rigorously studied in uni-modal models. A timely question is how such (potentially misaligned) biases and cue preferences behave under multi-modal fusion in VLMs. As a first step towards a better understanding, we investigate a particularly well-studied vision-only bias - the texture vs. shape bias and the dominance of local over global information. As expected, we find that VLMs inherit this bias to some extent from their vision encoders. Surprisingly, the multi-modality alone proves to have important effects on the model behavior, i.e., the joint training and the language querying change the way visual cues are processed. While this direct impact of language-informed training on a model’s visual perception is intriguing, it raises further questions on our ability to actively steer a model’s output so that its prediction is based on particular visual cues of the user’s choice. Interestingly, VLMs have an inherent tendency to recognize objects based on shape information, which is different from what a plain vision encoder would do. Further active steering towards shape-based classifications through language prompts is however limited. In contrast, active VLM steering towards texture-based decisions through simple natural language prompts is often more successful.

1447Proposer-Agent-Evaluator (PAE): Autonomous Skill Discovery For Foundation Model Internet Agents

[openreview] [pdf]

Abstract The vision of a broadly capable and goal-directed agent, such as an Internet-browsing agent in the digital world and a household humanoid in the physical world, has rapidly advanced, thanks to the generalization capability of foundation models. Such a generalist agent needs to have a large and diverse skill repertoire, such as finding directions between two travel locations and buying specific items from the Internet. If each skill needs to be specified manually through a fixed set of human-annotated instructions, the agent’s skill repertoire will necessarily be limited due to the quantity and diversity of human-annotated instructions. In this work, we address this challenge by introducing Proposer-Agent-Evaluator (PAE), a novel framework that enables foundation model agents to autonomously discover and practice skills in the wild. At the heart of PAE is a context-aware task proposer that autonomously proposes tasks for the agent to practice with context information of the websites such as user demos or even just the name of the website itself. Then, the agent policy attempts those tasks in the real world with resulting trajectories evaluated by an autonomous model-based success evaluator. The success evaluation serves as the reward signal for the agent to refine its policies through RL. We validate PAE on challenging vision-based web navigation, using both real-world and self-hosted websites from WebVoyager and WebArena. Our results show that PAE significantly improves the zero-shot generalization capability of VLM Internet agents (more than 30% relative improvement) to both unseen tasks and websites. Our model also achieves an absolute advantage of over 10% (from 22.6% to 33.0%) comparing to other state-of-the-art open source VLM agents including Qwen2VL-72B. To the best of our knowledge, this work represents the first attempt to apply autonomous task proposal with RL for agents, achieving SOTA performance among open-source models. We plan to release our models and code to facilitate further research.

1448Efficient Causal Decision Making with One-sided Feedback

[openreview] [pdf]

Abstract We study a class of decision-making problems with one-sided feedback, where outcomes are only observable for specific actions. A typical example is bank loans, where the repayment status is known only if a loan is approved and remains undefined if rejected. In such scenarios, conventional approaches to causal decision evaluation and learning from observational data are not directly applicable. In this paper, we introduce a novel value function to evaluate decision rules that addresses the issue of undefined counterfactual outcomes. Without assuming no unmeasured confounders, we establish the identification of the value function using shadow variables. Furthermore, leveraging semiparametric theory, we derive the efficiency bound for the proposed value function and develop efficient methods for decision evaluation and learning. Numerical experiments and a real-world data application demonstrate the empirical performance of our proposed methods.

[openreview] [pdf]

Abstract Traditional reinforcement learning (RL) typically requires vast amounts of training data to develop effective policies. In contrast, large language models (LLMs) exhibit strong generalization and zero-shot capabilities, but struggle with plan- ning and understanding complex action policies. In this work, we introduce STRATEGIST, a novel approach that integrates the strengths of both methods. Our approach leverages LLMs to learn high-level strategic abstractions, which are then refined and executed by a low-level mechanism, such as Monte Carlo Tree Search (MCTS). STRATEGIST is a generalizable framework that can be trained through population-based self-play simulations and self-improvement, without the need for prior training data. We demonstrate the effectiveness of STRATEGIST in learning optimal policies for competitive, multi-turn games with partial informa- tion, including Game of Pure Strategy (GOPS) and multi-agent, hidden-identity discussion games like The Resistance: Avalon. Our results show that agents trained with STRATEGIST outperform those trained with traditional RL methods, other LLM-based skill acquisition techniques, and pre-existing LLM agents across both game environments.

1450Free Hunch: Denoiser Covariance Estimation for Diffusion Models Without Extra Costs

[openreview] [pdf]

Abstract The covariance for clean data given a noisy observation is an important quantity in many conditional generation methods for diffusion models. Current methods require heavy test-time computation, altering the standard diffusion training process or denoiser architecture, or making heavy approximations. We propose a new framework that sidesteps these issues by using covariance information that is available for free from training data and the curvature of the generative trajectory, which is linked to the covariance through the second-order Tweedie’s formula. We integrate these sources of information using (i) a novel method to transfer covariance estimates across noise levels and (ii) low-rank updates in a given noise level. We validate the method on linear inverse problems, where it outperforms recent baselines, especially with fewer diffusion steps.

1451Hierarchical Preference Optimization: Learning to achieve goals via feasible subgoals prediction

[openreview] [pdf]

Abstract This work introduces Hierarchical Preference Optimization (HPO), a novel approach to hierarchical reinforcement learning (HRL) that addresses non-stationarity and infeasible subgoal generation issues when solving complex robotic control tasks. HPO leverages maximum entropy reinforcement learning combined with token-level Direct Preference Optimization (DPO), eliminating the need for pre-trained reference policies that are typically unavailable in challenging robotic scenarios. Mathematically, we formulate HRL as a bi-level optimization problem and transform it into a primitive-regularized DPO formulation, ensuring feasible subgoal generation and avoiding degenerate solutions. Extensive experiments on challenging robotic navigation and manipulation tasks demonstrate HPO’s impressive performance, where HPO shows an improvement of up to 35% over the baselines. Furthermore, ablation studies validate our design choices, and quantitative analyses confirm HPO’s ability to mitigate non-stationarity and infeasible subgoal generation issues in HRL.

1452Learn from Interactions: General-Sum Interactive Inverse Reinforcement Learning

[openreview] [pdf]

Abstract This paper studies the problem that a learner aims to learn the reward function of the expert from the interaction with the expert and how to interact with the expert. We formulate the problem as a stochastic bi-level optimization problem and develop a double-loop algorithm “general-sum interactive inverse reinforcement learning” (GSIIRL). In the GSIIRL, the learner first learns the reward function of the expert in the inner loop and then learns how to interact with the expert in the outer loop. We theoretically prove the convergence of our algorithm and validate our algorithm through simulations.

1453Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

[openreview] [pdf]

Abstract We study learning the approximate Nash equilibrium (NE) policy profile in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence. The algorithms in previous works studying this problem either require full-information feedback or only have asymptotic convergence rates. In contrast, we study IIEFGs in the formulation of partially observable Markov games (POMGs) with the perfect-recall assumption and bandit feedback, where the knowledge of the game is not known a priori and only the rewards of the experienced information set and action pairs are revealed to the learners in each episode. Our algorithm utilizes a negentropy regularizer weighted by a virtual transition over information set-action space. By carefully designing the virtual transition together with the leverage of the entropy regularization technique, we prove that our algorithm converges to the NE of IIEFGs with a provable finite-time convergence rate of O~(k18)\widetilde{O}(k^{-\frac{1}{8}}) with high probability under bandit feedback, thus answering the second question of \citet{Fiegel2023adapting} affirmatively.

1454Improved Stochastic Controlled Averaging for Distributed and Federated Learning

[openreview] [pdf]

Abstract Distributed and federated learning (D/FL) is a powerful machine learning (ML) paradigm in which clients collaborate to train a model under the coordination of a central server. Depending on the nature of clients, data in each client might have the same distribution (called the homogeneous setting) or different distributions (the heterogeneous setting). The state-of-the-art D/FL algorithm SCAFFOLD addresses the critical issue of data heterogeneity through the use of control variables. However, while theoretical analysis suggests that the convergence rate of SCAFFOLD is independent of data heterogeneity, the practical performance of SCAFFOLD is often inconsistent in homogeneous and heterogeneous settings. Motivated by the disagreement between theory and practice of SCAFFOLD, in this work, we propose a novel D/FL algorithm to bridge this experimental performance gap while preserving similar theoretical guarantees as SCAFFOLD. The proposed algorithm accommodates arbitrary data heterogeneity, partial participation, local updates, and supports unbiased communication compression. Theoretically, we prove that our algorithm is unaffected by data heterogeneity and achieves state-of-the-art convergence rate as SCAFFOLD. Furthermore, numerical experiments indicate that our algorithm achieves consistent (similar) test accuracy in both homogeneous and heterogeneous settings while often converges faster than existing baselines.

1455AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs

[openreview] [pdf]

Abstract Jailbreak attacks serve as essential red-teaming tools, proactively assessing whether LLMs can behave responsibly and safely in adversarial environments. Despite diverse strategies (e.g., cipher, low-resource language, persuasions, and so on) that have been proposed and shown success, these strategies are still manually designed, limiting their scope and effectiveness as a red-teaming tool. In this paper, we propose AutoDAN-Turbo, a black-box jailbreak method that can automatically discover as many jailbreak strategies as possible from scratch, without any human intervention or predefined scopes (e.g., specified candidate strategies), and use them for red-teaming. As a result, AutoDAN-Turbo can significantly outperform baseline methods, achieving a 74.3% higher average attack success rate on public benchmarks. Notably, AutoDAN-Turbo achieves an 88.5 attack success rate on GPT-4-1106-turbo. In addition, AutoDAN-Turbo is a unified framework that can incorporate existing human-designed jailbreak strategies in a plug-and-play manner. By integrating human-designed strategies, AutoDAN-Turbo can even achieve a higher attack success rate of 93.4 on GPT-4-1106-turbo.

1456Hot PATE: Private Aggregation of Distributions for Diverse Tasks

[openreview] [pdf]

Abstract The Private Aggregation of Teacher Ensembles (PATE) framework is a versatile approach to privacy-preserving machine learning. In PATE, responses made based on different parts of sensitive data are aggregated into a single response in a privacy-preserving way. Recently, multiple works applied PATE for tasks such as sequential text generation that are inherently diverse (or “hot”), with multiple valid responses. These designs, however, suffer from tension between diversity and privacy -- since diversity in the responses reduces agreement which forces the aggregation to use smaller noise scales and thus incur higher privacy loss. But limiting diversity of the aggregate response is undesirable since in large models, the very knowledge we want to transfer is encapsulated in the response distribution. We propose \emph{hot PATE} that is tailored for the diverse setting where responses are distributions. We formally define \emph{preserving diversity} and design an efficient aggregation method that provably transfers the diversity to the (randomized) aggregate response while incurring no privacy penalty. The method can be implemented using an API access to proprietary models and used as a plug-in replacement for the baseline ``cold’’ PATE in existing methods. We demonstrate empirically the potential of hot PATE for an order of magnitude improvement in a task of in-context learning via prompts.

1457Scalable Ensemble Diversification for OOD Generalization and Detection

[openreview] [pdf]

Abstract Training a diverse ensemble of models has several practical applications such as providing candidates for model selection with better out-of-distribution (OOD) generalization, and enabling the detection of OOD samples via Bayesian principles. An existing approach to diverse ensemble training encourages the models to disagree on provided OOD samples. However, the approach is computationally expensive and it requires well-separated ID and OOD examples, such that it has only been demonstrated in small-scale settings.Method.This work presents a Hardness-based Diversification Regularizer (HDR) applicable to large-scale settings (e.g. ImageNet) that does not require OOD samples. Instead, HDR identifies hard training samples on the fly and encourages the ensemble members to disagree on these. To improve scaling, we show how to avoid the expensive computations in existing methods of exhaustive pairwise disagreements across models.Results.We evaluate the benefits of diversification with experiments on ImageNet. First, for OOD generalization, we observe large benefits from the diversification in multiple settings including output-space (classical) ensembles and weight-space ensembles (model soups). Second, for OOD detection, we turn the diversity of ensemble hypotheses into a novel uncertainty score estimator that surpasses a large number of OOD detection baselines.

1458Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry

[openreview] [pdf]

Abstract Diffusion models have demonstrated an remarkable capability to edit or imitate images, which has raised concerns regarding the safeguarding of intellectual property. To address these concerns, the adoption of adversarial attacks, which introduce adversarial perturbations that can fool the targeted diffusion model into protected images , has emerged as a viable solution. Consequently, diffusion models, like many other deep network models, are believed to be susceptible to adversarial attacks. However, in this work, we draw attention to an important oversight in existing research, as all previous studies have focused solely on attacking latent diffusion models (LDMs), neglecting adversarial examples for diffusion models in the pixel space (PDMs). Through extensive experiments, we demonstrate that nearly all existing adversarial attack methods designed for LDMs, as well as adaptive attacks designed for PDMs, fail when applied to PDMs. We attribute the vulnerability of LDMs to their encoders, indicating that diffusion models exhibit strong robustness against adversarial attacks. Building upon this insight, we find that PDMs can be used as an off-the-shelf purifier to effectively eliminate adversarial patterns generated by LDMs, thereby maintaining the integrity of images. Notably, we highlight that most existing protection methods can be easily bypassed using PDM-based purification. We hope our findings prompt a reevaluation of adversarial samples for diffusion models as potential protection methods.

1459Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning

[openreview] [pdf]

Abstract Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robotic arm pushing a particular target block to a goal location. In this case, hindsight relabeling will give high rewards to any trajectory that does not interact with the block. However, these behaviors are only useful when the object is already at the goal—an extremely rare case in practice. A dataset dominated by these kinds of trajectories will make learning more difficult. On the other hand, much of the meaningful behavior is filtered through interactions such as pushing the block with the gripper. To address this issue, we introduce Hindsight Relabeling using Interactions (HInt), which combines interactions with hindsight relabeling to improve the sample efficiency of downstream RL. However, interactions do not have a general consensus statistical definition, and especially one useful for downstream GCRL. Therefore, we propose a definition of interactions based on the concept of null counterfactual: a cause object is interacting with a target object if in a world where the cause object did not exist, the target object would have different transition dynamics. We leverage this definition to infer interactions in Null Counterfactual Interaction Inference (NCII), which uses a “nulling” operation with a learned model to simulate absences and infer interactions. We demonstrate that NCII is able to achieve significantly improved interaction inference accuracy on both simple linear dynamics domains and dynamic robotic domains in Robosuite, Robot Air Hockey, and Franka Kitchen. Furthermore, we demonstrate that HInt improves sample efficiency by up to 4× in these domains as goal-conditioned tasks.

1460Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

[openreview] [pdf]

Abstract One of the main challenges in optimal scaling of large language models (LLMs) is the prohibitive cost of hyperparameter tuning, particularly learning rate η\eta and batch size BB. While techniques like μ\muP (Yang et al., 2022) provide scaling rules for optimal η\eta transfer in the infinite model size limit, the optimal scaling behavior in the infinite data size limit (TT \to \infty) remains unknown. We fill in this gap by observing for the first time an interplay of three optimal η\eta scaling regimes: ηT\eta \propto \sqrt{T}, η1\eta \propto 1, and η1/T\eta \propto 1/\sqrt{T} with transitions controlled by BB and its relation to the time-evolving critical batch size BcritTB_\mathrm{crit} \propto T. Furthermore, we show that the optimal batch size is positively correlated with BcritB_\mathrm{crit}: keeping it fixed becomes suboptimal over time even if learning rate is scaled optimally. Surprisingly, our results demonstrate that the observed optimal η\eta and BB dynamics are preserved with μ\muP model scaling, challenging the conventional view of BcritB_\mathrm{crit} dependence solely on loss value. Complementing optimality, we examine the sensitivity of loss to changes in learning rate, where we find the sensitivity to decrease with TT \to \infty and to remain constant with μ\muP model scaling. We hope our results make the first step towards a unified picture of the joint optimal data and model scaling.

1461Improving Convergence Guarantees of Random Subspace Second-order Algorithm for Nonconvex Optimization

[openreview] [pdf]

Abstract In recent years, random subspace methods have been actively studied for large-dimensional non-convex problems. Recent subspace methods have improved theoretical guarantees such as iteration complexity and local convergence rate while reducing computational costs by deriving descent directions in randomly selected low-dimensional subspaces. This paper proposes the Random Subspace Homogenized Trust Region (RSHTR) method with the best theoretical guarantees among random subspace algorithms for non-convex optimization. RSHTR achieves an ε\varepsilon-approximate first-order stationary point in O(ε3/2)O(\varepsilon^{-3/2}) iterations, converging locally at a linear rate. Furthermore, under rank-deficient conditions, RSHTR satisfies ε\varepsilon-approximate second-order necessary condition in O(ε3/2)O(\varepsilon^{-3/2}) iterations and exhibits a local quadratic convergence. Experiments on real-world datasets verify the benefits of RSHTR.

1462TF-score: Time-series Forecasting using score-based diffusion model

[openreview] [pdf]

Abstract Diffusion models have emerged as powerful generative models, capable of synthesizing high-quality images by capturing complex underlying patterns. Building on this success, these models have been adapted for time-series forecasting, a domain characterized by intricate temporal dependencies. However, most existing works have focused primarily on empirical performance without sufficient theoretical exploration. In this paper, we address this gap by introducing a generalized loss function within the diffusion-based forecasting framework. Leveraging this foundation, we introduce TF-score, a score-based diffusion model designed to capture the interdependencies between historical data and future predictions. Extensive experiments across six benchmark datasets show that TF-score consistently surpasses leading baselines, including prior diffusion-based models. Furthermore, we extend existing guidance sampling strategies into a our score-based formulation, achieving performance gains across multiple datasets while providing a detailed analysis of the trade-offs involved.

14633DAxisPrompt: Promoting the 3D Grounding and Reasoning in GPT-4o

[openreview] [pdf]

Abstract Multimodal Large Language Models (MLLMs) exhibit impressive capabilities across a variety of tasks, especially when equipped with carefully designed visual prompts. However, existing studies primarily focus on logical reasoning and visual understanding, while the capability of MLLMs to operate effectively in 3D vision remains an ongoing area of exploration. In this paper, we introduce a novel visual prompting method called 3DAxisPrompt to elicit the 3D understanding capabilities of MLLMs in real-world scenes. More specifically, our method leverages the 3D coordinate axis and masks generated from the Segment Anything Model (SAM) to provide explicit geometric priors to MLLMs and then extend their impressive 2D grounding/reasoning ability to real-world 3D scenarios. Besides we also provide a thorough investigation of the potential visual prompting formats and conclude our findings to reveal the potential and limits of 3D understanding capabilities in GPT-4o. Finally, we build evaluation environments with four datasets, {\it i.e.} ShapeNet, ScanNet, FMB, and nuScene datasets, covering various 3D tasks. Based on this, we conduct extensive quantitative and qualitative experiments, which demonstrate the effectiveness of the proposed method. Overall, our study reveals that GPT-4o, with the help of 3DAxisPrompt, can effectively perceive an object’s 3D position in real-world scenarios. Nevertheless, a single prompt engineering approach does not consistently achieve the best outcomes for all 3D tasks. This study highlights the feasibility of leveraging MLLMs for 3D vision grounding/reasoning with prompt engineering techniques.

1464ActSafe: Active Exploration with Safety Constraints for Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) is ubiquitous in the development of modern AI systems. However, state-of-the-art RL agents require extensive, and potentially unsafe, interactions with their environments to learn effectively. These limitations confine RL agents to simulated environments, hindering their ability to learn directly in real-world settings. In this work, we present ActSafe, a novel model-based RL algorithm for safe and efficient exploration. ActSafe learns a well-calibrated probabilistic model of the system and plans optimistically w.r.t. the epistemic uncertainty about the unknown dynamics, while enforcing pessimism w.r.t. the safety constraints. Under regularity assumptions on the constraints and dynamics, we show that ActSafe guarantees safety during learning while also obtaining a near-optimal policy in finite time. In addition, we propose a practical variant of ActSafe that builds on latest model-based RL advancements and enables safe exploration even in high-dimensional settings such as visual control. We empirically show that ActSafe obtains state-of-the-art performance in difficult exploration tasks on standard safe deep RL benchmarks while ensuring safety during learning.

1465Interpolate: How Resetting Active Neurons can also improve Generalizability in Online Learning

[openreview] [pdf]

Abstract While neural networks have shown a significant gain in performance across a wide range of applications, they still struggle in non-stationary settings as they tend to lose their ability to adapt to new tasks — a phenomenon known as the loss of plasticity. The conventional approach to addressing this problem often involves resetting the most under-utilized or dormant parts of the network, suggesting that recycling such parameters is crucial for maintaining a model’s plasticity. In this study, we explore whether this approach is the only way to address plasticity loss. Contrary to previous findings, we show that resetting the most active parameters can also lead to better generalization. Additionally, we introduce a model merging method that can perform similarly or better compared to traditional resetting methods, offering a new perspective on training dynamics in non-stationary settings.

1466Think Thrice Before You Act: Progressive Thought Refinement in Large Language Models

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have demonstrated that progressive refinement, rather than providing a single answer, results in more accurate and thoughtful outputs. However, existing methods often rely heavily on supervision signals to evaluate previous responses, making it difficult to effectively assess output quality in more open-ended scenarios. Additionally, these methods are typically designed for specific tasks, which limits their generalization to new domains. To address these limitations, we propose Progressive Thought Refinement (PTR), a framework that enables LLMs to progressively refine their responses. PTR operates in two phases: (1) Thought data construction stage: We propose a \textit{weak and strong model collaborative selection} strategy to build a high-quality progressive refinement dataset to ensure logical consistency from thought to answers, and the answers are gradually refined in each round. (2) Thought-Mask Fine-Tuning Phase: We design a training structure to mask the “thought” and adjust loss weights to encourage LLMs to refine prior thought, teaching them to implicitly understand “how to improve” rather than “what is correct.” Experimental results show that PTR significantly enhances LLM performance across ten diverse tasks (avg. from 49.6% to 54.48%) without task-specific fine-tuning. Notably, in more open-ended tasks, LLMs also demonstrate substantial improvements in the quality of responses beyond mere accuracy, suggesting that PTR truly teaches LLMs to self-improve over time. Our project’s source code and datasets are available athttps://anonymous.4open.science/r/PTR_LLM

1467On the Limitations of General Purpose Domain Generalisation Methods

[openreview] [pdf]

Abstract The Domain Generalisation (DG) problem setting requires a model trained on a set of data distributions (domains) to generalise to new distributions. Despite a huge amount of empirical study, previous DG methods fail to substantially outperform empirical risk minimisation on rigorous DG benchmarks. Motivated by this, we analyse the DG problem from a learning theoretic perspective andcharacterise in which situations DG will succeed or fail. Specifically, we derive upper bounds on the excess risk of ERM and lower bounds on the minimax excess risk, for three settings with different restrictions on how the domains may differ. In the most unconstrained setting, we show that all learning algorithms converge slowly with respect to number of training domains, potentially explaining the lack of algorithmic progress in this area. We also consider constrained settings including limiting the pairwise domain distances as measured by a broad class of integral probability metrics, and constraining all domains to have the same underlying support. In these constrained cases, DG algorithms can converge more rapidly. Notably, for all three settings, the we demonstrate that ERM has an optimal rate of convergence towards the best possible model. Our analysis guides practitioners interested in knowing when cross-domain generalisation might be reliable, and suggests strategies for optimising the performance of ERM in each setting.

1468Large Scale Knowledge Washing

[openreview] [pdf]

Abstract Large language models show impressive abilities in memorizing world knowledge, which leads to concerns regarding memorization of private information, toxic or sensitive knowledge, and copyrighted content. We introduce the problem of Large Scale Knowledge Washing, focusing on unlearning an extensive amount of factual knowledge. Previous unlearning methods usually define the reverse loss and update the model via backpropagation, which may affect the model’s fluency and reasoning ability or even destroy the model due to extensive training with the reverse loss. Existing works introduce additional data from downstream tasks to prevent the model from losing capabilities, which requires downstream task awareness. Controlling the tradeoff of unlearning existing knowledge while maintaining existing capabilities is also challenging. To this end, we propose LaW (Large Scale Washing), where we update the MLP layers in decoder-only large language models to perform knowledge washing, as inspired by model editing methods. We derive a new objective with the knowledge to be unlearned to update the weights of certain MLP layers. Experimental results demonstrate the effectiveness of LaW in forgetting target knowledge while maximally maintaining reasoning ability. The code will be open-sourced.

1469A Comprehensive Framework for Analyzing the Convergence of Adam: Bridging the Gap with Stochastic Gradient Descent

[openreview] [pdf]

Abstract Adaptive Moment Estimation (Adam) is a cornerstone optimization algorithm in deep learning, widely recognized for its flexibility with adaptive learning rates and efficiency in handling large-scale data. However, despite its practical success, the theoretical understanding of Adam’s convergence has been constrained by stringent assumptions, such as almost surely bounded stochastic gradients or uniformly bounded gradients, which are more restrictive than those typically required for analyzing stochastic gradient descent (SGD).In this paper, we introduce a novel and comprehensive framework for analyzing the convergence properties of Adam. This framework offers a versatile approach to establishing Adam’s convergence. Specifically, we prove that Adam achieves asymptotic (last iterate sense) convergence in both the almost sure sense and the (L_1) sense under the relaxed assumptions typically used for SGD, namely (L)-smoothness and the ABC inequality. Meanwhile, under the same assumptions, we show that Adam attains non-asymptotic sample complexity bounds similar to those of SGD.

1470Power of Augmented Replicas in Out-Of-Distribution Detection

[openreview] [pdf]

Abstract Data augmentation is widely used in machine learning to enhance training datasets by introducing minor variations to the original data, traditionally aiming to prevent overfitting and improve model performance. This paper explores a novel application of data augmentation during the inference stage to enhance out-of-distribution (OOD) detection. The proposed method involves replicating the inference image multiple times, applying various transformation techniques to each replica, and then evaluating the detectors using these augmented images. The effectiveness of this approach is assessed across different detectors, models, and datasets, demonstrating its potential to improve OOD detection capabilities.

1471Subspace Node Pruning

[openreview] [pdf]

Abstract Efficiency of neural network inference is undeniably important in a time where commercial use of AI models increases daily. Node pruning is the art of removing computational units such as neurons, filters, attention heads, or even entire layers to significantly reduce inference time while retaining network performance. In this work, we propose the projection of unit activations to an orthogonal subspace in which there is no redundant activity and within which we may prune nodes while simultaneously recovering the impact of lost units via linear least squares. We identify that, for effective node pruning, this subspace must be constructed using a triangular transformation matrix, a transformation which is equivalent to and unnormalized Gram-Schmidt orthogonalization. We furthermore show that the order in which units are orthogonalized can be optimised to maximally reduce node activations in our subspace and thereby form a more optimal ranking of nodes. Finally, we leverage these orthogonal subspaces to automatically determine layer-wise pruning ratios based upon the relative scale of node activations in our subspace, equivalent to cumulative variance. Our proposed method reaches state of the art when pruning ImageNet trained VGG-16 and rivals more complex state of the art methods when pruning ResNet-50 networks across a range of pruning ratios.

1472Guided Stream of Search: Learning to Better Search with Language Models via Optimal Path Guidance

[openreview] [pdf]

Abstract While language models have demonstrated impressive capabilities across a range of tasks, they still struggle with tasks that require complex planning and reasoning. Recent studies have proposed training language models on search processes rather than optimal solutions, resulting in better generalization performance even though search processes are noisy and even suboptimal. However, these studies overlook the value of optimal solutions, which can serve as step-by-step landmarks to guide more effective search. In this work, we explore how to leverage optimal solutions to enhance the search and planning abilities of language models. To this end, we propose guided stream of search (GSoS), which seamlessly incorporates optimal solutions into the self-generation process in a progressive manner, producing high-quality search trajectories. These trajectories are then distilled into the pre-trained model via supervised fine-tuning. Our approach significantly enhances the search and planning abilities of language models on Countdown, a simple yet challenging mathematical reasoning task. Notably, combining our method with RL fine-tuning yields further improvements, whereas previous supervised fine-tuning methods do not benefit from RL. Furthermore, our approach exhibits greater effectiveness than leveraging optimal solutions in the form of subgoal rewards.

1473Learning 4D Embodied World Models

[openreview] [pdf]

Abstract In this paper, we present a 4D embodied world model, which takes in an image observation and language instruction as input and predicts a 4D dynamic mesh predicting how the scene will change as the embodied agent performs actions based on the given instructions. In contrast to previously learned world models which typically generate 2D videos, our 4D model provides detailed 3D information on precise configurations and shape of objects in a scene over time. This allows us to effectively learn accurate inverse dynamic models for an embodied agent to execute a policy for interacting with the environment. To construct a dataset to train such 4D world models, we first annotate large-scale existing video robotics dataset using pretrained depth and normal prediction models to construct 3D consistent 4D models of each video. To efficiently learn generative models on this 4D data, we propose to train a video generative model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each video. We then present an algorithm to directly convert generated RGB, Depth and Normal images into high-quality dynamic 4D mesh models of the world. We illustrate how this enables us to predict high-quality meshes consistent across both time and space from embodied scenarios, render novel views for embodied scenes, as well as construct policies that substantially outperform those from prior 2D and 3D models of the world. Our code, model, and dataset will be made publicly available.

1474Unleashing the Potential of Temperature Scaling for Multi-Label Logit Distillation

[openreview] [pdf]

Abstract This paper undertakes meticulous scrutiny of the pure logit-based distillation under multi-label learning through the lens of activation function. We begin with empirically clarifying a recently discovered perspective that vanilla sigmoid per se is more suitable than tempered softmax in multi-label distillation, is not entirely correct. After that, we reveal that both the sigmoid and tempered softmax have an intrinsic limitation. In particular, we conclude that ignoring the decisive factor temperature τ\tau in the sigmoid is the essential reason for its unsatisfactory results. With this regard, we propose unleashing the potential of temperature scaling in the multi-label distillation and present Tempered Logit Distillation (TLD), an embarrassingly simple yet astonishingly performant approach. Specifically, we modify the sigmoid with the temperature scaling mechanism, deriving a new activation function, dubbed as tempered sigmoid. With theoretical and visual analysis, intriguingly, we identify that tempered sigmoid with τ\tau smaller than 1 provides an effect of hard mining by governing the magnitude of penalties according to the sample difficulty, which is shown as the key property to its success. Our work is accompanied by comprehensive experiments on COCO, PASCAL-VOC, and NUS-WIDE over several architectures across three multi-label learning scenarios: image classification, object detection, and instance segmentation. Distillation results evidence that TLD consistently harvests remarkable performance and surpasses the prior counterparts, demonstrating its superiority and versatility.

1475OpenPL: Realistic Evaluation of Prompt Learning for VLM in Open Environments

[openreview] [pdf]

Abstract Vision-language models (VLMs) have demonstrated impressive zero-shot capabilities across various image classification tasks. Their performance can be further enhanced through prompt learning methods. To evaluate the effectiveness of prompt learning, it is important to assess its robustness to new classes and distributional shifts. However, current studies typically assume single data distribution shifts and pre-known new class space, which still have gaps with real-world open environments where data distributions and classes are often uncertain and subject to continuous change. To better analyze the robustness of prompt learning methods in more realistic scenarios, we propose a novel evaluation benchmark called OpenPL from the following perspectives: 1) We reconstruct multiple scenarios of open environments, encompassing dynamic class changes, dynamic distribution shifts, and dynamic co-evolution of both distribution and classes; 2) We propose a series of new performance metrics for prompt learning methods based on the Dynamic Robustness Curve (DRC) to better understand their robustness in open environments; 3) We re-implement diverse prompt learning methods and evaluate their performance on the proposed OpenPL benchmark. The results show that no current prompt learning method is robust to open environments and no meaningful performance improvement is achieved compared to the zero-shot performance, designing robust prompt learning methods remains a difficult task. All re-implementations are available at \url{https://anonymous.4open.science/r/OpenPL-565E}.

1476Stable-Transformer: Towards a Stable Transformer Training

[openreview] [pdf]

Abstract The scale of parameters in Transformers has expanded dramatically—from hundreds of millions to several trillion. A key challenge when scaling the model to trillions is the training instability. Although many practical tricks, such as learning rate warmup, query-key normalization and better weight initialization, have been introduced to mitigate the training instability, a rigorous mathematical understanding of why such instabilities happen and why the above-mentioned tricks work well is still unclear. In this paper, we give a theoretical analysis of the initialization, normalization and attention mechanism in Transformers, and present a set of stabilized designs of the initialization, normalization and attention mechanism, which are thus termed as \textit{StableInit}, \textit{StableNorm} and \textit{StableAtten}, individually. In experiments, we demonstrate that each of our stabilized designs, \ie \textit{StableInit}, \textit{StableNorm} and \textit{StableAtten}, exhibits better stability. Furthermore, by putting the stabilized designs together, we propose a stabilized Transformer, termed \textit{Stable-Transformer}, and show in experiments that our proposed \textit{Stable-Transformer} achieves more stable training process.

1477Incremental Causal Effect for Time to Treatment Initialization

[openreview] [pdf]

Abstract We consider time to treatment initialization. This can commonly occur in preventive medicine, such as disease screening and vaccination; it can also occur with non-fatal health conditions such as HIV infection without the onset of AIDS; or in tech industry where items wait to be reviewed manually for spam or abusive contents, etc. While traditional causal inference focused on `when to treat’ and its effects, including their possible dependence on subject characteristics, we consider the incremental causal effect when the intensity of time to treatment initialization is intervened upon. We provide identification of the incremental causal effect without the commonly required positivity assumption, as well as an estimation framework using inverse probability weighting. We illustrate our approach via simulation, and apply it to a rheumatoid arthritis study to evaluate the incremental effect of time to start methotrexate on joint pain.

1478Marginalization Consistent Mixture of Separable Flows for Probabilistic Irregular Time Series Forecasting

[openreview] [pdf]

Abstract Probabilistic forecasting models for joint distributions of targets in irregular time series are a heavily under-researched area in machine learning with, to the best of our knowledge, only three models researched so far: GPR, the Gaussian Process Regression model (Dürichen et al., 2015), TACTiS, the Transformer-Attentional Copulas for Time Series Drouin et al. (2022); Ashok et al. (2024) and ProFITi (Yalavarthi et al., 2024b), a multivariate normalizing flow model based on invertible attention layers. While ProFITi, thanks to using multivariate normalizing flows, is the more expressive model with a better predictive performance, we will show that it suffers from marginalization inconsistency: it does not guarantee that the marginal distributions of a subset of variables in its predictive distributions coincide with the directly predicted distributions of these variables. Also, TACTiS does not provide any guarantees for marginalization consistency. We develop a novel probabilistic irregular time series forecasting model, Marginal- ization Consistent Mixtures of Separable Flows (moses), that mixes several nor- malizing flows with (i) Gaussian Processes with full covariance matrix as source distributions and (ii) a separable invertible transformation, aiming to combine the expressivity of normalizing flows with the marginalization consistency of Gaussians. In experiments on four different datasets we show that moses outper- form other state-of-the-art marginalization consistent models, perform on par with ProFITi, but different from ProFITi, guarantees marginalization consistency.

1479ON THE CONVERGENCE OF CYCLIC HIERARCHICAL FEDERATED LEARNING WITH HETEROGENEOUS DATA

[openreview] [pdf]

Abstract Hierarchical Federated Learning (HFL) advances the classic Federated Learning (FL) by introducing the multi-layer architecture between clients and the central server, in which edge servers aggregate models from respective clients and further send to the central server. Instead of directly uploading each update from clients for aggregation, the HFL not only reduces the communication and computational overhead but also greatly enhances the scalability of supporting a massive number of clients. When HFL operates for applications having a large-scale clients, edge servers train their models in a cyclic pattern (a ring architecture) as opposed to the star-type of architecture where each edge develops their own models independently.We refer it as Cyclic HFL(CHFL). Driven by its promising feature of handling data heterogeneity and resiliency, CHFL has a great potential to be deployed in practice. Unfortunately, the thorough convergence analysis on CHFL remains lacking, especially considering the widely-existing data heterogeneity issue among clients. To the best of our knowledge, we are the first to provide a theoretical convergence analysis for CHFL in strongly convex, general convex, and non-convex objectives. Our results demonstrate the convergence rate are O~(1/MNRKT)\tilde{\mathcal{O}}(1/MNRKT) for strongly convex objective, O(1/MNRKT)\mathcal{O}(1/\sqrt{MNRKT}) for general convex objective, and O(1/MNRKT)\mathcal{O}(1/\sqrt{MNRKT}) for non-convex objective, under standard assumptions. Here, MM is the number of edge servers, NN is the number of clients in edge, KK is local steps in client, and RR is the edge training round. Through extensive experiments on real-world datasets, besides validating our theoretical findings, we further show CHFL achieves a comparable or superior performance when accounting for both inter- and intra-edge data heterogeneity.

1480Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

[openreview] [pdf]

Abstract Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on an fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM’s safety alignment (e.g., reduce safety rate by 70%), which can not be effectively defended by existing defense methods (at most 4% absolute improvement), while our safety defense method can significantly enhance the attacked LLM’s safety alignment (at most 69% absolute improvement).

1481Optimal Client Training in Federated Learning with Deep Reinforcement Learning

[openreview] [pdf]

Abstract Federated Learning (FL) is a distributed framework for collaborative model training over large-scale distributed data. Centralized FL leverages a server to aggregate client models which can enable higher performance while maintaining client data privacy. However, it has been shown that in centralized model aggregation, performance can degrade in the presence of non-IID data across different clients. We remark that training a client locally on more data than necessary does not benefit the overall performance of all clients. In this paper, we devise a novel framework that leverages Deep Reinforcement Learning (DRL) to optimize an agent that selects the optimal amount of data necessary to train a client model without oversharing information with the server. Starting from complete unawareness of the client’s performance, the DRL agent utilizes the change in training loss as a reward signal and learns to optimize the amount of data necessary for improving the client’s performance. Specifically, after each aggregation round, the DRL algorithm considers the local performance as the current state and outputs the optimal weights for each class in the training data to be used during the next round of local training. In doing so, the agent learns a policy that creates the optimal partition of the local training dataset during the FL rounds. After FL, the client utilizes the entire local training dataset to further enhance its performance on its own data distribution, mitigating the non-IID effects of aggregation. Through extensive experiments, we demonstrate that training FL clients through our algorithm results in superior performance on multiple benchmark datasets and FL frameworks.

1482Understanding Diffusion-based Representation Learning via Low-Dimensional Modeling

[openreview] [pdf]

Abstract This work addresses the critical question of why and when diffusion models, despite their generative design, are capable of learning high-quality representations in a self-supervised manner. We hypothesize that diffusion models excel in representation learning due to their ability to learn the low-dimensional distributions of image datasets via optimizing a noise-controlled denoising objective. Our empirical results support this hypothesis, indicating that variations in the representation learning performance of diffusion models across noise levels are closely linked to the quality of the corresponding posterior estimation. Grounded on this observation, we offer theoretical insights into the unimodal representation dynamics of diffusion models as noise scales vary, demonstrating how they effectively learn meaningful representations through the denoising process. We also highlight the impact of the inherent parameter-sharing mechanism in diffusion models, which accounts for their advantages over traditional denoising auto-encoders in representation learning.

1483Diffusion State-Guided Projected Gradient for Inverse Problems

[openreview] [pdf]

Abstract Recent advancements in diffusion models have been effective in learning data priors for solving inverse problems. They leverage diffusion sampling steps for inducing a data prior while using a measurement guidance gradient at each step to impose data consistency. For general inverse problems, approximations are needed when an unconditionally trained diffusion model is used since the measurement likelihood is intractable, leading to inaccurate posterior sampling. In other words, due to their approximations, these methods fail to preserve the generation process on the data manifold defined by the diffusion prior, leading to artifacts in applications such as image restoration. To enhance the performance and robustness of diffusion models in solving inverse problems, we propose Diffusion State-Guided Projected Gradient (DiffStateGrad), which projects the measurement gradient onto a subspace that is a low-rank approximation of an intermediate state of the diffusion process. DiffStateGrad, as a module, can be added to a wide range of diffusion-based inverse solvers to improve the preservation of the diffusion process on the prior manifold and filter out artifact-inducing components. We highlight that DiffStateGrad improves the robustness of diffusion models in terms of the choice of measurement guidance step size and noise while improving the worst-case performance. Finally, we demonstrate that DiffStateGrad improves upon the state-of-the-art on linear and nonlinear image restoration inverse problems.

1484Adaptive Rentention & Correction for Continual Learning

[openreview] [pdf]

Abstract Continual learning, also known as lifelong learning or incremental learning, refers to the process by which a model learns from a stream of incoming data over time. A common problem in continual learning is the classification layer’s bias towards the most recent task. Traditionally, methods have relied on incorporating data from past tasks during training to mitigate this issue. However, the recent shift in continual learning to memory-free environments has rendered these approaches infeasible. In this study, we propose a solution focused on the testing phase. We first introduce a simple Out-of-Task Detection method, OTD, designed to accurately identify samples from past tasks during testing. Leveraging OTD, we then propose: (1) an Adaptive Retention mechanism for dynamically tuning the classifier layer on past task data; (2) an Adaptive Correction mechanism for revising predictions when the model classifies data from previous tasks into classes from the current task. We name our approach Adaptive Retention & Correction (ARC). While designed for memory-free environments, ARC also proves effective in memorybased settings. Extensive experiments show that our proposed method can be plugged in to virtually any existing continual learning approach without requiring any modifications to its training procedure. Specifically, when integrated with state-of-the-art approaches, ARC achieves an average performance increase of 2.7% and 2.6% on the CIFAR-100 and Imagenet-R datasets, respectively

1485PREMIUM: LLM Personalization with Individual-level Preference Feedback

[openreview] [pdf]

Abstract With an increasing demand for LLM personalization, various methods have been developed to deliver customized LLM experiences, including in-context learning, retrieval augmentation, and parameter-efficient fine-tuning. However, most existing methods are not readily locally deployable, limited by the compute cost, privacy risks, and an inability to adapt to dynamic user preferences. Here, we propose to use a tag system to efficiently characterize user profiles, inspired from the insights from personality typology and recommendation systems. Based on the observation, we present a locally deployable LLM-agnostic framework for achieving LLM personalization: PREMIUM\textbf{PREMIUM} (P\textbf{P}reference R\textbf{R}anking EM\textbf{EM}powered I\textbf{I}ndividual U\textbf{U}ser M\textbf{M}odeling), which obtains individual-level feedback by having users rank responses and continuously self-iterates optimization during the interaction between the user and the LLM. Notably, a variant of PREMIUM, PREMIUM-Embed, can effectively capture user preferences while being deployable with laptop-level resources. Besides algorithmic innovation, we further prepare a novel dataset, Ranking-TAGER, which provides a valuable evaluation protocol for LLM personalization. Extensive experiments validate that PREMIUM remarkably outperforms various baselines, achieving a 15%-50% higher accuracy and a 2.5%-35% higher win rate on Ranking-TAGER, as well as a 3%-13% higher accuracy and a 2%-7.5% higher F1 Score on LaMP-2. More importantly, we further demonstrate that PREMIUM can develop an effective strategy with minimal interactive data, adapt to dynamic user preferences, and demonstrate excellent scalability in both scale and functionality.

1486Diffusion-Based Offline RL for Improved Decision-Making in Augmented Single ARC Task

[openreview] [pdf]

Abstract Effective long-term strategies enable AI systems to navigate complex environments by making sequential decisions over extended horizons. Similarly, reinforcement learning (RL) agents optimize decisions across sequences to maximize rewards, even without immediate feedback. To verify that Latent Diffusion-Constrained Q-learning (LDCQ), a prominent diffusion-based offline RL method, demonstrates strong reasoning abilities in multi-step decision-making, we aimed to evaluate its performance on the Abstraction and Reasoning Corpus (ARC). However, applying offline RL methodologies to enhance strategic reasoning in AI for solving tasks in ARC is challenging due to the lack of sufficient experience data in the ARC training set. To address this limitation, we introduce an augmented offline RL dataset for ARC, called Synthesized Offline Learning Data for Abstraction and Reasoning (SOLAR), along with the SOLAR-Generator, which generates diverse trajectory data based on predefined rules. SOLAR enables the application of offline RL methods by offering sufficient experience data. We synthesized SOLAR for a simple task and used it to train an agent with the LDCQ method. Our experiments demonstrate the effectiveness of the offline RL approach on a simple ARC task, showing the agent’s ability to make multi-step sequential decisions and correctly identify answer states. These results highlight the potential of the offline RL approach to enhance AI’s strategic reasoning capabilities.

1487Context is Key: A Benchmark for Forecasting with Essential Textual Information

[openreview] [pdf]

Abstract Forecasting is a critical task in decision making across various domains. While numerical data provides a foundation, it often lacks crucial context necessary for accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge or constraints, which can be efficiently communicated through natural language. However, the ability of existing forecasting models to effectively integrate this textual information remains an open question. To address this, we introduce “Context is Key” (CiK), a time series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities. We evaluate a range of approaches, including statistical models, time series foundation models and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. By presenting this benchmark, we aim to advance multimodal forecasting, promoting models that are both accurate and accessible to decision-makers with varied technical expertise. The benchmark can be visualized athttps://anon-forecast.github.io/benchmark_report_dev/.

1488Enhancing End-to-End Autonomous Driving with Latent World Model

[openreview] [pdf]

Abstract In autonomous driving, end-to-end planners directly utilize raw sensor data, enabling them to extract richer scene features and reduce information loss compared to traditional planners. This raises a crucial research question: how can we develop better scene feature representations to fully leverage sensor data in end-to-end driving? Self-supervised learning methods show great success in learning rich feature representations in NLP and computer vision. Inspired by this, we propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving. LAW predicts future latent scene features based on current features and ego trajectories. This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks, improving scene feature learning while optimizing trajectory prediction. LAW achieves state-of-the-art performance across multiple benchmarks, including real-world open-loop benchmark nuScenes, NAVSIM, and simulator-based closed-loop benchmark CARLA. The code will be released.

1489Federated Coordination: Private and Distributed Strategy Alignment

[openreview] [pdf]

Abstract Coordination in multi-agent systems is critical for optimizing collective outcomes and is applicable in diverse fields such as drone swarms, emergency response, and more. Despite extensive research, the distributed coordination strategy alignment problem---where all agents follow the same strategy and execute the prescribed actions without a global coordinator---remains largely unexplored, posing challenges in scalability and privacy preservation. We introduce a new research problem termed ``federated coordination", which seeks to achieve decentralized strategy alignment across distributed agents while maintaining the privacy of strategy choices. To address this problem, we propose a framework that employs an energy-based model. It facilitates decentralized strategy alignment by associating agent states with coordination strategies through local minimum energy values. We address privacy concerns through a simple yet effective communication protocol that protects strategy selections from eavesdropping and information leakage. Our extensive experimental results validate these contributions, demonstrating scalability and reduced computational demands. This enhances the practicality of coordination systems in multi-agent settings.

1490Forecasting Needles in a Time Series Haystack

[openreview] [pdf]

Abstract Shocks and sudden spikes are common characteristics of real-world time series data. For example, demand surges or electricity outages often occur in time series data, manifesting as spikes (“Needles”) added to the regular time series (“Haystack”). Despite their importance, it is surprising to find their absence in the benchmarking protocol at the frontier of time series research—Time Series Foundation Models (TSFMs). To address this gap, we present the Needle-in-a-Time-Series-Haystack (NiTH) Benchmark, which includes both synthetic and real-world spiky time series data from diverse domains like traffic, energy, and biomedical systems. For synthetic data, we develop a flexible framework using Poisson-based modeling to generate spiky time series, allowing us to evaluate forecast models under various conditions. To accurately assess model performance, we introduce a new metric based on Dynamic Time Warping, specifically designed for spiky data. We evaluate the zero-shot forecasting capabilities of 6 popular TSFMs over 64 million observations, identifying their limitations related to architecture, tokenization, and loss functions. Furthermore, we demonstrate that the incorporation of the proposed NiTH dataset, due to its diversity compared to the common pre-training corpus of TSFMs, results in improved performance.

1491Clusters Agnostic Network Lasso Bandits

[openreview] [pdf]

Abstract We consider a multi-task contextual bandit setting, where the learner is given a graph encoding relations between the bandit tasks. The tasks’ preference vectors are assumed to be piecewise constant over the graph, forming clusters. At every round, we estimate the preference vectors by solving an online network lasso problem with a suitably chosen, time-dependent regularization parameter. We establish a novel oracle inequality relying on a convenient restricted eigenvalue assumption. Our theoretical findings highlight the importance of dense intra-cluster connections and sparse inter-cluster ones. That results in a sublinear regret bound significantly lower than its counterpart in the independent task learning setting. Finally, we support our theoretical findings by experimental evaluation against graph bandit multi-task learning and online clustering of bandits algorithms.

1492Can We Further Elicit Reasoning in LLMs? Critic-Guided Planning with Retrieval-Augmentation for Solving Challenging Tasks

[openreview] [pdf]

Abstract State-of-the-art large language models (LLMs) exhibit impressive problem-solving capabilities but may struggle with complex reasoning and factual correctness. Existing methods harness the strengths of chain-of-thought (CoT) and retrieval-augmented generation (RAG) to decompose a complex problem into simpler steps and apply retrieval to improve factual correctness. These methods work well on straightforward reasoning tasks but often falter on challenging tasks such as competitive programming and mathematics, due to frequent reasoning errors and irrelevant knowledge retrieval. To address this, we introduce Critic-guided planning with Retrieval-augmentation, CR-Planner, a novel framework that leverages fine-tuned critic models to guide both reasoning and retrieval processes through planning. CR-Planner solves a problem by iteratively selecting and executing sub-goals. Initially, it identifies the most promising sub-goal from reasoning, query generation, and retrieval, guided by rewards given by a critic model named sub-goal critic. It then executes this sub-goal through sampling and selecting the optimal output based on evaluations from another critic model named execution critic. This iterative process, informed by retrieved information and critic models, enables CR-Planner to effectively navigate the solution space towards the final answer. We employ Monte Carlo Tree Search (MCTS) to collect the data for training the critic models, allowing for a systematic exploration of action sequences and their long-term impacts. We validate CR-Planner on challenging domain-knowledge-intensive and reasoning-heavy tasks, including competitive programming, theorem-driven math reasoning, and complex domain retrieval problems. Our experiments demonstrate that CR-Planner significantly outperforms baselines, highlighting its effectiveness in addressing challenging problems by improving both reasoning and retrieval.

1493CAT: Concept-level backdoor ATtacks for Concept Bottleneck Models

[openreview] [pdf]

Abstract Despite the transformative impact of deep learning across multiple domains, the inherent opacity of these models has driven the development of Explainable Artificial Intelligence (XAI). Among these efforts, Concept Bottleneck Models (CBMs) have emerged as a key approach to improve interpretability by leveraging high-level semantic information. However, CBMs, like other machine learning models, are susceptible to security threats, particularly backdoor attacks, which can covertly manipulate model behaviors. Understanding that the community has not yet studied the concept level backdoor attack of CBM, because of “Better the devil you know than the devil you don’t know.”, we introduce CAT (Concept-level Backdoor ATtacks), a methodology that leverages the conceptual representations within CBMs to embed triggers during training, enabling controlled manipulation of model predictions at inference time. An enhanced attack pattern, CAT+, incorporates a correlation function to systematically select the most effective and stealthy concept triggers, thereby optimizing the attack’s impact. Our comprehensive evaluation framework assesses both the attack success rate and stealthiness, demonstrating that CAT and CAT+ maintain high performance on clean data while achieving significant targeted effects on backdoored datasets. This work underscores the potential security risks associated with CBMs and provides a robust testing methodology for future security assessments.

1494Understanding Benefit of Personalization: Beyond Classification

[openreview] [pdf]

Abstract In many applications spanning healthcare, finance, and admissions, it is beneficial to have personalized machine learning models that make predictions tailored to subgroups. This can be achieved by encoding personalized characteristics (such as age and sex) as model inputs. In domains where model trust and accuracy are paramount, it is critical to evaluate the effect of personalizing models not only on prediction accuracy but also on the quality of post-hoc model explanations. This paper introduces a unifying framework to quantify and validate personalization benefits in terms of both prediction accuracy and explanation quality across different groups, extending this concept to regression settings for the first time --broadening its scope and applicability. For both regression and classification, we derive novel bounds for the number of personalized attributes that can be used to reliably validate these gains. Additionally, through our theoretical analysis we demonstrate that improvements in prediction accuracy due to personalization do not necessarily translate to enhanced explainability, underpinning the importance to evaluate both metrics when applying machine learning models to safety-critical settings such as healthcare. Finally, we evaluate our proposed framework and validation techniques on a real-world dataset, exemplifying the analysis possibilities that they offer. This research contributes to ongoing efforts in understanding personalization benefits, offering a robust and versatile framework for practitioners to holistically evaluate their models.

1495Revisit, Extend, and Enhance Hessian-Free Influence Functions

[openreview] [pdf]

Abstract Influence functions serve as crucial tools for assessing sample influence. By employing the first-order Taylor extension, sample influence can be estimated without the need for expensive model retraining. However, applying influence functions directly to deep models presents challenges, primarily due to the non-convex nature of the loss function and the large size of model parameters. This difficulty not only makes computing the inverse of the Hessian matrix costly but also renders it non-existent in some cases. Various approaches, including matrix decomposition, have been explored to expedite and approximate the inversion of the Hessian matrix, with the aim of making influence functions applicable to deep models. In this paper, we revisit a specific, albeit naive, yet effective approximation method known as TracIn, and simplify it further, introducing the name Inner Product (IP). This method substitutes the inverse of the Hessian matrix with an identity matrix. We offer deeper insights into why this straightforward approximation method is effective. Furthermore, we extend its applications beyond measuring model utility to include considerations of fairness and robustness. Finally, we enhance IP through an ensemble strategy. To validate its effectiveness, we conduct experiments on synthetic data and extensive evaluations on noisy label detection, sample selection for large language model fine-tuning, and defense against adversarial attacks.

1496Find A Winning Sign: Sign Is All We Need to Win the Lottery

[openreview] [pdf]

Abstract The lottery ticket hypothesis (LTH) posits the existence of a sparse network (a.k.a. winning ticket) that can generalize comparably to its dense counterpart after training from initialization. However, early works fail to generalize its observation and method to large-scale settings. While recent methods, such as weight rewinding or learning rate rewinding (LRR), may have found effective pruning methods, we note that they still struggle with identifying a winning ticket. In this paper, we take a step closer to finding a winning ticket by arguing that a signed mask, a binary mask with parameter sign information, can transfer the capability to achieve strong generalization after training (i.e., generalization potential) to a randomly initialized network. We first share our observation on the subnetwork trained by LRR: if the parameter signs are maintained, the LRR-driven subnetwork retains its generalization potential even when the parameter magnitudes are randomly initialized, excluding those of normalization layers. However, this fails when the magnitudes of normalization layer parameters are initialized together. To tackle the significant influence of normalization layer parameters, we propose AWS, a slight variation of LRR to find A Winning Sign. Specifically, we encourage low error barriers along the linear path connecting the subnetwork trained by AWS to its counterpart with initialized normalization layer parameters, maintaining the generalization potential even when all parameters are initialized. Interestingly, we observe that across various architectures and datasets, a signed mask of the AWS-driven subnetwork can allow a randomly initialized network to perform comparably to a dense network, taking a step closer to the goal of LTH.

1497Stagewise Development in Transformers and the Geometry of the Loss Landscape

[openreview] [pdf]

Abstract We show that internal structure emerges in discrete developmental stages during transformer training, for language modeling and in-context linear regression. We introduce a method for detecting boundaries between these stages by probing the degeneracy of the geometry of the population loss. The measure of degeneracy we use is the Local Learning Coefficient (LLC), which is derived from singular learning theory. We establish the validity of the stages revealed by the LLC using a range of behavioral and structural metrics. In a majority of the stages the loss decreases and the geometry becomes less degenerate, which is consistent with an emerging literature on stagewise learning and saddle-to-saddle dynamics in neural network training. However, we also discover several stages in which the geometry becomes more degenerate, and in one example link this to a phenomenon that we call layer normalization collapse. These findings provide new insights into the intricate process of transformer training and underscore the importance of loss landscape geometry in understanding model development.

1498Multi-objective Multi-agent Reinforcement Learning with Pareto-stationary Convergence

[openreview] [pdf]

Abstract Multi-objective multi-agent reinforcement learning (MOMARL) problems frequently arise in real world applications (e.g., path planning for swarm robots) or have not been explored well. To find Pareto-optimum is NP-hard, and thus some multi-objective algorithms have emerged recently to provide Pareto-stationary solution centrally, managed by a single agent. Yet, they cannot deal with MOMARL problem, as the dimension of global state-action (s,a)(\boldsymbol{s},\boldsymbol{a}) grows exponentially with the number of spatially distributed agents. To tackle this issue, we design a novel graph-truncated QQ-function approximation method for each agent ii, which does not require the global state-action (s,a)(\boldsymbol{s},\boldsymbol{a}) but only the neighborhood state-action (sNiκ,aNiκ)(s_{\mathcal{N}^{\kappa}_{i}},a_{\mathcal{N}^{\kappa}_{i}}) of its κ\kappa-hop neighbors. To further reduce the dimension to state-action (sNiκ,ai)(s_{\mathcal{N}^{\kappa}_{i}},a_{i}) with only local action, we further develop a concept of action-averaged QQ-function and establish the equivalence between using graph-truncated QQ-function and action-averaged QQ-function for policy gradient approximation. Accordingly, we develop a distributed scalable algorithm with linear function approximation and we prove that it successfully converges Pareto-stationary solution at rate O(1/T)\mathcal{O}(1/T) that is inversely proportional to time domain TT. Finally, we run simulations in a robot path planning environment and show our algorithm converges to greater multi-objective values as compared to the latest MORL algorithm, and performs close to the central optimum with much shorter running time.

1499Efficient Policy Evaluation with Safety Constraint for Reinforcement Learning

[openreview] [pdf]

Abstract In reinforcement learning, classic on-policy evaluation methods often suffer from high variance and require massive online data to attain the desired accuracy. Previous studies attempt to reduce evaluation variance by searching for or designing proper behavior policies to collect data. However, these approaches ignore the safety of such behavior policies---the designed behavior policies have no safety guarantee and may lead to severe damage during online executions. In this paper, to address the challenge of reducing variance while ensuring safety simultaneously, we propose an optimal variance-minimizing behavior policy under safety constraints. Theoretically, while ensuring safety constraints, our evaluation method is unbiased and has lower variance than on-policy evaluation. Empirically, our method is the only existing method to achieve both substantial variance reduction and safety constraint satisfaction. Furthermore, we show our method is even superior to previous methods in both variance reduction and execution safety.

1500Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

[openreview] [pdf]

Abstract The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.

1501Adaptive Pruning of Pretrained Transformer via Differential Inclusions

[openreview] [pdf]

Abstract Large transformers have demonstrated remarkable success, making it necessary to compress these models to reduce inference costs while preserving their performance. Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio, which results in high computational costs. In contrast, we propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure. Therefore, the solution path identifies a Transformer weight family with various sparsity levels, offering greater flexibility and customization.In this paper, weintroduce such an effective pruning method, termed SPP (Solution Path Pruning). To achieve effective pruning, we segment the transformers into paired modules, including query-key pairs, value-projection pairs, and sequential linear layers, and apply low-rank compression to these pairs, maintaining the output structure while enabling structural compression within the inner states. Extensive experiments conducted on various well-known transformer backbones have demonstrated the efficacy of SPP.

1502DSR: Reinforcement Learning with Dynamical Skill Refinement

[openreview] [pdf]

Abstract Reinforcement learning with skills (RL with skills) is an efficient paradigm for solving sparse-reward tasks by extracting skills from demonstration datasets and learning high-level policy which selects skills. Because each selected skill by high-level policy is executed for multiple consecutive timesteps, the high-level policy is essentially learned in a temporally abstract Markov decision process (TA-MDP) built on the skills, which shortens the task horizon and reduces the exploration cost. However, these skills are usually sub-optimal because of the potential low quality and low coverage of the datasets, which causes the sub-optimal performance in the downstream task. Refining skills is intuitive, but the change of skills will in turn lead to the non-stationarity of the transition dynamics of TA-MDP which we name temporal abstraction shift. To address the dilemma of sub-optimal skills and temporal abstraction shift, we unify the optimization objectives of the entire hierarchical policy consisting of the high-level policy and the low-level policy whose latent space embeds the skills. We theoretically prove that the unified optimization objective guarantees the performance improvement in TA-MDP, and that optimizing the performance in TA-MDP is equivalent to optimizing a lower bound of the performance of the entire hierarchical policy in original MDP. Furthermore, in order to overcome the phenomenon of skill space collapse, we propose the dynamical skill refinement (DSR) mechanism which names our method. The experiment results empirically validate the effectiveness of our method, and show the advantages over the state-of-the-art (SOTA) methods.

1503Sparse-to-Sparse Training of Diffusion Models

[openreview] [pdf]

Abstract Diffusion models (DMs) are a powerful type of generative models that have achieved state-of-the-art results in various image synthesis tasks and have shown potential in other domains, such as natural language processing and temporal data modeling. Despite their stable training dynamics and ability to produce diverse high-quality samples, DMs are notorious for requiring significant computational resources, both in the training and inference stages. Previous work has focused mostly on increasing the efficiency of model inference. This paper introduces, for the first time, the paradigm of sparse-to-sparse training to DMs, with the aim of improving both training and inference efficiency. We train sparse DMs from scratch (Latent Diffusion and ChiroDiff) using three different methods (Static-DM, RigL-DM, and MagRan-DM) to study the effect of sparsity in model performance. Our experiments show that sparse DMs are able to match and sometimes outperform their Dense counterparts, while substantially reducing the number of trainable parameters and FLOPs.

1504Tropical Geometry Features for Novelty Detection and interpretability

[openreview] [pdf]

Abstract Existing methods for critical tasks such as out-of-distribution (OOD) detection, uncertainty quantification, and adversarial robustness often focus on measuring the output of the last or intermediate layers of a neural network such as logits and energy score. However, these methods typically overlook the geometric properties of the learned representations in the latent space, failing to capture important signals that relate to model reliability, fairness, and adversarial vulnerability.Innovations: We introduce an innovative method, termed Tropical Geometry Features (TGF), for detecting out-of-distribution data and enhancing overall model evaluation. This approach leverages the geometric properties of polytopes derived from a trained neural network’s learned representations. By integrating these geometric features with the data used during training, TGF establishes a unique signature of in-distribution data points. Our framework extends beyond OOD detection, providing insights into model uncertainty, adversarial robustness, interpretability, and fairness. Through TGF, we enhance interpretability technique to detect OOD, uncertainty, adverserial robustness in dynamic and unpredictable environments.

1505Model-Agnostic Knowledge Guided Correction for Improved Neural Surrogate Rollout

[openreview] [pdf]

Abstract Modeling the evolution of physical systems is critical to many applications in science and engineering. As the evolution of these systems is predominantly governed by partial differential equations (PDEs), there are a number of sophisticated computational simulations which resolve these systems with high accuracy. However, as these simulations incur high computational costs, they are infeasible to be employed for large-scale analysis. A popular alternative to simulators are neural network surrogates which are trained in a data-driven manner and are much more computationally efficient. However, these surrogate models suffer from high rollout error when used autoregressively, especially when confronted with training data paucity (i.e., a small number of trajectories to learn from). Existing work proposes to improve surrogate rollout error by either including physical loss terms directly in the optimization of the model or incorporating computational simulators as `differentiable layers’ in the neural network. Both of these approaches have their challenges, with physical loss functions suffering from slow convergence for stiff PDEs and simulator layers requiring gradients which are not always available, especially in legacy simulators. We propose the Hybrid PDE Predictor with RL (HyPER) model: a model-agnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator (with or without gradients) to reduce surrogate rollout error significantly. In addition to reducing rollout error by 60%-90% we show that HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption.

1506Learning Counterfactual Interventions for Self-Supervised Motion Estimation

[openreview] [pdf]

Abstract A major challenge in self-supervised learning from visual inputs is extracting information from the learned representations to an explicit and usable form. This is most commonly done by learning readout layers with supervision or using highly specialized heuristics. This is challenging primarily because the self-supervised pretext tasks and the downstream tasks that extract information are not tightly connected in a principled manner---improving the former does not guarantee improvements in the latter. The recently proposed counterfactual world modeling paradigm aims to address this challenge through a masked next frame predictor base model which enables simple counterfactual extraction procedures for extracting optical flow, segments and depth. In this work, we take the next step and parameterize and optimize the counterfactual extraction of optical flow by solving the same simple next frame prediction task as the base model. Our approach achieves state of the art performance for estimation motion on real-world videos while requiring no labeled data. This work sets the foundation for future methods on improving the extraction of more complex visual structures like segments and depth with high accuracy.

1507Linear Projections of Teacher Embeddings for Few-Class Distillation

[openreview] [pdf]

Abstract Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model. Traditionally, KD involves training the student to mimic the teacher’s output probabilities, while more advanced techniques have explored guiding the student to adopt the teacher’s internal representations. Despite its widespread success, the performance of KD in binary classification and few-class problems has been less satisfactory. This is because the information about the teacher model’s generalization patterns scales directly with the number of classes. Moreover, several sophisticated distillation methods may not be universally applicable or effective for data types beyond Computer Vision. Consequently, effective distillation techniques remain elusive for a range of key real-world applications, such as sentiment analysis, search query understanding, and advertisement-query relevance assessment. Taking these observations into account, we introduce a novel method for distilling knowledge from the teacher model’s representations, which we term Learning Embedding Linear Projections (LELP). Inspired by recent findings about the structure of final-layer representations, LELP works by identifying informative linear subspaces in the teacher’s embedding space, and splitting them into pseudo-subclasses. The student model is then trained to replicate these pseudo-subclasses. Our experimental evaluations on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate that LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems, where most KD methods suffer.

1508Hadamard Representations: Augmenting Hyperbolic Tangents in RL

[openreview] [pdf]

Abstract Activation functions are one of the key components of a deep neural network. The most commonly used activation functions can be classed into the category of continuously differentiable (e.g. tanh) and linear-unit functions (e.g. ReLU), both having their own strengths and drawbacks with respect to downstream performance and representation capacity through learning (e.g. measured by the number of dead neurons and the effective rank). In reinforcement learning, the performance of continuously differentiable activations often falls short as compared to linear-unit functions. We provide insights into the vanishing gradients associated with the former, and show that the dying neuron problem is not exclusive to ReLU’s. To alleviate vanishing gradients and the resulting dying neuron problem occurring with continuously differentiable activations, we propose a Hadamard representation. Using deep Q-networks and proximal policy optimization in the Atari domain, we show faster learning, a reduction in dead neurons and increased effective rank.

1509Query-based Knowledge Transfer for Heterogeneous Learning Environments

[openreview] [pdf]

Abstract Decentralized collaborative learning under data heterogeneity and privacy constraints has rapidly advanced. However, existing solutions like federated learning, ensembles, and transfer learning, often fail to adequately serve the unique needs of clients, especially when local data representation is limited.To address this issue, we propose a novel framework called Query-based Knowledge Transfer (QKT) that enables tailored knowledge acquisition to fulfill specific client needs without direct data exchange. It employs a data-free masking strategy to facilitate the communication-efficient query-focused knowledge transformation while refining task-specific parameters to mitigate knowledge interference and forgetting. Our experiments, conducted on both standard and clinical benchmarks, show that QKT significantly outperforms existing collaborative learning methods by an average of 20.91% points in single-class query settings and an average of 14.32% points in multi-class query scenarios. Further analysis and ablation studies reveal that QKT effectively balances the learning of new and existing knowledge, showing strong potential for its application in decentralized learning.

1510From Steering Vectors to Conceptors and Beyond: Compositional Affine Steering Mechanisms for LLMs

[openreview] [pdf]

Abstract Controlling and understanding the representations of large language models (LLMs) remain central challenges as they become more powerful. In this paper, we combine conceptor theory with recent advances in activation steering to develop a novel framework that generalizes both approaches for provably optimal affine steering. Conceptors characterize sets of neural network activations, representable as ellipsoids, and they act as soft projection matrices, enabling precise and flexible control over LLM activations while offering deeper insights into their internal representations. Our framework derives optimal affine steering functions from first principles, outperforming traditional additive steering methods across in-context learning tasks. Additionally, we use a Boolean algebra over conceptor matrices that allows for the composition of multiple steering objectives. Empirical results demonstrate that this approach surpasses existing methods for combining steering vectors. By uniting conceptor theory with activation steering, this work provides not only a more powerful tool for controlling LLM outputs, but also a principled approach for better understanding the internal mechanisms governing model representations and behavior.

1511Differentiable Average Precision Loss in DETR

[openreview] [pdf]

Abstract Average Precision (AP) is a widely used metric for evaluating object detection systems because it effectively integrates both classification accuracy and localization precision. In this paper, we conduct a detailed analysis of the characteristics of the AP metric, focusing on its non-differentiability and non-convexity. Building on this analysis, we propose a novel loss function called Differentiable Average Precision Loss (DAP-loss), which provides a differentiable approximation of AP, thereby enabling direct optimization of AP across a set of images. We validate the effectiveness of DAP-loss both theoretically and empirically, extending its application to the cost functions used in the Hungarian matching algorithm, which makes it suitable for end-to-end detection models. DAP-loss supports the simultaneous optimization of classification and localization tasks within an end-to-end framework, eliminating the need for hyperparameters to balance these tasks—a common challenge in traditional methods. In the later stages of training, we applied DAP-loss to replace the original loss functions in several state-of-the-art end-to-end models, including DETR and Deformable DETR. Experimental results demonstrate that our method achieves significant improvements over baselines on the COCO dataset.

1512Adversarial Latent Feature Augmentation for Fairness

[openreview] [pdf]

Abstract Achieving fairness in machine learning remains a critical challenge, especially due to the opaque effects of data augmentation on input spaces within nonlinear neural networks. Nevertheless, current approaches that emphasize augmenting latent features, rather than input spaces, offer limited insights into their ability to detect and mitigate bias. In response, we introduce the concept of the “unfair region” in the latent space, a subspace that highlights areas where misclassification rates for certain demographic groups are disproportionately high, leading to unfair prediction results. To address this, we propose Adversarial Latent Feature Augmentation (ALFA), a method that leverages adversarial fairness attacks to perturb latent space features, which are then used as data augmentation for fine-tuning. ALFA intentionally shifts latent features into unfair regions, and the last layer of the network is fine-tuned with these perturbed features, leading to a corrected decision boundary that enhances fairness in classification in a cost-effective manner. We present a theoretical framework demonstrating that our adversarial fairness objective reliably generates biased feature perturbations, and that fine-tuning on samples from these unfair regions ensures fairness improvements. Extensive experiments across diverse datasets, modalities, and backbone networks validate that training with these adversarial features significantly enhances fairness while maintaining predictive accuracy in classification tasks.

1513Latent Safety-Constrained Policy Approach for Safe Offline Reinforcement Learning

[openreview] [pdf]

Abstract In safe offline reinforcement learning, the objective is to develop a policy that maximizes cumulative rewards while strictly adhering to safety constraints, utilizing only offline data. Traditional methods often face difficulties in balancing these constraints, leading to either diminished performance or increased safety risks. We address these issues with a novel approach that begins by learning a conservatively safe policy through the use of Conditional Variational Autoencoders, which model the latent safety constraints. Subsequently, we frame this as a Constrained Reward-Return Maximization problem, wherein the policy aims to optimize rewards while complying with the inferred latent safety constraints. This is achieved by training an encoder with a reward-Advantage Weighted Regression objective within the latent constraint space. Our methodology is supported by theoretical analysis, including bounds on policy performance and sample complexity. Extensive empirical evaluation on benchmark datasets, including challenging autonomous driving scenarios, demonstrates that our approach not only maintains safety compliance but also excels in cumulative reward optimization, surpassing existing methods. Additional visualizations provide further insights into the effectiveness and underlying mechanisms of our approach.

1514Distributed Unlearning with Lossy Compression

[openreview] [pdf]

Abstract Machine unlearning enables to remove the contribution of a set of data points from a trained model. In a distributed setting, where a server orchestrates training using data available at a set of remote users, unlearning is essential to cope with the possible presence of malicious users. Existing distributed unlearning algorithms require the server to store all model updates observed in training, leading to immense storage overhead for preserving the ability to unlearn. In this work we study lossy compression schemes for facilitating distributed server-side unlearning with limited memory footprint. We identify suitable lossy compression mechanisms based on random lattice coding and sparsification. For a family of stochastic compression schemes encompassing probabilistic and subtractive dithered quantization, we derive an upper bound on the difference between the desired model that is trained from scratch and the model unlearned from lossy compressed stored updates. Our bound outperforms the state-of-the-art known bounds for non-compressed decentralized server-side unlearning, even when lossy compression is incorporated. We further provide a numerical study, shows that suited lossy compression can enable distributed unlearning with notably reduced memory footprint at the server while preserving the utility of the unlearned model.

1515HarmoniCa: Harmonizing Training and Inference for Better Feature Cache in Diffusion Transformer Acceleration

[openreview] [pdf]

Abstract Diffusion Transformers (DiTs) have gained prominence for outstanding scalability and extraordinary performance in generative tasks. However, their considerable inference costs impede practical deployment. The feature cache mechanism, which involves storing and retrieving redundant computations across timesteps, holds promise for reducing per-step inference time in diffusion models. Most existing caching methods for DiT are manually designed. Although the learning-based approach attempts to optimize strategies adaptively, it suffers from discrepancies between training and inference, which hampers both the performance and acceleration ratio. Upon detailed analysis, we pinpoint that these discrepancies primarily stem from two aspects: (1)Prior Timestep Disregard, where training ignores the effect of cache usage at earlier timesteps, and (2)Objective Mismatch, where the training target (align predicted noise in each timestep) deviates from the goal of inference (generate the high-quality image). To alleviate these discrepancies, we proposeHarmoniCa, a novel method thatHarmonizes training and inference with a novel learning-basedCaching framework built uponStep-Wise Denoising Training(SDT) andImage Error Proxy-Guided Objective(IEPO). Compared to the traditional training paradigm, the newly proposed SDT maintains the continuity of the denoising process, enabling the model to leverage information from prior timesteps during training, similar to the way it operates during inference. Furthermore, we design IEPO, which integrates an efficient proxy mechanism to approximate the final image error caused by reusing the cached feature. Therefore, IEPO helps balance final image quality and cache utilization, resolving the issue of training that only considers the impact of cache usage on the predicted output at each timestep. Extensive experiments on class-conditional and text-to-image (T2I) tasks for 8 models and 4 samplers with resolutions ranging from 256×256256\times256 to 2048×20482048\times2048 demonstrate the exceptional performance and speedup capabilities of our HarmoniCa. For example, HarmoniCa is the first feature cache method applied to the 20-step PixArt-α\alpha 1024×10241024\times1024 that achieves over 1.5×\times speedup in latency with an improved FID compared to the non-accelerated model. Remarkably, HarmoniCa requires no image data during training and reduces about 25% of training time compared to the existing learning-based approach.

1516Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation

[openreview] [pdf]

Abstract Large language models (LLMs) have recently gained much attention in building autonomous agents. However, performance of current LLM-based web agents in long-horizon tasks is far from optimal, often yielding errors such as repeatedly buying a non-refundable flight ticket. By contrast, humans can avoid such an irreversible mistake, as we have an awareness of the potential outcomes (e.g., losing money) of our actions, also known as the “world model”. Motivated by this, our study first starts with preliminary analyses, confirming the absence of world models in current LLMs (e.g., GPT-4o, Claude-3.5-Sonnet, etc.). Then, we present a World-model-augmented (WMA) web agent, which simulates the outcomes of its actions for better decision-making. To overcome the challenges in training LLMs as world models predicting next observations, such as repeated elements across observations and long HTML inputs, we propose a transition-focused observation abstraction, where the prediction objectives are free-form natural language descriptions exclusively highlighting important state differences between time steps. Experiments on WebArena and Mind2Web show that our world models improve agents’ policy selection without training and demonstrate our agents’ cost- and time-efficiency compared to recent tree-search-based agents.

1517Solving Composable Constraints for Inverse Design Tasks

[openreview] [pdf]

Abstract Inverse design tasks are an important category of problem in which we want to identify some input vector xx satisfying some desirable properties. In this paper we propose a mechanism for representing inequality constraints as Signed Distance Functions (SDFs). SDFs permit efficient projection of points into the solution region as well as providing a mechanism for composing constraints via boolean set operations. In this paper, we provide theoretical motivation for Signed Distance Functions (SDFs) as an implicit representation of inequality constraints. Next, we provide analysis demonstrating that SDFs can be used to efficiently project points into solution regions. Additionally, we propose two novel algorithms for computing SDFs for wide families of machine learning models. Finally, we demonstrate practical utility by performing conditional image generation using MNIST and CelebA datasets, and computational drug design using the ZINC-250K dataset. From the experimental results, we note that the composable constraints can reliably and efficiently compute solutions to complex inverse design tasks with deep learning models.

1518Sparse Rewards Can Self-Train Dialogue Agents

[openreview] [pdf]

Abstract Recent advancements in state-of-the-art (SOTA) Large Language Model (LLM) agents, especially in multi-turn dialogue tasks, have been primarily driven by supervised fine-tuning and high-quality human feedback. However, as base LLM models continue to improve, acquiring meaningful human feedback has become increasingly challenging and costly. In certain domains, base LLM agents may eventually exceed human capabilities, making traditional feedback-driven methods impractical. In this paper, we introduce a novel self-improvement paradigm that empowers LLM agents to autonomously enhance their performance without external human feedback. Our method, Juxtaposed Outcomes for Simulation Harvesting (JOSH), is a self-alignment algorithm that leverages a sparse reward simulation environment to extract ideal behaviors and further train the LLM on its own outputs. We present ToolWOZ, a sparse reward tool-calling simulation environment derived from MultiWOZ. We demonstrate that models trained with JOSH, both small and frontier, significantly improve tool-based interactions while preserving general model capabilities across diverse benchmarks. Our code and data are publicly available on GitHub.

1519CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

[openreview] [pdf]

Abstract We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 ×\times 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, \model is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weights of the 3D Causal VAE, the video caption model, and CogVideoX are open-source.

1520Practical Epistemic Uncertainty Quantification for View Synthesis

[openreview] [pdf]

Abstract View synthesis using Neural Radiance Fields (NeRF) and Gaussian Splatting (GS) has demonstrated impressive fidelity in rendering real-world scenarios. However, practical methods for accurate and efficient epistemic Uncertainty Quantification (UQ) in view synthesis are lacking. Existing approaches for NeRF either introduce significant computational overhead (e.g., “10x increase in training time” or “10x repeated training”) or are limited to specific uncertainty conditions or models. Notably, GS models lack any systematic approach for comprehensive epistemic UQ. This capability is crucial for improving the robustness and scalability of neural view synthesis, enabling active model updates, error estimation, and scalable ensemble modeling based on uncertainty. In this paper, we revisit NeRF and GS-based methods from a function approximation perspective, identifying key differences and connections in 3D representation learning. Building on these insights, we introduce PH-Dropout, the first real-time and accurate method for epistemic uncertainty estimation that operates directly on pre-trained NeRF and GS models. Extensive evaluations validate our theoretical findings and demonstrate the effectiveness of PH-Dropout.

1521Distribution Corrected Estimation via Adversarial Density Weighted Regression

[openreview] [pdf]

Abstract We propose a novel one-step supervised imitation learning (IL) framework called Adversarial Density Regression (ADR). This IL framework aims to correct the policy learned on unknown-quality to match the expert distribution by utilizing demonstrations, without relying on the Bellman operator. Specifically, ADR addresses several limitations in previous IL algorithms: First, most IL algorithms are based on the Bellman operator, which inevitably suffer from cumulative offsets from sub-optimal rewards during multi-step update processes. Additionally, off-policy training frameworks suffer from Out-of-Distribution (OOD) state-actions. Second, while conservative terms help solve the OOD issue, balancing the conservative term is difficult. To address these limitations, we fully integrate a refined one-step Distribution Corrected Estimation (DICE)-type supervised framework named ADR. Theoretically, we demonstrate that this adaptation can effectively correct the distribution of policies trained on unknown-quality datasets to align with the expert policy’s distribution. Moreover, the difference between the empirical and the optimal value function is proportional to the lower bound of ADR’s objective, indicating that minimizing ADR’s objective is akin to approaching the optimal value. Experimentally, we validated the performance of ADR by conducting extensive evaluations. Specifically, ADR outperforms all of the selected IL algorithms on tasks from the Gym-Mujoco domain. Meanwhile, it achieves an 89.5% improvement over IQL when utilizing ground truth rewards on tasks from the Adroit and Kitchen domains.

1522Robust Thompson Sampling Algorithms Against Reward Poisoning Attacks

[openreview] [pdf]

Abstract Thompson sampling is one of the most popular learning algorithms for online sequential decision-making problems and has rich real-world applications. However, current Thompson sampling algorithms are limited by the assumption that the rewards received are uncorrupted, which may not be true in real-world applications where adversarial reward poisoning exists. To make Thompson sampling more reliable, we want to make it robust against adversarial reward poisoning. The main challenge is that one can no longer compute the actual posteriors for the true reward, as the agent can only observe the rewards after corruption. In this work, we solve this problem by computing pseudo-posteriors that are less likely to be manipulated by the attack. We propose robust algorithms based on Thompson sampling for the popular stochastic and contextual linear bandit settings in both cases where the agent is aware or unaware of the budget of the attacker. We theoretically show that our algorithms guarantee near-optimal regret under any attack strategy.

1523Adversarial Attacks on Data Attribution

[openreview] [pdf]

Abstract Data attribution aims to quantify the contribution of individual training data points to the outputs of an AI model, which has been used to measure the value of training data and compensate data providers. Given the impact on financial decisions and compensation mechanisms, a critical question arises concerning the adversarial robustness of data attribution methods. However, there has been little to no systematic research addressing this issue. In this work, we aim to bridge this gap by detailing a threat model with clear assumptions about the adversary’s goal and capabilities and proposing principled adversarial attack methods on data attribution. We present two methods,Shadow AttackandOutlier Attack, which generate manipulated datasets to inflate the compensation adversarially. The Shadow Attack leverages knowledge about the data distribution in the AI applications, and derives adversarial perturbations through “shadow training”, a technique commonly used in membership inference attacks. In contrast, the Outlier Attack does not assume any knowledge about the data distribution and relies solely on black-box queries to the target model’s predictions. It exploits an inductive bias present in many data attribution methods - outlier data points are more likely to be influential - and employs adversarial examples to generate manipulated datasets. Empirically, in image classification and text generation tasks, the Shadow Attack can inflate the data-attribution-based compensation by at least 200%, while the Outlier Attack achieves compensation inflation ranging from 185% to as much as 643%.

1524How does controllability emerge in language models during pretraining?

[openreview] [pdf]

Abstract Language models can be controlled by adjusting their internal representations, which alters the degree to which concepts such as emotional tone, style, truthfulness, and safety are expressed in their generative outputs. This paper demonstrates that controllability emerges abruptly during pre-training, and furthermore, even closely-related concepts (e.g. anger and sadness) can emerge at different stages of pre-training. To understand how controllability of internal representations changes during pre-training, we introduce the “Intervention Detector” (ID), which applies dimensionality reduction to hidden states under different stimuli, and outputs concept representations that can be applied to control language models. Using these concept representations, we then compute an extraction score (ID score) that shows how well the extracted representations align with the model’s hidden states. This ID score can be used to approximately predict the time of emergence of controllability for different concepts, and the degree to which each concept is controllable. By analyzing ID scores across a longitudinal series of models taken at different stages of pre-training, we demonstrate that, as pre-training progresses, concepts become increasingly easier to extract via dimensionality reduction methods, which correlates with the emergence of controllability. For instance, in the CrystalCoder model, the controllability of the concept “anger” emerges at 68% of pre-training, whereas the controllability of the concept “sadness” emerges at 93% of the pre-training process. We use heatmap visualizations and other metrics (eg., entropy, cosine similarity, tSNE) to study these differences, and validate the reliability and generalizability of ID scores through model interventions using the extracted concept representations.

1525A Stochastic Approach to the Subset Selection Problem via Mirror Descent

[openreview] [pdf]

Abstract The subset selection problem is fundamental in machine learning and other fields of computer science. We introduce a stochastic formulation for the minimum cost subset selection problem in a black box setting, in which only the subset metric value is available. Subsequently, we can handle two-stage schemes, with an outer subset-selection component and an inner subset cost evaluation component. We propose formulating the subset selection problem in a stochastic manner by choosing subsets at random from a distribution whose parameters are learned. Two stochastic formulations are proposed. The first explicitly restricts the subset’s cardinality, and the second yields the desired cardinality in expectation. The distribution is parameterized by a decision variable, which we optimize using Stochastic Mirror Descent. Our choice of distributions yields constructive closed-form unbiased stochastic gradient formulas and convergence guarantees, including a rate with favorable dependency on the problem parameters. Empirical evaluation of selecting a subset of layers in transfer learning complements our theoretical findings and demonstrates the potential benefits of our approach.

1526Episodic Control-Based Adversarial Policy Learning in Two-player Competitive Games

[openreview] [pdf]

Abstract Training adversarial agents to attack neural network policies has proven to be both effective and practical. However, we observe that existing methods can be further enhanced by distinguishing between states leading to win or lose and encouraging the policy training to prioritize winning states. In this paper, we address this gap by introducing an episodic control-based approach for adversarial policy training. Our method extracts the historical evaluations for states from historical experiences with an episodic memory, and then incorporating these evaluations into the rewards to improve the adversarial policy optimization. We evaluate our approach using two-player competitive games in MuJoCo simulation environments, demonstrating that our method establishes the most promising attack performance and defense difficulty against the victims among the existing adversarial policy training techniques.

1527Indirect Gradient Matching for Adversarial Robust Distillation

[openreview] [pdf]

Abstract Adversarial training significantly improves adversarial robustness, but superior performance is primarily attained with large models. This substantial performance gap for smaller models has spurred active research into adversarial distillation (AD) to mitigate the difference. Existing AD methods leverage the teacher’s logits as a guide. In contrast to these approaches, we aim to transfer another piece of knowledge from the teacher, the input gradient. In this paper, we propose a distillation module termed Indirect Gradient Distillation Module (IGDM) that indirectly matches the student’s input gradient with that of the teacher. Experimental results show that IGDM seamlessly integrates with existing AD methods, significantly enhancing their performance. Particularly, utilizing IGDM on the CIFAR-100 dataset improves the AutoAttack accuracy from 28.06% to 30.32% with the ResNet-18 architecture and from 26.18% to 29.32% with the MobileNetV2 architecture when integrated into the SOTA method without additional data augmentation.

1528Consistency Models Made Easy

[openreview] [pdf]

Abstract Consistency models (CMs) offer faster sampling than traditional diffusion models, but their training is resource-intensive. For example, as of 2024, training a state-of-the-art CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an effective scheme for training CMs that largely improves the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs. We can thus fine-tune a consistency model starting from a pretrained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly reduced training times while improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained for hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling laws of CMs under ECT, showing that they obey the classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Our code will be made publicly available, making CMs more accessible to the broader community.

1529Collaborative Data Optimization

[openreview] [pdf]

Abstract Training efficiency plays a pivotal role in deep learning. This paper begins by analyzing current methods for enhancing efficiency, highlighting the necessity of optimizing targets, a process we define as data optimization. Subsequently, we reveal that current data optimization methods incur significant additional costs, e.g., human resources or computational overhead, due to their inherently sequential optimization process. To address these issues, we propose CoOpt, a highly efficient, parallelized framework designed for collaborative data optimization. CoOpt enables participants to independently optimize data subsets, ensuring that the overall performance, once these subsets are collected, remains comparable to the sequential optimization of the entire dataset, thus significantly reducing optimization costs for individual participants. Extensive experiments have been conducted on various real-world scenarios to demonstrate the effectiveness and efficiency of CoOpt across various datasets and architectures.

1530Robust Federated Learning Frameworks Guarding Against Data Flipping Threats for Autonomous Vehicles

[openreview] [pdf]

Abstract Federated Learning (FL) has become an established technique to facilitate privacy-preserving collaborative training across a multitude of clients. The ability to achieve collaborative learning from multiple parties containing an extensive volume of data while providing the essence of data privacy made it an attractive solution to address numerous challenges in sensitive data-driven fields such as autonomous vehicles (AVs). However, its decentralized nature exposes it to security threats, such as evasion and data poisoning attacks, where malicious participants can compromise training data. This paper addresses the challenge of defending federated learning systems against data poisoning attacks specifically focusing on data-flipping techniques in AVs by proposing a novel defense mechanism that combines anomaly detection with robust aggregation techniques. Our approach employs statistical outlier detection and model-based consistency checks to filter out compromised updates before they affect the global model. Experiments on benchmark datasets show that our method significantly enhances robustness by preventing nearly 15% of accuracy drop for our global model when confronted with a malicious participant and reduction the the attack success rate even when dealing with 20% of poisoning level. These findings provide a comprehensive solution to strengthen FL systems against adversarial threats.

1531Continuous-Time Analysis of Adaptive Optimization and Normalization

[openreview] [pdf]

Abstract Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental to modern deep learning, however, their training dynamics lack comprehensive theoretical understanding, with limited insight into why common practices—such as specific hyperparameter choices and normalization layers—contribute to successful generalization. This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics that can shed light on such practical questions. We theoretically derive a stable region for Adam’s hyperparameters (β,γ)(\beta, \gamma) that ensures bounded updates, empirically verifying these predictions by observing unstable exponential growth of parameter updates outside this region. Furthermore, we theoretically justify the success of normalization layers by uncovering an implicit meta-adaptive effect of scale-invariant architectural components. This insight leads to an explicit optimizer, 2-Adam, which we generalize to kk-Adam—an optimizer that applies an adaptive normalization procedure kk times, encompassing Adam (corresponding to k=1k=1) and Adam with a normalization layer (corresponding to k=2k=2). Overall, our continuous-time formulation of Adam facilitates a principled analysis, offering deeper understanding of optimal hyperparameter choices and architectural decisions in modern deep learning.

1532The Perfect Blend: Redefining RLHF with Mixture of Judges

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) has become the leading approach for fine-tuning large language models (LLM). However, RLHF has limitations in multi-task learning (MTL) due to challenges of reward hacking and extreme multi-objective optimization (i.e., trade-off of multiple and/or sometimes conflicting objectives). Applying RLHF for MTL currently requires careful tuning of the weights for reward model and data combinations. This is often done via human intuition and does not generalize. In this work, we introduce a novel post-training paradigm which we called Constrained Generative Policy Optimization (CGPO). The core of CGPO is Mixture of Judges (MoJ) with cost-efficient constrained policy optimization with stratification, which can identify the perfect blend in RLHF in a principled manner. It shows strong empirical results with theoretical guarantees, does not require extensive hyper-parameter tuning, and is plug-and-play in common post-training pipelines. Together, this can detect and mitigate reward hacking behaviors while reaching a pareto-optimal point across an extremely large number of objectives.Our results show that CGPO consistently outperforms other commonly used SoTA RLHF algorithms (such as PPO and DPO) on a wide range of tasks -- general chat, STEM questions, instruction following, math, coding and knowledge. In particular, CGPO improves over PPO by 7.4% in AlpacaEval-2 (general chat), 12.5% in Arena-Hard (STEM\reasoning), 2% in IFEval (Instrcution Following), 2% in both MATH and GSM8K (Math\reasoning), 5% in HumanEval (Coding), and 2% in the ARC challenge (Knowledge). We also observe that PPO is susceptible to severe reward hacking behaviors (it exhibits severe regression in popular coding benchmarks) which can be addressed by CGPO. CGPO represents a breakthrough in RLHF, simultaneously addressing reward-hacking and extreme multi-objective optimization, and thereby advancing the state-of-the-art in aligning general-purpose LLMs.

1533SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models

[openreview] [pdf]

Abstract Instruction-following is a fundamental capability of language models, requiring the model to recognize even the most subtle requirements in the instructions and accurately reflect them in its output. Such an ability is well-suited for and often optimized by preference learning. However, existing methods often directly sample multiple independent responses from the model when creating preference pairs. Such practice can introduce content variations irrelevant to whether the instruction is precisely followed (e.g., different expressions about the same semantic), interfering with the goal of teaching models to recognize the key differences that lead to improved instruction following. In light of this, we introduce SPaR, a self-play framework integrating tree-search self-refinement to yield valid and comparable preference pairs free from distractions. By playing against itself, an LLM employs a tree-search strategy to refine its previous responses with respect to the instruction while minimizing unnecessary variations. Our experiments show that a LLaMA3-8B model, trained over three iterations guided by SPaR, surpasses GPT-4-Turbo on the IFEval benchmark without losing general capabilities. Furthermore, SPaR demonstrates promising scalability, greatly enhancing the performance of LLaMA3-70B. We also identify how inference scaling in tree search would impact model performance. Code and data will be publicly available.

1534Explainable Rewards in RLHF Using LLM-as-a-Judge

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has been gaining popularity as a method for aligning Large Language Models (LLMs) with human preferences. It involves performing Supervised Fine-Tuning (SFT) followed by fine-tuning using a reward model trained on human preference data. However, two primary issues with this approach are the difficult and expensive curation of human preference data and the opaque, black-box nature of the rewards. To address these issues, this paper introduces a novel framework for aligning LLMs with human preferences. Our framework involves using representative sub-dimensions for specific tasks to generate rewards by leveraging a performant out-of-the-box LLM. We evaluate our approach by fine-tuning two models, one using our approach and one using traditional black-box rewards. Evaluation using an advanced LLM-based method demonstrates that our approach maintains the performance of the black-box baseline while offering superior explainability and flexibility. This framework not only enhances transparency in RLHF but also eliminates reliance on expensive human-curated preference data.

1535When is Task Vector Provably Effective for Model Editing? A Generalization Analysis of Nonlinear Transformers

[openreview] [pdf]

Abstract weighted sum of task vectors, each of which is the weight update from the pre-trained model to fine-tuned models for certain tasks. This approach recently gained attention as a computationally efficient inference method for model editing, e.g., multi-task learning, forgetting, and out-of-domain generalization capabilities. However, the theoretical understanding of why task vectors can execute various conceptual operations remains limited, due to the highly non-convexity of training Transformer-based models. To the best of our knowledge, this paper provides the first theoretical characterization of the generalization guarantees of task vector methods on nonlinear Transformers. We consider a conceptual learning setting, where each task is a binary classification problem based on a discriminative pattern. We theoretically prove the effectiveness of task addition in simultaneously learning a set of irrelevant or aligned tasks, as well as the success of task negation in unlearning one task from irrelevant or contradictory tasks. Moreover, we prove the proper selection of linear coefficients for task arithmetic to achieve guaranteed generalization to out-of-domain tasks. All of our theoretical results hold for both dense-weight parameters and their low-rank approximations. Although established in a conceptual setting, our theoretical findings were validated on a practical machine unlearning task using the large language model Phi-1.5 (1.3B).

1536CFBD: COARSE-TO-FINE DETECTION OF BACKDOOR ATTACKS IN MULTIMODAL CONTRASTIVE LEARNING

[openreview] [pdf]

Abstract The backdoor attack in Multimodal Contrastive Learning (MCL) task has been receiving increasing attention in recent years, due to numerous downstream tasks that rely on pre-trained MCL models. Backdoor detection has been one of the effective protection solutions to fight against backdoor attacks. However, the majority of existing backdoor detection methods in MCL usually produces nonsatisfying detection results. Two main factors are responsible for this: 1) one-stage detection lacks subsequent dynamic adaptation to the distribution of poisoned and benign pairs when faced with different attacks, and 2) the criteria used in existing methods, specifically the cosine similarity between image and caption, are insufficient to distinguish between poisoned and benign pairs. To address these problems, we extend the conventional one-stage detection architecture to a two-stage architecture and propose a better metric in the second stage with high precision and high fault tolerance. To this end, we design a novel Coarse-to-Fine two-stage Backdoor Detection method, termed CFBD, which primarily focuses on multimodal learning involving image-caption relationships, such as CLIP. The objective of the coarse stage is to roughly partition dataset into poisoned, benign and suspicious subset. In the fine-grained stage, we use the average textual correlation with the poisoned subset to improve the detection quality. Extensive experiments demonstrate that CFBD achieves superior backdoor detection performance, e.g., almost 100% True Positive Rate (TPR) for diverse attacks over the large scale dataset CC-3M, markedly outperforming state-of-the-art methods.

1537On the loss of context-awareness in general instruction finetuning

[openreview] [pdf]

Abstract Pretrained Large Language Models (LLMs) require post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs to enable instruction following. However, this process can potentially harm existing capabilities learned during pretraining. In this paper, we investigate the loss of context awareness after SFT, defined as the capability to extract and understand information from the user-provided context and respond accordingly. We are the first to identify and show that the loss of context-awareness appears on instruction-finetuned LLMs when the chat template is applied to the input prompts. We identify the performance decline is partially caused by the bias embedded into the chat template to focus less on the the user-provided context. Based on these observations, we propose two methods to mitigate the loss of context awareness in instruct models: post hoc attention steering on user prompts and conditional instruction fine-tuning with a context-dependency indicator. Empirical experiments on 4 context-dependent downstream tasks and 3 pretrained LLMs of different sizes show that our methods can effectively mitigate the loss of context awareness without compromising the general ability of instruction following. Our findings also strongly advocate the necessity to benchmark context awareness after instruction fine-tuning carefully.

1538Uncertainty-aware Reward Model: Teaching Reward Models to Know What is Unknown

[openreview] [pdf]

Abstract Reward models (RM) play a critical role in aligning generations of large language models (LLM) to human expectations. However, prevailing RMs fail to capture the stochasticity within human preferences and cannot effectively evaluate the reliability of reward predictions. To address these issues, we propose Uncertain-aware RM (URM) and Uncertain-aware RM Ensemble (URME) to incorporate and manage uncertainty in reward modeling. URM can model the distribution of disentangled attributes within human preferences, while URME quantifies uncertainty through discrepancies in the ensemble, thereby identifying potential lack of knowledge during reward evaluation. Experiment results indicate that the proposed URM achieves state-of-the-art performance compared to models with the same size, demonstrating the effectiveness of modeling uncertainty within human preferences. Furthermore, empirical results show that through uncertainty quantification, URM and URME can identify unreliable predictions to improve the quality of reward evaluations.

1539Blind Inversion using Latent Diffusion Priors

[openreview] [pdf]

Abstract Diffusion models have emerged as powerful tools for solving inverse problems due to their exceptional ability to model complex prior distributions. However, existing methods predominantly assume known forward operators (i.e., non-blind), limiting their applicability in practical settings where acquiring such operators is costly. Additionally, many current approaches rely on pixel-space diffusion models, leaving the potential of more powerful latent diffusion models (LDMs) underexplored. In this paper, we introduce LatentDEM, an innovative technique that addresses more challenging blind inverse problems using latent diffusion priors. At the core of our method is solving blind inverse problems within an iterative Expectation-Maximization (EM) framework: (1) the E-step recovers clean images from corrupted observations using LDM priors and a known forward model, and (2) the M-step estimates the forward operator based on the recovered images. Additionally, we propose two novel optimization techniques tailored for LDM priors and EM frameworks, yielding more accurate and efficient blind inversion results. As a general framework, LatentDEM supports both linear and non-linear inverse problems. Beyond common 2D image restoration tasks, it enables new capabilities in non-linear 3D inverse rendering problems. We validate LatentDEM’s performance on representative 2D blind deblurring and 3D sparse-view reconstruction tasks, demonstrating its superior efficacy over prior arts.

1540How to Verify Any (Reasonable) Distribution Property: Computationally Sound Argument Systems for Distributions

[openreview] [pdf]

Abstract As statistical analyses become more central to science, industry and society, there is a growing need to ensure correctness of their results. Approximate correctness can be verified by replicating the entire analysis, but can we verify without replication? We focus on distribution testing problems: verifying that an unknown distribution is close to having a claimed property. Our main contribution is an interactive protocol between a verifier and an untrusted prover, which can be used to verify any distribution property that can be decided in polynomial time given a full and explicit description of the distribution. If the distribution is at statistical distance ε\varepsilon from having the property, then the verifier rejects with high probability. This soundness property holds against any polynomial-time strategy that a cheating prover might follow, assuming the existence of collision-resistant hash functions (a standard assumption in cryptography). For distributions over a domain of size NN, the protocol consists of 4 messages and the communication complexity and verifier runtime are roughly O~(N/ε2)\widetilde{O}\left(\sqrt{N} / \varepsilon^2 \right). The verifier’s sample complexity is O~(N/ε2)\widetilde{O}\left(\sqrt{N} / \varepsilon^2 \right), and this is optimal up to polylog(N)\text{polylog}(N) factors (for any protocol, regardless of its communication complexity). Even for simple properties, approximately deciding whether an unknown distribution has the property can require quasi-linear sample complexity and running time. For any such property, our protocol provides a quadratic speedup over replicating the analysis.

1541Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance

[openreview] [pdf]

Abstract Large Language Model’s are instruction-finetuned to enhance their ability to follow user instructions and better comprehend input context. Still, they often struggle to follow the input context, especially when it contradicts model’s parametric knowledge. This manifests as various failures, such as hallucinations where a model inserts outdated or unwarranted facts into its response. In this work, we observe an intriguing phenomenon: the context reliance of the model decreases as instruction finetuning progresses, despite an initial expected increase\textit{despite an initial expected increase}. We call this phenomenon as the context-parametric inversion\textbf{context-parametric inversion}. This is surprising, as one would expect instruction tuning to improve the model’s ability to follow input instructions. We observe this behavior on multiple general purpose instruction tuning datasets such as TULU, Alpaca and Ultrachat, across multiple model families like Llama, Mistral and Pythia. We perform various controlled studies to eliminate some simple hypothesis for this observed behavior and isolate what datapoints cause this counter-intuitive behavior. We then analyze the phenomenon theoretically, to explain why context reliance varies across the trajectory of finetuning. We tie the observed context-parametric inversion to the properties of the finetuning data, which provides us with some potential mitigation strategies that provide limited but insightful gains.

1542Methods with Local Steps and Random Reshuffling for Generally Smooth Non-Convex Federated Optimization

[openreview] [pdf]

Abstract Non-convex Machine Learning problems typically do not adhere to the standard smoothness assumption. Based on empirical findings, Zhang et al. (2020b) proposed a more realistic generalized (L0,L1)(L_0,L_1)-smoothness assumption, though it remains largely unexplored. Many existing algorithms designed for standard smooth problems need to be revised. However, in the context of Federated Learning, only a few works address this problem but rely on additional limiting assumptions. In this paper, we address this gap in the literature: we propose and analyze new methods with local steps, partial participation of clients, and Random Reshuffling without extra restrictive assumptions beyond generalized smoothness. The proposed methods are based on the proper interplay between clients’ and server’s stepsizes and gradient clipping. Furthermore, we perform the first analysis of these methods under the Polyak-Łojasiewicz condition. Our theory is consistent with the known results for standard smooth problems, and our experimental results support the theoretical insights.

1543Emergence of a High-Dimensional Abstraction Phase in Language Transformers

[openreview] [pdf]

Abstract A language model (LM) is a mapping from a linguistic context to an output token. However, much remains to be known about this mapping, including how its geometric properties relate to its function. We take a high-level geometric approach to its analysis, observing, across five pre-trained transformer-based LMs and three input datasets, a distinct phase characterized by high intrinsic dimensionality. During this phase, representations (1) correspond to the first full linguistic abstraction of the input; (2) are the first to viably transfer to downstream tasks; (3) predict each other across different LMs. Moreover, we find that an earlier onset of the phase strongly predicts better language modelling performance. In short, our results suggest that a central high-dimensionality phase underlies core linguistic processing in many common LM architectures.

1544SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have been proven highly effective at generating high-quality images. However, as these models grow larger, they require significantly more memory and suffer from higher latency, posing substantial challenges for deployment. In this work, we aim to accelerate diffusion models by quantizing their weights and activations to 4 bits. At such an aggressive level, both weights and activations are highly sensitive to quantization, where conventional post-training quantization methods for large language models like smoothing become insufficient. To overcome this limitation, we propose SVDQuant, a new 4-bit quantization paradigm. Different from smoothing which redistributes outliers between weights and activations, our approach absorbs these outliers using a low-rank branch. We first shift the outliers from activations into the weights, then employ a high-precision low-rank branch to take in the outliers in the weights. This process eases the quantization on both sides. However, naively running the low-rank branch independently incurs significant overhead due to extra data movement of activations, negating the quantization speedup. To address this, we design an inference engine LoRunner that fuses the kernels in the low-rank branch into the kernels in the low-bit branch to cut off redundant memory access. Extensive experiments on SDXL, PixArt-Σ\Sigma, and FLUX.1 validate the effectiveness of SVDQuant in preserving image quality. We reduce the memory usage for the 12B FLUX.1 models by 3.6×, achieving 3.5× speedup over the 4-bit weight-only quantized baseline on a 16GB RTX-4090 GPU, paving the way for more interactive applications on PCs. We will release the code and models upon publication.

1545Breaking Mental Set to Improve Reasoning through Diverse Multi-Agent Debate

[openreview] [pdf]

Abstract Large Language Models (LLMs) have seen significant progress but continue to struggle with persistent reasoning mistakes. Previous methods of self-reflection have been proven limited due to the models’ inherent fixed thinking patterns. While Multi-Agent Debate (MAD) attempts to mitigate this by incorporating multiple agents, it often employs the same reasoning methods, even though assigning different personas to models. This leads to a “fixed mental set,” where models rely on homogeneous thought processes without exploring alternative perspectives. In this paper, we introduce Diverse Multi-Agent Debate (DMAD), a method that encourages agents to think with distinct reasoning approaches. By leveraging diverse problem-solving strategies, each agent can gain insights from different perspectives, refining its responses through discussion and collectively arriving at the optimal solution. DMAD effectively breaks the limitations of fixed mental sets. We evaluate DMAD against various prompting techniques, including self-reflection and traditional MAD, across multiple benchmarks using both LLMs and Multimodal LLMs. Our experiments show that DMAD consistently outperforms other methods, delivering better results than MAD in fewer rounds.

1546From Uncontextualized Embeddings to Marginal Feature Effects: Incorporating Intelligibility into Tabular Transformer Networks

[openreview] [pdf]

Abstract In recent years, deep neural networks have showcased their predictive power across a variety of tasks. The transformer architecture, originally developed for natural language processing, has also shown great efficiency in handling tabular data, offering a competitive alternative to traditional gradient-boosted decision trees in this domain. However, this predictive power comes at the cost of intelligibility: Marginal feature effects are almost completely lost in the black-box nature of deep tabular transformer networks. Alternative architectures that use the additivity constraints of classical statistical regression models can maintain intelligible marginal feature effects, but often fall short in predictive power compared to their more complex counterparts. To bridge the gap between intelligibility and performance, we propose an adaptation of tabular transformer networks designed to identify marginal feature effects. We provide theoretical justifications that marginal feature effects can be accurately identified, and our ablation study demonstrates that the proposed model efficiently detects these effects, even amidst complex feature interactions. To demonstrate the model’s predictive capabilities, we compare it to several interpretable as well as black-box models and find that it can match black-box performances while maintaining intelligibility. The source code is vailable athttps://anonymous.4open.science/r/nmfrmr-B086.

1547No Location Left Behind: Measuring and Improving the Fairness of Implicit Representations for Earth Data

[openreview] [pdf]

Abstract Implicit neural representations (INRs) exhibit growing promise in addressing Earth representation challenges, ranging from emissions monitoring to climate modeling. However, existing methods disproportionately prioritize global average performance, whereas practitioners require fine-grained insights to understand biases and variations in these models. To bridge this gap, we introduce FAIR-Earth: a first-of-its-kind dataset explicitly crafted to challenge and examine inequities in Earth representations. FAIR-Earth comprises various high-resolution Earth signals, and uniquely aggregates extensive metadata along stratifications like landmass size and population density to assess the fairness of models. Evaluating state-of-the-art INRs across the various modalities of FAIR-Earth, we uncover striking performance disparities. Certain subgroups, especially those associated with high-frequency signals (e.g., islands, coastlines), are consistently poorly modeled by existing methods. In response, we propose spherical wavelet encodings, building on previous spatial encoding research for INRs. Leveraging the multi-resolution analysis capabilities of wavelets, our encodings yield more consistent performance over various scales and locations, offering more accurate and robust representations of the biased subgroups. These open-source contributions represent a crucial step towards facilitating the equitable assessment and deployment of implicit Earth representations.

1548Influencing Humans to Conform to Preference Models for RLHF

[openreview] [pdf]

Abstract Designing a reinforcement learning from human feedback (RLHF) algorithm for learning from preferences requires assuming a preference model, sometimes implicitly. A preference model that poorly describes how humans generate preferences risks learning a poor approximation of the human’s unobservable reward function. In this paper, we conduct three human studies to assess whether one can influence the expression of real human preferences to more closely conform to a desired preference model. Importantly, our approach does not seek to alter the human’s unobserved reward function. Rather, we change how humans use this reward function to generate preferences, such that they better match whatever preference model is assumed by a particular RLHF algorithm. We introduce three interventions: showing humans the quantities that underlie a preference model, which is normally unobservable information derived from the reward function; training people to follow a specific preference model; and modifying the preference elicitation question. All intervention types show significant effects, providing practical tools to improve preference data quality and the resultant alignment of learned reward functions.Overall we establish a novel research direction in model alignment: training humans and designing interfaces to increase human conformance with the assumptions of the algorithm that will learn from their input.

1549Implicit Search via Discrete Diffusion: A Study on Chess

[openreview] [pdf]

Abstract In the post-AlphaGo era, there has been a resurgence of interest in search techniques like Monte Carlo Tree Search (MCTS) within the realm of Large Language Models (LLMs). This renewed attention is driven by the recognition that current next-token prediction models often lack the ability for long-term planning. Is it possible to instill search-like abilities within the models to enhance their planning abilities without relying on explicit search? We propose DiffuSearch, a model that does implicit search by looking into the future world via discrete diffusion modeling. We instantiate DiffuSearch on a classical board game, Chess, where explicit search is known to be essential. Through extensive controlled experiments, we show DiffuSearch outperforms both the searchless and explicit search-enhanced policies. Specifically, DiffuSearch outperforms the one-step policy by 19.2% and the MCTS-enhanced policy by 14% on action accuracy. Furthermore, DiffuSearch demonstrates a notable 30% enhancement in puzzle-solving abilities compared to explicit search, along with a significant 540 Elo increase in game-playing strength assessment.

1550Toto: Time Series Optimized Transformer for Observability

[openreview] [pdf]

Abstract We introduce the Time Series Optimized Transformer for Observability (Toto), a foundation model designed for time series forecasting with a focus on observability metrics. Toto features a novel proportional factorized attention mechanism and a Student-T mixture model head, enabling it to efficiently handle high-dimensional, sparse, and non-stationary data. Trained on one trillion time series data points, including 75% proprietary observability data, Toto demonstrates state-of-the-art zero-shot performance on standard benchmarks such as electricity and weather forecasting. Furthermore, it significantly outperforms existing models in observability-specific tasks, making it an ideal solution for real-time system monitoring and anomaly detection. Toto’s architectural innovations make it a versatile tool for both general-purpose forecasting and domain-specific applications, setting a new benchmark for scalability and accuracy in time series analysis.

1551PREDICTING THE BEHAVIOR OF AI AGENTS USING TRANSFER OPERATORS

[openreview] [pdf]

Abstract Predicting the behavior of AI-driven agents is particularly challenging without a preexisting model. In our paper, we address this by treating AI agents as stochastic nonlinear dynamical systems and adopting a probabilistic perspective to predict their statistical behavior using the Fokker-Planck equation. We formulate the approximation of the density transfer operator as an entropy minimization problem, which can be solved by leveraging the Markovian property and decomposing its spectrum. Our data-driven methodology simultaneously approximates the Markov operator to perform prediction of the evolution of the agents and also predicts the terminal probability density of AI agents, such as robotic systems and generative models. We demonstrate the effectiveness of our prediction model through extensive experiments on practical systems driven by AI algorithms.

1552Beyond Random Augmentations: Pretraining with Hard Views

[openreview] [pdf]

Abstract Self-Supervised Learning (SSL) methods typically rely on random image augmentations, or views, to make models invariant to different transformations. We hypothesize that the efficacy of pretraining pipelines based on conventional random view sampling can be enhanced by explicitly selecting views that benefit the learning progress. A simple yet effective approach is to select hard views that yield a higher loss. In this paper, we propose Hard View Pretraining (HVP), a learning-free strategy that extends random view generation by exposing models to more challenging samples during SSL pretraining. HVP encompasses the following iterative steps: 1) randomly sample multiple views and forward each view through the pretrained model, 2) create pairs of two views and compute their loss, 3) adversarially select the pair yielding the highest loss according to the current model state, and 4) perform a backward pass with the selected pair. In contrast to existing hard view literature, we are the first to demonstrate hard view pretraining’s effectiveness at scale, particularly training on the full ImageNet-1k dataset, and evaluating across multiple SSL methods, Convolutional Networks, and Vision Transformers. As a result, HVP sets a new state-of-the-art on DINO ViT-B/16, reaching 78.8% linear evaluation accuracy (a 0.6% improvement) and consistent gains of 1% for both 100 and 300 epoch pretraining, with similar improvements across transfer tasks in DINO, SimSiam, iBOT, and SimCLR.

1553Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads

[openreview] [pdf]

Abstract Large language models (LLMs) have shown remarkable advances in supporting long-context comprehension and processing tasks. However, scaling the generation inference of LLMs to such long contexts incurs significant additional computation load, and demands a substantial GPU memory footprint to maintain the key-value (KV) cache of transformer-based LLMs. Existing KV cache compression methods, such as quantization, face memory bottlenecks as context length increases, while static-sized caches, such as selective eviction, suffer from inefficient policies. These limitations restrict deployment on consumer-grade devices like a single Nvidia 4090 GPU. To overcome this, we propose Locret, an efficient framework for long-context LLM inference that introduces retaining heads to evaluate the causal importance of KV cache units, allowing for more accurate eviction within a fixed cache size. Locret is fine-tuned on top of the frozen backbone LLM using a minimal amount of data from standard long-context SFT datasets. During inference, we evict low-importance cache units along with a chunked prefill pattern, significantly reducing peak GPU memory usage. We conduct an extensive empirical study to evaluate Locret, where the experimental results show that Locret outperforms the recent popular and competitive approaches, including InfLLM, Quantization, SirLLM, and MInference, in terms of memory efficiency and the quality of generated contents --- Locret achieves over a 20×20\times and 8×8\times KV cache compression ratio compared to the full KV cache for Phi-3-mini-128K and Llama-3.1-8B-instruct. Additionally, Locret can be combined with other efficient inference methods, such as quantization and token merging. To the best of our knowledge, Locret is the first framework capable of deploying Llama-3.1-8B or similar models on a single Nvidia 4090 GPU, enabling 128K long-context inference without compromising generation quality, and requiring little additional system optimizations.

1554Injecting Learnable Table Features into LLMs

[openreview] [pdf]

Abstract To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to extend them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that the even the most advanced LLMs (such as GPTs) may still fall short on coping with tabular data. More specifically, the current scheme often simply replies on serializing the tabular data, together with the meta information, then put them through the LLMs. We argue that the loss of the structural information and incomplete cell values persisted are the root of this shortcoming. In this work, we further propose TAMO that bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulted model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvement with an average relative gain by 42.65%.

1555Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality

[openreview] [pdf]

Abstract Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and factual variables, challenges the construction of the graphical model. Existing approaches, e.g., Twin Network and Single World Intervention Graphs (SWIG), establish a symbiotic relationship to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical limitations of certain current methods in cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and factual variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks.

1556Metadata Matters for Time Series: Informative Forecasting with Transformers

[openreview] [pdf]

Abstract Time series forecasting is prevalent in extensive real-world applications, such as financial analysis and energy planning. Previous studies primarily focus on time series modality, endeavoring to capture the intricate variations and dependencies inherent in time series. Beyond numerical time series data, we notice that metadata (e.g. dataset and variate descriptions) also carries valuable information essential for forecasting, which can be used to identify the application scenario and provide more interpretable knowledge than digit sequences. Inspired by this observation, we propose a Metadata-informed Time Series Transformer MetaTST, which incorporates multiple levels of context-specific metadata into Transformer forecasting models to enable informative time series forecasting. To tackle the unstructured nature of metadata, MetaTST formalizes them into natural languages by pre-designed templates and leverages large language models (LLMs) to encode these texts into metadata tokens as a supplement to classic series tokens, resulting in an informative embedding. Further, a Transformer encoder is employed to communicate series and metadata tokens, which can extend series representations by metadata information for more accurate forecasting. This design also allows the model to adaptively learn context-specific patterns across various scenarios, which is particularly effective in handling large-scale, diverse-scenario forecasting tasks. Experimentally, MetaTST achieves state-of-the-art compared to advanced time series models and LLM-based methods on widely acknowledged short- and long-term forecasting benchmarks, covering both single-dataset individual and multi-dataset joint training settings.

1557Provable Convergence Bounds for Hybrid Dynamical Sampling and Optimization

[openreview] [pdf]

Abstract Analog dynamical accelerators (DXs) are a growing sub-field in computer architecture research, offering order-of-magnitude gains in power efficiency and latency over traditional digital methods in several machine learning, optimization, and sampling tasks. However, limited-capacity accelerators require hybrid analog/digital algorithms to solve real-world problems, commonly using large-neighborhood local search (LNLS) frameworks. Unlike fully digital algorithms, hybrid LNLS has no non-asymptotic convergence guarantees and no principled hyperparameter selection schemes, particularly limiting cross-device training and inference.In this work, we provide non-asymptotic convergence guarantees for hybrid LNLS by reducing to block Langevin Diffusion (BLD) algorithms. Adapting tools from classical sampling theory, we prove exponential KL-divergence convergence for randomized and cyclic block selection strategies using ideal DXs. With finite device variation, we provide explicit bounds on the 2-Wasserstein bias in terms of step duration, noise strength, and function parameters. Our BLD model provides a key link between established theory and novel computing platforms, and our theoretical results provide a closed-form expression linking device variation, algorithm hyperparameters, and performance.

1558A Simple Baseline for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract The versatility of large language models has led to intensive ongoing work focused on adaptations to other modalities. This can involve moderate modifications of an existing model, piggybacking on the language model’s capabilities to train multimodal models or even starting with pre-trained checkpoints and attaching specialized adapters to recast a new modality (e.g., time-series) as ``language’'. This latter approach, prominent in a growing set of nice results, yields strong performance across benchmarks. It also makes sense -- while a large amount of temporal data is acquired every day (e.g., wearable sensors, physiological measurements in healthcare), unlike text/image corpus, much of it is not publicly available (except financial markets) for various reasons. But training (or even fine-tuning) these large models is expensive or difficult with limited resources. In this paper, we study and characterize the performance profile of a simple model for multivariate time-series forecasting. By simple, we mean that the model is restricted to tokenization based on classical ideas (as has been shown to be effective in vision) which are then allowed to attend/interact: via self-attention but also via ways that are a bit more general than dot-product attention, accomplished via basic geometric algebra ideas. We show that even a single or two layer model yields results that are competitive with much bigger (and even LLM-based) models on most benchmarks reported in the literature.

1559OMS: One More Step Noise Searching to Enhance Membership Inference Attacks for Diffusion Models

[openreview] [pdf]

Abstract The data-intensive nature of Diffusion models amplifies the risks of privacy infringements and copyright disputes, particularly when training on extensive unauthorized data scraped from the Internet. Membership Inference Attacks (MIA) aim to determine whether a data sample has been utilized by the target model during training, thereby serving as a pivotal tool for privacy preservation. Current MIA employs the prediction loss to distinguish between training member samples and non-members. These methods assume that, compared to non-members, members, having been encountered by the model during training result in a smaller prediction loss. However, this assumption proves ineffective in diffusion models due to the randomly noise sampled during the training process. Rather than estimating the loss, our approach examines this random noise and reformulate the MIA as a noise search problem, assuming that members are more feasible to find the noise used in the training process. We formulate this noise search process as an optimization problem and employ the fixed-point iteration to solve it. We analyze current MIA methods through the lens of the noise search framework and reveal that they rely on the first residual as the discriminative metric to differentiate members and non-members. Inspired by this observation, we introduce \textbf{OMS}, which augments existing MIA methods by iterating \textbf{O}ne \textbf{M}ore fixed-point \textbf{S}tep to include a further residual, i.e., the second residual.We integrate our method into various MIA methods across different diffusion models. The experimental results validate the efficacy of our proposed approach.

1560Online Neuro-Symbolic Predicate Invention for High-Level Planning

[openreview] [pdf]

Abstract Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the strengths of symbolic and neural knowledge representations. We outline an online algorithm for inventing such predicates and learning abstract world models. We compare our approach to hierarchical reinforcement learning, vision-language model planning, and symbolic predicate invention approaches, on both in- and out-of-distribution tasks across five simulated robotic domains. Results show that our approach offers better sample complexity, stronger out-of-distribution generalization, and improved interpretability.

1561Learned Data Transformation: A Data-centric Plugin for Enhancing Time Series Forecasting

[openreview] [pdf]

Abstract Data-centric approaches in Time Series Forecasting (TSF) often involve heuristic-based operations on data. This paper proposes to find a general end-to-end data transformation that serves as a plugin to enhance any arbitrary TSF model’s performance. To achieve this, we propose the Proximal Transformation Network (PTN), which learns effective transformations while maintaining proximity to the raw data to ensure fidelity. When orthogonally integrated with popular TSF models, our method helps achieve state-of-the-art performance on seven real-world datasets. Additionally, we show that the proximal transformation process can be interpreted in terms of predictability and distribution alignment among channels, highlighting the potential of data-centric methods for future research. Our code is available athttps://anonymous.4open.science/r/PTN-2FC6/.

1562FoundTS: Comprehensive and Unified Benchmarking of Foundation Models for Time Series Forecasting

[openreview] [pdf]

Abstract Time Series Forecasting (TSF) is key functionality in numerous fields, including in finance, weather services, and energy management. While TSF methods are emerging these days, many of them require domain-specific data collection and model training and struggle with poor generalization performance on new domains. Foundation models aim to overcome this limitation. Pre-trained on large-scale language or time series data, they exhibit promising inferencing capabilities in new or unseen data. This has spurred a surge in new TSF foundation models. We propose a new benchmark, FoundTS\texttt{FoundTS}, to enable thorough and fair evaluation and comparison of such models. FoundTS\texttt{FoundTS} covers a variety of TSF foundation models, including those based on large language models and those pretrained on time series. Next, FoundTS\texttt{FoundTS} supports different forecasting strategies, including zero-shot, few-shot, and full-shot, thereby facilitating more thorough evaluations. Finally, FoundTS\texttt{FoundTS} offers a pipeline that standardizes evaluation processes such as dataset splitting, loading, normalization, and few-shot sampling, thereby facilitating fair evaluations. Building on this, we report on an extensive evaluation of TSF foundation models on a broad range of datasets from diverse domains and with different statistical characteristics. Specifically, we identify pros and cons and inherent limitations of existing foundation models, and we identify directions for future model design. We make our code and datasets available athttps://anonymous.4open.science/r/FoundTS-C2B0.

[openreview] [pdf]

Abstract Programmatic reinforcement learning (PRL) has been explored for representing policies through programs as a means to achieve interpretability and generalization. Despite promising outcomes, current state-of-the-art PRL methods are hindered by sample inefficiency, necessitating tens of millions of program-environment interactions. To tackle this challenge, we introduce a novel LLM-guided search framework (LLM-GS). Our key insight is to leverage the programming expertise and common sense reasoning of LLMs to enhance the efficiency of assumption-free, random-guessing search methods. We address the challenge of LLMs’ inability to generate precise and grammatically correct programs in domain-specific languages (DSLs) by proposing a Pythonic-DSL strategy — an LLM is instructed to initially generate Python codes and then convert them into DSL programs. To further optimize the LLM-generated programs, we develop a search algorithm named Scheduled Hill Climbing, designed to efficiently explore the programmatic search space to improve the programs consistently. Experimental results in the Karel domain demonstrate our LLM-GS framework’s superior effectiveness and efficiency. Extensive ablation studies further verify the critical role of our Pythonic-DSL strategy and Scheduled Hill Climbing algorithm. Moreover, we conduct experiments with two novel tasks, showing that LLM-GS enables users without programming skills and knowledge of the domain or DSL to describe the tasks in natural language to obtain performant programs.

1564GLANCE: Global Actions in a Nutshell for Counterfactual Explainability

[openreview] [pdf]

Abstract The widespread deployment of machine learning systems in critical real-world decision-making applications has highlighted the urgent need for counterfactual explainability methods that operate effectively. Global counterfactual explanations, expressed as actions to offer recourse, aim to provide succinct explanations and insights applicable to large population subgroups. Effectiveness is measured by the fraction of the population that is provided recourse, ensuring that the actions benefit as many individuals as possible. Keeping the cost of actions low ensures the proposed recourse actions remain practical and actionable. Limiting the number of actions that provide global counterfactuals is essential to maximize interpretability. The primary challenge, therefore, is balancing these trade-offs—maximizing effectiveness, minimizing cost, while maintaining a small number of actions. We introduce GLANCE, a versatile and adaptive framework, comprising two algorithms, that allows the careful balancing of the trade-offs among the three key objectives, with the size objective functioning as a tunable parameter to keep the actions few and easy to interpret. C-GLANCE employs a clustering approach that considers both the feature space and the space of counterfactual actions, thereby accounting for the distribution of points in a way that aligns with the structure of the model. T-GLANCE provides additional features to enhance flexibility. It employs a tree-based approach, that allows users to specify split features, to build a decision tree with a single counterfactual action at each node that can be used as a subgroup policy. Our extensive experimental evaluation demonstrates that our method consistently shows greater robustness and performance compared to existing methods across various datasets and models.

1565Activations Aren’t Cheap in LoRA, Weights Are

[openreview] [pdf]

Abstract LoRA has become the prevailing technique for finetuning large neural networks with limited computational resources. Historically, activations have been regarded as small and computationally inexpensive to manipulate—a view reflected by LoRA, which leverages this assumption and adds a low-rank term to intermediate activations. However, in the era of modern large language models (LLMs) and diffusion models, this notion has been challenged by the desire for increasing context lengths and smaller models, a trend which inevitably leads activations to consume more memory than the model weights themselves. Surprisingly, when finetuning a 1B model with a context length greater than 2048, we find that LoRA finetuning uses more memory than full-parameter finetuning. This study finds that manipulating additional model weights within the computation graph in parameter-efficient finetuning techniques can often be more memory-efficient than operating on the activations. We provide a semantically-equivalent computation graph reformulation for LoRA, and other popular PeFT techniques, which saves memory and trains faster, advancing the Pareto-frontier for finetuning tasks that can be achieved on consumer hardware. Under practical conditions, this reformulation provides up to a 1.4x reduction in max memory usage and latency for LoRA finetuning across various language and diffusion transformers.

1566BAMDP Shaping: a Unified Theoretical Framework for Intrinsic Motivation and Reward Shaping

[openreview] [pdf]

Abstract Intrinsic motivation (IM) and reward shaping guide reinforcement learning (RL) agents by adding pseudo-rewards, which can lead to useful emergent behaviors. However, they can also exhibit unanticipated side effects -- leading to reward hacking or fixation with noisy TVs. Here we provide a theoretical model which anticipates these behaviors, and provides broad criteria under which unanticipated side effects can be bounded. We characterize all pseudo-rewards as reward shaping in Bayes-Adaptive Markov Decision Processes (BAMDPs), which formulates the problem of learning in MDPs as a meta-level MDP over the agent’s knowledge about the environment. We can understand pseudo-rewards as guiding exploration by incentivizing RL agents to go to states with higher BAMDP value, which comprises the value of information gathered and the prior value of the physical state, while they mislead exploration when they align poorly with this value. We extend potential-based shaping theory to prove that an RL algorithm which is approximately optimal for a shaped BAMDP is only guaranteed to remain so for the underlying problem when pseudo-rewards are BAMDP Potential-based shaping Functions (BAMPFs). We also prove that for a BAMPF which settles---i.e., its potential eventually stops changing over time---no RL agent can reward-hack to find a policy maximizing shaped rewards without also maximizing real rewards. We show that it is straightforward to retrofit or design new pseudo-reward terms in this form to avoid unintended side effects.

1567Boosting Camera Motion Control for Video Diffusion Transformers

[openreview] [pdf]

Abstract Recent advancements in diffusion models have significantly enhanced the quality of video generation. However, fine-grained control over camera pose remains a challenge. While U-Net-based models have shown promising results for camera control, transformer-based diffusion models (DiT)—the preferred architecture for large-scale video generation—suffer from severe degradation in camera motion accuracy. In this paper, we investigate the underlying causes of this issue and propose solutions tailored to DiT architectures. Our study reveals that camera control performance depends heavily on the choice of conditioning methods rather than camera pose representations that is commonly believed. To address the persistent motion degradation in DiT, we introduce \textbf{Camera Motion Guidance (CMG)}, based on classifier-free guidance, which boosts camera control by over 400%. Additionally, we present a sparse camera control pipeline, significantly simplifying the process of specifying camera poses for long videos. Our method universally applies to both U-Net and DiT models, offering improved camera control for video generation tasks.

1568Offline Hierarchical Reinforcement Learning via Inverse Optimization

[openreview] [pdf]

Abstract Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a significant challenge. Crucially, actions taken by higher-level policies may not be directly observable within hierarchical controllers, and the offline dataset might have been generated using a different policy structure, hindering the use of standard offline learning algorithms. In this work, we propose OHIO\textit{OHIO}: a framework for offline reinforcement learning (RL) of hierarchical policies. Our framework leverages knowledge of the policy structure to solve the inverse problem\textit{inverse problem}, recovering the unobservable high-level actions that likely generated the observed data under our hierarchical policy. This approach constructs a dataset suitable for off-the-shelf offline training. We demonstrate our framework on robotic and network optimization problems and show that it substantially outperforms end-to-end RL methods and improves robustness. We investigate a variety of instantiations of our framework, both in direct deployment of policies trained offline and when online fine-tuning is performed.

1569LocDiffusion: Identifying Locations on Earth by Diffusing in the Hilbert Space

[openreview] [pdf]

Abstract Image geolocalization is a fundamental yet challenging task, aiming at inferring the geolocation on Earth where an image is taken. Existing methods approach it either via grid-based classification or via image retrieval. The geolocalization accuracy of these methods is constrained by the choice of geographic grid cell sizes or the spatial distributions of the retrieval image/geolocation gallery, and their performance significantly suffers when the spatial distribution of test images does not align with such choices. To address these limitations, we propose to leverage diffusion models to achieve image geolocalization with arbitrary resolutions. To avoid the problematic manifold reprojection step in diffusion, we developed a novel spherical positional encoding-decoding framework, which encodes points on a spherical surface (e.g., geolocations on Earth) into a Hilbert space of Spherical Harmonics coefficients and decodes points (geolocations) by mode-seeking. We call this type of position encoding Spherical Harmonics Dirac Delta (SHDD) Representation. We also propose a novel SirenNet-based architecture called CS-UNet to learn the conditional backward process in the latent SHDD space by minimizing a latent KL-divergence loss. We train a conditional latent diffusion model called LocDiffusion that generates geolocations under the guidance of images – to the best of our knowledge, the first generative model to address the image geolocalization problem. We evaluate our LocDiffusion model against SOTA image geolocalization baselines. LocDiffusion achieves competitive geolocalization performance and demonstrates significantly stronger generalizability to unseen geolocations.

1570Efficient Bisection Projection to Ensure NN Solution Feasibility for Optimization over General Set

[openreview] [pdf]

Abstract Neural networks (NNs) have shown promise in solving constrained optimization problems in real-time. However, ensuring that NN-generated solutions strictly adhere to constraints is challenging due to NN prediction errors. Recent methods have achieved feasibility guarantees over ball-homeomorphic sets with low complexity and bounded optimality loss, yet extending these guarantees to more general sets remains largely open. In this paper, we developBisection Projection, an efficient approach to ensure NN solution feasibility for optimization over general compact sets with non-empty interiors, irrespective of their ball-homeomorphic properties. Our method begins by identifying multiple interior points (IPs) within the constraint set, chosen based on their eccentricity modulated by the NN infeasibility region. We utilize another unsupervised-trained NN (called IPNN) to map inputs to these interior points, thereby reducing the complexity of computing these IPs in run-time. For NN solutions initially deemed infeasible, we apply a bisection procedure that adjusts these solutions towards the identified interior points, ensuring feasibility with minor projection-induced optimality loss. We prove the feasibility guarantee and bound the optimality loss of our approach under mild conditions. Extensive simulations, including non-convex optimal power flow problems in large-scale networks, demonstrate that bisection projection outperforms existing methods in solution feasibility and computational efficiency with comparable optimality losses.

1571Generalizing Dynamics Modeling Easier from Representation Perspective

[openreview] [pdf]

Abstract Learning system dynamics from observations is a critical problem in many applications over various real-world complex systems, e.g., climate, ecology, and fluid systems. Recently, the neural-based dynamics modeling method has become the prevalent solution, where its basic idea is to embed the original states of objects into a latent space before learning the dynamics using neural-based methods such as neural Ordinary Differential Equations (ODE). Given observations from different complex systems, the existing dynamics modeling methods offer a specific model for each observation, resulting in poor generalization. Inspired by the great success of pre-trained models, we raise a question: whether we can conduct a generalized Pre-trained Dynamic EncoDER (PDEDER), which, for various complex systems, can embed their original states into a latent space, where the dynamics can be easier captured. To conduct this generalized PDEDER, we collect 153 sets of real-world and synthetic observations from 24 complex systems. Inspired by the success of time series forecasting using Pre-trained Language Models (PLM), we can employ any PLM and further update it over these dynamic observations by tokenization techniques to achieve the generalized PDEDER. Given any future dynamic observation, we can fine-tune PDEDERwith any specific dynamics modeling method. We evaluate PDEDER on 18 dynamic systems by long/short-term forecasting under both in-domain and cross-domain settings and the empirical results indicate the effectiveness of PDEDER.

1572Generative Reward Models

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has greatly improved the performance of modern Large Language Models (LLMs). The RLHF process is resource-intensive and technically challenging, generally requiring a large collection of human preference labels over model-generated outputs. Reinforcement Learning from AI Feedback (RLAIF) addresses this data collection challenge by leveraging synthetic preferences generated by an LLM. However, recent work has shown that synthetic preferences labels may not align well with human preference judgments.To address this, we propose a hybrid approach that unifies RLHF and RLAIF methodologies. We introduce GenRM, an iterative algorithm that trains an LLM on self-generated reasoning traces, leading to synthetic preference labels matching human preference judgments. Empirically, we show that zero-shot LLM-based judgments under-perform compared to Bradley-Terry reward models on in-distribution tasks (between 9-36%). In contrast, GenRM achieves in-distribution accuracy comparable to Bradley-Terry models, while significantly outperforming them on out-of-distribution tasks (between 10-45%). Moreover, GenRM surpasses the performance of using LLMs as judges on both in-distribution (by 9-31%) and out-of-distribution tasks (by 2- 6%). Our results show that combining the strengths of RLHF and RLAIF offers a promising approach for improving the quality of synthetic preference labels.

1573Investigating Online RL in World Models

[openreview] [pdf]

Abstract Over the past decade, online reinforcement learning (RL) has made drastic improvements in a number of settings, such as video games and robotics. However, despite these successes, the impact of RL on manyreal-worldproblems has remained limited. Underlying this fact is that, in many settings, we are unable to learn in an online fashion due to excessive cost and safety requirements or lack of an accurate simulator. In principle, foundation world models trained on large-scale uncurated offline data such as internet videos and other modalities could provide a training paradigm for generalist AI agents which alleviates the need for task specific simulation environments. Unfortunately, training inside world models is usually studied in the context of offline RL, where popular datasets have a biased structure. This necessitates short roll-outs or other severely limiting mechanisms to prevent model exploitation. Here we probe under what circumstances full roll-out training inside world models is possiblewithoutany penalties. We find that on a non-adversarial offline dataset simply ensembling over a large number of independently trained world models is sufficient to ensure transfer to the real world, even for datasets that are orders of magnitude smaller than is common in offline RL. Interestingly, more sophisticated methods for level selection provide no advantage and standard offline RL methods underperform in this setting.

1574Morse: Fast Sampling for Accelerating Diffusion Models Universally

[openreview] [pdf]

Abstract In this paper, we present Morse, a simple and universal framework for accelerating diffusion models. The key insight of Morse is to reformulate the iterative generation (from noise to data) process via taking advantage of fast jump sampling and adaptive residual feedback strategies. Specifically, Morse involves two models called Dash and Dot that interact with each other. The Dash model is just the pre-trained diffusion model of any type, but operates in a jump sampling regime, creating sufficient space for sampling efficiency improvement. The Dot model is significantly faster than the Dash model, which is learnt to generate residual feedback conditioned on the observations at the current jump sampling point on the trajectory of the Dash model, lifting the noise estimate to easily match the next-step estimate of the Dash model without jump sampling. By chaining the outputs of the Dash and Dot models run in a time-interleaved fashion, Morse exhibits the merit of flexibly attaining desired image generation performance while improving overall runtime efficiency. With our proposed weight sharing strategy between the Dash and Dot models, Morse is efficient for training and inference. We validate the efficacy of our method under a variety of experimental setups. Our method shows an average speedup of 1.78× to 3.31× over a wide range of sampling step budgets relative to baseline diffusion models. Furthermore, we show that our method can be also generalized to improve the Latent Consistency Model (LCM-SDXL, which is already accelerated with consistency distillation technique) tailored for few-step text-to-image synthesis. The code will be made publicly available.

1575Bidirectional Communication-Efficient Non-Convex Adaptive Federated Learning

[openreview] [pdf]

Abstract Within the framework of federated learning, we introduce two novel strategies: New Lazy Aggregation (NLA) and Accelerated Aggregation (AA). The NLA strategy reduces communication and computational costs through adaptive gradient skipping, while the AA strategy accelerates computation and decreases communication costs via adaptive gradient accumulation. Building upon these innovative strategies and compression techniques, we propose two new algorithms: FedBNLACA and FedBACA, aimed at minimizing bidirectional communication costs. We provide theoretical guarantees for client participation (either full or partial) in these algorithms under non-convex settings and heterogeneous data. In the context of non-convex optimization with full client participation, our proposed FedBNLACA and FedBACA algorithms achieve the same convergence rate of O(1/T)\mathcal{O}\big(1/T\big) as their non-tight counterparts. Extensive experimental results demonstrate that our protocols facilitate effective training in non-convex environments and exhibit robustness across a wide range of devices, partial participation, and imbalanced data.

1576Federated Instruction Tuning of LLMs with Domain Coverage Augmentation

[openreview] [pdf]

Abstract Federated Domain-specific Instruction Tuning (FedDIT) utilizes limited cross-client private data together with server-side public data for instruction augmentation, ultimately boosting model performance within specific domains. To date, the factors affecting FedDIT remain unclear, and existing instruction augmentation methods primarily focus on the centralized setting without considering distributed environments. Our experiments reveal that the cross-client domain coverage, rather than data heterogeneity, drives model performance in FedDIT. In response, we propose FedDCA, which optimizes domain coverage through greedy client center selection and retrieval-based augmentation. For client-side computational efficiency and system scalability, FedDCA^*, the variant of FedDCA, utilizes heterogeneous encoders with server-side feature alignment. Extensive experiments across four distinct domains (code, medical, financial, and mathematical) substantiate the effectiveness of both methods. Additionally, we investigate privacy preservation against memory extraction attacks utilizing various amounts of public data. Results show that there is no significant correlation between the volume of public data and the privacy-preserving capability. However, as the fine-tuning rounds increase, the risk of privacy leakage reduces or converges.

1577UrbanDiT: A Foundation Model for Open-World Urban Spatio-Temporal Learning

[openreview] [pdf]

Abstract The urban environment is characterized by complex spatio-temporal dynamics arising from diverse human activities and interactions. Effectively modeling these dynamics is essential for understanding and optimizing urban systems. In this work, we introduce UrbanDiT, a foundation model for open-world urban spatio-temporal learning that successfully scale up diffusion transformers in this field. UrbanDiT pioneers a unified model that integrates diverse spatio-temporal data sources and types while learning universal spatio-temporal patterns across different cities and scenarios. This allows the model to unify both multi-data and multi-task learning, and effectively support a wide range of spatio-temporal applications. Its key innovation lies in the elaborated prompt learning framework, which adaptively generates both data-driven and task-specific prompts, guiding the model to deliver superior performance across various urban applications.UrbanDiT offers three primary advantages: 1) It unifies diverse data types, such as grid-based and graph-based data, into a sequential format, allowing to capture spatio-temporal dynamics across diverse scenarios of different cities; 2) With masking strategies and task-specific prompts, it supports a wide range of tasks, including bi-directional spatio-temporal prediction, temporal interpolation, spatial extrapolation, and spatio-temporal imputation; and 3) It generalizes effectively to open-world scenarios, with its powerful zero-shot capabilities outperforming nearly all baselines with training data. These features allow UrbanDiT to achieves state-of-the-art performance in different domains such as transportation traffic, crowd flows, taxi demand, bike usage, and cellular traffic, across multiple cities and tasks. UrbanDiT sets up a new benchmark for foundation models in the urban spatio-temporal domain. Code and datasets are publicly available athttps://anonymous.4open.science/r/UrbanDiT.

1578MVFL: Multivariate Vertical Federated Learning for Time-Series Forecasting

[openreview] [pdf]

Abstract Extending multivariate time series forecasting to resource-limited devices is a critical demand for real applications, especially with the advancements in IoT technologies. A common scenario is where the variates are distributed vertically on different devices and each device needs to do local forecasting. This paper studies a resource-efficient solution for this scenario based on vertical federated learning (VFL). Prior VFL frameworks are designed for situations where only one party holds the labels and would struggle to meet the demand of the targeted scenario, as storage resources usage would increase dramatically with the number of devices. Going beyond VFL, we design multivariate vertical federated learning (MVFL) as a novel federated learning framework, where we separate communication features and local features in an embedded feature space. This design enables MVFL to utilize storage and communication resources more efficiently by eliminating the redundant models. MVFL outperforms VFL approaches in both efficiency and accuracy. On four real-world benchmarks, compared to VFL, when the storage resources are equally utilized, MVFL yields a 14% relative improvement on loss with a 43% relative improvement on communication resources usage. Even when both MVFL and VFL employ the same main model size, MVFL achieves a 75% reduction in storage resources compared to VFL, albeit with a slight compromise in terms of loss.

1579Optimal Strong Regret and Violation in Constrained MDPs via Policy Optimization

[openreview] [pdf]

Abstract We study online learning in constrained MDPs (CMDPs), focusing on the goal of attaining sublinear strong regret and strong cumulative constraint violation. Differently from their standard (weak) counterparts, these metrics do not allow negative terms to compensate positive ones, raising considerable additional challenges. Efroni et al. (2020) were the first to propose an algorithm with sublinear strong regret and strong violation, by exploiting linear programming. Thus, their algorithm is highly inefficient, leaving as an open problem achieving sublinear bounds by means of policy optimization methods, which are much more efficient in practice. Very recently, Muller et al. (2024) have partially addressed this problem by proposing a policy optimization method that allows to attain O~(T0.93)\widetilde{\mathcal{O}}(T^{0.93}) strong regret/violation. This still leaves open the question of whether optimal bounds are achievable by using an approach of this kind. We answer such a question affirmatively, by providing an efficient policy optimization algorithm with O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) strong regret/violation. Our algorithm implements a primal-dual scheme that employs a state-of-the-art policy optimization approach for adversarial (unconstrained) MDPs as primal algorithm, and a UCB-like update for dual variables.

1580Distribution Shift Aware Neural Feature Transformation

[openreview] [pdf]

Abstract Feature transformation, as a core task of Data-centric AI (DCAI), aims to improve the original feature set to enhance AI capabilities. In dynamic real-world environments, where there exists a distribution shift, feature knowledge may not be transferable between data. This matter prompts a distribution shift feature transformation (DSFT) problem. Prior research works for feature transformation either depend on domain expertise, rely on a linear assumption, prove inefficient for large feature spaces, or demonstrate vulnerability to imperfect data. Furthermore, existing techniques for addressing the distribution shift cannot be directly applied to discrete search problems. DSFT presents two primary challenges: 1) How can we reformulate and solve feature transformation as a learning problem? and 2) What mechanisms can integrate shift awareness into such a learning paradigm? To tackle these challenges, we leverage a unique Shift-aware Representation-Generation Perspective. To formulate a learning scheme, we construct a representation-generation framework: 1) representation step: encoding transformed feature sets into embedding vectors; 2) generation step: pinpointing the best embedding and decoding as a transformed feature set. To mitigate the issue of distribution shift, we propose three mechanisms: 1) shift-resistant representation, where embedding dimension decorrelation and sample reweighing are integrated to extract the true representation that contains invariant information under distribution shift; 2) flatness-aware generation, where several suboptimal embeddings along the optimization trajectory are averaged to obtain a robust optimal embedding, proving effective for diverse distribution; and 3) shift-aligned pre and post-processing, where normalizing and denormalizing align and recover distribution gaps between training and testing data. Ultimately, extensive experiments are conducted to indicate the effectiveness, robustness, and trackability of our proposed framework.

1581Encoder-only Next Token Prediction

[openreview] [pdf]

Abstract Next-token prediction models have predominantly relied on decoder-only Transformers with causal attention, driven by the common belief that causal attention is essential to prevent “cheating” by masking future tokens. We challenge this widely accepted notion and argue that this design choice is about efficiency rather than necessity. While decoder-only Transformers are still a good choice for practical reasons, they are not the only viable option. In this work, we introduce Encoder-only Next Token Prediction (ENTP). We explore the differences between ENTP and decoder-only Transformers in expressive power and complexity, highlighting potential advantages of ENTP. We introduce the Triplet-Counting task and show, both theoretically and experimentally, that while ENTP can perform this task easily, a decoder-only Transformer cannot. Finally, we empirically demonstrate ENTP’s superior performance across various realistic tasks, such as length generalization and in-context learning.

1582MPHIL: Multi-Prototype Hyperspherical Invariant Learning for Graph Out-of-Distribution Generalization

[openreview] [pdf]

Abstract Out-of-distribution (OOD) generalization has emerged as a critical challenge in graph learning, as real-world graph data often exhibit diverse and shifting environments that traditional models fail to generalize across. A promising solution to address this issue is graph invariant learning (GIL), which aims to learn invariant representations by disentangling label-correlated invariant subgraphs from environment-specific subgraphs. However, existing GIL methods face two major challenges: (1) the difficulty of capturing and modeling diverse environments in graph data, and (2) the semantic cliff, where invariant subgraphs from different classes are difficult to distinguish, leading to poor class separability and increased misclassifications. To tackle these challenges, we propose a novel method termed Multi-Prototype Hyperspherical Invariant Learning (MPHIL), which introduces two key innovations: (1) invariant learning in hyperspherical space, enabling robust invariant feature extraction and prototypical learning in a highly discriminative space, and (2) class prototypes as intermediate variables, which eliminate the need for explicit environment modeling in GIL and mitigate the semantic cliff issue through multi-prototype-based classification. Derived from the theoretical framework of GIL, we introduce two novel objective functions: the invariant prototype matching loss to ensure samples are matched to the correct class prototypes, and the prototype separation loss to increase the distinction between prototypes of different classes in the hyperspherical space. Extensive experiments on 11 OOD generalization benchmark datasets demonstrate that MPHIL achieves state-of-the-art performance, significantly outperforming existing methods across graph data from various domains and with different distribution shifts. The source code of MPHIL is available athttps://anonymous.4open.science/r/MPHIL-23C0/.

1583Towards Flexible and Controllable Unknown Rejection

[openreview] [pdf]

Abstract Reliable prediction is an essential requirement for deep neural models that are deployed in open environments, where both covariate and semantic out-of-distribution (OOD) data arise naturally. Recent studies have formulated and pursued two problems named OOD generalization and detection independently, where the former aims to correctly recognize covariate shifts while the latter focuses on rejecting semantic shifts. However, existing methods are misaligned with real-world applications in two aspects. First, in practice, to make safe decisions, a reliable model should accept correctly recognized inputs while rejecting both those misclassified covariate-shifted and semantic-shifted examples. Second, considering the potential existing trade-off between rejecting different failure cases, more convenient, controllable, and flexible unknown rejection approaches are needed. To meet the above requirements, we propose a novel and elegantly simple unknown rejection framework to unify and facilitate classification with rejection under both covariate and semantic shifts. Our key insight is that by separating and consolidating failure-specific reliability knowledge with low-rank adapters and then integrating them, we can enhance the unknown rejection ability effectively and flexibly. Extensive experiments demonstrate the superiority of our framework.

1584DPD-LoRA: Dynamic Prompt-Driven Low-Rank Adaptation for Improved Generalization

[openreview] [pdf]

Abstract Fine-tuning large models presents technical challenges such as catastrophic forgetting and parameter inefficiency. Low-rank Adaptation (LoRA) and Propmt Learning can help address some of these challenges by providing more compact and flexible representations. However, Low-rank approximation is susceptible to outliers and relies on the assumption of a global low-rank structure, which can be suboptimal. Additionally, Prompt learning can overfit to specific downstream tasks, reducing its effectiveness when adapting to new tasks. In this paper, we introduce Dynamic Prompt-Driven Low-Rank Adaptation (DPD-LoRA)\textbf{Dynamic Prompt-Driven Low-Rank Adaptation (DPD-LoRA)}, a novel framework that seamlessly integrates task-specific guidance using hierarchical prompt tokens and parameter-efficient adaptation. Unlike traditional methods, task-aware prompts in the DPD-LoRA dynamically influences low-rank updates in the model’s parameters, thus enabling robust adaptation and generalization across diverse tasks and mitigating the forgetting issues. We further improve the learning capabilities of the model by breaking down the standard LoRA into multiple low-rank sub-matrices, without adding additional parameters. Further, we use an adaptive loss function to guarantee alignment with the distribution of the pre-trained model. Specifically, we introduce a self-regulated mechanism to improve stability, and a soft-gated selection mechanism to decide when to activate adaptation modules to improve performance on unseen categories. Extensive experiments on 11 benchmark datasets demonstrate that DPD-LoRA significantly outperforms state-of-the-art methods in both accuracy and generalization, offering a comprehensive solution to the challenges of fine-tuning large-scale models.

1585One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation

[openreview] [pdf]

Abstract Foundation models (FMs) are pre-trained on large-scale datasets and then fine-tuned on a downstream task for a specific application. The most successful and most commonly used fine-tuning method is to modulate the pre-trained weights via a low-rank adaptation (LoRA). LoRA introduces new weight matrices that are usually initialized at random with a uniform rank distribution across the model. Recent works focus onweight-driveninitialization or learning of an adaptive rank allocation during training. Both approaches have only been investigated in isolation, resulting in slow convergence or a uniform rank distribution, leading to sub-optimal performance. We propose to enhance LoRA by initializing the new weights in adata-drivenmanner by computing singular value decomposition on minibatches of activation vectors. Then, we initialize the LoRA matrices with the obtained right-singular vectors and re-distribute ranks among all weight matrices to explain the maximal amount of variance across layers and continue the standard LoRA fine-tuning procedure. This results in our new methodExplainedVarianceAdaptation (EVA). We apply EVA to a variety of fine-tuning tasks ranging from language generation and understanding to image classification and reinforcement learning. EVA exhibits faster convergence than competitors and attains the highest average score across a multitude of tasks per domain.

1586Equivariant score-based generative models provably learn distributions with symmetries efficiently

[openreview] [pdf]

Abstract Symmetry is ubiquitous in many real-world phenomena and tasks, such as physics, images, and molecular simulations. Empirical studies have demonstrated that incorporating symmetries into generative models can provide better generalization and sampling efficiency when the underlying data distribution has group symmetry. In this work, we provide the first theoretical analysis and guarantees of score-based generative models (SGMs) for learning distributions that are invariant with respect to some group symmetry and offer the first quantitative comparison between data augmentation and adding equivariant inductive bias. First, building on recent works on the Wasserstein-1 (d1\mathbf{d}_1) guarantees of SGMs and empirical estimations of probability divergences under group symmetry, we provide an improved d1\mathbf{d}_1 generalization bound when the data distribution is group-invariant. Second, we describe the inductive bias of equivariant SGMs using Hamilton-Jacobi-Bellman theory, and rigorously demonstrate that one can learn the score of a symmetrized distribution using equivariant vector fields without data augmentations through the analysis of the optimality and equivalence of score-matching objectives. This also provides practical guidance that one does not have to augment the dataset as long as the vector field or the neural network parametrization is equivariant. Moreover, we quantify the impact of not incorporating equivariant structure into the score parametrization, by showing that non-equivariant vector fields can yield worse generalization bounds. This can be viewed as a type of model-form error that describes the missing structure of non-equivariant vector fields. Numerical simulations corroborate our analysis and highlight that data augmentations cannot replace the role of equivariant vector fields.

1587C-Adam: Confidence-Based Optimization for Online Learning

[openreview] [pdf]

Abstract Modern recommendation systems frequently employ online learning to dynamically update their models with freshly collected data. The most commonly used optimizer for updating neural networks in these contexts is the Adam optimizer, which integrates momentum (mtm_t) and adaptive learning rate (vtv_t). However, the volatile nature of online learning data, characterized by its frequent distribution shifts and presence of noises, poses significant challenges to Adam’s standard optimization process: (1) Adam may use outdated momentum and the average of squared gradients, resulting in slower adaptation to distribution changes, and (2) Adam’s performance is adversely affected by data noise. To mitigate these issues, we introduce CAdam, a confidence-based optimization strategy that assesses the consistence between the momentum and the gradient for each parameter dimension before deciding on updates. If momentum and gradient are in sync, CAdam proceeds with parameter updates according to Adam’s original formulation; if not, it temporarily withholds updates and monitors potential shifts in data distribution in subsequent iterations. This method allows CAdam to distinguish between the true distributional shifts and mere noise, and adapt more quickly to new data distributions. Our experiments with both synthetic and real-world datasets demonstrate that CAdam surpasses other well-known optimizers, including the original Adam, in efficiency and noise robustness. Furthermore, in large-scale A/B testing within a live recommendation system, CAdam significantly enhances model performance compared to Adam, leading to substantial increases in the system’s gross merchandise volume (GMV).

[openreview] [pdf]

Abstract Dynamic link prediction is an important problem considered by many recent works proposing various approaches for learning temporal edge patterns. To assess their efficacy, models are evaluated on publicly available benchmark datasets involving continuous-time and discrete-time temporal graphs. However, as we show in this work, the suitability of common batch-oriented evaluation depends on the datasets’ characteristics, which can cause multiple issues: For continuous-time temporal graphs, fixed-size batches create time windows with different durations, resulting in an inconsistent dynamic link prediction task. For discrete-time temporal graphs, the sequence of batches can additionally introduce temporal dependencies that are not present in the data. In this work, we empirically show that this common evaluation approach leads to skewed model performance and hinders the fair comparison of methods. We mitigate this problem by reformulating dynamic link prediction as a link forecasting task that better accounts for temporal information present in the data. We provide implementations of our new evaluation method for commonly used graph learning frameworks.

1589Can Textual Gradient Work in Federated Learning?

[openreview] [pdf]

Abstract Recent studies highlight the promise of LLM-based prompt optimization, especially with TextGrad, which automates ``differentiation’’ via texts and backpropagates textual feedback provided by LLMs. This approach facilitates training in various real-world applications that do not support numerical gradient propagation or loss calculation. It opens new avenues for optimization in decentralized, resource-constrained environments, suggesting that users of black-box LLMs (e.g., ChatGPT) could enhance components of LLM agentic systems (such as prompt optimization) through collaborative paradigms like federated learning (FL). In this paper, we systematically explore the potential and pitfalls of incorporating textual gradient into FL. Our contributions are fourfold.Firstly, we introduce a novel FL paradigm, Federated Textual Gradient (FedTextGrad), that allows FL clients to upload their locally optimized prompts derived from textual gradients, while the FL server aggregates the received prompts through text summarization. Unlike traditional FL frameworks, which are designed for numerical aggregation, FedTextGrad is specifically tailored for handling textual data, expanding the applicability of FL to a broader range of problems that lack well-defined numerical loss functions.Secondly, building on this design, we conduct extensive experiments to explore the feasibility of federated textual gradients. Our findings highlight the importance of properly tuning key factors (e.g, local steps) in FL training to effectively integrate textual gradients.Thirdly, We highlight a major challenge in federated textual gradient aggregation: retaining essential information from distributed prompt updates. Concatenation often produces prompts that exceed the LLM API’s context window, while summarization can degrade performance by generating overly condensed or complex text that lacks key context.Last but not least, in response to this issue, we improve the vanilla variant of FedTextGrad by providing actionable guidance to the LLM when summarizing client prompts by leveraging the Uniform Information Density principle. Such a design reduces the complexity of the aggregated global prompt, thereby better incentive LLM reasoning ability. Through this principled study, we enable the adoption of textual gradients in FL for optimizing LLMs, identify important issues, and pinpoint future directions, thereby opening up a new research area that warrants further investigation.

1590From Static to Dynamic: Leveraging Implicit Behavioral Models to Facilitate Transition in Offline-to-Online Reinforcement Learning

[openreview] [pdf]

Abstract Transitioning reinforcement learning (RL) models from offline training environments to dynamic online settings faces critical challenges because of the distributional shift and the model inability in effectively adapting to new, unseen scenarios. This work proposes the \textbf{B}ehavior \textbf{A}daption \textbf{Q}-Learning (BAQ), a novel framework facilitating smoother transitions in offline-to-online RL. BAQ strategically leverages the implicit behavioral model to imitate and adapt behaviors of offline datasets, enabling the model to handle out-of-distribution state-action pairs more effectively during its online deployment. The key to our approach is the integration of a composite loss function that not only mimics the offline data-driven policy but also dynamically adjusts to new experiences encountered online. This dual-focus mechanism enhances the model’s adaptability and robustness, reducing Q-value estimation errors and improving the overall learning efficiency. Extensive empirical evaluations demonstrate that BAQ significantly outperforms existing methods, achieving enhanced adaptability and reduced performance degradation in diverse RL settings. Our framework sets a new standard for offline-to-online RL, offering a robust solution for applications requiring reliable transitions from theoretical training to practical, real-world execution.

1591Mitigating Privacy Risk of Adversarial Examples with Counterfactual Explanations

[openreview] [pdf]

Abstract Robustness and privacy are two fundamental security properties that machine learning models require. Without the balance between robustness and privacy leads to robust models with high privacy risks. Obtaining machine learning models with high adversarial robustness and privacy performance remains an open problem. In order to enhance the privacy performance of robust models, we employ counterfactual explanations as a method to mitigate privacy risks while concurrently maintaining robust model accuracy, reducing the privacy risk of the robust model to the level of random guessing and using counterfactual explanations to generate adversarial examples for the first time. We analyze the similarities and differences between adversarial examples and counterfactual explanations and utilize these properties to design the generation method. We conduct an in-depth analysis of the advantages offered by counterfactual explanations compared to traditional adversarial examples. Our study indicates that the correlation between robustness and privacy is strong and the ideal balance state of accuracy, robustness, and privacy is with 95% adversarial examples involved in model training.

1592Do Large Language Models have Lateral Thinking in Puzzle-Solving Games?

[openreview] [pdf]

Abstract Large Language Models (LLMs) show exceptional skills in a wide range of tasks, with their ability in lateral thinking standing out as a particularly intriguing area. Lateral thinking in LLMs allows them to understand deeper or suggested meanings from the context, which is essential for making sense of complex scenarios, especially in puzzle-solving games. To delve deeper into and improve the lateral thinking capabilities of LLMs in the realm of puzzle-solving, we introduce the ``Lateral Thinking Puzzles’’ and construct the accompanying dataset. Our novel P\mathcal{P}uzzleV\mathcal{V}erse framework aims to enhance LLMs’ lateral thinking in puzzle-solving games. Complementing this, we propose a creativity metric to ensure comprehensive evaluations. Experiments show that the selected LLMs, after being trained with P\mathcal{P}uzzleV\mathcal{V}erse, have an average improvement of 101.9% compared to their performance before P\mathcal{P}uzzleV\mathcal{V}erse training among all metrics. We also validate the robustness of P\mathcal{P}uzzleV\mathcal{V}erse that trained LLMs perform better in other reasoning tasks.

1593DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems

[openreview] [pdf]

Abstract Combinatorial Optimization (CO) problems are fundamentally important in numerous real-world applications across diverse industries, characterized by entailing enormous solution space and demanding time-sensitive response. Despite recent advancements in neural solvers, their limited expressiveness struggles to capture the multi-modal nature of CO landscapes. While some research has shifted towards diffusion models, these models still sample solutions indiscriminately from the entire NP-complete solution space with time-consuming denoising processes, which limit their practicality for large problem scales. We proposeDISCO, an efficientDIffusionSolver for large-scaleCombinatorialOptimization problems that excels in both solution quality and inference speed. DISCO’s efficacy is twofold: First, it enhances solution quality by constraining the sampling space to a more meaningful domain guided by solution residues, while preserving the multi-modal properties of the output distributions. Second, it accelerates the denoising process through an analytically solvable approach, enabling solution sampling with minimal reverse-time steps and significantly reducing inference time. DISCO delivers strong performance on large-scale Traveling Salesman Problems and challenging Maximal Independent Set benchmarks, with inference time up to 5.28 times faster than other diffusion alternatives. By incorporating a divide-and-conquer strategy, DISCO can well generalize to solve unseen-scale problem instances, even surpassing models specifically trained for those scales.

1594Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

[openreview] [pdf]

Abstract Finetuning large language models on instruction data is an important step in enriching the knowledge learned during pre-training and improving instruction-following capabilities. As the number of instruction datasets continues to grow, selecting the right data to achieve optimal results becomes increasingly important. In this work, we ask a prominent question: How can we determine the optimal subset of data for effective training? While much of the existing research primarily emphasizes local criteria, such as instance quality, for subset selection, we argue that a global approach focused on data diversity is more critical. Our approach utilizes kk-means clustering to ensure that the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, with the importance and sampling weight of each cluster being reassessed in every training iteration. This method allows us to reduce the effect of outliers and automatically filter out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over the random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is submitted as supplementary materials.

1595PWM: Policy Learning with Multi-Task World Models

[openreview] [pdf]

Abstract Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World models methods offer scalability by learning a simulation of the environment, but often rely on inefficient gradient-free optimization methods for policy extraction. In contrast, gradient-based methods exhibit lower variance but fail to handle discontinuities. Our work reveals that well-regularized world models can generate smoother optimization landscapes than the actual dynamics, facilitating more effective first-order optimization. We introduce Policy learning with multi-task World Models (PWM), a novel model-based RL algorithm for continuous control. Initially, the world model is pre-trained on offline data, and then policies are extracted from it using first-order optimization in less than 10 minutes per task. PWM effectively solves tasks with up to 152 action dimensions and outperforms methods that use ground-truth dynamics. Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines, without relying on costly online planning. Visualizations and code available athttps://policy-world-model.github.io/.

1596Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of ‘jailbreaking’ techniques to elicit harmful text from models that were fine-tuned to be harmless. Recent work on red-teaming, model editing, and interpretability suggests that this challenge stems from how (adversarial) fine-tuning largely serves to suppress rather than remove undesirable capabilities from LLMs. Prior work has introduced latent adversarial training (LAT) as a way to improve robustness to broad classes of failures. These prior works have considered untargeted latent space attacks where the adversary perturbs latent activations to maximize loss on examples of desirable behavior. Untargeted LAT can provide a generic type of robustness but does not leverage information about specific failure modes. Here, we experiment with targeted LAT where the adversary seeks to minimize loss on a specific competing task. We find that it can augment a wide variety of state-of-the-art methods. First, we use targeted LAT to improve robustness to jailbreaks, outperforming a strong R2D2 baseline with orders of magnitude less compute. Second, we use it to more effectively remove backdoors with no knowledge of the trigger. Finally, we use it to more effectively unlearn knowledge for specific undesirable tasks in a way that is also more robust to re-learning. Overall, our results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.

1597Skill Expansion and Composition in Parameter Space

[openreview] [pdf]

Abstract Human excels at reusing prior knowledge to address new challenges and develop skills while solving problems. This paradigm becomes increasingly popular in the development of autonomous agents, as it develops systems that can self-evolve in response to new challenges like human beings. However, previous methods suffer from limited training efficiency when expanding new skills and fail to fully leverage prior knowledge to facilitate new task learning. We propose Parametric Skill Expansion and Composition (PSEC), a new framework designed to iteratively evolve the agents’ capabilities and efficiently address new challenges by maintaining a manageable skill library. This library can progressively integrate skill primitives as plug-and-play LoRA modules, facilitating efficient and flexible skill expansion. This structure also enables the direct skill compositions in parameter space by merging LoRA modules that encode different primitives, leveraging shared information across skills to effectively program new skills. Based on this, we propose a context-aware modular to dynamically activate different skills to collaboratively handle new tasks. Empowering diverse applications including multi-objective composition, dynamics shift, and continual policy shift, the results on D4RL, DSRL benchmarks, and the DeepMind Control Suite show that PSEC exhibits superior capacity to leverage prior knowledge to efficiently tackle new challenges, as well as expand its skill libraries to evolve the capabilities.

1598N-ForGOT: Towards Not-forgetting and Generalization of Open Temporal Graph Learning

[openreview] [pdf]

Abstract Temporal Graph Neural Networks (TGNNs) lay emphasis on capturing node interactions over time but often overlook evolution in node classes and dynamic data distributions triggered by the continuous emergence of new class labels, known as the open-set problem. This problem poses challenges for existing TGNNs in preserving learned classes while rapidly adapting to new, unseen classes. To address this, this paper identifies two primary factors affecting model performance on the open temporal graph, backed by a theoretical guarantee: (1) the forgetting of prior knowledge and (2) distribution discrepancies between successive tasks. Building on these insights, we propose N-ForGOT, which incorporates two plug-in modules into TGNNs to preserve prior knowledge and enhance model generalizability for new classes simultaneously. The first module preserves previously established inter-class connectivity and decision boundaries during the training of new classes to mitigate the forgetting caused by temporal evolutions of class characteristics. The second module introduces an efficient method for measuring distribution discrepancies with designed temporal Weisfeiler-Lehman subtree patterns, effectively addressing both structural and temporal shifts while reducing time complexity. Experimental results on four public datasets demonstrate that our method significantly outperforms state-of-the-art approaches in prediction accuracy, prevention of forgetting, and generalizability.

1599Solving PDEs via learnable quadrature

[openreview] [pdf]

Abstract Partial differential Equations (PDEs) are an essential tool across science and engineering. Recent work has shown how contemporary developments in machine learning models can directly help in improving methods for solution discovery of PDEs. This line of work falls under the umbrella of Physics-Informed Machine Learning. A key step in solving a PDE is to determine a set of points in the domain where the current iterate of the PDE’s solution will be evaluated. The most prevalent strategy here is to use Monte Carlo sampling, but it is widely known to be sub-optimal in lower dimensions. We leverage recent advances in asymptotic expansions of quadrature nodes and weights (for weight functions belonging to the modified Gauss-Jacobi family) together with suitable adjustments for parameterization towards a data-driven framework for learnable quadrature rules. A direct benefit is a performance improvement in solving PDEs via neural networks, relative to existing alternatives, on a set of problems commonly studied in the literature. Beyond finding a standard solution for an instance of a single PDE, our construction enables learning rules to predict solutions for a given family of PDEs via a simple use of hyper-networks, a broadly useful capability.

1600MVGS: Multi-view-regulated Gaussian Splatting for Novel View Synthesis

[openreview] [pdf]

Abstract Recent works in volume rendering, \textit{e.g.} NeRF and 3D Gaussian Splatting (3DGS), significantly advance the rendering quality and efficiency with the help of the learned implicit neural radiance field or 3D Gaussians. Rendering on top of an explicit representation, the vanilla 3DGS and its variants deliver real-time efficiency by optimizing the parametric model with single-view supervision per iteration during training which is adopted from NeRF. Consequently, certain views are overfitted, leading to unsatisfying appearance in novel-view synthesis and imprecise 3D geometries. To solve aforementioned problems, we propose a new 3DGS optimization method embodying four key novel contributions:We transform the conventional single-view training paradigm into a multi-view training strategy. With our proposed multi-view regulation, 3D Gaussian attributes are further optimized without overfitting certain training views. As a general solution, we improve the overall accuracy in a variety of scenarios and different Gaussian variants.Inspired by the benefit introduced by additional views, we further propose a cross-intrinsic guidance scheme, leading to a coarse-to-fine training procedure concerning different resolutions.Built on top of our multi-view regulated training, we further propose a cross-ray densification strategy, densifying more Gaussian kernels in the ray-intersect regions from a selection of views.By further investigating the densification strategy, we found that the effect of densification should be enhanced when certain views are distinct dramatically. As a solution, we propose a novel multi-view augmented densification strategy, where 3D Gaussians are encouraged to get densified to a sufficient number accordingly, resulting in improved reconstruction accuracy. We conduct extensive experiments to demonstrate that our proposed method is capable of improving novel view synthesis of the Gaussian-based explicit representation methods about 1 dB PSNR for various tasks. \href{https://mvgs666.github.io/}{\textcolor{magenta}{Codesare available.}}

1601HQGS: High-Quality Novel View Synthesis with Gaussian Splatting in Degraded Scenes

[openreview] [pdf]

Abstract 3D Gaussian Splatting (3DGS) has shown promising results for Novel View Synthesis. However, while it is quite effective when based on high-quality images, its performance declines as image quality degrades, due to lack of resolution, motion blur, noise, compression artifacts, or other factors common in real-world data collection. While some solutions have been proposed for specific types of degradation, general techniques are still missing. To address the problem, we propose a robust HQGS that significantly enhances the 3DGS under various degradation scenarios. We first analyze that 3DGS lacks sufficient attention in some detailed regions in low-quality scenes, leading to the absence of Gaussian primitives in those areas and resulting in loss of detail in the rendered images. To address this issue, we focus on leveraging edge structural information to provide additional guidance for 3DGS, enhancing its robustness. First, we introduce an edge-semantic fusion guidance module that combines rich texture information from high-frequency edge-aware maps with semantic information from images. The fused features serve as prior guidance to capture detailed distribution across different regions, bringing more attention to areas with a higher concentration of Gaussian primitives. Additionally, we present a structural cosine similarity loss to complement pixel-level constraints, further improving the quality of the rendered images. Extensive experiments demonstrate that our method offers better robustness and achieves the best results across various degraded scenes. The source code and trained models will be made available to the public.

1602Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

[openreview] [pdf]

Abstract This paper looks into the critical area of deep learning robustness and challenges the common belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To further investigate this contradiction, a training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the previous assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding the relationship between loss landscape and explanation loss. Codes are provided in the supplement.

1603Rényi Regularised Reinforcement Learning

[openreview] [pdf]

Abstract Entropy regularisation has proven effective in reinforcement learning (RL) for encouraging exploration. Recent work demonstrating the equivalence between entropy regularised RL and approximate probabilistic inference suggests the potential for improving existing methods by generalising the inference procedure. We develop the Rényi regularised RL framework by using Rényi variational inference to learn a stochastic policy. We present theoretical results for policy evaluation and improvement within this new framework. Additionally, we propose two novel algorithms, α\alpha-SAC and α\alpha-SQL, for large-scale RL tasks. We show that these algorithms attain higher returns on games from the Atari suite relative to an entropy-regularised benchmark, SAC-Discrete.

1604LFPS: Learned Farthest Point Sampling

[openreview] [pdf]

Abstract The processing of point clouds with deep neural networks is relevant for many applications, including remote sensing and autonomous driving with LiDAR sensors. To ensure the computational feasibility of point cloud processing, it is crucial to reduce the cloud’s resolution, i.e., its number of points. This downsampling of point clouds requires a deep learning model to abstract information, enabling it to process points within a more holistic context. A traditional technique for reducing the resolution of a point cloud is Farthest Point Sampling (FPS). It achieves a uniform point distribution but does not adapt to the network’s learning process. In contrast, learned sampling methods are adaptive to the network but cannot be seamlessly incorporated into diverse network architectures and do not guarantee uniformity. Thus, they can miss informative regions of the point cloud, reducing their effectiveness for large-scale point cloud applications.To address these limitations and bridge the gap between algorithmic and learned sampling methods, we introduce Learned Farthest Point Sampling (LFPS), an innovative approach that combines the advantages of both algorithmic and learned techniques. Our method relies on a novel loss function designed to enforce a uniform point distribution. We show by theoretical proof that its minima guarantee a uniformity comparable to FPS. Furthermore, we extend the loss function to include information about key points, enabling the network to adaptively influence point selection while preserving uniform distribution in relevant as well as less relevant regions. In experimental studies, we evaluate the performance of LFPS both independently and within existing network architectures. Our results (a) show that LFPS serves as a plug-in alternative for algorithmic sampling methods, particularly as a faster alternative to FPS for large-scale point clouds, and (b) confirm the enhanced performance of LFPS across various tasks, emphasizing its versatility and effectiveness.

1605Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

[openreview] [pdf]

Abstract We show that even the most recent safety-aligned LLMs are not robust to simpleadaptivejailbreaking attacks. First, we demonstrate how to successfully leverage access tologprobsfor jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token``Sure’'), potentially with multiple restarts. In this way, we achieve 100% attack success rate---according to GPT-4 as a judge---on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreakallClaude models---that do not expose logprobs---via either a transfer or prefilling attack with a100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models---a task that shares many similarities with jailbreaking---which is the algorithm that brought us thefirst placein a recent trojan detection competition. The common theme behind these attacks is thatadaptivityis crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection).

1606ATTENDING: Federated Learning with Personalized Attentive Pruning for Heterogeneous Clients

[openreview] [pdf]

Abstract Federated Learning (FL) emerges as a novel machine learning paradigm, enabling distributed clients to collaboratively train a global model while eliminating local data transmission. Despite its advantages, FL faces challenges posed by system and data heterogeneity. System heterogeneity prevents low-end clients from participating in FL with uniform models, while data heterogeneity adversely impacts the learning performance of FL. In this paper, we propose the personalized ATTENtive pruning enabled federateD learnING (ATTENDING) to collectively address these heterogeneity challenges. Specifically, we first design an attention module incorporating spatial and channel attention to enhance the learning performance on heterogeneous data. Subsequently, we introduce the attentive pruning algorithm to generate personalized local models guided by attention scores, aiming to facilitate clients’ participation in FL. Finally, we introduce a specific heterogeneous aggregation algorithm integrated with an attention matching mechanism to efficiently aggregate the pruned models. We implement ATTENDING with a real FL platform and the evaluation results show that ATTENDING significantly outperforms the baselines by up to 11.3% and reduces the average model footprints by 32%. Our code is available at:https://anonymous.4open.science/r/ATTENDING.

1607Variational Learned Priors for Intrinsically Motivated Reinforcement Learning

[openreview] [pdf]

Abstract Efficient exploration is a fundamental challenge in reinforcement learning, especially in environments with sparse rewards. Intrinsic motivation can improve exploration efficiency by rewarding agents for encountering novel states. In this work, we propose a method called Variation Learned Priors for intrinsic motivation that estimates state novelty through variational state encoding. Specifically, novelty is measured using the Kullback-Leibler divergence between a Variational Autoencoder’s learned prior and posterior distributions. When tested across various domains, our approach improves the latent space quality of the Variational Autoencoder, leading to increased exploration efficiency and better task performance for the reinforcement learning agent.

1608Regularization by Texts for Latent Diffusion Inverse Solvers

[openreview] [pdf]

Abstract The recent development of diffusion models has led to significant progress in solving inverse problems by leveraging these models as powerful generative priors. However, challenges persist due to the ill-posed nature of such problems, often arising from ambiguities in measurements or intrinsic system symmetries. To address this, we introduce a novel latent diffusion inverse solver, regularization by text (TReg), inspired by the human ability to resolve visual ambiguities through perceptual biases. TReg integrates textual descriptions of preconceptions about the solution during reverse diffusion sampling, dynamically reinforcing these descriptions through null-text optimization, which we refer to as adaptive negation. Our comprehensive experimental results demonstrate that TReg effectively mitigates ambiguity in inverse problems, improving both accuracy and efficiency.

1609A Multi-Decomposition Method for Compressing Larger AI Models Based on Reinforcement Learning

[openreview] [pdf]

Abstract With the development of modern deep neural network (DNN), the scale of parameters is increasing, making it difficult to deploy models for use on resource-constrained edge devices. To address this issue, model compression is necessary, and using low-rank matrix decomposition to compress DNN models is an effective research approach. However, traditional studies on low-rank decomposition compression typically apply a single matrix decomposition method to each parameter matrix in the neural network, without considering the structural characteristics of each layer in AI models, thus failing to achieve the optimal compression effect. Therefore, this paper proposes, for the first time, a scheme for model compression using multiple decomposition methods, selecting the most suitable decomposition method for each layer in the model. However, to truly implement this approach, it is essential to balance model accuracy and compression cost. To address this, we propose a joint optimization paradigm that simultaneously optimizes model accuracy and compression rate. We also introduce a framework LMFBRL based on reinforcement learning that jointly selects the optimal decomposition method and rank. Tests were conducted on five models such as LeNet-300, ResNet-20, and Vgg-16. Compared to singly using the MF method for compressing the LeNet300 model, our approach has shown an improvement of 3.6% in compression rate and a 1.8% increase in accuracy. The test results validate the effectiveness of the algorithm proposed in this paper.

1610Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques

[openreview] [pdf]

Abstract Scaling large language models has driven remarkable advancements across various domains, yet the continual increase in model size presents significant challenges for real-world deployment. The Mixture of Experts (MoE) architecture offers a promising solution by dynamically selecting and activating only a subset of experts during inference, thus substantially reducing computational costs while preserving high performance. Despite these benefits, MoE introduces new inefficiencies, such as excessive parameters and communication overhead. In this work, we present a holistic study on compression techniques of Mixture of Experts to enhance both efficiency and scalability. While recent efforts have focused on reducing the number of experts, these approaches still suffer from considerable communication and computational costs. To address this, we propose more aggressive strategies, such as Layer Drop, which removes entire MoE layers, and Block Drop, which eliminates transformer blocks. Surprisingly, these aggressive structure pruning techniques not only preserve model performance but also substantially improve efficiency. Additionally, beyond Expert Trimming, we also introduce Expert Slimming, which compresses individual experts to further boost performance and can be seamlessly integrated with Expert Trimming. Extensive experimental results demonstrate the effectiveness of our proposed methods — Layer Drop and Block Drop — along with the comprehensive recipe that integrates Expert Slimming and Expert Trimming, achieving a 6.05× speedup with 77.1% reduced memory usage while maintaining over 92% of performance on Mixtral-8×7B. Our code will be made publicly available upon acceptance.

1611Trust but Verify: Programmatic VLM Evaluation in the Wild

[openreview] [pdf]

Abstract Vision-Language Models (VLMs) often generate plausible but incorrect responses to visual queries. However, reliably quantifying the effect of such hallucinations in free-form responses to open-ended queries is challenging as it requires visually verifying each claim within the response. We propose Programmatic VLM Evaluation (PROVE), a new benchmarking paradigm for evaluating VLM responses to open-ended queries. To construct PROVE, we provide a large language model with a high-fidelity scene-graph representation constructed from a hyper-detailed image caption, and prompt it to generate diverse question-answer (QA) pairs, as well as programs that can be executed over the scene graph object toverifyeach QA pair. We thus construct a benchmark of 10.5k challenging but grounded visual QA pairs. Next, to evaluate free-form model responses to queries in PROVE, we propose aprogrammaticevaluation strategy that measures both the helpfulness and truthfulness of a response within a unified scene graph-based framework. We benchmark the helpfulness-truthfulness trade-offs of a range of VLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.

1612Autoregressive Moving-average Attention Mechanism for Time Series Forecasting

[openreview] [pdf]

Abstract We propose an Autoregressive (AR) Moving-average (MA) attention structure that can adapt to various linear attention mechanisms, enhancing their ability to capture long-range and local temporal patterns in time series. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that incorporating the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results. The code implementation is available at the following link:https://anonymous.4open.science/r/ARMA-attention-3437.

1613Selective Aggregation for Low-Rank Adaptation in Federated Learning

[openreview] [pdf]

Abstract We investigate LoRA in federated learning through the lens of the asymmetry analysis of the learned AA and BB matrices. In doing so, we uncover that AA matrices are responsible for learning general knowledge, while BB matrices focus on capturing client-specific knowledge. Based on this finding, we introduce Federated Share-A Low-Rank Adaptation (FedSA-LoRA), which employs two low-rank trainable matrices AA and BB to model the weight update, but only AA matrices are shared with the server for aggregation. Moreover, we delve into the relationship between the learned AA and BB matrices in other LoRA variants, such as rsLoRA and VeRA, revealing a consistent pattern. Consequently, we extend our FedSA-LoRA method to these LoRA variants, resulting in FedSA-rsLoRA and FedSA-VeRA. In this way, we establish a general paradigm for integrating LoRA with FL, offering guidance for future work on subsequent LoRA variants combined with FL. Extensive experimental results on natural language understanding and generation tasks demonstrate the effectiveness of the proposed method. Our code is available athttps://anonymous.4open.science/r/FedSA-LoRA-3498/.

1614Multi-Robot Motion Planning with Diffusion Models

[openreview] [pdf]

Abstract Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robot diffusion models. In this paper, we propose a method for generating collision-free multi-robot trajectories that conform to underlying data distributions while using only single-robot data. Our algorithm, Multi-robot Multi-model planning Diffusion (MMD), does so by combining learned diffusion models with classical search-based techniques---generating data-driven motions under collision constraints. Scaling further, we show how to compose multiple diffusion models to plan in large environments where a single diffusion model fails to generalize well. We demonstrate the effectiveness of our approach in planning for dozens of robots in a variety of simulated scenarios motivated by logistics environments. View video demonstrations in our supplementary material, and our code at:https://github.com/<removed_for_review>.

1615Can Stability be Detrimental? Better Generalization through Gradient Descent Instabilities

[openreview] [pdf]

Abstract Traditional analyses of gradient descent optimization show that, when the largest eigenvalue of the loss Hessian - often referred to as the sharpness - is below a critical learning-rate threshold, then training is ‘stable’ and training loss decreases monotonically. Recent studies, however, have suggested that the majority of modern deep neural networks achieve good performance despite operating outside this stable regime. In this work, we demonstrate that such instabilities, induced by large learning rates, move model parameters toward flatter regions of the loss landscape. Our crucial insight lies in noting that, during these instabilities, the orientation of the Hessian eigenvectors rotate. This, we conjecture, allows the model to explore regions of the loss landscape that display more desirable geometrical properties for generalization, such as flatness. These rotations are a consequence of network depth, and we prove that for any network with depth >1> 1, unstable growth in parameters cause rotations in the principal components of the Hessian, which promote exploration of the parameter space away from unstable directions. Our empirical studies reveal an implicit regularization effect in gradient descent with large learning rates operating beyond the stability threshold. We find these lead to excellent generalization performance on modern benchmark datasets.

1616Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

[openreview] [pdf]

Abstract The current paradigm for safety alignment of large language models (LLMs) follows aone-size-fits-allapproach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model withstaticsafety standards too restrictive to be useful, as well as too costly to be re-aligned.We proposeControllable Safety Alignment(CoSA), a framework designed to adapt models to diverse safety requirements without re-training. Instead of aligning a fixed model, we align models to followsafety configs—free-form natural language descriptions of the desired safety behaviors—that are provided as part of the system prompt. To adjust model safety behavior, authorized users only need to modify such safety configs at inference time. To enable that, we propose CoSAlign, a data-centric method for aligning LLMs to easily adapt to diverse safety configs. Furthermore, we devise a novel controllability evaluation protocol that considers both helpfulness and configured safety, summarizing them into CoSA-Score, and construct CoSApien, ahuman-authoredbenchmark that consists of real-world LLM use cases with diverse safety requirements and corresponding evaluation prompts.We show that CoSAlign leads to substantial gains of controllability over strong baselines including in-context alignment. Our framework encourages better representation and adaptation to pluralistic human values in LLMs, and thereby increasing their practicality.

1617TIS-DPO: Token-level Importance Sampling for Direct Preference Optimization With Estimated Weights

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has been widely adopted for preference alignment of Large Language Models (LLMs) due to its simplicity and effectiveness. However, DPO is derived as a bandit problem in which the whole response is treated as a single arm, ignoring the importance differences between tokens, which may affect optimization efficiency and make it difficult to achieve optimal results. In this work, we propose that the optimal data for DPO has equal expected rewards for each token in winning and losing responses, as there is no difference in token importance. However, since the optimal dataset is unavailable in practice, we propose using the original dataset for importance sampling to achieve unbiased optimization. Accordingly, we propose a token-level importance sampling DPO objective named TIS-DPO that assigns importance weights to each token based on its reward. Inspired by previous works, we estimate the token importance weights using the difference in prediction probabilities from a pair of contrastive LLMs. We explore three methods to construct these contrastive LLMs: (1) guiding the original LLM with contrastive prompts, (2) training two separate LLMs using winning and losing responses, and (3) performing forward and reverse DPO training with winning and losing responses. Experiments show that TIS-DPO significantly outperforms various baseline methods on harmlessness and helpfulness alignment and summarization tasks. We also visualize the estimated weights, demonstrating their ability to identify key token positions.

1618Transformers are Universal In-context Learners

[openreview] [pdf]

Abstract Transformers are deep architectures that define ``in-context mappings’’ which enable predicting new tokens based on a given set of tokens (such as a prompt in NLP applications or a set of patches for a vision transformer). In this work, we study in particular the ability of these architectures to handle an arbitrarily large number of context tokens. To mathematically, uniformly address their expressivity, we consider the case that the mappings are conditioned on a context represented by a probability distribution of tokens which becomes discrete for a finite number of these. The relevant notion of smoothness then corresponds to continuity in terms of the Wasserstein distance between these contexts. We demonstrate that deep transformers are universal and can approximate continuous in-context mappings to arbitrary precision, uniformly over compact token domains. A key aspect of our results, compared to existing findings, is that for a fixed precision, a single transformer can operate on an arbitrary (even infinite) number of tokens. Additionally, it operates with a fixed embedding dimension of tokens (this dimension does not increase with precision) and a fixed number of heads (proportional to the dimension). The use of MLPs between multi-head attention layers is also explicitly controlled. We consider both unmasked attentions (as used for the vision transformer) and masked causal attentions (as used for NLP and time series applications). We tackle the causal setting leveraging a space-time lifting to analyze causal attention as a mapping over probability distributions of tokens.

1619FairDD: Fair Dataset Distillation via Adversarial Matching

[openreview] [pdf]

Abstract Condensing large datasets into smaller synthetic counterparts has demonstrated its promise for image classification. However, previous research has overlooked a crucial concern in image recognition: ensuring that models trained on condensed datasets are unbiased towards protected attributes (PA), such as gender and race. Our investigation reveals that dataset distillation (DD) fails to alleviate the unfairness towards minority groups within original datasets. Moreover, this bias typically worsens in the condensed datasets due to their smaller size. To bridge the research gap, we propose a novel fair dataset distillation (FDD) framework, namely FairDD, which can be seamlessly applied to diverse matching-based DD approaches, requiring no modifications to their original architectures. The key innovation of FairDD lies in adversarially matching synthetic datasets to PA-wise groups of original datasets simultaneously, rather than indiscriminate alignment to the whole distributions in vanilla DDs, dominated by majority groups. This adversarial matching allows synthetic datasets to avoid collapsing into majority groups and bootstrap their balanced generation to all PA groups. Consequently, FairDD could effectively regularize vanilla DDs to favor biased generation toward minority groups while maintaining the accuracy of target attributes. Theoretical analyses and extensive experimental evaluations demonstrate that FairDD significantly improves fairness compared to vanilla DD methods, without sacrificing classification accuracy. Its consistent superiority across diverse DDs, spanning Distribution and Gradient Matching, establishes it as a versatile FDD approach.

1620SHIFTING TIME: TIME-SERIES FORECASTING WITH KHATRI-RAO NEURAL OPERATORS

[openreview] [pdf]

Abstract We present an operator-theoretic framework for time-series forecasting that involves learning a continuous time-shift operator associated with temporal and spatio-temporal problems. The time-shift operator learning paradigm offers a continuous relaxation of the discrete lag factor used in traditional autoregressive models enabling the history of a function up to a given time to be mapped to its future values. To parametrize the operator learning problem, we propose Khatri-Rao neural operators -- a new architecture for defining non-stationary integral transforms which achieves almost linear cost on spatial and spatio-temporal problems. From a practical perspective, the advancements made in this work allow us to handle irregularly sampled observations and forecast at super-resolution in both space and time. Detailed numerical studies across a wide range of temporal and spatio-temporal benchmark problems suggest that the proposed approach is highly scalable and provides results that compares favourably with the state-of-the-art methods.

1621Faster, More Efficient RLHF through Off-Policy Asynchronous Learning

[openreview] [pdf]

Abstract The dominant paradigm for RLHF isonlineandon-policyRL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM’s own outputs. While performant, it is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online butoff-policyRLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF on instruction following tasks 40% faster than a synchronous run while matching final performance measured with GPT-4o.

1622A Pontryagin Perspective on Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning has traditionally focused on learning state-dependent policies to solve optimal control problems in aclosed-loopfashion. In this work, we introduce the paradigm ofopen-loop reinforcement learningwhere a fixed action sequence is learned instead. We present three new algorithms: one robust model-based method and two sample-efficient model-free methods. Rather than basing our algorithms on Bellman’s equation from dynamic programming, our work builds onPontryagin’s principlefrom the theory of open-loop optimal control. We provide convergence guarantees and evaluate all methods empirically on a pendulum swing-up task, as well as on two high-dimensional MuJoCo tasks, demonstrating remarkable performance compared to existing baselines.

1623An Information-Theoretic Approach to Diversity Evaluation of Prompt-based Generative Models

[openreview] [pdf]

Abstract Text-guided sample generation schemes are commonly evaluated based on the quality of generated data and their alignment with the input text prompt. On the other hand, several applications of prompt-based generative models require adequate diversity in the generated data to ensure the models’ capability of generating samples possessing a variety of features. However, the existing diversity metrics are designed for unconditional generative models, and thus cannot distinguish the diversity of created data because of the variety of text prompts and the diversity contributed by the generation model. In this work, our goal is to quantify the prompt-caused and model-caused diversity of samples produced by a prompt-based generative model. We propose an information-theoretic approach to the model’s internal diversity quantification, where we decompose the kernel-based entropy H(X)H(X) of generated data XX, to the sum of conditional entropy H(XT)H(X|T) given text variable TT and the mutual information I(X;T)I(X;T) between the text and data variables. We utilize the conditional entropy H(XT)H(X|T) to define theConditional-Vendiscore for the quantification of the internal diversity of the model and prove theoretical results interpreting the application of these scores to mixture distributions. We perform several numerical experiments to show the correlation between the Conditional-Vendi score and standard text-based generative models. We also demonstrate the application of the proposed framework to detect inequities in text-based sample generation schemes.

1624A Large Deviation Theory Analysis on the Implicit Bias of SGD

[openreview] [pdf]

Abstract Stochastic Gradient Descent (SGD) plays a key role in training deep learning models, yet its ability to implicitly regularize and enhance generalization remains an open theoretical question. We apply Large Deviation Theory (LDT) to analyze why SGD selects models with strong generalization properties. We show that the generalization error jointly depends on the level of concentration of its empirical loss around its expected value and the \textit{abnormality} of the random deviations stemming from the stochastic nature of the training data observation process. Our analysis reveals that SGD gradients are inherently biased toward models exhibiting more concentrated losses and less abnormal and smaller random deviations. These theoretical insights are empirically validated using deep convolutional neural networks, confirming that mini-batch training acts as a natural regularizer by preventing convergence to models with high generalization errors.

1625Adaptive Uncertainty-Aware Reinforcement Learning from Human Feedback

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) is a popular technique to align large language models (LLMs) to human preferences. It requires learning a reward model that predicts scalar values given a generated text sequence, acting as a proxy for human preference scores. A central problem of RLHF is \textit{reward hacking}, i.e., overoptimization. LLMs can easily exploit the reward model by generating text that can receive high scores but no longer align with human preferences. We address this problem by proposing a new objective which adapts the tradeoff between reward model score and regularisation based on reward uncertainty. We hypothesize that when the reward model uncertainty is low, RLHF should make a larger step size by lowering the regularization coefficient. On the other hand, when the uncertainty is high, optimization should slow down by staying closer to the original model. We present a novel re-formulation of the RLHF objective and derive our approach from its generalization to account for reward model variance. We demonstrate that our uncertainty-aware RLHF objective mitigates overoptimization and outperforms vanilla RLHF by 50% on a standard summarization task.

1626Dist Loss: Enhancing Regression in Few-Shot Region through Distribution Distance Constraint

[openreview] [pdf]

Abstract Imbalanced data distributions are prevalent in real-world scenarios, posing significant challenges in both imbalanced classification and imbalanced regression tasks. They often cause deep learning models to overfit in areas of high sample density (many-shot regions) while underperforming in areas of low sample density (few-shot regions). This characteristic restricts the utility of deep learning models in various sectors, notably healthcare, where areas with few-shot data hold greater clinical relevance. While recent studies have shown the benefits of incorporating distribution information in imbalanced classification tasks, such strategies are rarely explored in imbalanced regression. In this paper, we address this issue by introducing a novel loss function, termed Dist Loss, designed to minimize the distribution distance between the model’s predictions and the target labels in a differentiable manner, effectively integrating distribution information into model training. Dist Loss enables deep learning models to regularize their output distribution during training, effectively enhancing their focus on few-shot regions. We have conducted extensive experiments across three datasets spanning computer vision and healthcare: IMDB-WIKI-DIR, AgeDB-DIR, and ECG-Ka-DIR. The results demonstrate that Dist Loss effectively mitigates the negative impact of imbalanced data distribution on model performance, achieving state-of-the-art results in sparse data regions. Furthermore, Dist Loss is easy to integrate, complementing existing methods. Our code will be made publicly available following the review process.

1627Machine Unlearning for Streaming Forgetting

[openreview] [pdf]

Abstract Machine unlearning aims to remove knowledge derived from the specific training data that are requested to be forgotten in a well-trained model while preserving the knowledge learned from the remaining training data. Currently, machine unlearning methods typically handle all forgetting data in a single batch, removing the corresponding knowledge all at once upon request. However, in practical scenarios, requests for data removal often arise in a streaming manner rather than in a single batch, leading to reduced efficiency and effectiveness in existing methods. Such challenges of streaming forgetting have not been the focus of much research. In this paper, to address the challenges of performance maintenance, efficiency, and data access brought about by streaming unlearning requests, we introduce an online unlearning paradigm, formalizing the unlearning as a distribution shift problem. We then estimate the altered distribution and propose a novel online unlearning algorithm to achieve efficient streaming forgetting without requiring access to the original training data. Theoretical analyses confirm an O(VTT+ΔT)O(V_T\sqrt{T} + \Delta_T) error bound on the streaming unlearning regret, where VTV_T represents the cumulative total variation in the optimal solution over TT learning rounds and ΔT\Delta_T represents the cumulative total divergence between remaining and forgetting data distributions. This theoretical guarantee is achieved under mild conditions without the strong restriction of convex loss function. Experiments across various models and datasets validate the performance of our proposed method.

1628OCS+: Improving PTQ with Outlier Translation

[openreview] [pdf]

Abstract Post-training quantization (PTQ) is an effective technique for accelerating DNN model inference, where activations typically follow a bell-shaped distribution. Since commodity hardware employs a linear quantization grid and limited quantization levels, prior PTQs optimize a clipping threshold to minimize overall quantization error, which excludes outliers from the bell-shaped data. However, outliers are non-trivial for low-bit and lightweight models. Thus OCS (Zhao et al.,2019) proposed to save outliers by halving and duplicating. However, in activation quantization, the original OCS sacrifices the precision of the regular inliers, leading to severe accuracy degradation. To address this, we propose OCS+ to save outlier activation without affecting the regular inliers. Consequently, OCS+ theoretically achieves one-bit higher representation under the predefined bitwidth hardware. OCS+ is based on offline mathematical transformation, thus it does not require additional training or re-design works on hardware. Experiments over CNNs and ViTs demonstrate OCS+ significantly outperforms OCS and help improve current PTQ SOTAs, e.g., OCS+ improves the current SOTAs by 12.73% in Acc@1 for W2A2 MobileNet-v2. The code will be released.

1629Addressing Extrapolation Error in Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Cooperative Multi-Agent Reinforcement Learning (MARL) has become a critical tool for addressing complex real-world problems. However, scalability remains a significant challenge due to the exponentially growing joint action space. In our analysis, we highlight a critical but often overlooked issue:extrapolation error, which arises when unseen state-action pairs are inaccurately assigned unrealistic values, severely affecting performance. We demonstrate that the success of value factorization methods can be largely attributed to their ability to mitigate this error. Building on this insight, we introduce multi-step bootstrapping and ensemble techniques to further reduce extrapolation errors, showing that straightforward modifications can lead to substantial performance improvements. Our findings underscore the importance of recognizing extrapolation error in MARL and highlight the potential of exploring simpler methods to advance the field.

1630DiverseFlow: Sample-Efficient Diverse Mode Coverage in Flows

[openreview] [pdf]

Abstract Many real-world applications of flow generative models desire a diverse set of samples covering multiple modes of the target distribution. However, the predominant approach for obtaining diverse sets is not sample-efficient, as it involves independently obtaining many samples from the source distribution and mapping them through the flow until the desired mode coverage is achieved. As an alternative to repeated sampling, we introduce DiverseFlow---a training-free, inference-time approach to improve the diversity of flow models. Our key idea is to employ a determinantal point process to induce a coupling between the samples and drive sample diversity under a fixed sampling budget. We demonstrate the efficacy of DiverseFlow for tasks where sample efficient diversity is highly desirable---text-guided image generation with polysemous words, inverse problems like large-hole inpainting, and class-conditional image synthesis.

1631Be More Diverse than the Most Diverse: Online Selection of Diverse Mixtures of Generative Models

[openreview] [pdf]

Abstract The availability of multiple training algorithms and architectures for generative models requires a selection mechanism to form a single model over a group of well-trained generation models. The selection task is commonly addressed by identifying the model that maximizes an evaluation score based on the diversity and quality of the generated data. However, such a best-model identification approach overlooks the possibility that a mixture of available models can outperform each individual model. In this work, we explore the selection of a mixture of multiple generative models and formulate a quadratic optimization problem to find an optimal mixture model achieving the maximum of kernel-based evaluation scores including kernel inception distance (KID) and Renyi kernel entropy (RKE). To identify the optimal mixture of the models using the fewest possible sample queries, we propose an online learning approach calledMixture Upper Confidence Bound (Mixture-UCB). Specifically, our proposed online learning method can be extended to every convex quadratic function of the mixture weights, for which we prove a concentration bound to enable the application of the UCB approach. We prove a regret bound for the proposed Mixture-UCB algorithm and perform several numerical experiments to show the success of the proposed Mixture-UCB method in finding the optimal mixture of text-based and image-based generative models.

1632Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation.

1633Label Correlation Biases Direct Time Series Forecast

[openreview] [pdf]

Abstract Time series modeling is uniquely challenged by the presence of autocorrelation in both historical and label sequences. Current research predominantly focuses on handling autocorrelation within the historical sequence but often neglects its presence in the label sequence. Specifically, emerging forecast models mainly conform to the direct forecast (DF) paradigm, generating multi-step forecasts under the assumption of conditional independence within the label sequence. This assumption disregards the inherent autocorrelation in the label sequence, thereby limiting the performance of DF-based models. In response to this gap, we introduce the Frequency-enhanced Direct Forecast (FreDF), which bypasses the complexity of label autocorrelation by learning to forecast in the frequency domain. Our experiments demonstrate that FreDF substantially outperforms existing state-of-the-art methods and is compatible with a variety of forecast models. Code is available athttps://anonymous.4open.science/r/FreDF-0FB1.

1634Reinforced In-Context Black-Box Optimization

[openreview] [pdf]

Abstract Black-Box Optimization (BBO) has found successful applications in many fields of science and engineering. Recently, there has been a growing interest in meta-learning particular components of BBO algorithms to speed up optimization and get rid of tedious hand-crafted heuristics. As an extension, learning the entire algorithm from data requires the least labor from experts and can provide the most flexibility. In this paper, we propose RIBBO, a method to reinforce-learn a BBO algorithm from offline data in an end-to-end fashion. RIBBO employs expressive sequence models to learn the optimization histories produced by multiple behavior algorithms and tasks, leveraging the in-context learning ability of large models to extract task information and make decisions accordingly. Central to our method is to augment the optimization histories withregret-to-gotokens, which are designed to represent the performance of an algorithm based on cumulative regret over the future part of the histories. The integration of regret-to-go tokens enables RIBBO to automatically generate sequences of query points that satisfy the user-desired regret, which is verified by its universally good empirical performance on diverse problems, including BBO benchmark functions, hyper-parameter optimization and robot control problems.

1635Communication-Efficient Federated Learning via Model-Agnostic Projection Adaptation

[openreview] [pdf]

Abstract Federated learning (FL) enables collaborative model training across distributed clients without centralizing sensitive raw data while benefiting from diverse data sources. Despite recent advancements in FL, the communication overhead remains a significant challenge, especially for large-scale models. Recent low-rank adaptation (LoRA) techniques have shown promise in reducing these burdens in FL, but they are typically applied to each layer individually and depend on the model architecture, which limits their performance. To address these shortcomings, we propose Model-Agnostic Projection Adaptation (MAPA), a novel approach that applies factorization to the entire model parameter space, which we view as asingle vector, regardless of the number of layers and model architecture. MAPA factorizes the single-vector model update into a fixedreconstruction matrixand a trainableprojection vector, with the reconstruction matrix being randomly initialized using a shared seed at each round. This ensures thatonlythe projection vectors need to be communicated to the server, thereby reducing the communication cost. Furthermore, MAPA’s vector-based representation and relaxed rank constraints allow for a larger reconstruction matrix and smaller projection vector dimensions compared to LoRA, enhancing the expressiveness of model updates while significantly reducing communication overhead. Experimental results demonstrate that MAPA outperforms existing FL methods in both communication efficiency and model performance, effectively coupling optimization and communication efficiency in FL environments.

1636Evaluating the Goal-Directedness of Large Language Models

[openreview] [pdf]

Abstract LLM-based agents may transform AI and society in the near future. Along with opportunities for automation and increased productivity come novel safety and ethics concerns. This means both researchers and regulators need good ways to keep track of progress and properties of LLM-based agents. A key feature of agentic behaviour is goal-directedness, which has so far received limited attention in the context of AI agents. In this work we define the concept of goal-directedness for LLM agents, and develop a framework for evaluating it empirically on tasks involving information gathering, information processing, and execution. Results on state-of-the-art LLM agents indicate a lack of goal-directedness, meaning models often fail to fully deploy capabilities that they evidently have. This raises the question of how we can elicit the full capabilities of LLM-based agents, as well as what policies should be in place for future more goal-directed systems.

1637Machine Unlearning for Contrastive Learning under Auditing

[openreview] [pdf]

Abstract Machine unlearning offers effective solutions for revoking the influence of specific training data on pre-trained model parameters. While existing approaches address unlearning for classification and generative models, they overlook an important category of machine learning models: contrastive learning (CL) methods. This paper addresses this gap by introducing the Machine Unlearning for Contrastive Learning (MUC) framework and adapting existing methods. We identify limitations in current approaches, noting that several methods perform inadequately as unlearners and that existing auditing tools insufficiently validate unlearning effects in contrastive learning. To address these issues, we propose Alignment Calibration (AC), a novel method that explicitly considers contrastive learning properties and optimizes towards new auditing metrics for easy verification of unlearning. Through empirical comparisons with baseline methods on SimCLR, MoCo, and CLIP, we demonstrate that AC: (1) achieves state-of-the-art performance, approximating exact unlearning (retraining); (2) enables data owners to clearly visualize unlearning effects through black-box auditing.

[openreview] [pdf]

Abstract The AlphaZero/MuZero (A/MZ) family of algorithms has achieved remarkable success across various challenging domains by integrating Monte Carlo Tree Search (MCTS) with learned models. Learned models introduce epistemic uncertainty, which is caused by learning from limited data and is useful for exploration in sparse reward environments. MCTS does not account for the propagation of this uncertainty however. To address this, we introduce Epistemic MCTS (EMCTS): a theoretically motivated approach to account for the epistemic uncertainty in search and harness the search for deep exploration. In the challenging sparse-reward task of writing code in the Assembly language SUBLEQ, AZ paired with our method achieves significantly higher sample efficiency over baseline AZ. Search with EMCTS solves variations of the commonly used hard-exploration benchmark Deep Sea - which baseline A/MZ are practically unable to solve - much faster than an otherwise equivalent method that does not use search for uncertainty estimation, demonstrating significant benefits from search for epistemic uncertainty estimation.

1639Utility as Fair Pricing

[openreview] [pdf]

Abstract In 2018, researchers proposed the use of generalized entropy indices as a unified approach to quantifying algorithmicunfairnessat both the group and individual levels. Using this metric they empirically evidenced a trade-off between group and individual fairness. The definition of the index introduces an array of new parameters; thus, while the construction of the metric is principled, its behavior is opaque. Since its publication, the metric has been highly reproduced in the literature, researched and implemented in open source libraries by IBM, Microsoft and Amazon; thus demonstrating traction among researchers, educators and practitioners. Advice or grounded justification around appropriate parameter selection, however, remains scarce. Nevertheless, the metric has been implemented in libraries with default or hard-coded parameter settings from the original paper with little to no explanation.In this article we take an intentionally data agnostic (rational, rather than empirical) approach to understanding the index, illuminating its behavior with respect to different error distributions and costs, and the effect of placing constraints on it. By adding the simple requirement that the the resulting fairness metric should be independent of model accuracy, we demonstrate consistency between cost sensitive learning and individual fairness in this paradigm. By viewing a classification decision as a transaction between the individual and the decision maker, and accounting for both perspectives, we prove that, with careful parameter selection, the concepts of utility and (group and individual) fairness can be firmly aligned, establishing generalized entropy indices as a natural extension of utility, in the quest to mitigate bias in machine learning.

1640Optimizing 4D Gaussians for Dynamic Scene Video from Single Landscape Images

[openreview] [pdf]

Abstract To achieve realistic immersion in landscape images, fluids such as water and clouds need to move within the image while revealing new scenes from various camera perspectives. Recently, a field called dynamic scene video has emerged, which combines single image animation with 3D photography. These methods use pseudo 3D space, implicitly represented with Layered Depth Images (LDIs). LDIs separate a single image into depth-based layers, which enables elements like water and clouds to move within the image while revealing new scenes from different camera perspectives. However, as landscapes typically consist of continuous elements, including fluids, the representation of a 3D space separates a landscape image into discrete layers, and it can lead to diminished depth perception and potential distortions depending on camera movement. Furthermore, due to its implicit modeling of 3D space, the output may be limited to videos in the 2D domain, potentially reducing their versatility. In this paper, we propose representing a complete 3D space for dynamic scene video by modeling explicit representations, specifically 4D Gaussians, from a single image. The framework is focused on optimizing 3D Gaussians by generating multi-view images from a single image and creating 3D motion to optimize 4D Gaussians. The most important part of proposed framework is consistent 3D motion estimation, which estimates common motion among multi-view images to bring the motion in 3D space closer to actual motions. As far as we know, this is the first attempt that considers animation while representing a complete 3D space from a single landscape image. Our model demonstrates the ability to provide realistic immersion in various landscape images through diverse experiments and metrics. Extensive experimental results arehttps://anonymous.4open.science/r/ICLR_3D_MOM-7B9E/README.md.

1641How Transformers Implement Induction Heads: Approximation and Optimization Analysis

[openreview] [pdf]

Abstract Transformers exhibit exceptional in-context learning capabilities, yet the theoretical understanding of the underlying mechanisms remain limited. A recent work (Elhage et al., 2021) identified a “rich” in-context mechanism known as induction head, contrasting with “lazy” nn-gram models that overlook long-range dependencies. In this work, we provide both approximation and optimization analyses of how transformers implement induction heads. In the approximation analysis, we formalize both standard and generalized induction head mechanisms, and examine whether two-layer single- or multi-head transformers can efficiently implement them, with an emphasis on the distinct role of each transformer submodule. For the optimization analysis, we study the training dynamics on a synthetic mixed target, composed of a 4-gram and an in-context 2-gram component. This setting enables us to precisely characterize the entire training process and uncover anabrupt transitionfrom lazy (4-gram) to rich (induction head) mechanisms as training progresses.

1642Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

[openreview] [pdf]

Abstract Recent studies have shown that the denoising process in (generative) diffusion models can induce meaningful (discriminative) representations inside the model, though the quality of these representations still lags behind those learned through recent self-supervised learning methods. We argue that one main bottleneck in training large-scale diffusion models for generation comes down to learning these good representations, and training can become significantly easier when the model is aided by strong external visual representations. We study this by introducing a straightforward regularization called REPresentation Alignment (REPA), which aligns the projections of noisy input hidden states in denoising networks with clean image representations obtained from external, pretrained visual encoders. The results are striking: our simple strategy yields significant improvements in both training efficiency and generation quality when applied to popular diffusion and flow-based transformers, such as DiTs and SiTs. For instance, our method can speed up SiT training by over 17.5×\times, matching the performance (without classifier-free guidance) of a SiT-XL model trained for 7M steps in less than 400K steps. Additionally, in terms of final generation quality, our approach achieves a superior FID score of 1.80 when using guidance.

1643Balancing Bias in Two-sided Markets for Fair Stable Matchings

[openreview] [pdf]

Abstract The Balanced Stable Marriage (BSM) problem aims to find a stable matching in a two-sided market that minimizes the maximum dissatisfaction among two sides. The classical Deferred Acceptance algorithm merely produces an unfair stable marriage, providing optimal partners for one side while partially assigning pessimal partners to the other. Solving BSM is NP-hard, thwarting attempts to resolve the problem exactly. As the instance size increases in practice, recent studies have explored heuristics for finding a fair stable marriage but have not found an exact optimal solution for BSM efficiently. Nevertheless, in this paper we propose an efficient algorithm, Isorropia, that returns the exact optimal solution to practical BSM problem instances. Isorropia constructs two sets of candidate rotations from which it builds three sets of promising antichains, and performs local search on those three sets of promising antichains. Our extensive experimental study shows that Isorropia surpasses the time-efficiency of baselines that return the exact solution by up to three orders of magnitude.

1644Towards Understanding Robustness and Generalization in World Models

[openreview] [pdf]

Abstract World model has recently emerged as a promising approach to reinforcement learning (RL), as evidenced by the recent successes that world model based agents achieve state-of-the-art performance on a wide range of visual control tasks. This work aims to obtain a deep understanding of the robustness and generalization capabilities of world models. Thus motivated, we develop a stochastic differential equation formulation by treating the world model learning as a stochastic dynamical system in the latent state space, and characterize the impact of latent representation errors on robustness and generalization, for both cases with zero-drift representation errors and with non-zero-drift representation errors. Our somewhat surprising findings, based on both theoretic and experimental studies, reveal that for the case with zero drift, modest latent representation errors can in fact function as implicit regularization and hence result in improved robustness. We further propose a Jacobian regularization scheme to mitigate the compounding error propagation effects of non-zero drift, thereby enhancing training stability and robustness. Our extensive experimental studies corroborate that this regularization approach not only stabilizes training but also accelerates convergence and improves accuracy of long-horizon prediction.

1645A Markov decision process for variable selection in Branch and bound

[openreview] [pdf]

Abstract Mixed-Integer Linear Programming (MILP) is a powerful framework used to address a wide range of NP-hard combinatorial optimization problems, often solved by Branch and bound (B&B). A key factor influencing the performance of B&B solvers is the variable selection heuristic governing branching decisions. Recent contributions have sought to adapt reinforcement learning (RL) algorithms to the B&B setting to learn optimal branching policies, through Markov Decision Processes (MDP) inspired formulations, and ad hoc convergence theorems and algorithms. In this work, we introduce B&B MDPs, a principled vanilla MDP formulation for variable selection in B&B, allowing to leverage a broad range of RL algorithms for the purpose of learning optimal B&B heuristics. Computational experiments validate our model empirically, as our branching agent outperforms prior state-of-the-art RL agents on four standard MILP benchmarks.

1646No Free Lunch: Retrieval-Augmented Generation Undermines Fairness in LLMs, Even for Vigilant Users

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) is widely adopted for its effectiveness and cost-efficiency in mitigating hallucinations and enhancing the domain-specific generation capabilities of large language models (LLMs). However, is this effectiveness and cost-efficiency truly a free lunch? In this study, we comprehensively investigate the fairness costs associated with RAG by proposing a practical three-level threat model from the perspective of user awareness of fairness. Specifically, varying levels of user fairness awareness result in different degrees of fairness censorship on the external dataset. We examine the fairness implications of RAG using uncensored, partially censored, and fully censored datasets. Our experiments demonstrate that fairness alignment can be easily undermined through RAGwithout the need for fine-tuning or retraining.Even with fully censored and supposedly unbiased external datasets, RAG can lead to biased outputs.Our findings underscore the limitations of current alignment methods in the context of RAG-based LLMs and highlight the urgent need for new strategies to ensure fairness. We propose potential mitigations and call for further research to develop robust fairness safeguards in RAG-based LLMs.

1647Regret measure in continuous time limit for a stochastic Multi-armed bandit problem

[openreview] [pdf]

Abstract We study a class of stochastic multi-armed bandit problems with a risk-sensitive regret measure within a continuous limit setting. This problem is interesting when optimizing the expected reward is not the foremost objective, and the problem horizon is long. Through scaling the state parameters, including the number of pulls and cumulative reward for each arm we study the bandit problem with infinite horizon, we delineate such risk using a Hamilton-Jacobi-Bellman equation with quadratic growth. Using this approach, we establish an explicit form of the optimal policy associated with the considered risk. As an application, we present examples where the results obtained in continuous time offer insights into the optimal policy for each case. Finally, numerical experiments confirm the theoretical results are presented.

1648Personalized Federated Fine-tuning for Heterogeneous Data: a Two-Level Low Rank Adaptation Approach

[openreview] [pdf]

Abstract We study the personalized federated fine-tuning task with heterogeneous client data in the context of foundation models, where clients collaboratively fine-tune a foundation model (e.g., BERT, GPT) without sharing their local data, achieving personalized models simultaneously. While recent efforts have applied parameter-efficient fine-tuning techniques like low-rank adaptation (LoRA) or training prompts in federated settings, they often overlook data heterogeneity and model personalization. The primary challenge is that a single common adapter or prompt learner may not suffice for the diverse data of all clients. To address this issue, we propose PF2LoRA, a new personalized federated fine-tuning algorithm based on a novel \emph{ two-level low rank adaptation framework} on top of LoRA. Given the pretrained foundation model whose weight is frozen, our algorithm aims to learn two levels of adaptation simultaneously: the first level aims to learn a common adapter for all clients, while the second level fosters individual client personalization. This framework explicitly accommodates variations in adapter matrix ranks across clients and introduces minimal additional memory overhead, as the second-level adaptation comprises a small number of parameters compared to the first level. Our experiments on natural language understanding and generation tasks demonstrate that PF2LoRA significantly outperforms existing federated fine-tuning methods.

1649Robust Transfer of Safety-Constrained Reinforcement Learning Agents

[openreview] [pdf]

Abstract Reinforcement learning (RL) often relies on trial and error, which may cause undesirable outcomes. As a result, standard RL is inappropriate for safety-critical applications. To address this issue, one may train a safe agent in a controlled environment (where safety violations are allowed) and then transfer it to the real world (where safety violations may have disastrous consequences). Prior work has made this transfer safe as long as the new environment preserves the safety-related dynamics. However, in most practical applications, differences or shifts in dynamics between the two environments are inevitable, potentially leading to safety violations after the transfer. This work aims to guarantee safety even when the new environment has different (safety-related) dynamics. In other words, we aim to make the process of safe transfer robust. Our methodology (1) robustifies an agent in the controlled environment and (2) provably provides---under mild assumption---a safe transfer to new environments. The empirical evaluation shows that this method yields policies that are robust against changes in dynamics, demonstrating safety after transfer to a new environment.

1650improve weakly supervised visual grounding by learning where to focus on

[openreview] [pdf]

Abstract Visual grounding is a crucial task for connecting visual and language descriptions by identifying target objects based on language entities. However, fully supervised methods require extensive annotations, which can be challenging and time-consuming to obtain. Weakly supervised visual grounding, which only relies on image-sentence association without object-level annotations, offers a promising solution. Previous approaches have mainly focused on finding the relationship between detected candidates, without considering improving object localization. In this work, we propose a novel method that leverages Grad-CAM to help the model identify precise objects. Specifically, we introduce a CAM encoder that exploits Grad-CAM information and a new loss function, attention mining loss, to guide the Grad-CAM feature to focus on the entire object. We also use an architecture which combines CNN and transformer, and a multi-modality fusion module to aggregate visual features, language features and CAM features. Our proposed approach achieves state-of-the-art results on several datasets, demonstrating its effectiveness in different scenes. Ablation studies further confirm the benefits of our architecture.

1651FastTF: 4 Parameters are All You Need for Long-term Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting is essential across various sectors, including finance, transportation, and industry. In this paper, we propose FastTF, a powerful yet lightweight model in Time-Frequency domain for long-term time series forecasting. Our aim is to push the boundary of model lightweighting and facilitate the deployment of lightweight model on resource-constrained devices. Leveraging the global nature and information compressibility of the time series in frequency domain, we introduce patch-wise downsampling, Sparse Frequency Mixer (SFM), and patch predictor to capture the temporal variations of frequency components across different patches. Experimental results on five public datasets demonstrate that FastTF with very few parameters outperforms several state-of-the-art models and demonstrates a strong generalization capability. Notably, on the ETTh1 dataset, FastTF with only 4 parameters achieves a performance that is close to the DLinear and FITS in the horizon-96 forecasting. Furthermore, we deployed our model on a FPGA development board (Zynq UltraScale+ RFSoC ZCU208 Evaluation Kit), where the corresponding resource usage statistics illustrate that our model has a very low computational overhead and latency, making it easily implemented on hardware devices.

1652On the ergodic convergence properties of the Peaceman-Rachford method and their applications in solving linear programming

[openreview] [pdf]

Abstract In this paper, we study the ergodic convergence properties of the Peaceman-Rachford (PR) method with semi-proximal terms for solving convex optimization problems (COPs). By reformulating the PR method as a degenerate proximal point method, for the first time we establish the global convergence of the ergodic sequence generated by the PR method with broadly chosen semi-proximal terms under the assumption that there exists a Karush–Kuhn–Tucker (KKT) solution to the COPs. This result represents a significant departure from previous studies on the non-ergodic convergence of the PR method, which typically requires strong convexity (or strong monotonicity in the reformulated operator) conditions that are hardly satisfied for COPs. Moreover, we establish an ergodic iteration complexity of O(1/k)O(1/k) of the PR method with semi-proximal terms, measured by the objective error, the feasibility violation, and the KKT residual using the ε\varepsilon-subdifferential. Based on these convergence properties, we introduce the solver EPR-LP, using the ergodic sequence of the PR method with semi-proximal terms for solving linear programming (LP) problems. EPR-LP incorporates an adaptive restart strategy and dynamic penalty parameter updates for efficiency and robustness. Extensive numerical experiments on LP benchmark datasets, executed on a high-performance GPU, show that our Julia-based solver outperforms the award-winning solver PDLP at a tolerance level of 10-8.

1653Towards Continuous Reuse of Graph Models via Holistic Memory Diversification

[openreview] [pdf]

Abstract This paper addresses the challenge of incremental learning in growing graphs with increasingly complex tasks. The goal is to continually train a graph model to handle new tasks while retaining its inference ability on previous tasks. Existing methods usually neglect the importance of memory diversity, limiting in effectively selecting high-quality memory from previous tasks and remembering broad previous knowledge within the scarce memory on graphs. To address that, we introduce a novel holistic Diversified Memory Selection and Generation (DMSG) framework for incremental learning in graphs, which first introduces a buffer selection strategy that considers both intra-class and inter-class diversities, employing an efficient greedy algorithm for sampling representative training nodes from graphs into memory buffers after learning each new task. Then, to adequately rememorize the knowledge preserved in the memory buffer when learning new tasks, we propose a diversified memory generation replay method. This method first utilizes a variational layer to generate the distribution of buffer node embeddings and sample synthesized ones for replaying. Furthermore, an adversarial variational embedding learning method and a reconstruction-based decoder are proposed to maintain the integrity and consolidate the generalization of the synthesized node embeddings, respectively. Finally, we evaluate our model on node classification tasks involving increasing class numbers. Extensive experimental results on publicly accessible datasets demonstrate the superiority of DMSG over state-of-the-art methods.

1654CC-VFed: Client Contribution Detects Byzantine Attacks in Vertical Federated Learning

[openreview] [pdf]

Abstract Vertical federated learning (VFL) is a type of federated learning where the collection of different features is shared among multiple clients, and it is attracting attention as a training method that takes into account the privacy and security of training data. On the other hand, in federated learning, there is a threat of Byzantine attacks, where some malicious clients disrupt the training of the model and output an trained model that does not exhibit the behavior that should be obtained. Thus far, numerous defense methods against Byzantine attacks on horizontal federated learning have been proposed, most of which focus on the similarity of the models generated across clients having the similar features and mitigate the attacks by excluding outliers. However, in VFL, the feature sets assigned by each client are inherently different, making similar methods inapplicable, and there is little existing research in this area. In light of the above, this paper organizes and classifies feasible Byzantine attacks and proposes a new defense method CC-VFed against these attack methods. Firstly, this paper organizes and classifies attack methods that contaminate training data, demonstrating that sign-flipping attacks pose a threat to VFL. Subsequently, in order to capture the differences in client features, this paper proposes a method for detecting and neutralizing malicious clients based on their contribution to output labels, demonstrating that it is indeed possible to defend Byzantine attacks in VFL.

1655A Likelihood Based Approach to Distribution Regression Using Conditional Deep Generative Models

[openreview] [pdf]

Abstract In this work, we explore the theoretical properties of conditional deep generative models under the statistical framework of distribution regression where the response variable lies in a high-dimensional ambient space but concentrates around a potentially lower-dimensional manifold. More specifically, we study the large-sample properties of a likelihood-based approach for estimating these models. Our results lead to the convergence rate of a sieve maximum likelihood estimator (MLE) for estimating the conditional distribution (and its devolved counterpart) of the response given predictors in the Hellinger (Wasserstein) metric. Particularly, our rate depends solely on the intrinsic dimension and smoothness of the true conditional distribution. These findings provide an explanation of why conditional deep generative models can circumvent the curse of dimensionality from the perspective of statistical foundations and demonstrate that they can learn a broader class of nearly singular conditional distributions. Our analysis also emphasizes the importance of introducing a small noise perturbation to the data when they are supported sufficiently close to a manifold. Finally, in our numerical studies, we demonstrate the effective implementation of the proposed approach using both synthetic and real-world datasets, which also provide complementary validation to our theoretical findings.

1656Timer-XL: Long-Context Transformers for Unified Time Series Forecasting

[openreview] [pdf]

Abstract We present Timer-XL, a generative Transformer for unified time series forecasting. To uniformly predict 1D and 2D time series, we generalize next token prediction, predominantly adopted for causal generation of 1D sequences, to multivariate next token prediction. The proposed paradigm uniformly formulates various forecasting scenarios as a long-context generation problem. We opt for the generative Transformer, which can capture global-range and causal dependencies while providing contextual flexibility, to implement unified forecasting on univariate series characterized by non-stationarity, multivariate time series with complicated dynamics and correlations, and covariate-informed contexts that include both endogenous and exogenous variables. Technically, we propose a universal TimeAttention to facilitate generative Transformers on time series, which can effectively capture fine-grained intra- and inter-series dependencies of flattened time series tokens (patches) and is further strengthened by position embeddings in both temporal and variable dimensions. Timer-XL achieves state-of-the-art performance across challenging forecasting benchmarks through a unified approach. As a large time series model, it demonstrates notable model transferability by large-scale pre-training, as well as contextual flexibility in token lengths, positioning it as a one-for-all forecaster.

1657Inference-Friendly Models With MixAttention

[openreview] [pdf]

Abstract The size of the key-value (KV) cache plays a critical role in determining both the maximum context length and the number of concurrent requests supported during inference in modern language models. The KV cache size grows proportionally with the number of attention heads and the tokens processed, leading to increased memory consumption and slower inference for long inputs. In this work, we explore the use of MixAttention, a model architecture modification closely related to a blog published by Character.AI. MixAttention combines sliding window attention, where only a small subset of recent tokens is stored in the KV cache, with KV cache sharing across layers. Our experiments demonstrate that MixAttention significantly reduces memory usage and improves inference speed without sacrificing model performance in both short and long-context tasks. We also explore various configurations of this architecture, identifying those that maintain quality across evaluation metrics while optimizing resource efficiency.

1658SPIN: Self-Supervised Prompt INjection

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly used in a variety of important applications, yet their safety and reliability remain major concerns. Various adversarial and jailbreak attacks have been proposed to bypass the safety alignment and cause the model to produce harmful responses. We introduce Self-supervised Prompt INjection (SPIN) which can detect and reverse these various attacks on LLMs. Just by injecting an adaptive defense prompt at inference-time, our method is simple, effective, and compatible with existing safety-aligned models. Our benchmarks demonstrate that our system can reduce the attack success rate by up to 87.9%, while maintaining the performance on benign user requests. In addition, we discuss the situation of an adaptive attacker and show that our method is still resilient against attackers who are aware of our defense.

1659Benign Overfitting in Out-of-Distribution Generalization of Linear Models

[openreview] [pdf]

Abstract Benign overfitting refers to the phenomenon where a over-paramterized model fits the training data perfectly, including noise in the data, but still generalizes well to the unseen test data. While prior work provide a solid theoretical understanding of this phenomenon under the in-distribution setup, modern machine learning often operates in a more challenging Out-of-Distribution (OOD) regime, where the target (test) distribution can be rather different from the source (training) distribution. In this work, we take an initial step towards understanding benign overfitting in the OOD regime by focusing on the basic setup of over-parameterized linear models under covariate shift. We provide non-asymptotic guarantees proving that, when the target covariance satisfies certain structural conditions, benign overfitting occurs in standard ridge regression even under the OOD regime. We identify a number of key quantities relating source and target covariance, which govern the performance of OOD generalization. Our result is sharp, which provably recovers prior in-distribution benign overfitting guarantee (Tsigler & Bartlett, 2023), as well as under-parameterized OOD guarantee (Ge et al., 2024) when specializing to each setup. Moreover, we also present theoretical results for a more general family of target covariance matrix, where standard ridge regression only achieves a slow statistical rate of O(1/n)\mathcal{O}(1/\sqrt{n}) for the excess risk, while Principal Component Regression (PCR) is guaranteed to achieve the fast rate O(1/n)\mathcal{O}(1/n), where nn is the number of samples.

1660Why Solving Multi-agent Path Finding with Large Language Models has not Succeeded Yet

[openreview] [pdf]

Abstract With the explosive influence caused by the success of large language models (LLM), there has been an extensive amount of recent work showing that foundation models can be used to solve a large variety of tasks. However, there is very limited work that shares insights on multi-agent planning. Multi-agent planning is different from other domains by combining the difficulty of multi-agent coordination and planning, and making it hard to leverage external tools to facilitate the reasoning needed. In this paper, we focus on the problem of multi-agent path finding (MAPF), which is also known as multi-robot route planning, and study the performance of solving MAPF with LLMs. We first show the motivating success of single-agent planning and multi-agent pathfinding in an empty room map without obstacles, then the failure to plan on the harder room map and maze map of the standard MAPF benchmark. We present our position on why directly solving MAPF with LLMs has not been successful yet, and we use various experiments to support our hypothesis. Based on our results, we discussed how researchers with different backgrounds could help with this problem from different perspectives.

1661Benchmarking XAI Explanations with Human-Aligned Evaluations

[openreview] [pdf]

Abstract In this paper, we introduce PASTA (Perceptual Assessment System for explanaTion of Artificial intelligence), a novel framework for a human-centric evaluation of XAI techniques in computer vision. Our first key contribution is a human evaluation of XAI explanations on four diverse datasets—COCO, Pascal Parts, Cats Dogs Cars, and MonumAI—which constitutes the first large-scale benchmark dataset for XAI, with annotations at both the image and concept levels. This dataset allows for robust evaluation and comparison across various XAI methods. Our second major contribution is a data-based metric for assessing the interpretability of explanations. It mimics human preferences, based on a database of human evaluations of explanations in the PASTA-dataset. With its dataset and metric, the PASTA framework provides consistent and reliable comparisons between XAI techniques, in a way that is scalable but still aligned with human evaluations. Additionally, our benchmark allows for comparisons between explanations across different modalities, an aspect previously unaddressed. Our findings indicate that humans tend to prefer saliency maps over other explanation types. Moreover, we provide evidence that human assessments show a low correlation with existing XAI metrics that are numerically simulated by probing the model.

1662Attention layers provably solve single-location regression

[openreview] [pdf]

Abstract Attention-based models, such as Transformer, excel across various tasks but lack a comprehensive theoretical understanding, especially regarding token-wise sparsity and internal linear representations. To address this gap, we introduce the single-location regression task, where only one token in a sequence determines the output, and its position is a latent random variable, retrievable via a linear projection of the input. To solve this task, we propose a dedicated predictor, which turns out to be a simplified version of a non-linear self-attention layer. We study its theoretical properties, by showing its asymptotic Bayes optimality and analyzing its training dynamics. In particular, despite the non-convex nature of the problem, the predictor effectively learns the underlying structure. This work highlights the capacity of attention mechanisms to handle sparse token information and internal linear structures.

1663Does Refusal Training in LLMs Generalize to the Past Tense?

[openreview] [pdf]

Abstract Refusal training is widely used to prevent LLMs from generating harmful, undesirable, or illegal outputs. We reveal a curious generalization gap in the current refusal training approaches: simply reformulating a harmful request in the past tense (e.g.,"How to make a Molotov cocktail?“to"How did people make a Molotov cocktail?”) is often sufficient to jailbreak many state-of-the-art LLMs. We systematically evaluate this method on Llama-3 8B, Claude-3.5 Sonnet, GPT-3.5 Turbo, Gemma-2 9B, Phi-3-Mini, GPT-4o-mini, GPT-4o, o1-mini, o1-preview, and R2D2 models using GPT-3.5 Turbo as a reformulation model. For example, the success rate of this simple attack on GPT-4o increases from 1% using direct requests to 88% using 20 past-tense reformulation attempts on harmful requests from JailbreakBench with GPT-4 as a jailbreak judge. Interestingly, we also find that reformulations in the future tense are less effective, suggesting that refusal guardrails tend to consider past historical questions more benign than hypothetical future questions. Moreover, our experiments on fine-tuning GPT-3.5 Turbo show that defending against past reformulations is feasible when past tense examples are explicitly included in the fine-tuning data. Overall, our findings highlight that the widely used alignment techniques---such as SFT, RLHF, and adversarial training---employed to align the studied models can be brittle and do not always generalize as intended.

1664Uncertainty-aware Human Mobility Modeling and Anomaly Detection

[openreview] [pdf]

Abstract Given the GPS coordinates of a large collection of human agents over time, how can we model their mobility behavior toward effective anomaly detection (e.g. for bad-actor or malicious behavior detection) without any labeled data? Human mobility and trajectory modeling have been studied extensively with varying capacity to handle complex input, and performance-efficiency trade-offs. With the arrival of more expressive models in machine learning, we attempt to model GPS data as a sequence of stay-point events, each with a set of characterizing spatiotemporal features, and leverage modern sequence models such as Transformers for un/self-supervised training and inference. Notably, driven by the inherent stochasticity of certain individuals’ behavior, we equip our model with aleatoric/data uncertainty estimation. In addition, to handle data sparsity of a large variety of behaviors, we incorporate epistemic/model uncertainty into our model. Together, aleatoric and epistemic uncertainty enable a robust loss and training dynamics, as well as uncertainty-aware decision making in anomaly scoring. Experiments on large expert-simulated datasets with tens of thousands of agents demonstrate the effectiveness of our model against both forecasting and anomaly detection baselines. All code is available athttps://anonymous.4open.science/r/mobility-ad.

1665An Information-Theoretic Analysis of Thompson Sampling for Logistic Bandits

[openreview] [pdf]

Abstract We study the performance of the Thompson Sampling algorithm for logistic bandit problems, where the agent receives binary rewards with probabilities determined by a logistic function, exp(βa,θ)/(1+exp(βa,θ))\exp(\beta \langle a, \theta \rangle)/(1+\exp(\beta \langle a, \theta \rangle)), with slope parameter β\beta. We focus on the setting where both the action aa and parameter θ\theta lie within the dd-dimensional unit ball. Adopting the information-theoretic framework introduced by (Russo & Van Roy, 2015), we analyze the information ratio, a statistic that quantifies the trade-off between the information gained about the optimal action and the immediate regret incurred. We improve upon previous results by establishing that the information ratio is bounded by 92dα2\tfrac{9}{2}d\alpha^{-2}, where α\alpha is a minimax measure of the alignment between the action space and the parameter space, independent of β\beta. Notably, we derive a regret bound of order O(d/αTlog(βT/d))O(d/\alpha\sqrt{T \log(\beta T/d)}), which scales only logarithmically with the logistic function parameter β\smash{\beta}. To the best of our knowledge, this is the first regret bound for logistic bandits that achieves logarithmic dependence on β\beta while being independent of the number of actions. In particular, when the action space encompasses the parameter space, the expected regret of Thompson Sampling is of order O~(dT)\tilde{O}(d \sqrt{T}).

1666Uncertainty-Aware Counterfactual Explanations using Bayesian Neural Nets

[openreview] [pdf]

Abstract A counterfactual explanation describes the smallest input change required to alter the prediction of an AI model towards a desired outcome. When using neural net- works, counterfactuals are obtained using variants of projected gradient descent. Such counterfactuals have been shown to be brittle and implausible, potentially jeopardising the explanatory aspects of counterfactuals. Numerous approaches for obtaining better counterfactuals have been put forward. Even though these solutions address some of the shortcomings, they often fall short of providing an all-around solution for robust and plausible counterfactuals. We hypothesise this is due to the deterministic nature and limitations of neural networks, which fail to capture the uncertainty of the training data. Bayesian Neural Networks (BNNs) are a well-known class of probabilistic models that could be used to over- come these issues; unfortunately, there is currently no framework for developing counterfactuals for them. In this paper, we fill this gap by proposing a formal framework to define counterfactuals for BNNs and develop algorithmic solutions for computing them. We evaluate our framework on a set of commonly used benchmarks and observe that BNNs produce counterfactuals that are more robust, plausible, and less costly than deterministic baselines

1667TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies

[openreview] [pdf]

Abstract Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models’ spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.

1668Improving Denoising Diffusion with Efficient Conditional Entropy Reduction

[openreview] [pdf]

Abstract Diffusion models (DMs) have achieved significant success in generative modeling, but their iterative denoising process is computationally expensive. Training-free samplers, such as DPM-Solver, accelerate this process through gradient estimation-based numerical iterations. However, the mechanisms behind this acceleration remain insufficiently understood. In this paper, we demonstrate gradient estimation-based iterations enhance the denoising process by effectively \emph{\textbf{r}educing the conditional \textbf{e}ntropy} of reverse transition distribution. Building on this analysis, we introduce streamlined denoising iterations for DMs that optimize conditional entropy in score-integral estimation to improve the denoising iterations. Experiments on benchmark pre-trained models in both pixel and latent spaces validate our theoretical insights, demonstrating that numerical iterations based on conditional entropy reduction improve the reverse denoising diffusion process of DMs.

1669Why Not Transform Chat Large Language Models to Non-English?

[openreview] [pdf]

Abstract Large language models (LLMs) excel in various tasks, but their performance in non-English languages remains limited due to imbalanced training data. To address this limitation, we explore how to transform chat LLMs to non-English. Chat LLMs offer more advanced capabilities than base LLMs, such as multi-turn conversation and alignment with human preferences. However, transforming chat LLMs presents greater challenges than base LLMs. First, how can we effectively transfer advanced capabilities without their supervised data in target languages? Second, how can we prevent the original capabilities from catastrophic forgetting without replaying their training procedure in English? We target these issues by introducing a simple framework called TransLLM. TransLLM divides the transfer problem into some common sub-tasks with the translation chain-of-thought, eliminating the need for complex training data. More importantly, TransLLM uses two key strategies to prevent catastrophic forgetting: Low-rank adaptation, which preserves the original LLM parameters during training, and recovery KD, which utilizes data generated by the chat LLM itself to recover the original knowledge from the frozen parameters. Experiments conducted across five languages and three LLMs demonstrate the superiority of TransLLM. Notably, TransLLM outperforms GPT-4 in Thai, demonstrating higher levels of helpfulness and safety, using just 8B parameters and publicly accessible data. Our analysis demonstrates how recovery KD combined with LoRA helps mitigate catastrophic forgetting.

1670Projected Neural Differential Equations for Learning Constrained Dynamics

[openreview] [pdf]

Abstract Neural differential equations offer a powerful approach for learning dynamics from data. However, they do not impose known constraints that should be obeyed by the learned model. It is well-known that enforcing constraints in surrogate models can enhance their generalizability and numerical stability. In this paper, we introduce projected neural differential equations (PNDEs), a new method for constraining neural differential equations based on projection of the learned vector field to the tangent space of the constraint manifold. In tests on several challenging examples, including chaotic dynamical systems and state-of-the-art power grid models, PNDEs outperform existing methods while requiring fewer hyperparameters. The proposed approach demonstrates significant potential for enhancing the modeling of constrained dynamical systems, particularly in complex domains where accuracy and reliability are essential.

1671Elliptic Loss Regularization

[openreview] [pdf]

Abstract Regularizing neural networks is important for anticipating model behavior in regions of the data space that are not well represented. In this work, we propose a regularization technique for enforcing a level of smoothness in the mapping between the input space and the loss. We specify the level of regularity by requiring that the loss of the network satisfies an elliptic operator over the data domain. To do this, we modify the usual empirical risk minimization objective such that we instead minimize a new objective that satisfies an elliptic operator over points within the domain. This allows us to use existing theory on elliptic operators to anticipate the behavior of the error for points outside the training set. We propose a tractable computational method that approximates the behavior of the elliptic operator while being computationally efficient. Finally, we analyze the properties of the proposed regularization to understand the performance on common problems of distribution shift and group imbalance. Numerical experiments empirically confirm the promise of the proposed regularization technique.

1672From Decoupling to Adaptive Transformation: a Wider Optimization Space for PTQ

[openreview] [pdf]

Abstract Post-training low-bit quantization (PTQ) is useful to accelerate DNNs due to its high efficiency. Currently, finetuning through self-distillation feature reconstruction is one of the most effective PTQ techniques. However, when bitwidth goes to be extremely low, we find that current parameter update settings in PTQ feature reconstruction are sub-optimal. Considering all possible parameters and the ignored fact that integer weight can be obtained early before actual inference, we thoroughly explore 1) the setting of weight’s quantization step into six cases by decoupling; 2) ignored learnable params in PTQ like BN and bias. Based on these explorations, we find there exist a wider optimization space and a better optimum. Considering these, we propose an Adaptive Quantization Transformation(AdaQTransform) for PTQ reconstruction, which provides adaptive per-channel transformation on the quant output feature, making them better fit FP32 counterpart and achieve lower PTQ feature reconstruction error. During inference, the AdaQTransform parameters can be merged without incurring additional inference costs. Based on AdaQTransform, for the first time, we build a general quantization setting paradigm subsuming current PTQs, QATs and other potential approaches. Experiments demonstrate that AdaQTransform expands the optimization space for PTQ and helps current PTQs find a better optimum over CNNs, ViTs, LLMs and low-level vision networks (image super-resolution). Specifically, AdaQTransform improves the current best PTQ by 5.7% on W2A2-MobileNet-v2. The code will be released.

1673LORA-MaOO: Learning Ordinal Relations and Angles for Expensive Many-Objective Optimization

[openreview] [pdf]

Abstract Many-objective optimization (MaOO) simultaneously optimizes many conflicting objectives to identify the Pareto front - a set of diverse solutions that represent different optimal balances between conflicting objectives. For expensive MaOO problems, due to their costly function evaluations, computationally cheap surrogates have been widely used in MaOO to save evaluation budget. However, as the number of objectives increases, the cost of using surrogates and the difficulty of maintaining solution diversity increase rapidly. It is a challenge to reach diverse optimal solutions with a relatively low cost of using surrogates for MaOO problems. In this paper, we propose LORA-MaOO, a surrogate-assisted MaOO algorithm that learns surrogates from spherical coordinates to handle this challenge. LORA-MaOO includes an ordinal-regression-based surrogate for convergence and M1M-1 regression-based surrogates for diversity. MM is the number of objectives. Such a surrogate modeling framework makes it possible to complete the surrogate-assisted search and produce optimal candidate solutions with a single ordinal surrogate, while the M1M-1 remaining surrogates are only used to select diverse optimal candidate solutions for expensive evaluations, which lowers the cost of using surrogates and thus enhances the optimization efficiency. In addition, we design a clustering method to quantify artificial ordinal relations for non-dominated solutions and improve the quantification of dominance-based ordinal relations. These ordinal relations are used to train the ordinal regression surrogate which predicts how desirable the candidate solutions are in terms of convergence. The solution diversity is maintained via angles between solutions instead of pre-defined auxiliary reference vectors, which is parameter-free. Experimental results show that LORA-MaOO significantly outperforms other surrogate-assisted MaOO methods on most MaOO benchmark problems and real-world applications.

1674Best Possible Q-Learning

[openreview] [pdf]

Abstract Fully decentralized learning, where the global information, \textit{i.e.}, the actions of other agents, is inaccessible, is a fundamental challenge in cooperative multi-agent reinforcement learning. However, the convergence and optimality of most decentralized algorithms are not theoretically guaranteed, since the transition probabilities are non-stationary as all agents are updating policies simultaneously. To tackle this challenge, we propose \textit{best possible operator}, a novel decentralized operator, and prove that the policies of cooperative agents will converge to the optimal joint policy if each agent independently updates its individual state-action value by the operator when there is only one optimal joint policy. Further, to make the update more efficient and practical, we simplify the operator and prove that the convergence and optimality still hold with the simplified one. By instantiating the simplified operator, the derived fully decentralized algorithm, \textit{best possible Q-learning} (BQL), does not suffer from non-stationarity. Empirically, we show that BQL achieves remarkable improvement over baselines in a variety of cooperative multi-agent tasks.

1675Why Fine-Tuning Struggles with Forgetting in Machine Unlearning? Theoretical Insights and a Remedial Approach

[openreview] [pdf]

Abstract Machine Unlearning has emerged as a significant area of research, focusing on ‘removing’ specific subsets of data from a trained model. Fine-tuning (FT) methods have become one of the fundamental approaches for approximating unlearning, as they effectively retain model performance. However, it is consistently observed that naive FT methods struggle to forget the targeted data. In this paper, we present the first theoretical analysis of FT methods for machine unlearning within a linear regression framework, providing a deeper exploration of this phenomenon. We investigate two scenarios with distinct features and overlapping features. Our findings reveal that FT models can achieve zero remaining loss yet fail to forget the forgetting data, unlike golden models (trained from scratch without the forgetting data). This analysis reveals that naive FT methods struggle with forgetting because the pretrained model retains information about the forgetting data, and the fine-tuning process has no impact on this retained information. To address this issue, we first propose a theoretical approach to mitigate the retention of forgetting data in the pretrained model. Our analysis shows that removing the forgetting data’s influence allows FT models to match the performance of the golden model. Building on this insight, we introduce a discriminative regularization term to practically reduce the unlearning loss gap between the fine-tuned model and the golden model. Our experiments on both synthetic and real-world datasets validate these theoretical insights and demonstrate the effectiveness of the proposed regularization method.

1676Probabilistic Hash Embeddings for Temporal Tabular Data Streams

[openreview] [pdf]

Abstract We study temporal tabular data-streams (TTD) where each observation has both categorical and numerical values, and where the universe of distinct categorical items is not known upfront and can even grow unboundedly over time. Such data is common in many large-scale systems, such as user activity in computer system logs and scientific experiment records. Feature hashing is commonly used as a pre- processing step to map the categorical items into a known universe, before doing representation learning (Coleman et al., 2024; Desai et al., 2022). However, these methods have been developed and evaluated for the offline or batch settings. In this paper, we consider the pre-processing step of hashing before representation learning in the online setting for TTD. We show that deterministic embeddings suffer from forgetting in online learning with TTD, leading to performance deterioration. To mitigate the issue, we propose a probabilistic hash embedding (PHE) model that treats hash embeddings as stochastic and applies Bayesian online learning to learn incrementally with data. Based on the structure of PHE, we derive a scalable inference algorithm to learn model parameters and infer/update the posteriors of hash embeddings and other latent variables. Our algorithm (i) can handle evolving vocabulary of categorical items, (ii) is adaptive to new items without forgetting old items, (iii) is implementable with a bounded set of parameters that does not grow with the number of distinct observed items on the stream, and (iv) is efficiently implementable both in the offline and the online streaming setting. Experiments in classification, sequence modeling, and recommendation systems with TTD demonstrate the superior performance of PHE compared to baselines.

1677Cognitive map formation under uncertainty via local prediction learning

[openreview] [pdf]

Abstract Cognitive maps are internal world models that enable adaptive behavior including spatial navigation and planning. The Cognitive Map Learner (CML) has been recently proposed as a model for cognitive map formation and planning. A CML learns high dimensional state and action representations using local prediction learning. While the CML offers a simple and elegant solution to cognitive map learning, it is limited by its simplicity, applying only to fully observable environments. To address this, we introduce the Partially Observable Cognitive Map Learner (POCML), extending the CML to handle partially observable environments.The POCML employs a superposition of states for probabilistic representation and uses binding operations for state updates. Additionally, an associative memory is incorporated to enable adaptive behavior across environments with similar structures. We derive local update rules tailored to the POCML’s probabilistic state representation and associative memory. We demonstrate a POCML is capable of learning the underlying structure of an environment via local next-observation prediction learning. In addition, we show that a POCML trained on an environment is capable of generalizing to environments with the same underlying structure but with novel observations, achieving good zero-shot next-observation prediction accuracy, significantly outperforming sequence models such as LSTMs and Transformers. Finally, we present a case study of navigation in a two-tunnel maze environment with aliased observations, showing that a POCML is capable of effectively using its probabilistic state representations for disambiguation of states and spatial navigation.

1678Performing Interpretability Analysis in Federated Learning Context

[openreview] [pdf]

Abstract Federated learning continues to evolve but faces challenges in interpretability and explainability. We introduce a creative approach employing Neural Additive Models (NAMs) within a federated learning framework to address these challenges. These models referred to as Federated Neural Additive Models (FedNAMs), merge the advantages of NAMs, where individual networks concentrate on specific input features, with the decentralized approach of federated learning, ultimately producing interpretable analysis results. This integration enhances privacy by training on local data across multiple devices, thereby minimizing the risks of data centralization and enhancing model robustness and generalizability. FedNAMs maintain detailed feature-specific learning, making them especially valuable in sectors like finance and healthcare. They facilitate training client-specific models that integrate local updates, preserving privacy and reducing centralization concerns. Our studies on various text and image classification tasks, using datasets such as OpenFetch ML Wine, UCI Heart Disease, and Iris, show that FedNAMs deliver strong interpretability with minimal accuracy loss compared to traditional Federated Deep Neural Networks (DNNs). The research involves notable findings, including the identification of key predictive features at the client level as well as at the global level. Volatile acidity, sulfates, and chlorides for wine quality. Chest pain type, maximum heart rate, and number of vessels for heart disease. Petal length and width for iris classification. This approach strengthens privacy and model efficiency and improves interpretability and robustness across diverse datasets. Finally, FedNAMs generate insights on causes of highly and low interpretable features.

1679Noise is More Than Just Interference: Information Infusion Networks for Anomaly Detection

[openreview] [pdf]

Abstract 3D anomaly detection is a crucial task in computer vision, aiming to identify anomalous points or regions from point cloud data. However, existing methods may encounter challenges when handling point clouds with high intra-class variance, especially for methods that rely on registration techniques. In this study, we propose a novel 3D anomaly detection method, termed Information Gain Block-based Anomaly Detection (IGB-AD), to address the challenges of insufficient anomaly detection information and high intra-class variance. To extract ordered features from 3D point clouds, the technique of Rotation-Invariant Farthest Point Sampling (RIFPS) is first introduced. Then, an Information Perfusion (IP) module composed of stacked Information Gain Blocks (IGB) is proposed to utilize prior noise to provide more distinguishing information for the features, where IGB is designed to utilize noise in a reverse-thinking manner to enhance anomaly detection. Finally, a Packet Downsampling (PD) technique is developed to preserve key information between multiple clusters to solve the complex downsampling situation. The main purpose of the framework is to utilize the effective information within prior noise to provide more detection criteria for anomaly detection. In addition, an Intra-Class Diversity (ICD) 3D dataset is constructed, which contains multiple categories with high class-variance. Experimental results show that the proposed IGB-AD method achieves the State-Of-The-Arts (SOTA) performance on the Anomaly ShapeNet dataset, with an P-AUROC of 81.5% and I-AUROC of 80.9%, and also gains the best performance on the ICD dataset, with an P-AUROC of 57.4% and I-AUROC of 60.2%. Our dataset will be released after acceptance.

1680Zero-Shot Goal Dialogue via Reinforcement Learning on Imagined Conversations

[openreview] [pdf]

Abstract Large language models (LLMs) have emerged as powerful and general solutions to many natural language tasks. However, many of the most important applications of language generation are interactive, where an agent has to talk to a person to reach a desired outcome. For example, a teacher might try to understand their student’s current comprehension level to tailor their instruction accordingly, and a travel agent might ask questions of their customer to understand their preferences in order to recommend activities they might enjoy. LLMs trained with supervised fine-tuning or ``single-step’’ RL, as with standard RLHF, might struggle which tasks that require such goal-directed behavior, since they are not trained to optimize for overall conversational outcomes after multiple turns of interaction. In this work, we explore a new method for adapting LLMs with RL for such goal-directed dialogue. Our key insight is that, though LLMs might not effectively solve goal-directed dialogue tasks out of the box, they can provide useful data for solving such tasks by simulating human-like behaviors. Given a textual description of a goal-directed dialogue task, we leverage LLMs to synthesize hypothetical in-domain human-human interactions. Our algorithm then utilizes this dataset with offline reinforcement learning}to train an interactive conversational agent that can optimize multi-step objectives. Empirically, we show that our proposed approach achieves state-of-the-art performance in various goal-directed dialogue tasks that include teaching and preference elicitation.

1681Balanced conic rectified flow

[openreview] [pdf]

Abstract Rectified flow is a generative model that learns smooth transport mappings between two distributions through an ordinary differential equation (ODE). Unlike diffusion-based generative models, which require costly numerical integration of a generative ODE to sample images with state-of-the-art quality, rectified flow uses an iterative process called reflow to learn smooth and straight ODE paths. This allows for relatively simple and efficient generation of high-quality images. However, rectified flow still faces several challenges. 1) The reflow process requires a large number of generative pairs to preserve the target distribution, leading to significant computational costs. 2) Since the model is typically trained using only generated image pairs, its performance heavily depends on the 1-rectified flow model, causing it to become biased towards the generated data.In this work, we experimentally expose the limitations of the original rectified flow and propose a novel approach that incorporates real images into the training process. By preserving the ODE paths for real images, our method effectively reduces reliance on large amounts of generated data. Instead, we demonstrate that the reflow process can be conducted efficiently using a much smaller set of generated and real images. In CIFAR-10, we achieved significantly better FID scores, not only in one-step generation but also in full-step simulations, while using only 7.27.2% of the generative pairs compared to the original method. Furthermore, our approach induces straighter paths and avoids saturation on generated images during reflow, leading to more robust ODE learning while preserving the distribution of real images.

1682Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model

[openreview] [pdf]

Abstract Aligning language models (LMs) with human preferences has become a key area of research, enabling these models to meet diverse user needs better. Inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor, we extend this idea to model alignment. In this work, we observe that the alignment behavior in weaker models can be effectively transferred to stronger models and even exhibit an amplification effect. Based on this insight, we propose a method called Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model. Experiments demonstrate that WSPO delivers outstanding performance, improving the win rate of Qwen2-7B-Instruct on Arena-Hard from 39.70 to 49.60, achieving a remarkable 47.04 length-controlled win rate on AlpacaEval 2, and scoring 7.33 on MT-bench. Our results suggest that using the weak model to elicit a strong model with a high alignment ability is feasible.

1683Failure Modes of LLMs for Causal Reasoning on Narratives

[openreview] [pdf]

Abstract In this work, we investigate the causal reasoning abilities of large language models (LLMs) through the representative problem of inferring causal relationships from narratives. We find that even state of the art language models rely heavily on unreliable shortcuts, both in terms of the narrative presentation and their parametric knowledge. For example, LLMs tend to determine causal relationships based on the temporal ordering of events (i.e., earlier events cause later ones), resulting in lower performance whenever events are not narrated in their exact causal order. Similarly, we demonstrate that LLMs struggle with long-term causal reasoning — they often fail when the narratives are longer and contain many events. As an additional failure mode, we show LLMs appear to heavily rely on their parametric knowledge at the expense of reasoning over the provided narrative. This degrades their abilities whenever the narrative opposes parametric knowledge. We extensively validate these failure modes through carefully controlled synthetic experiments, as well as evaluations on real-world narratives. Finally, we observe that explicitly generating a causal graph generally improves performance while naive chain-of-thought is ineffective. Collectively, our results distill precise failure modes of current state-of-the art models and can pave the way for future techniques to enhance causal reasoning in LLMs.

1684Towards Stabilizable Sequential Smoothing Spline Interpolation by Point Forecasting

[openreview] [pdf]

Abstract Sequential smoothing spline interpolators exhibit unstable behavior under low-delay response requirements. That is, instability issues are observed when a smoothing spline interpolator is forced to provide an interpolated trajectory piece subject to processing only a few to no incoming data points at each time stamp. Typically, the above instability setback is solved by increasing the delay, sacrificing some degree of smoothness in the interpolated trajectory, or a combination of both. However, stable sequential smoothing spline interpolation strategies working under low delay and without compromising their degree of smoothness seem vastly unexplored in the literature. To the best of our knowledge, this work formalizes the internal instability and asserts the controllability of sequential smoothing spline interpolators for the first time. Specifically, we model the trajectory assembled by a smoothing spline interpolator as a discrete dynamical system of the spline coefficients, facilitating the analysis of its internal instability and controllability. From these results, we propose a stabilizing strategy based on data point forecasting capable of operating even under delayless regimes and without sacrificing any smoothness of the interpolated trajectory. Our claims are theoretically confirmed, or experimentally supported by extensive numerical results otherwise.

1685MaestroMotif: Skill Design from Artificial Intelligence Feedback

[openreview] [pdf]

Abstract Describing skills in natural language has the potential to provide an accessible way to inject human knowledge about decision-making into an AI system. We present MaestroMotif, a method for AI-assisted skill design, which yields high-performing and adaptable agents. MaestroMotif leverages the capabilities of Large Language Models (LLMs) to effectively create and reuse skills. It first uses an LLM’s feedback to automatically design rewards corresponding to each skill, starting from their natural language description. Then, it employs an LLM’s code generation abilities, together with reinforcement learning, for training the skills and combining them to implement complex behaviors specified in language. We evaluate MaestroMotif using a suite of complex tasks in the NetHack Learning Environment (NLE), demonstrating that it surpasses existing approaches in both performance and usability.

1686Learning subgoal representations from state graphs in goal-conditioned hierarchical reinforcement learning

[openreview] [pdf]

Abstract The integration of graphs with Goal-conditioned Hierarchical Reinforcement Learning (GCHRL) has recently gained attention, as the intermediate goals (subgoals) can be effectively sampled from graphs that naturally represent the overall task structure in most RL tasks. However, some existing approaches often rely on domain-specific knowledge to construct these graphs, limiting their applicability to new tasks. Other graph-based approaches create graphs dynamically during exploration but struggle to fully utilize them because they have problems passing the information in the graphs to newly visited states. Additionally, current GCHRL methods face challenges such as sample inefficiency and poor subgoal representations. In this paper, we present a solution to these issues through the development of a graph encoder-decoder that can evaluate unseen states. Our proposed method, Graph-Guided sub-Goal representation Generation RL (G4RL), can be incorporated into any existing GCHRL method to enhance performance. We show that the graph encoder-decoder can be effectively implemented using a network trained on the state graph generated during exploration. Empirical results indicate that leveraging high and low-level intrinsic rewards from the graph encoder-decoder significantly enhances the performance of state-of-the-art GCHRL approaches in both dense and sparse reward environments.

1687The Quest for Winning Tickets in Low-Rank Adapters

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA), a prominent parameter-efficient fine-tuning (PEFT) method, offers an effective strategy for adapting large pre-trained models to specific tasks with minimal computational overhead. LoRA achieves this by introducing low-rank parameter matrices to the frozen pre-trained models. However, despite their efficiency, LoRA and its variants modify all elements of a parameter block, which is unnecessary as LoRA primarily aims to adjust a small set of subspaces that capture task-specific knowledge. Drawing inspiration from the Lottery Ticket Hypothesis (LTH), which posits that dense neural networks contain sparse subnetworks capable of performing similarly to fully-parameterized models, we investigate whether similar sparse subnetworks exist for low-rank adapters. We demonstrate that such subnetworks, often referred to as “winning tickets” in the context of LTH, indeed exist for low-rank adapters. We introduce a method to identify this sparse subset of weights for each layer by relating the top subspaces of the pretrained parameter block to the elements of the corresponding weight matrix. This subset is then fine-tuned using LoRA. We show that this sparse subset is not necessarily unique; as long as sparsity is kept within a certain bound defined by the task, random subnetworks with similar sparsity can act as winning tickets. Building on this discovery, we propose a novel approach called Partial-LoRA, which adds sparse low-rank parameters to pre-trained models. Through extensive experiments on 8 vision and 4 language tasks, we demonstrate that Partial-LoRA can reduce trainable parameters by up to 87% while maintaining or even improving model performance in some cases. Our work thus reduces memory needs and theoretically grounds sparse LoRAs.

1688Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs

[openreview] [pdf]

Abstract We propose semantic entropy probes (SEPs), a cheap and reliable method for uncertainty quantification in Large Language Models (LLMs). Hallucinations, which are plausible-sounding but factually incorrect and arbitrary model generations, present a major challenge to the practical adoption of LLMs. Recent work by Farquhar et al. proposes semantic entropy (SE), which can reliably detect hallucinations by quantifying the uncertainty over different generations by estimating entropy over semantically equivalent sets of outputs. However, the 5-to-10-fold increase in computation cost associated with SE computation hinders practical adoption. To address this, we propose SEPs, which directly approximate SE from the hidden states of a single generation. SEPs are simple to train and do not require sampling multiple model generations at test time, reducing the overhead of semantic uncertainty quantification to almost zero. We show that SEPs retain high performance for hallucination detection and generalize better to out-of-distribution data than previous probing methods that directly predict model accuracy. Our results across models and tasks suggest that model hidden states capture SE, and our ablation studies give further insights into the token positions and model layers for which this is the case.

1689Sketch-to-Skill: Bootstrapping Robot Learning with Human Drawn Trajectory Sketches

[openreview] [pdf]

Abstract Training robotic manipulation policies traditionally requires numerous demonstrations and/or environmental rollouts. While recent Imitation Learning (IL) and Reinforcement Learning (RL) methods have reduced the number of required demonstrations, they still rely on expert knowledge to collect high-quality data, limiting scalability and accessibility. We propose Sketch-to-Skill, a novel framework that leverages human-drawn 2D sketch trajectories to bootstrap and guide RL for robotic manipulation. Our approach extends beyond previous sketch-based methods, which were primarily focused on imitation learning or policy conditioning, limited to specific trained tasks. Sketch-to-Skill employs a Sketch-to-3D Trajectory Generator that translates 2D sketches into 3D trajectories, which are then used to autonomously collect initial demonstrations. We utilize these sketch-generated demonstrations in two ways: to pre-train an initial policy through behavior cloning and to refine this policy through RL with guided exploration. Experimental results demonstrate that Sketch-to-Skill achieves \sim96% of the performance of the baseline model that leverages teleoperated demonstration data, while exceeding the performance of a pure reinforcement learning policy by \sim170%, only from sketch inputs. This makes robotic manipulation learning more accessible and potentially broadens its applications across various domains.

1690Provable weak-to-strong generalization via benign overfitting

[openreview] [pdf]

Abstract The classic teacher-student model in machine learning posits that a strong teacher supervises a weak student to improve the student’s capabilities. We instead consider the inverted situation, where a weak teacher supervises a strong student with imperfect pseudolabels. This paradigm was recently brought forth by \citet{burns2023weak} and termed \emph{weak-to-strong generalization}. We theoretically investigate weak-to-strong generalization for binary and multilabel classification in a stylized overparameterized spiked covariance model with Gaussian covariates where the weak teacher’s pseudolabels are asymptotically like random guessing. Under these assumptions, we provably identify two asymptotic phases of the strong student’s generalization after weak supervision: (1) successful generalization and (2) random guessing. Our techniques should eventually extend to weak-to-strong multiclass classification. Towards doing so, we prove a tight lower tail inequality for the maximum of correlated Gaussians, which may be of independent interest. Understanding the multilabel setting reinforces the value of using logits for weak supervision when they are available.

1691Evaluating Oversight Robustness with Incentivized Reward Hacking

[openreview] [pdf]

Abstract Scalable oversight aims to train systems to perform tasks that are hard for humans to specify, demonstrate and validate. As ground truth is not available for such tasks, evaluating scalable oversight techniques is challenging: existing methods measure the success of an oversight method based on whether it allows an artificially weak overseer to successfully supervise an AI to perform a task. In this work, we additionally measure the robustness of scalable oversight techniques by testing their vulnerability to reward hacking by an adversarial supervisee. In experiments on a synthetic domain, we show that adding an explicit reward hacking incentive to the model being trained leads it to exploit flaws in a weak overseer, and that scalable oversight methods designed to mitigate these flaws can make the optimization more robust to reward hacking. We hope these experiments lay a foundation for future work to validate scalable oversight methods’ ability to mitigate reward hacking in realistic settings.

1692HyperDPO: Hypernetwork-based Multi-Objective Fine-Tuning Framework

[openreview] [pdf]

Abstract In LLM alignment and many other ML applications, one often faces theMulti-Objective Fine-Tuning (MOFT)problem,i.e.fine-tuning an existing model with datasets labeled w.r.t. different objectives simultaneously. To address the challenge, we propose theHyperDPOframework, a hypernetwork-based approach that extends the Direct Preference Optimization (DPO) technique, originally developed for efficient LLM alignment with preference data, to accommodate the MOFT settings. By substituting the Bradley-Terry-Luce model in DPO with the Plackett-Luce model, our framework is capable of handling a wide range of MOFT tasks that involve listwise ranking datasets. Compared with previous approaches, HyperDPO enjoys an efficient one-shot training process for profiling the Pareto front of auxiliary objectives, and offers flexible post-training control over trade-offs. Additionally, we propose a novelHyper Prompt Tuningdesign, that conveys continuous weight across objectives to transformer-based models without altering their architecture. We demonstrate the effectiveness and efficiency of the HyperDPO framework through its applications to various tasks, including Learning-to-Rank (LTR) and LLM alignment, highlighting its viability for large-scale ML deployments.

1693Towards Debiased Source-Free Domain Adaptation

[openreview] [pdf]

Abstract Source-Free Domain Adaptation (SFDA) aims to adapt a model trained in an inaccessible source domain SS to a different, unlabelled target domain TT. The conventional approach generates pseudo-labels for the TT samples with the source-trained model, which are then used for model adaptation. However, we show that the adapted model is biased to the spurious correlations in SS, consequently leading to catastrophic failure on TT samples that are dissimilar to SS. Unfortunately, without any prior knowledge about spurious correlations, the current SFDA setting has no mechanism to circumvent this bias. We introduce a practical setting to address this gap -- Debiased SFDA, where the model receives additional supervision from a pre-trained, frozen reference model. This setting stays in line with the essence of SFDA, which accommodates proprietary source-domain training, while also offering prior knowledge that is unaffected by source-domain training to facilitate debiasing. Under this setting, we propose 1) a simple contrastive objective that debiases the source-trained model from spurious correlations inconsistent with the reference model; 2) a diagnostic metric that evaluates the degree to which an adapted model is biased towards SS. Our objective can be easily plugged into different baselines for debiasing, and through extensive evaluations, we demonstrate that it engenders consistent improvements across standard benchmarks. Code is supplied under supplementary material.

1694Accelerated training through iterative gradient propagation along the residual path

[openreview] [pdf]

Abstract Despite being the cornerstone of deep learning, backpropagation is criticized for its inherent sequentiality, which can limit the scalability of very deep models. Such models faced convergence issues due to vanishing gradient, later resolved using residual connections. Variants of these are now widely used in modern architectures. However, the computational cost of backpropagation remains a major burden, accounting for most of the training time. Taking advantage of residual-like architectural designs, we introduce Highway backpropagation, a parallelizable iterative algorithm that approximates backpropagation, by alternatively i) accumulating the gradient estimates along the residual path, and ii) backpropagating them through every layer in parallel. This algorithm is naturally derived from a decomposition of the gradient as the sum of gradients flowing through all paths, and is adaptable to a diverse set of common architectures, ranging from ResNets and Transformers to recurrent neural networks. Through an extensive empirical study on a large selection of tasks and models, we evaluate Highway-BP and show that major speedups can be achieved with minimal performance degradation.

1695Efficient Multi-agent Offline Coordination via Diffusion-based Trajectory Stitching

[openreview] [pdf]

Abstract Learning from offline data without interacting with the environment is a promising way to fully leverage the intelligent decision-making capabilities of multi-agent reinforcement learning (MARL). Previous approaches have primarily focused on developing learning techniques, such as conservative methods tailored to MARL using limited offline data. However, these methods often overlook the temporal relationships across different timesteps and spatial relationships between teammates, resulting in low learning efficiency in imbalanced data scenarios. To comprehensively explore the data structure of MARL and enhance learning efficiency, we propose Multi-Agent offline coordination via Diffusion-based Trajectory Stitching (MADiTS), a novel diffusion-based data augmentation pipeline that systematically generates trajectories by stitching high-quality coordination segments together. MADiTS first generates trajectory segments using a trained diffusion model, followed by applying a bidirectional dynamics constraint to ensure that the trajectories align with environmental dynamics. Additionally, we develop an offline credit assignment technique to identify and optimize the behavior of underperforming agents in the generated segments. This iterative procedure continues until we obtain a satisfactory augmented episode trajectory. Empirical results on MPE and SMAC imbalanced datasets demonstrate that MADiTS significantly improves MARL performance.

1696MetaDD: Boosting Dataset Distillation with Neural Network Architecture-Invariant Generalization

[openreview] [pdf]

Abstract Dataset distillation (DD) entails creating a refined, compact distilled dataset from a large-scale dataset to facilitate efficient training. A significant challenge in DD is the dependency between the distilled dataset and the neural network (NN) architecture used. Training a different NN architecture with a distilled dataset distilled using a specific architecture often results in diminished trainning performance for other architectures. This paper introduces MetaDD, designed to enhance the generalizability of DD across various NN architectures. Specifically, MetaDD partitions distilled data into meta features (i.e., the data’s common characteristics that remain consistent across different NN architectures) and heterogeneous features (i.e., the data’s unique feature to each NN architecture). Then, MetaDD employs an architecture-invariant loss function for multi-architecture feature alignment, which increases meta features and reduces heterogeneous features in distilled data. As a low-memory consumption component, MetaDD can be seamlessly integrated into any DD methodology. Experimental results demonstrate that MetaDD significantly improves performance across various DD methods. On the Distilled Tiny-Imagenet with Sre2L (50 IPC), MetaDD achieves cross-architecture NN accuracy of up to 30.1%, surpassing the second-best method (GLaD) by 1.7%.

1697Utilitarian Algorithm Configuration for Infinite Parameter Spaces

[openreview] [pdf]

Abstract Utilitarian algorithm configuration is a general-purpose technique for automatically searching the parameter space of a given algorithm to optimize its performance, as measured by a given utility function, on a given set of inputs. Recently introduced utilitarian configuration procedures offer optimality guarantees about the returned parameterization while provably adapting to the hardness of the underlying problem. However, the applicability of these approaches is severely limited by the fact that they only search a finite, relatively small set of parameters. They cannot effectively search the configuration space of algorithms with continuous or uncountable parameters. In this paper we introduce a new procedure, which we dub COUP (Continuous, Optimistic Utilitarian Procrastination). COUP is designed to search infinite parameter spaces efficiently to find good configurations quickly. Furthermore, COUP maintains the theoretical benefits of previous utilitarian configuration procedures when applied to finite parameter spaces but is significantly faster, both provably and experimentally.

1698Mixture-of-Queries Transformer: Camouflaged Instance Segmentation via Queries Cooperation and Frequency Enhancement

[openreview] [pdf]

Abstract Due to the high similarity between camouflaged instances and the surroundings and the widespread camouflage-like scenarios, the recently proposed camouflaged instance segmentation (CIS) is a challenging and relevant task. Previous approaches achieve some progress on CIS, while many overlook camouflaged objects’ color and contour nature and then decide on each candidate instinctively. In this paper, we contribute a Mixture-of-Queries Transformer (MoQT) in an end-toend manner for CIS which is based on two key designs (a Frequency Enhancement Feature Extractor and a Mixture-of-Queries Decoder). First, the Frequency Enhancement Feature Extractor is responsible for capturing the camouflaged clues in the frequency domain. To expose camouflaged instances, the extractor enhances the effectiveness of contour, eliminates the interference color, and obtains suitable features simultaneously. Second, a Mixture-of-Queries Decoder utilizes multiple experts of queries (several queries comprise an expert) for spotting camouflaged characteristics with cooperation. These experts collaborate to generate outputs, refined hierarchically to a fine-grained level for more accurate instance masks. Coupling these two components enables MoQT to use multiple experts to integrate effective clues of camouflaged objects in both spatial and frequency domains. Extensive experimental results demonstrate our MoQT outperforms 18 state-of-the-art CIS approaches by 2.69% on COD10K and 1.93% on NC4K in average precision.

[openreview] [pdf]

Abstract Nearest neighbor search is a fundamental data structure problem with many applications in machine learning, computer vision, recommendation systems and other fields. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted (Carbonell et al, 1998) that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number of points closest to the query, while in the second stage, the points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output.In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to ``piggy-back’’ on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of kk points approximately closest to the query, in time that only depends on kk and logΔ\log \Delta, where Δ\Delta is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.

1700Cost-Effective Online Multi-LLM Selection with Versatile Reward Models

[openreview] [pdf]

Abstract With the rapid advancement of large language models (LLMs), the diversity of multi-LLM tasks and the variability in their pricing structures have become increasingly important, as costs can vary greatly between different LLMs. To tackle these challenges, we introduce the \textit{C2MAB-V}, a \underline{C}ost-effective \underline{C}ombinatorial \underline{M}ulti-armed \underline{B}andit with \underline{V}ersatile reward models for optimal LLM selection and usage. This online model differs from traditional static approaches or those reliant on a single LLM without cost consideration. With multiple LLMs deployed on a scheduling cloud and a local server dedicated to handling user queries, \textit{C2MAB-V} facilitates the selection of multiple LLMs over a combinatorial search space, specifically tailored for various collaborative task types with different reward models. Based on our designed online feedback mechanism and confidence bound technique, \textit{C2MAB-V} can effectively address the multi-LLM selection challenge by managing the exploration-exploitation trade-off across different models, while also balancing cost and reward for diverse tasks. The NP-hard integer linear programming problem for selecting multiple LLMs with trade-off dilemmas is addressed by: i) decomposing the integer problem into a relaxed form by the local server, ii) utilizing a discretization rounding scheme that provides optimal LLM combinations by the scheduling cloud, and iii) continual online updates based on feedback. Theoretically, we prove that \textit{C2MAB-V} offers strict guarantees over versatile reward models, matching state-of-the-art results for regret and violations in some degenerate cases. Empirically, we show that \textit{C2MAB-V} effectively balances performance and cost-efficiency with nine LLMs for three application scenarios.

1701Flow matching achieves almost minimax optimal convergence

[openreview] [pdf]

Abstract Flow matching (FM) has gained significant attention as a simulation-free generative model. Unlike diffusion models, which are based on stochastic differential equations, FM employs a simpler approach by solving an ordinary differential equation with an initial condition from a normal distribution, thus streamlining the sample generation process. This paper discusses the convergence properties of FM in terms of the pp-Wasserstein distance, a measure of distributional discrepancy. We establish that FM can achieve an almost minimax optimal convergence rate for 1p21 \leq p \leq 2, presenting the first theoretical evidence that FM can reach convergence rates comparable to those of diffusion models. Our analysis extends existing frameworks by examining a broader class of mean and variance functions for the vector fields and identifies specific conditions necessary to attain these optimal rates.

1702Bayesian Persuasion Is a Bargaining Game

[openreview] [pdf]

Abstract Bayesian persuasion studies how a sender with an informational advantage can persuade a receiver with a different motive to take actions that benefit the sender. This problem is previously formulated from an equilibrium perspective, where the sender is to choose a Bayes correlated equilibrium and the receiver is willing to respect the signaling scheme based on posterior beliefs. However, evidence in real-world scenarios and studies in farsighted receivers suggest otherwise: senders tend to be much more honest than the equilibrium. In this work, we show that Bayesian persuasion is reducible to a bargaining game. This reduction suggests that the receiver in Bayesian persuasion can be aware of the game structure and can develop an anti-exploitation strategy. This equalizes the power of commitment of the two parties and prevents the sender from taking the maximum possible payoff. Through experiments on large language models, we demonstrate the receiver’s retaliatory strategies and the sender’s compromise to that. More findings on the impact of the context and alignments further suggest that bargaining behavior emerges in persuasion tasks. The insights given by our results have potential implications on various scenarios to reduce exploitation, improve equality, and improve social welfare.

1703Transformers meet Neural Algorithmic Reasoners

[openreview] [pdf]

Abstract Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text datasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer’s language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution. Finally, we empirically show that Transformer-only models distilled from TransNAR models also exhibit improved out-of-distribution generalization capabilities.

1704Intransigent Teachers Guide Better Test-Time Adaptation Students

[openreview] [pdf]

Abstract Test-Time Adaptation (TTA) has recently emerged as a promising strategy that allows the adaptation of pre-trained models to changing data distributions at deployment time, without access to any labels. To address the error accumulation problem, various approaches have used the teacher-student framework. In this work, we challenge the common strategy of setting the teacher weights to be an exponential moving average of the student by showing that error accumulation still occurs, but only on longer sequences compared to those commonly utilized. We analyze the stability-plasticity trade-off within the teacher-student framework and propose to use an intransigent teacher instead. We show that not changing any of the weights of the teacher model within existing TTA methods allows them to significantly improve their performance on multiple datasets with longer scenarios and smaller batch sizes. Finally, we show that the proposed changes are applicable to different architectures and are more robust to changes in hyper-parameters.

1705Watch Out!! Your Confidence Might be a Reason for Vulnerability

[openreview] [pdf]

Abstract The tremendous success of deep neural networks (DNNs) in solving `any’ complex computer vision task leaves no stone unturned for their deployment in the physical world. However, the concerns arise when natural adversarial corruptions might perturb the physical world in unconstrained images. It is widely known that these corruptions are inherently present in the environment and can fool DNNs. While the literature aims to provide safety to DNNs against these natural corruptions they have developed two forms of defenses: (i) detection of corrupted images and (ii) mitigation of corruptions. So far, very little work has been done to understand the reason behind the vulnerabilities of DNNs against such corruption. We assert that network confidence is an essential component and ask whether the higher it is, the better the decision of a network is or not. Moreover, we ask the question of whether this confidence itself is a reason for their vulnerability against corruption. We extensively study the correlation between the confidence of a model and its robustness in handling corruption. Through extensive experimental evaluation using multiple datasets and models, we found a significant connection between the confidence and robustness of a network.

1706Generalized Category Discovery Utilizing Reciprocal Learning and Class-wise Distribution Regularization

[openreview] [pdf]

Abstract Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer inferior base discrimination due to the unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the leaning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our method achieves superior performance in all classes with negligible extra computation. Extensive experiments on seven GCD datasets validate the effectiveness of our method, e.g. delivering a notable 3.7% improvement on the CUB200 dataset. Our codes will be available upon acceptance.

1707On Understanding of the Dynamics of Model Capacity in Continual Learning

[openreview] [pdf]

Abstract The core issue in continual learning (CL) is balancing catastrophic forgetting of prior knowledge with generalization to new tasks, otherwise, known as the stability-plasticity dilemma. We argue that the dilemma is akin to the capacity(the networks’ ability to represent tasks) of the neural network(NN) in the CL setting. Within this context, this work introduces ``CL’s effective model capacity (CLEMC)" to understand the dynamical behavior of stability-plasticity balance point in the CL setting. We define CLEMC as a function of the NN, the task data, and the optimization procedure. Leveraging CLEMC, we demonstrate that the capacity is non-stationary and regardless of the NN architecture and optimization method, the network’s ability to represent new tasks diminishes if the incoming tasks’ data distributions differ from previous ones. We formulate these results using dynamical systems’ theory and conduct extensive experiments to complement the findings. Our analysis extends from a small feed-forward(FNN) and convolutional networks(CNN) to medium sized graph neural networks(GNN) to transformer-based large language models(LLM) with millions of parameters.

1708HOME-3: HIGH-ORDER MOMENTUM ESTIMATOR USING THIRD-POWER GRADIENT FOR CONVEX, SMOOTH NONCONVEX, AND NONSMOOTH NONCONVEX OPTIMIZATION

[openreview] [pdf]

Abstract Momentum-based gradients are critical for optimizing advanced machine learning models, as they not only accelerate convergence but also help gradient-based optimizers overcome stationary points. While most state-of-the-art momentum techniques rely on lower-power gradients, such as the squared first-order gradient, there has been limited exploration into the potential of higher-power gradients—those raised to powers greater than two, such as the third-power first-order gradient. In this work, we introduce the concept of high-order momentum, where momentum is constructed using higher-power gradients, with a specific focus on the third-power first-order gradient as a representative example. Our research offers both theoretical and empirical evidence of the benefits of this novel approach. From a theoretical standpoint, we demonstrate that incorporating third-power gradients into momentum can improve the convergence bounds of gradient-based optimizers for both convex and smooth nonconvex problems. To validate these findings, we conducted extensive empirical experiments across convex, smooth nonconvex, and nonsmooth nonconvex optimization tasks. The results consistently showcase that high-order momentum outperforms traditional momentum-based optimizers, providing superior performance and more efficient optimization.

1709Task Arithmetic in Trust Region: A Training-Free Model Merging Approach to Navigate Knowledge Conflicts

[openreview] [pdf]

Abstract Multi-task model merging offers an efficient solution for integrating knowledge from multiple fine-tuned models, mitigating the significant computational and storage demands associated with multi-task training. As a key technique in this field, Task Arithmetic (TA) defines task vectors by subtracting the pre-trained model (θpre\theta_{\text{pre}}) from the fine-tuned task models in parameter space, then adjusting the weight between these task vectors and θpre\theta_{\text{pre}} to balance task-generalized and task-specific knowledge. Despite the promising performance of TA, conflicts can arise among the task vectors, particularly when different tasks require distinct model adaptations. In this paper, we formally define this issue as knowledge conflicts, characterized by the performance degradation of one task after merging with a model fine-tuned for another task. Through in-depth analysis, we show that these conflicts stem primarily from the components of task vectors that align with the gradient of task-specific losses at θpre\theta_{\text{pre}}. To address this, we propose Task Arithmetic in Trust Region (TATR), which defines the trust region as dimensions in the model parameter space that cause only small changes (corresponding to the task vector components with gradient orthogonal direction) in the task-specific losses. Restricting parameter merging within this trust region, TATR can effectively alleviate knowledge conflicts. Moreover, TATR serves as both an independent approach and a plug-and-play module compatible with a wide range of TA-based methods. Extensive empirical evaluations on eight distinct datasets robustly demonstrate that TATR improves the multi-task performance of several TA-based model merging methods by an observable margin.

1710A Percolation Model of Emergence: Analyzing Transformers Trained on a Formal Language

[openreview] [pdf]

Abstract Increase in data, size, or compute can lead to sudden learning of specific capabilities by a neural network---a phenomenon often called “emergence”. Beyond scientific understanding, establishing the causal factors underlying such emergent capabilities is crucial to enable risk regulation frameworks for AI. In this work, we seek inspiration from study of emergent properties in other fields and propose a phenomenological definition for the concept in the context of neural networks. Our definition implicates the acquisition of general structures underlying the data-generating process as a cause of sudden performance growth for specific, narrower tasks. We empirically investigate this definition by proposing an experimental system grounded in a context-sensitive formal language and find that Transformers trained to perform tasks on top of strings from this language indeed exhibit emergent capabilities. Specifically, we show that once the language’s underlying grammar and context-sensitivity inducing structures are learned by the model, performance on narrower tasks suddenly begins to improve. We then analogize our network’s learning dynamics with the process of percolation on a bipartite graph, establishing a formal phase transition model that predicts the shift in the point of emergence observed in our experiments when changing the data structure. Overall, our experimental and theoretical frameworks yield a step towards better defining, characterizing, and predicting emergence in neural networks.

1711DRoC: Elevating Large Language Models for Complex Vehicle Routing via Decomposed Retrieval of Constraints

[openreview] [pdf]

Abstract This paper proposes Decomposed Retrieval of Constraints (DRoC), a novel framework aimed at enhancing large language models (LLMs) in exploiting solvers to tackle vehicle routing problems (VRPs) with intricate constraints. While LLMs have shown promise in solving simple VRPs, their potential in addressing complex VRP variants is still suppressed, due to the limited embedded internal knowledge that is required to accurately reflect diverse VRP constraints. Our DRoC framework mitigates the issue by integrating external knowledge via a novel retrieval-augmented generation (RAG) approach. More specifically, the DRoC decomposes VRP constraints, externally retrieves information relevant to each constraint, and synergistically combines internal and external knowledge to benefit the program generation for solving VRPs. The DRoC also allows LLMs to dynamically select between RAG and self-debugging mechanisms, thereby optimizing program generation without the need for additional training. Experiments across 48 VRP variants exhibit the superiority of DRoC, with significant improvements in the success rate and optimality gap delivered by the generated programs. The DRoC framework has the potential to elevate LLM performance in complex optimization tasks, fostering the applicability of LLMs in industries such as transportation and logistics.

1712Towards Realistic Long-tailed Semi-supervised Learning in an Open World

[openreview] [pdf]

Abstract Open-world long-tailed semi-supervised learning (OLSSL) has increasingly attracted attention. However, existing OLSSL algorithms generally assume that the distributions between known and novel categories are nearly identical. Against this backdrop, we construct a more Realistic Open-world Long-tailed Semi-supervised Learning (ROLSSL) setting where there is no premise on the distribution relationships between known and novel categories. Furthermore, even within the known categories, the number of labeled samples is significantly smaller than that of the unlabeled samples, as acquiring valid annotations is often prohibitively costly in the real world. Under the proposed ROLSSL setting, we propose a simple yet potentially effective solution called dual-stage post-hoc logit adjustments. The proposed approach revisits the logit adjustment strategy by considering the relationships among the frequency of samples, the total number of categories, and the overall size of data. Then, it estimates the distribution of unlabeled data for both known and novel categories to dynamically readjust the corresponding predictive probabilities, effectively mitigating category bias during the learning of known and novel classes with more selective utilization of imbalanced unlabeled data. Extensive experiments on datasets such as CIFAR100 and ImageNet100 have demonstrated performance improvements of up to 50.1%, validating the superiority of our proposed method and establishing a strong baseline for this task. For further researches, the experimental code will be open soon.

1713BTBS-LNS: Binarized-Tightening, Branch and Search on Learning LNS Policies for MIP

[openreview] [pdf]

Abstract Learning to solve large-scale Mixed Integer Program (MIP) problems is an emerging research topic, and Policy learning-based Large Neighborhood Search (LNS) has been a popular paradigm. However, the explored space of LNS policy is often limited even in the training phase, making the learned policy sometimes wrongly fix some potentially important variables early in the search, leading to local optimum in some cases. Moreover, many methods only assume binary variables to deal with. We present a practical approach, termed Binarized-Tightening Branch-and-Search for Large Neighborhood Search (BTBS-LNS). It comprises of three key techniques: 1) the ``Binarized Tightening" technique for integer variables to handle their wide range by binary encoding and bound tightening; 2) an attention-based tripartite graph to capture global correlations among variables and constraints for an MIP instance; 3) an extra branching network as a global view, to identify and optimize wrongly-fixed backdoor variables at each search step. Experiments show its superior performance over the open-source solver SCIP and LNS baselines. Moreover, it performs competitively with, and sometimes better than the commercial solver Gurobi (v9.5.0), especially on MIPLIB2017 benchmark chosen by Hans Mittelmann, where our method can deliver 10% better primal gaps compared with Gurobi in a 300s cut-off time.

1714A shot of Cognac to forget bad memories: Corrective Unlearning in GNNs

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are increasingly being used for a variety of ML applications on graph data. As graph data does not follow the independently and identically distributed (i.i.d) assumption, adversarial manipulations or incorrect data can propagate to other datapoints through message passing, deteriorating the model’s performance. To allow model developers to remove the adverse effects of manipulated entities from a trained GNN, we study the recently formulated problem ofCorrective Unlearning. We find that current graph unlearning methods fail to unlearn the effect of manipulations even when the whole manipulated set is known. We introduce a new graph unlearning method,Cognac, which can unlearn the effect of the manipulation set even when only 5% of it is identified. It recovers most of the performance of a strong oracle with fully corrected training data, even beating retraining from scratch without the deletion set while being 8x more efficient. We hope our work guides GNN developers in fixing harmful effects due to issues in real-world data post-training.

1715Mitigating Reward Over-Optimization in RLHF via Behavior-Supported Regularization

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) is an effective method for aligning large language models (LLMs) with human values. However, reward over-optimization remains an open challenge leading to discrepancies between the performance of LLMs under the reward model and the true human objectives. A primary contributor to reward over-optimization is the extrapolation error that arises when the reward model evaluates out-of-distribution (OOD) responses. However, current methods still fail to prevent the increasing frequency of OOD response generation during the reinforcement learning (RL) process and are not effective at handling extrapolation errors from OOD responses. In this work, we propose theBehavior-Supported Policy Optimization(BSPO) method to mitigate the reward over-optimization issue. Specifically, we definebehavior policyas the next token distribution of the reward training dataset to model the in-distribution (ID) region of the reward model. Building on this, we introduce the behavior-supported Bellman operator to regularize the value function, penalizing all OOD values without impacting the ID ones. Consequently, BSPO reduces the generation of OOD responses during the RL process, thereby avoiding overestimation caused by the reward model’s extrapolation errors. Theoretically, we prove that BSPO guarantees a monotonic improvement of the supported policy until convergence to the optimal behavior-supported policy. Empirical results from extensive experiments show that BSPO outperforms baselines in preventing reward over-optimization due to OOD evaluation and finding the optimal ID policy.

1716Meta-Learning Adaptable Foundation Models

[openreview] [pdf]

Abstract The power of foundation models (FMs) lies in their capacity to learn highly expressive representations that can be adapted to a broad spectrum of tasks. However, these pretrained models require multiple stages of fine-tuning to become effective for downstream applications. Conventionally, the model is first retrained on the aggregate of a diverse set of tasks of interest, and then adapted to a specific low-resource downstream tasks by utilizing a parameter-efficient fine-tuning (PEFT) scheme. While this two-phase procedure seems reasonable, the independence of the retraining and fine-tuning phases causes a major issue, as there is no guarantee the retrained model will achieve good performance post fine-tuning. To explicitly address this issue, we introduce a meta-learning framework infused with PEFT in this intermediate retraining stage to learn a model which can be easily adapted to unseen tasks. For our theoretical results, we focus on linear models using low-rank adaptations. In this setting, we demonstrate the suboptimality of standard retraining for finding an adaptable set of parameters. Further, we prove that our method recovers the optimally adaptable parameters. We then apply these theoretical insights to retraining the RoBERTa model to predict continuations of conversations between different personas within the ConvAI2 dataset. Empirically, we observe significant performance benefits using our proposed meta-learning scheme during retraining relative to the conventional approach.

1717InstantPortrait: One-Step Portrait Editing via Diffusion Multi-Objective Distillation

[openreview] [pdf]

Abstract Real-time instruction-based portrait image editing is crucial in various applications, including filters, augmented reality, and video communications, etc. However, real-time portrait editing presents three significant challenges: identity preservation, fidelity to editing instructions, and fast model inference. Given that these aspects often present a trade-off, concurrently addressing them poses an even greater challenge. While diffusion-based image editing methods have shown promising capabilities in personalized image editing in recent years, they lack a dedicated focus on portrait editing and thus suffer from the aforementioned problems as well. To address the gap, this paper introduces an Instant-Portrait Network (IPNet), the first one-step diffusion-based model for portrait editing. We train the network in two stages. We first employ an annealing identity loss to train an Identity Enhancement Network (IDE-Net), to ensure robust identity preservation. We then train the IPNet using a novel diffusion Multi-Objective Distillation approach that integrates adversarial loss, identity distillation loss, and a novel Facial-Style Enhancing loss. The Diffusion Multi-Objective Distillation approach efficiently reduces inference steps, ensures identity consistency, and enhances the precision of instruction-based editing. Extensive comparison with prior models demonstrates IPNet as a superior model in terms of identity preservation, text fidelity, and inference speed.

1718See Further When Clear: Adaptive Generative Modeling with Curriculum Consistency Model

[openreview] [pdf]

Abstract Significant advances have been made in the sampling efficiency of diffusion models, driven by Consistency Distillation (CD), which trains a student model to mimic the output of a teacher model at an earlier timestep. However, we found that the learning complexity of the student model varies significantly across different timesteps, leading to suboptimal performance in consistency models. To address this issue, we propose the Curriculum Consistency Model (CCM), which stabilizes and balances the learning complexity across timesteps. We define the distillation process as a curriculum and introduce Peak Signal-to-Noise Ratio (PSNR) as a metric to quantify the difficulty of each step in this curriculum. By incorporating adversarial losses, our method achieves competitive single-step sampling Fréchet Inception Distance (FID) scores of 1.64 on CIFAR-10 and 2.18 on ImageNet 64x64. Moreover, our approach generalizes well to both Flow Matching models and diffusion models. We have extended our method to large-scale text-to-image models, including Stable Diffusion XL and Stable Diffusion 3.

1719Failure-Proof Non-Contrastive Self-Supervised Learning

[openreview] [pdf]

Abstract We identify sufficient conditions to avoid known failure modes, including representation, dimensional, cluster and intracluster collapses, occurring in non-contrastive self-supervised learning. Based on these findings, we propose a principled design for the projector and loss function. We theoretically demonstrate that this design introduces an inductive bias that promotes learning representations that are both decorrelated and clustered without explicit enforcing these properties and leading to improved generalization. To the best of our knowledge, this is the first solution that achieves robust training with respect to these failure modes while guaranteeing enhanced generalization performance in downstream tasks. We validate our theoretical findings on image datasets including SVHN, CIFAR10, CIFAR100 and ImageNet-100, and show that our solution, dubbed FALCON, outperforms existing feature decorrelation and cluster-based self-supervised learning methods in terms of generalization to clustering and linear classification tasks.

1720A Controlled Study on Long Context Extension and Generalization in LLMs

[openreview] [pdf]

Abstract Broad textual understanding and in-context learning require language models that utilize full document contexts. Due to the implementation challenges associated with directly training long-context models, many methods have been proposed for extending models to handle long contexts. However, owing to differences in data and model classes, it has been challenging to compare these approaches, leading to uncertainty as to how to evaluate long-context performance and whether it differs from standard evaluation. We implement a controlled protocol for extension methods with a standardized evaluation, utilizing consistent base models and extension data. Our study yields several insights into long-context behavior. First, we reaffirm the critical role of perplexity as a general-purpose performance indicator even in longer-context tasks. Second, we find that current approximate attention methods systematically underperform across long-context tasks. Finally, we confirm that exact fine-tuning based methods are generally effective within their extension range, whereas extrapolation remains challenging. All codebases, models, and checkpoints will be made available open-source, promoting transparency and facilitating further research in this critical area of AI development.

1721TILDE-Q: a Transformation Invariant Loss Function for Time-Series Forecasting

[openreview] [pdf]

Abstract Time-series forecasting has gained increasing attention in the field of artificial intelligence due to its potential to address real-world problems across various domains, including energy, weather, traffic, and economy. While time-series forecasting is a well-researched field, predicting complex temporal patterns such as sudden changes in sequential data still poses a challenge with current models. This difficulty stems from minimizing LpL_p norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE), which are susceptible to both intricate temporal dynamics modeling and signal shape capturing. Furthermore, these functions often cause models to behave aberrantly and generate uncorrelated results with the original time-series. Consequently, the development of a shape-aware loss function that goes beyond mere point-wise comparison is essential. In this paper, we examine the definition of shape and distortions, which are crucial for shape-awareness in time-series forecasting, and provide a design rationale for the shape-aware loss function. Based on our design rationale, we propose a novel, compact loss function called TILDE-Q (Transformation Invariant Loss function with Distance EQuilibrium) that considers not only amplitude and phase distortions but also allows models to capture the shape of time-series sequences. Furthermore, TILDE-Q supports the simultaneous modeling of periodic and nonperiodic temporal dynamics. We evaluate the efficacy of TILDE-Q by conducting extensive experiments under both periodic and nonperiodic conditions with various models ranging from naive to state-of-the-art. The experimental results show that the models trained with TILDE-Q surpass those trained with other metrics, such as MSE and DILATE, in various real-world applications, including electricity, traffic, illness, economics, weather, and electricity transformer temperature (ETT).

1722Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning

[openreview] [pdf]

Abstract To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process.Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward functions from human feedback. Despite recent successes, many of the HitL RL methods still require numerous human interactions to learn successful reward functions. To improve the feedback efficiency of HitL RL methods (i.e., require less human interaction), this paper introduces Sub-optimal Data Pre-training, SDP, an approach that leverages reward-free, sub-optimal data to improve scalar- and preference-based HitL RL algorithms. In SDP, we start by pseudo-labeling all low-quality data with the minimum environment reward. Through this process, we obtain reward labels to pre-train our reward model \emph{without} requiring human labeling or preferences. This pre-training phase provides the reward model a head start in learning, enabling it to recognize that low-quality transitions should be assigned low rewards. Extensive experiments with both simulated and human teachers reveal that SDP can at least meet, but often significantly improve, state-of-the-art HitL RL performance across a variety of simulated robotic tasks.

1723PRUC & Play: Probabilistic Residual User Clustering for Recommender Systems

[openreview] [pdf]

Abstract Modern recommender systems are typically based on deep learning (DL) models, where a dense encoder learns representations of users and items. As a result, these systems often suffer from the black-box nature and computational complexity of the underlying models, making it difficult to systematically interpret their outputs and enhance their recommendation capabilities. To address this problem, we propose Probabilistic Residual User Clustering (PRUC), a causal Bayesian recommendation model based on user clusters. Specifically, we address this problem by (1) automatically dividing users into clusters and identifying causal confounders that influence latent variables, (2) developing sub-models for each confounder given observable variables, and (3) generating recommendations by aggregating the rating residuals under each confounder using do-calculus. Experiments demonstrate that our plug-and-play PRUC is compatible with various base DL recommender systems, significantly improving their performance while automatically discovering meaningful user clusters.

1724Enhancing Language Model Agents using Diversity of Thoughts

[openreview] [pdf]

Abstract A popular approach to building agents using Language Models (LMs) involves iteratively prompting the LM, reflecting on its outputs, and updating the input prompts until the desired task is achieved. However, our analysis reveals two key shortcomings in the existing methods: (i)(i) limited exploration of the decision space due to repetitive reflections, which result in redundant inputs, and (ii)(ii) an inability to leverage insights from previously solved tasks. To address these issues, we introduce DoT (Diversity of Thoughts), a novel framework that a) explicitly reduces redundant reflections to enhance decision-space exploration, and b) incorporates a task-agnostic memory component to enable knowledge retrieval from previously solved tasks—unlike current approaches that operate in isolation for each task. Through extensive experiments on a suite of programming benchmarks (HumanEval, MBPP, and LeetCodeHardGym) using a variety of LMs, DoT demonstrates up to a 10\textbf{10}% improvement in Pass@1 while maintaining cost-effectiveness. Furthermore, DoT is modular by design. For instance, when the diverse reflection module of DoT is integrated with existing methods like Tree of Thoughts (ToT), we observe a significant 13\textbf{13}% improvement on Game of 24 (one of the main benchmarks of ToT), highlighting the broad applicability and impact of our contributions across various reasoning tasks.

1725Uncertainty-aware Fine-tuning on Time Series Foundation Model for Anomaly Detection

[openreview] [pdf]

Abstract Time-series anomaly detection is a crucial task in various real-world domains, geared towards identifying data observations that significantly deviate from the norm. Although time-series foundation models have shown promising results across multiple tasks, their effectiveness in anomaly detection is often inferior. This is due to their unsupervised learning paradigm being compromised by anomaly contamination in the training data. In addition, the existing approaches lack the capability to capture boundries of multiple types of normal and abnormal patterns. To overcome these challenges, we propose ULoRA-MoE, a general uncertainty-aware fine-tuning approach using resource-efficient Mixture-of-Expert (MoE) module based on LoRA. This proposed approach can enhance the fine-tuning performance across a broad spectrum of time series foundation models for anomaly detection. Each expert module of MoE can help learn different types of anomalies. Furthermore, we design the uncertainty-aware router of MoE using Gumbel-Softmax distribution for categorical sampling to capture the epistemic uncertainty. Given the estimated uncertainty, we propose a calibrated anomaly score function to mitigate the detrimental effects of anomaly contamination. We conducted extensive experiments on two general types of time series foundation models. The results demonstrate that our approach significantly improves the model performance compared to existing fine-tuning approaches. Furthermore, ULoRA-MoE shows competitive performance compared to a comprehensive set of non-learning, classical learning, and deep learning (DL) based time-series anomaly detection baselines across 8 real-world benchmarks.

1726How Does Data Diversity Shape The Weight Landscape of Neural Networks?

[openreview] [pdf]

Abstract To enhance the generalization of machine learning models to unseen data, techniques such as dropout, weight decay (L2 regularization), and noise augmentation are commonly employed. While regularization methods (i.e., dropout and weight decay) are geared toward adjusting model parameters to prevent overfitting, data augmentation increases the diversity of the input training set, a method purported to improve accuracy and calibration error. In this paper, we investigate the impact of each of these techniques on the parameter space of neural networks, with the goal of understanding how they alter the weight landscape in transfer learning scenarios. To accomplish this, we employ Random Matrix Theory to analyze the eigenvalue distributions of pre-trained models, fine-tuned using these techniques but using different levels of data diversity, for the same downstream tasks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. Additionally, we compare commonly used data augmentation methods with synthetic data created by generative models. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.

1727On Stochastic Contextual Bandits with Knapsacks in Small Budget Regime

[openreview] [pdf]

Abstract This paper studies stochastic contextual bandits with knapsack constraints (CBwK), where a learner observes a context, takes an action, receives a reward, and incurs a vector of costs at every round. The learner aims to maximize the cumulative rewards across TT rounds under the knapsack constraints with an initial budget of BB. We study CBwK in the small budget regime where the budget B=Ω(T)B = \Omega(\sqrt{T}) and propose an Adaptive and Universal Primal--Dual algorithm (AUPD) that achieves strong regret performance: i) AUPD achieves O~((1+νδb)T)\tilde{O}((1 + \frac{\nu^*}{\delta b})\sqrt{T}) regret under the strict feasibility assumption without any prior information, matching the best-known bounds; ii) AUPD achieves O~(T+νbT34)\tilde{O}(\sqrt{T}+ \frac{\nu^*}{\sqrt{b}}T^{\frac{3}{4}}) regret without strict feasibility assumption, which, to the best of our knowledge, is the first result in the literature. Here, the parameter ν\nu^* represents the optimal average reward; b=B/Tb=B/T is the average budget and δb\delta b is the feasibility/safety margin. We establish these strong results through the adaptive budget-aware design, which effectively balances reward maximization and budget consumption. We provide a new perspective on analyzing budget consumption using the Lyapunov drift method, along with a refined analysis of its cumulative variance. Our theory is further supported by experiments conducted on a large-scale dataset.

1728Exemplar-free Continual Representation Learning with Symmetric Distillation

[openreview] [pdf]

Abstract Continual learning strives to train a model in a sequential manner by learning from new tasks while retaining information about old tasks. Treating this as a common classification problem leads to catastrophic forgetting, especially in deep learning settings, where knowledge of old tasks is forgotten as soon as a model is optimized on new tasks. Existing solutions tackle this problem by imposing strict assumptions, such as the availability of exemplars from previously seen classes or a warm start of a model on many classes before starting the continual learning. While effective on known benchmarks, such assumptions can be impractical and do not directly address the stability-plasticity dilemma in continual learning. In this paper, we follow a recent push in the field to tackle continual learning in the exemplar-free cold-start setting. We propose Model-in-the-Middle (MITM). The idea behind MITM is to separate the learning of new classes and retention of past class knowledge by using two distinct models. We propose a learner with symmetric distillation from both models, enabling us to learn evolving representations as new tasks arrive. We show that explicitly separating and balancing old and new tasks through symmetric distillation helps absorb large distribution shifts in between tasks, mitigating the stability gap. Our approach is simple yet outperforms the state-of-the-art in the challenging exemplar-free cold-start continual learning setting.

1729Joint or Disjoint: Mixing Training Regimes for Early-Exit Models

[openreview] [pdf]

Abstract Early exits are an important efficiency mechanism integrated into deep neural networks that allows for the termination of the network’s forward pass before processing through all its layers. Early exit methods add trainable internal classifiers which leads to different training dynamics. However, there is no consistent verification of the approaches of training of early exit methods and little understanding how training regimes optimize the architecture. Most early exit methods employ a training strategy that either simultaneously trains the backbone network and the exit heads or trains the exit heads separately. We propose a training approach where the backbone is initially trained on its own, followed by a phase where both the backbone and the exit heads are trained together. Thus, we categorize early-exit training strategies into three distinct categories, and then validate them for their performance and efficiency. In this benchmark, we perform both theoretical and empirical analysis of early-exit training regimes. We study the methods in terms of information flow, loss landscape and numerical rank of activations and gauge the suitability of regimes for various architectures and datasets.

1730Multi-Perspective Test-Time Prompt Tuning for Global, Local Visuals, and Language

[openreview] [pdf]

Abstract Recent advances in vision-language models (VLMs) have demonstrated significant generalization across a broad range of tasks through prompt learning. However, bridging the distribution shift between training and test data remains a significant challenge. Existing researches utilize multiple augmented views of test samples for zero-shot adaptation. While effective, these approaches focus solely on global visual information, neglecting the local contextual details of test images. Moreover, simplistic, single-form textual descriptions limit the understanding of visual concepts, hindering the transfer performance of classes with similar or complex visual features. In this paper, we propose a Multi-Perspective Test-Time Prompt Tuning method, MP-TPT, building on two key insights: local visual perception and class-specific description augmentation. Specifically, we introduce local visual representations from VLMs during the optimization process to enhance the prompts’ ability to perceive local context. On the other hand, we design a data augmentation method at the text feature level that imparts regional visual priors to specific class texts, thereby enriching the class-specific descriptions. Furthermore, we synchronize the multi-view concept during the inference, integrating both local and global visual representations with text features for a deeper understanding of visual concepts. Through extensive experiments across 15 benchmark datasets, we demonstrate the advantages of MP-TPT, particularly achieving a 1% improvement in state-of-the-art TPT accuracy in cross-dataset settings, along with 4.5 times acceleration in inference speed.

1731Treatment Rule Optimization Under Counterfactual Temporal Point Processes with Latent States

[openreview] [pdf]

Abstract In high-stakes areas like healthcare, retrospective counterfactual analysis—such as evaluating what might have happened if treatments were administered earlier, later, or differently—is vital for refining treatment strategies. This paper proposes a counterfactual treatment optimization framework using temporal point processes to model outcome event sequences. By sampling potential outcome events under new treatment decision rules, our approach seeks to optimize treatment strategies in a counterfactual setting. To achieve accurate counterfactual evaluation of new decision rules, we explicitly introduce latent states into the modeling of temporal point processes. Our method first infers the latent states and associated noise, followed by counterfactual sampling of outcome events. This approach rigorously addresses the complexities introduced by latent states, effectively removing biases in the evaluation of treatment strategies. By proving the identifiability of model parameters in the presence of these states, we provide theoretical guarantees that enhance the reliability and robustness of the counterfactual analysis. By incorporating latent states and proving identifiability, our framework not only improves the accuracy and robustness of treatment decision rules but also offers actionable insights for optimizing healthcare interventions. This method holds significant potential for improving treatment strategies, particularly in healthcare scenarios where patient symptoms are complex and high-dimensional.

1732Challenging the Counterintuitive: Revisiting Simple Likelihood Tests with Normalizing Flows for Tabular Data Anomaly Detection

[openreview] [pdf]

Abstract In this study, we propose a novel approach to anomaly detection in the tabular domain using normalizing flows, leveraging a simple likelihood test to achieve state-of-the-art performance in unsupervised learning. Although simple likelihood tests have been shown to fail in anomaly detection for image data, we redefine the counterintuitive phenomenon and demonstrate, both theoretically and empirically, why this method succeeds in the tabular domain. Our approach outperforms traditional anomaly detection methods by offering more consistent results. Furthermore, we question the practice of fine-tuning parameters for each dataset individually, ensuring fair and unbiased comparisons by adopting uniform hyperparameters across all datasets. Through extensive experimentation, we validate the robustness and scalability of our method, highlighting its practical effectiveness in real-world settings.

1733Moment Constrained Optimal Transport for Control Applications

[openreview] [pdf]

Abstract This paper concerns the application of techniques from optimal transport (OT) to mean field control, in which the probability measures of interest in OT correspond to empirical distributions associated with a large collection of controlled agents. The control objective of interest motivates a one-sided relaxation of OT, in which the first marginal is fixed and the second marginal is constrained to a “moment class”: a set of probability measures defined by generalized moment constraints. This relaxation is particularly interesting for control problems as it enables the coordination of agents without the need to know the desired distribution beforehand. The inclusion of an entropic regularizer is motivated by both computational considerations, and also to impose hard constraints on agent behavior. A computational approach inspired by the Sinkhorn algorithm is proposed to solve this problem. This new approach to distributed control is illustrated with an application of charging a fleet of electric vehicles while satisfying grid constraints. An online version is proposed and applied in a case study on the ElaadNL dataset containing 10,000 EV charging transactions in the Netherlands. This empirical validation demonstrates the effectiveness of the proposed approach to optimizing flexibility while respecting grid constraints.

1734FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction

[openreview] [pdf]

Abstract Auto-regressive Large Language Models (LLMs) demonstrate remarkable performance across domanins such as vision and language processing. However, due to sequential processing through a stack of transformer layers, autoregressive decoding faces significant computation/latency challenges, particularly in resource-constrained environments like mobile and edge devices. Existing approaches in literature that aim to improve latency via skipping layers have two distinct flavors - 1) Early exit 2) Input-agnostic heuristics where tokens exit at pre-determined layers irrespective of input sequence. Both the above strategies have limitations - the former cannot be applied to handle KV Caching necessary for speed-ups in modern framework and the latter does not capture the variation in layer importance across tasks or more generally, across input sequences. To address both limitations, we propose \textsc{FiRST}, an algorithm that reduces inference latency by using layer-specific routers to select a subset of transformer layers adaptively for each input sequence - the prompt (during prefill stage) decides which layers will be skipped during decoding. \textsc{FiRST} preserves compatibility with KV caching enabling faster inference while being quality-aware. \textsc{FiRST} is model-agnostic and can be easily enabled on any pre-trained LLM. We further improve performance by incorporating LoRA adapters for fine-tuning on external datasets, enhancing task-specific accuracy while maintaining latency benefits. Our approach reveals that input adaptivity is critical - indeed, different task-specific middle layers play a crucial role in evolving hidden representations depending on task. Extensive experiments show that \textsc{FiRST} significantly reduces latency while retaining competitive performance (as compared to baselines), making our approach an efficient solution for LLM deployment in low-resource environments.

1735EventFlow: Forecasting Continuous-Time Event Data with Flow Matching

[openreview] [pdf]

Abstract Continuous-time event sequences, in which events occur at irregular intervals, are ubiquitous across a wide range of industrial and scientific domains. The contemporary modeling paradigm is to treat such data as realizations of a temporal point process, which is typically modeled in an autoregressive fashion by a neural network. While autoregressive models are successful in predicting the time of a single subsequent event, their performance is unsatisfactory in forecasting longer horizons due to cascading errors. We propose EventFlow\texttt{EventFlow}, a non-autoregressive generative model for temporal point processes. Our model builds on the flow matching framework in order to directly learn the joint distributions over event times, side-stepping the autoregressive process. EventFlow\texttt{EventFlow} is easy to implement and sample from, and achieves state-of-the-art performance in both unconditional and conditional generation tasks on a set of standard benchmarks.

1736Dual Consolidation for Pre-Trained Model-Based Domain-Incremental Learning

[openreview] [pdf]

Abstract Domain-Incremental Learning (DIL) involves the progressive adaptation of a model to new concepts across different domains. While recent advances in pre-trained models provide a solid foundation for DIL, learning new concepts often results in the catastrophic forgetting of pre-trained knowledge. Specifically, sequential model updates can overwrite both the representation and the classifier with knowledge from the latest domain. Thus, it is crucial to develop a representation and corresponding classifier that accommodate all seen domains throughout the learning process. To this end, we propose DUal ConsolidaTion (Duct) to unify and consolidate historical knowledge at both the representation and classifier levels. By merging the backbone of different stages, we create a representation space suitable for multiple domains incrementally. The merged representation serves as a balanced intermediary that captures task-specific features from all seen domains. Additionally, to address the mismatch between consolidated embeddings and the classifier, we introduce an extra classifier consolidation process. Leveraging class-wise semantic information, we estimate the classifier weights of old domains within the latest embedding space. By merging historical and estimated classifiers, we align them with the consolidated embedding space, facilitating incremental classification. Extensive experimental results on four benchmark datasets demonstrate Duct’s state-of-the-art performance.

1737Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data

[openreview] [pdf]

Abstract Discrete diffusion models with absorbing processes have shown promise in language modeling. The key quantities to be estimated are the ratios between the marginal probabilities of two transitive states at all timesteps, called the concrete score. In this paper, we reveal that the concrete score in absorbing diffusion can be expressed as conditional probabilities of clean data, multiplied by a time-dependent scalar in an analytic form. Motivated by this finding, we propose reparameterized absorbing discrete diffusion (RADD), a dedicated diffusion model without time-condition that characterizes the time-independent conditional probabilities. Besides its simplicity, RADD can reduce the number of function evaluations (NFEs) by caching the output of the time-independent network when the noisy sample remains unchanged in a sampling interval, which enables sampling acceleration. Built upon the new perspective of conditional distributions, we further unify absorbing discrete diffusion and any-order autoregressive models (AO-ARMs), showing that the upper bound on the negative log-likelihood for the diffusion model can be interpreted as an expected negative log-likelihood for AO-ARMs. Further, our RADD models achieve SOTA performance among diffusion models on 5 zero-shot language modeling benchmarks (measured by perplexity) at the GPT-2 scale.

1738Learning Generalizable Environment Models via Discovering Superposed Causal Relationships

[openreview] [pdf]

Abstract In reinforcement learning, a generalizable world model to mimic the environment is crucial for the assessment of various policy values in downstream tasks such as offline policy optimization and off-policy evaluation. Recently, studies have shown that learning a world model with sparse connections identified by causal discovery techniques can improve generalizability. So far, these studies focus on discovering a single and global causal structure. In this paper, we discuss a more practical setting in which the agent is deployed in an environment mixed with different causal mechanisms, called superposed causal relationships in this article. In this case, global causal discovery techniques will derive a degraded dense causal relationship, which will fail to improve the generalizability of the learned model. To solve the problem, we propose \textbf{S}uperposed c\textbf{A}usal \textbf{M}odel (SAC) learning. SAM learning is an end-to-end framework that learns a transformer-based model which can recognize the causal relationships that the agent is encountering on the fly and then adapts its predictions. The experiments are conducted in two simulated environments, where SAM shows powerful identify abilities in environments with superposed causal relationships. Both the dynamics model and the policies learned by the SAM~generalize well to unseen states.

1739Bridging and Modeling Correlations in Pairwise Data for Direct Preference Optimization

[openreview] [pdf]

Abstract Direct preference optimization (DPO), a widely adopted offline preference optimization algorithm, aims to align large language models (LLMs) with human-desired behaviors using pairwise preference data. However, the winning response and the losing response within pairwise data are generated isolatedly, leading to weak correlations between them as well as suboptimal alignment performance. To address this issue, we propose an effective framework for Bridging and Modeling Correlations in pairwise data, named BMC. Firstly, we increase the consistency and informativeness of the pairwise preference signals through targeted modifications, synthesizing a pseudo-winning response by improving the losing response with the winning response as a reference. Secondly, we identify that DPO alone is insufficient to model these correlations and capture nuanced variations. Therefore, we propose learning token-level correlations by dynamically leveraging the policy model’s confidence during training. Comprehensive experiments on QA, math, and instruction-following tasks demonstrate the effectiveness of our approach, significantly surpassing competitive baselines, including DPO. Additionally, our in-depth quantitative analysis reveals the reasons behind our method’s superior performance over DPO and showcases its versatility to other DPO variants.

1740Attic: A New Architecture for Tabular In-Context Learning Transformers

[openreview] [pdf]

Abstract Tabular In-Context Learning (ICL) transformers, such as TabPFN and TabForestPFN, have shown strong performance on tabular classification tasks. In this paper, we introduce Attic, a new architecture for ICL-transformers. Unlike TabPFN and TabForestPFN, where one token represents all features of one observation, Attic assigns one token to each feature of every observation. This simple architectural change results in a significant performance boost. As a result, we can confidently say that neural networks outperform tree-based methods like XGBoost.

1741SPARC: Continual learning beyond experience rehearsal and model surrogates

[openreview] [pdf]

Abstract Continual learning (CL) has become increasingly important as deep neural networks (DNNs) are required to adapt to the continuous influx of data without retraining from scratch. However, a significant challenge in CL is catastrophic forgetting (CF), where learning new tasks erases previously acquired knowledge, either partially or completely. Existing solutions often rely on experience rehearsal or full model surrogates to mitigate CF. While effective, these approaches introduce substantial memory and computational overhead, limiting their scalability and applicability in real-world scenarios where efficiency is critical. To address this, we propose SPARC, a scalable CL approach that eliminates the need for both experience rehearsal and full model surrogates. SPARC employs parameter-efficient task-specific working memories to capture information relevant to each task and task-agnostic semantic memory for cross-task knowledge consolidation. Additionally, SPARC introduces weight re-normalization in the classification layer to reduce recency bias toward newly learned tasks. SPARC is lightweight, requiring only 6% of the parameters used by full-model surrogates, yet it delivers superior performance on Seq-TinyImageNet and matches the results of rehearsal-based methods on various CL benchmarks. This makes SPARC a practical solution for continual learning where computational efficiency and scalability are paramount.

1742Gaussian Loss Smoothing Enables Certified Training with Tight Convex Relaxations

[openreview] [pdf]

Abstract Training neural networks with high certified accuracy against adversarial examples remains an open challenge despite significant efforts. While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods, perhaps surprisingly, can perform worse than looser relaxations. Prior work hypothesized that this phenomenon is caused by the discontinuity, non-smoothness and perturbation sensitivity of the loss surface induced by tighter relaxations. In this work, we theoretically show that Gaussian Loss Smoothing (GLS) can alleviate these issues. We confirm this empirically by instantiating GLS with two variants: a zeroth-order optimization algorithm called PGPE which allows training with non-differentiable relaxations, and a first-order optimization algorithm, called RGS, which requires gradients of the relaxation, but is much more efficient than PGPE. Extensive experiments show that when combined with tight relaxations, these methods surpass state-of-the-art methods when training on the same network architecture for many settings. Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks and pave a path towards leveraging tighter relaxations for certified training.

1743Can Symbolic Regression of Boolean Functions Boost Logic Synthesis?

[openreview] [pdf]

Abstract Logic synthesis, which aims to synthesize a compact logic circuit with minimized size while exactly satisfying a given functionality, plays an important role in chip design. Recently, symbolic regression (SR) has shown great success in scientific discovery to recover underlying mathematical functions from given datasets. However, we found from extensive experiments that existing SR methods struggle to recover an exact and compact boolean function for logic synthesis given a truth table, i.e., complete input-output pairs of the circuit. The major challenges include (1) the greater complexity of underlying boolean functions compared to mathematical functions, and (2) the complex objectives involving both exact recovery and expression optimization towards circuit minimization. To address these challenges, we propose a novel symbolic factorized boolean searcher (SINE) to recover exact and compact boolean functions towards logic synthesis. Motivated by the Shannon decomposition theorem, SINE proposes a factorized boolean function representation to decompose the underlying boolean function into multiple simplified sub-functions, significantly reducing their complexity and thus improving the recovery accuracy. Moreover, based on the key observation that, logical sharing is significant for circuit size minimization. SINE proposes a self symmetric sub-expression motif operators mining mechanism to enhance the monte-carlo tree search method for optimized boolean function learning. To the best of our knowledge, SINE is the first symbolic regression framework capable of exactly recovering optimized boolean functions for circuit optimization. Experiments on circuits across a wide range of inputs demonstrate that SINE significantly improves the recovery accuracy and decreases the size of synthesized circuits by up to 24.32% compared to state-of-the-art methods.

1744DAG-NAS: Explainable Neural Architecture Search\for Reinforcement Learning via Scalar-level DAG Modeling

[openreview] [pdf]

Abstract We present an explainable and effective Neural Architecture Search (NAS) framework for Reinforcement Learning (RL). We model a feed-forward neural network as a Directed Acyclic Graph (DAG) that consists of scalar-level operations and their interconnections. We train the model for RL tasks using a differentiable search method, followed by pruning the search outcomes. This process results in a compact neural architecture that achieves high performance and enhances explainability by emphasizing crucial information for solving the RL problem. This process results in a compact and efficient neural architecture that enhances explainability by emphasizing crucial information for solving the RL problem. We apply our NAS framework to the Actor-Critic PPO algorithm, targeting both actor and critic networks. We evaluate its performance across various RL tasks. Extensive experiments demonstrate that our architectures achieve comparable performance with significantly fewer parameters while also enhancing explainability by highlighting key features.

1745ProAdvPrompter: A Two-Stage Journey to Effective Adversarial Prompting for LLMs

[openreview] [pdf]

Abstract As large language models (LLMs) are increasingly being integrated into various real-world applications, the identification of their vulnerabilities to jailbreaking attacks becomes an essential component of ensuring the safety and reliability of LLMs. Previous studies have developed LLM assistants, known as the adversarial prompter, to automatically generate suffixes that manipulate target LLMs into generating harmful and undesirable outputs. However, these approaches often suffer from low performance or generate semantically meaningless prompts, which can be easily identified by perplexity-based defenses. In this paper, we introduce a novel two-stage method, ProAdvPrompter\texttt{ProAdvPrompter}, that significantly improves the performance of adversarial prompters. In ProAdvPrompter\texttt{ProAdvPrompter}, the first stage (Exploration) utilizes the loss information to guide the adversarial prompter in generating suffixes that are more likely to elicit harmful responses. Then the second stage (Exploitation) iteratively fine-tunes the prompter using high-quality generated adversarial suffixes to further boost performance. Additionally, we incorporate the prompt template to aid in the Exploration stage and propose a filtering mechanism to accelerate the training process in the Exploitation stage. We evaluate ProAdvPrompter\texttt{ProAdvPrompter} against the well-aligned LLMs (i.e., Llama2-Chat-7B and Llama3-chat-8B), achieving attack success rates of 99.68% and 97.12% respectively after 10 trials on the AdvBench dataset, thereby enhancing performance by 2\sim 2 times compared to previous works. Moreover, ProAdvPrompter\texttt{ProAdvPrompter} reduces training time by 20% on Llama3-Instruct-8B, generates more generalized adversarial suffixes, and demonstrates resilience against the perplexity defense. An ablation study further evaluates the effects of key components in ProAdvPrompter\texttt{ProAdvPrompter} (the prompt template and the filtering mechanism).

1746Enhancing LLM Faithfulness in Rationale Generation via Dual-Reward Probabilistic Inference

[openreview] [pdf]

Abstract As large language models (LLMs) are increasingly applied to complex reasoning tasks, achieving both accurate task performance and faithful explanations becomes crucial. However, LLMs often generate unfaithful explanations, partly because they do not consistently adhere closely to the provided context. Existing approaches address this problem either rely on superficial calibration, such as decomposed Chain-of-Thought prompting, or require costly retraining to improve model faithfulness. In this work, we propose a probabilistic inference paradigm that provides fine-grained and lookahead rewards to ensure that LLM-generated rationales are logically coherent and comprehensive. These rewards are derived from a domain-specific proposal distribution, allowing for optimised sequential Monte Carlo approximations. Our evaluations across three different reasoning tasks show that this method, which allows for controllable generation during inference, improves both accuracy and faithfulness of LLMs while keeping computational costs similar to those of existing decoding techniques. This method offers a promising path towards making LLMs more reliable for reasoning tasks without sacrificing performance or efficiency.

1747Private Wasserstein Distance

[openreview] [pdf]

Abstract Wasserstein distance is a key metric for quantifying data divergence from a distributional perspective. However, its application in privacy-sensitive environments, where direct sharing of raw data is prohibited, presents significant challenges. Existing approaches, such as Differential Privacy and Federated Optimization, have been employed to estimate the Wasserstein distance under such constraints. However, these methods often fall short when both accuracy and security are required. In this study, we explore the inherent triangular properties within the Wasserstein space, leading to a novel solution named TriangleWad\texttt{TriangleWad}. This approach facilitates the fast computation of the Wasserstein distance between datasets stored across different entities, ensuring that raw data remain completely hidden. TriangleWad not only strengthens resistance to potential attacks but also preserves high estimation accuracy. Through extensive experiments across various tasks involving both image and text data, we demonstrate its superior performance and significant potential for real-world applications.

1748Adaptive Continual Learning Through Proactive Detection of Transfer and Interference

[openreview] [pdf]

Abstract Continual learning (CL) requires models to sequentially learn multiple tasks, maximizing transfer and minimizing interference. CL methods based on pre-trained models (PTM) have shown strong performance by integrating PTM fine-tuning with traditional approaches. Despite these promising results, current methods lack the ability to proactively detect task transfer and interference at the local optimization level, limiting their effectiveness in maximizing transfer and minimizing interference. To address this issue, we propose adaptive continual learning strategies through proactive detection of transfer and interference. We derive the conditions under which task transfer and interference occur from a model optimization perspective, based on the Fisher matrix and gradient update directions. Based on them, we proposed a task transfer distance metric to help model modules detect transfer and interference during continual learning. We propose a dynamic parameter update mechanism and a dynamic expansion strategy, based on LoRA fine-tuning and a Mixture of Experts (MoE) mechanism, to handle varying levels of task transfer and interference. Experiments results of seven benchmarks show that our method achieves the best accuracy with a limited number of parameters, maximizing transfer and minimizing interference.

1749Output Scouting: Auditing Large Language Models for Catastrophic Responses

[openreview] [pdf]

Abstract Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. In this work, we explore the following scenario: imagine an AI safety auditor is searching for catastrophic responses from an LLM (e.g. a “yes” responses to “can I fire an employee for being pregnant?”), and is able to query the model a limited number times (e.g. 1000 times). What is a strategy for querying the model that would efficiently find those failure responses? To this end, we propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution. We then run experiments using two LLMs and find numerous examples of catastrophic responses. We conclude with a discussion that includes advice for practitioners who are looking to implement LLM auditing for catastrophic responses. We will release an open-source toolkit that implements our auditing framework using the Hugging Facetransformerslibrary following publication.

1750Efficiently Deploying LLMs with Controlled Risk

[openreview] [pdf]

Abstract Deploying large language models in production requires simultaneous attention to efficiency and risk control. Prior work has shown the possibility to cut costs while maintaining similar accuracy, but has neglected to focus on risk control. By contrast, here we present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries along the LLM intelligence hierarchy, enabling training-free model switching based solely on black-box API calls. Our framework presents novel trade-offs between efficiency and risk. For example, deploying HCMA on MMLU cuts the error rate of Llama3 405B by 30% when the model is allowed to abstain on 20% of the queries. To calibrate HCMA for optimal performance, our approach uses data-efficient logistic regressions (based on a simple nonlinear feature transformation), which require only 50 or 100 labeled examples to achieve excellent calibration error (ECE), cutting ECE by 50% compared to naive Platt scaling. On free-form generation tasks, we find that chain-of-thought is ineffectual for selective prediction, whereas zero-shot prompting yields drives error to 0% on TruthfulQA at high abstention rates. As LLMs are increasingly deployed across computing environments with different capabilities (such as mobile, laptop, and cloud), our framework paves the way towards maintaining deployment efficiency while putting in place sharp risk controls.

1751Achieving Exact Federated Unlearning with Improved Post-Unlearning Performance

[openreview] [pdf]

Abstract Federated learning is a machine learning paradigm that allows multiple clients to train aggregated model via sharing model updates to a central server without sharing their data. Even though the data is not shared, it can indirectly influence the aggregated model via the shared model updates. In many real-life scenarios, we need to completely remove a client’s influence (unlearning) from the aggregated model, such as competitive clients who want to remove their influence from the aggregated model after leaving the coalition to ensure other clients do not benefit from their contributions. The influence removal is also needed when the adversarial client negatively affects the aggregated model. Though the aggregated model can be retrained from scratch to ensure exact unlearning (completely removing the client’s influence from the aggregated model), it performs poorly just after the unlearning, which is undesirable during deployment. To overcome this challenge, this paper proposes federated unlearning algorithms that ensure exact unlearning while achieving better performance post-unlearning. Our experimental results on different real datasets validate the performance of the proposed algorithms.

1752Optimal Transport for Time Series Imputation

[openreview] [pdf]

Abstract Missing data imputation through distribution alignment has demonstrated advantages for non-temporal datasets but exhibits suboptimal performance in time-series applications. The primary obstacle is crafting a discrepancy measure that simultaneously (1) captures temporal pattern\textit{captures temporal pattern}—accounting for patterns such as periodicities and temporal dependencies inherent in time-series—and (2) accommodates non-stationarity\textit{accommodates non-stationarity}, ensuring robustness amidst multiple coexisting temporal patterns. In response to these challenges, we introduce the Proximal Spectrum Wasserstein (PSW) discrepancy based on the stochastic optimal transport framework, which incorporates a pairwise spectral distance to encapsulate temporal patterns, coupled with selective matching regularization to accommodate non-stationarity. Building upon PSW, we develop the PSW for Imputation (PSW-I) framework, which iteratively refines imputation results by minimizing the PSW discrepancy. Extensive experiments demonstrate that PSW-I effectively addresses these challenges and significantly outperforms prevailing time-series imputation methods.

1753Unlocking Speech Instruction Data Potential with Query Rewriting

[openreview] [pdf]

Abstract End-to-end Large Speech Language Models (LSLMs) demonstrate strong potential in response latency and speech comprehension capabilities, showcasing general intelligence across speech understanding tasks. However, the ability to follow speech instructions has not been fully realized due to the lack of datasets and heavily biased training tasks. Leveraging the rich ASR datasets, previous approaches have used Large Language Models (LLMs) to continue the linguistic information of speech to construct speech instruction datasets. Yet, due to the gap between LLM-generated results and real human responses, the continuation methods further amplify these shortcomings. Given the high costs of collecting and annotating speech instruction dataset by human, using speech synthesis to construct large-scale speech instruction datasets has become a balanced and robust alternative. Although modern Text-To-Speech (TTS) models have achieved near-human-level synthesis quality, it is challenging to appropriately convert out-of-distribution text instruction to speech due to the limitations of the training data distribution in TTS models.To address this issue, we propose a query rewriting framework with multi-LLM knowledge fusion, employing multiple agents to annotate and validate the synthesized speech, making it possible to construct high-quality speech instruction datasets without relying on human annotation. Experiments show that this method can transform text instructions into distributions more suitable for TTS models for speech synthesis through zero-shot rewriting, increasing data usability from 71% to 93%. It also demonstrates unique advantages in rewriting tasks that require complex knowledge and context-related abilities.

1754A Policy-Gradient Approach to Solving Imperfect-Information Games with Best-Iterate Convergence

[openreview] [pdf]

Abstract Policy gradient methods have become a staple of any single-agent reinforcement learning toolbox, due to their combination of desirable properties: iterate convergence, efficient use of stochastic trajectory feedback, and theoretically-sound avoidance of importance sampling corrections. In multi-agent imperfect-information settings (extensive-form games), however, it is still unknown whether the same desiderata can be guaranteed while retaining theoretical guarantees. Instead, sound methods for extensive-form games rely on approximating \emph{counterfactual} values (as opposed to Q values), which are incompatible with policy gradient methodologies. In this paper, we investigate whether policy gradient can be safely used in two-player zero-sum imperfect-information extensive-form games (EFGs). We establish positive results, showing for the first time that a policy gradient method leads to provable best-iterate convergence to a regularized Nash equilibrium in self-play.

1755Towards the Effect of Large Language Models on Out-Of-Distribution Challenge in Text-Attributed Graphs

[openreview] [pdf]

Abstract Text-Attributed Graphs (TAGs), where each node is associated with text attributes, are ubiquitous and have been widely applied in the real world. The Out-Of-Distribution (OOD) issue, i.e., the training data and the test data not from the same distribution, is quite common in learning on real-world TAGs, posing significant challenges to the effectiveness of graph learning models. Recently, Large Language Models (LLMs) have shown extraordinary capability in processing text data, and have demonstrated tremendous potential in handling TAGs. However, there is no benchmark work that systematically and comprehensively investigates the effect of these LLM-based methods on alleviating the OOD issue on TAGs. To bridge this gap, we first develop OOD-TAG, a comprehensive OOD benchmark dataset in TAGs which consists of diverse distributions. Meanwhile, we conduct a systematic and comprehensive investigation on OOD-TAG with different LLM pipelines for graphs. In addition, we provide original observations and novel insights based on the empirical study, which can suggest promising directions for the research of LLMs in addressing the OOD challenges on TAGs. Our code and dataset are available inhttps://anonymous.4open.science/r/GraphOOD-benchmark-5FCF/.

1756Measuring the Impact of Equal Treatment as Blindness via Distributions of Explanations Disparity

[openreview] [pdf]

Abstract Liberal political philosophy advocates for the policy of \emph{equal treatment as blindness}, which seeks to achieve fairness by treating individuals without considering their protected characteristics directly. However, this policy has faced longstanding criticism for perpetuating existing inequalities. In machine learning, this policy can be translated into the concept of \emph{fairness as unawareness}, and be measured using disparate treatment metrics such as Demographic Parity (a.k.a. Statistical Parity). Our analysis reveals that Demographic Parity does not faithfully measure whether individuals are being treated independently of the protected attribute by the model. We introduce the Explanation Disparity metric to measure fairness under \emph{equal treatment as blindness} policy. Our metric evaluates the fairness of predictive models by analyzing the extent to which the protected attribute can be inferred from the distribution of explanation values, specifically using Shapley values. The proposed metric tests for statistical independence of the explanation distributions over populations with different protected characteristics. We show the theoretical properties of “Explanation Disparity” and devise an equal treatment inspector based on the AUC of a Classifier Two-Sample Test. We experiment with synthetic and natural data to demonstrate and compare the notion with related ones. We release \texttt{explanationspace}, an open-source Python package with methods and tutorials

1757Enhancing Conversational Recommender Systems with Tree-Structured Knowledge and Pretrained Language Models

[openreview] [pdf]

Abstract Conversational recommender systems (CRS) have emerged as a key enhancement to traditional recommendation systems, offering interactive and explainable recommendations through natural dialogue. Recent advancements in pretrained language models (PLMs) have significantly improved the conversational capabilities of CRS, enabling more fluent and context-aware interactions. However, PLMs still face challenges, including hallucinations—where the generated content can be factually inaccurate—and difficulties in providing precise, entity-specific recommendations. To address these challenges, we propose the PCRS-TKA framework, which integrates PLMs with knowledge graphs (KGs) through prompt-based learning. By incorporating tree-structured knowledge from KGs, our framework grounds the PLM in factual information, thereby enhancing the accuracy and reliability of the recommendations. Additionally, we design a user preference extraction module to improve the personalization of recommendations and introduce an alignment module to ensure semantic consistency between dialogue text and KG data. Extensive experiments demonstrate that PCRS-TKA outperforms existing methods in both recommendation accuracy and conversational fluency.

1758RETRIEVAL-AUGMENTED GENERATION WITH ESTIMATION OF SOURCE RELIABILITY

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) addresses key limitations of large language models (LLMs), such as hallucinations and outdated knowledge, by incorporating external databases. These databases typically consult multiple sources to encompass up-to-date and various information. However, standard RAG methods often overlook the heterogeneous source reliability in the multi-source database and retrieve documents solely based on relevance, making them prone to propagating misinformation. To address this, we propose Reliability-Aware RAG (RA-RAG) which estimates the reliability of multiple sources and incorporates this information into both retrieval and aggregation processes. Specifically, it iteratively estimates source reliability and true answers for a set of queries with no labelling. Then, it selectively retrieves relevant documents from a few of reliable sources and aggregates them using weighted majority voting, where the selective retrieval ensures scalability while not compromising the performance. We also introduce a benchmark designed to reflect real-world scenarios with heterogeneous source reliability and demonstrate the effectiveness of RA-RAG compared to a set of baselines.

1759Avoid Being a Shortcut Learner through Library-Based Re-Learning

[openreview] [pdf]

Abstract Replay-based methods provide a promising solution to address catastrophic forgetting issue in continual learning. They try to retain previous knowledge by using a small amount of data from previous tasks stored in a fix-sized buffer. In this work, we invoke the information bottleneck principles and reveal some fundamental limitations of those methods on their effectiveness in capturing the truly important features from the prior tasks by relying on the buffer data selected according to the model’s performance on known tasks. Since future tasks are not accessible during model training and buffer construction, the trained model and the buffer data tend to be biased towards making accurate predictions on the labels of known tasks. However, when new task samples are introduced along with labels, the biased model and the buffer data become less effective in differentiating samples of the old tasks from those of the new ones. Inspired by the way humans learn over time, we propose a novel relearning technique that makes use of additional past data, referred to as the library, to test how much information the model loses after learning the new task. We then realign the model towards those forgotten samples by training on a carefully selected small subset samples from the library for a few epochs with comparable computational cost as existing replay-based models. The experimental results on multiple real-world datasets demonstrate that the proposed relearning process can improve the performance of the state-of-the-art continual learning methods by a large margin.

1760Understanding Dimensional Collapse in Cross-Modal Feature Distillation

[openreview] [pdf]

Abstract To overcome limited computing resources and the complexity of sensor configurations in deploying multi-modal neural networks in real-world applications, cross-modal knowledge distillation (CMKD) aims to transfer valuable information from a pretrained teacher model to a deployable student model with the target modality. Despite the successful applications of CMKD in various fields, our understanding of knowledge transfer across different modalities remains insufficient to fully explain the efficacy of feature distillation. In this work, we investigate the relationship between the distributional shifts across modalities, referred to as the modality gap, and its impact on the effectiveness of CMKD, particularly focusing on the problem of cross-modal feature distillation. We first hypothesize and empirically validate that the modality gap between the teacher and student causes dimensional collapse in the student’s feature space. To prevent such inefficiency, we propose a Cross-modal Information Bottleneck Approximation (CIBA) scheme aimed at extracting and transferring modality-general features from the teacher model. Lastly, we experimentally demonstrate that our distillation strategy effectively reduces the dimensional collapse in the student model, thereby achieving improved performance for various real-world multi-modal datasets.

1761A Time Series is Worth Five Experts: Heterogeneous Mixture of Experts for Traffic Flow Prediction

[openreview] [pdf]

Abstract Accurate traffic prediction faces significant challenges, necessitating a deep understanding of both temporal and spatial cues and their complex interactions across multiple variables. Recent advancements in traffic prediction systems are primarily due to the development of complex sequence-centric models. However, existing approaches often embed multiple variables and spatial relationships at each time step, which may hinder effective variable-centric learning, ultimately leading to performance degradation in traditional traffic prediction tasks. To overcome these limitations, we introduce variable-centric and prior knowledge-centric modeling techniques. Specifically, we propose a Heterogeneous Mixture of Experts (TITAN) model for traffic flow prediction. TITAN initially consists of three experts focused on sequence-centric modeling. Then, designed a low-rank adaptive method, TITAN simultaneously enables variable-centric modeling. Furthermore, we supervise the gating process using a prior knowledge-centric modeling strategy to ensure accurate routing. Experiments on two public traffic network datasets, METR-LA and PEMS-BAY, demonstrate that TITAN effectively captures variable-centric dependencies while ensuring accurate routing. Consequently, it achieves improvements in all evaluation metrics, ranging from approximately 4.37% to 11.53%, compared to previous state-of-the-art (SOTA) models. The code will be released upon acceptance.

1762The Effect of Personalization in FedProx: A Fine-grained Analysis on Statistical Accuracy and Communication Efficiency

[openreview] [pdf]

Abstract FedProx is a simple yet effective federated learning method that enables model personalization via regularization. Despite remarkable success in practice, a rigorous analysis of how such a regularization provably improves the statistical accuracy of each client’s local model hasn’t been fully established. Setting the regularization strength heuristically presents a risk, as an inappropriate choice may even degrade accuracy. This work fills in the gap by analyzing the effect of regularization on statistical accuracy, thereby providing a theoretical guideline for setting the regularization strength for achieving personalization. We prove that by adaptively choosing the regularization strength under different statistical heterogeneity, FedProx can consistently outperform pure local training and achieve a nearly minimax-optimal statistical rate. In addition, to shed light on resource allocation, we design an algorithm, provably showing that stronger personalization reduces communication complexity without increasing the computation cost overhead. Finally, our theory is validated on both synthetic and real-world datasets and its generalizability is verified in a non-convex setting.

1763MRS: A Fast Sampler for Mean Reverting Diffusion based on ODE and SDE Solvers

[openreview] [pdf]

Abstract In applications of diffusion models, controllable generation is of practical significance, but is also challenging. Current methods for controllable generation primarily focus on modifying the score function of diffusion models, while Mean Reverting (MR) Diffusion directly modifies the structure of the stochastic differential equation (SDE), making the incorporation of image conditions simpler and more natural. However, current training-free fast samplers are not directly applicable to MR Diffusion. And thus MR Diffusion requires hundreds of NFEs (number of function evaluations) to obtain high-quality samples. In this paper, we propose a new algorithm named MRS (MR Sampler) to reduce the sampling NFEs of MR Diffusion. We solve the reverse-time SDE and the probability flow ordinary differential equation (PF-ODE) associated with MR Diffusion, and derive semi-analytical solutions. The solutions consist of an analytical function and an integral parameterized by a neural network. Based on this solution, we can generate high-quality samples in fewer steps. Our approach does not require training and supports all mainstream parameterizations, including noise prediction, data prediction and velocity prediction. Extensive experiments demonstrate that MR Sampler maintains high sampling quality with a speedup of 10 to 20 times across ten different image restoration tasks. Our algorithm accelerates the sampling procedure of MR Diffusion, making it more practical in controllable generation.

1764Frequency-Conditioned Diffusion Models for Time Series Generation

[openreview] [pdf]

Abstract Time series data, commonly used in fields like climate studies, finance, and healthcare, usually faces challenges such as missing data and privacy concerns. Recently, diffusion models have emerged as effective tools for generating high-quality data, but applying them to time series is still difficult, especially for capturing long-range dependencies and complex information. In this paper, we introduce a new diffusion model that uses frequency domain information to improve time series data generation. In particular, we apply Fourier analysis to adaptively separate low-frequency global trends from high-frequency details, which helps the model better understand important patterns during the denoising process. Finally, our approach uses a specialized frequency encoder to integrate this information, enhancing the model’s ability to capture both global and local features. Through exhaustive experiments on various public datasets, our model shows an impressive performance in generating time series data for diverse tasks like forecasting and imputation, outperforming existing methods in accuracy and flexibility.

1765Policy Optimization under Imperfect Human Interactions with Agent-Gated Shared Autonomy

[openreview] [pdf]

Abstract We introduce AGSA, an Agent-Gated Shared Autonomy framework that learns from high-level human feedback to tackle the challenges of reward-free training, safe exploration, and imperfect low-level human control. Recent human-in-the loop learning methods enable human participants to intervene a learning agent’s control and provide online demonstrations. Nonetheless, these methods rely heavily on perfect human interactions, including accurate human-monitored intervention decisions and near-optimal human demonstrations. AGSA employs a dedicated gating agent to determine when to switch control, thereby reducing the need of constant human monitoring. To obtain a precise and foreseeable gating agent, AGSA trains a long-term gating value function from human evaluative feedback on the gating agent’s intervention requests and preference feedback on pairs of human intervention trajectories. Instead of relying on potentially suboptimal human demonstrations, the learning agent is trained using control-switching signals from the gating agent. We provide theoretical insights on performance bounds that respectively describe the ability of the two agents. Experiments are conducted with both simulated and real human participants at different skill levels in challenging continuous control environments. Comparative results highlight that AGSA achieves significant improvements over previous human-in-the-loop learning methods in terms of training safety, policy performance, and user-friendliness.

1766FACTOR: Factoring Complexity and Context Length in Long-Context Model Evaluation

[openreview] [pdf]

Abstract Large language models (LLMs) with extended context windows have shown remarkable capabilities, especially with contexts up to 128K tokens. However, whether these resource-intensive LLMs genuinely surpass simpler Retrieval Augmented Generation (RAG) techniques remains debated. We precisely delineate differences between long-context LLMs and RAG methods, emphasizing the unique long-context reasoning abilities of LLMs that RAG cannot replicate. Existing benchmarks often focus on retrieval tasks and contain weak if not none complex reasoning tasks, hindering assessment of reasoning over extended contexts. We introduce the \textbf{FACTOR} benchmark (\textbf{F}actoring \textbf{A}nalysis of \textbf{C}omplexity and \textbf{T}extual \textbf{C}ontext in \textbf{R}easoning), which evaluates LLMs by independently varying task complexity and context length. A comprehensive list of LLMs are evaluated on FACTOR. Besides mere accuracy scores, we also model the relationship between accuracy and complexity given the context length. A simple but consistent log-linear model works surprisingly well across various models. Also, the modeling contains two explainable parameters, the slope or Complexity Decay Factor (CDF) and the y-intercept or Contextual Decay Offset (CDO) that are shown to offer separate and insightful measures of the models’ complex reasoning and long context innate ability. Our findings highlight distinct failure modes linked to task complexity and context length, underscoring the unique reasoning capabilities of long-context LLMs unattainable by RAG methods.

1767Thermodynamic Natural Gradient Descent

[openreview] [pdf]

Abstract Second-order training methods have better convergence properties than gradient descent but are rarely used in practice for large-scale training due to their computational overhead. This can be viewed as a hardware limitation (imposed by digital computers). Here we show that natural gradient descent (NGD), a second-order method, can have a similar computational complexity per iteration to a first-order method, when employing appropriate hardware. We present a new hybrid digital-analog algorithm for training neural networks that is equivalent to NGD in a certain parameter regime but avoids prohibitively costly linear system solves. Our algorithm exploits the thermodynamic properties of an analog system at equilibrium, and hence requires an analog thermodynamic computer. The training occurs in a hybrid digital-analog loop, where the gradient and Fisher information matrix (or any other positive semi-definite curvature matrix) are calculated at given time intervals while the analog dynamics take place. We numerically demonstrate the superiority of this approach over state-of-the-art digital first- and second-order training methods on classification tasks and language model fine-tuning tasks.

1768BEARD: Benchmarking the Adversarial Robustness for Dataset Distillation

[openreview] [pdf]

Abstract Dataset Distillation (DD) is an emerging technique that compresses large-scale datasets into significantly smaller synthesized datasets while preserving high test performance and enabling the efficient training of large models. However, current research primarily focuses on enhancing evaluation accuracy under limited compression ratios, often overlooking critical security concerns such as adversarial robustness. A key challenge in evaluating this robustness lies in the complex interactions between distillation methods, model architectures, and adversarial attack strategies, which complicate standardized assessments. To address this, we introduce BEARD, an open and unified benchmark designed to systematically assess the adversarial robustness of DD methods, including DM, IDM, and BACON. BEARD encompasses a variety of adversarial attacks (e.g., FGSM, PGD, C&W) on distilled datasets like CIFAR-10/100 and TinyImageNet. Utilizing an adversarial game framework, it introduces three key metrics: Robustness Ratio (RR), Attack Efficiency Ratio (AE), and Comprehensive Robustness-Efficiency Index (CREI). Our analysis includes unified benchmarks, various Images Per Class (IPC) settings, and the effects of adversarial training. Results are available on the BEARD Leaderboard, along with a library providing model and dataset pools to support reproducible research. Access the code at BEARD.

1769Detecting Hallucination Before Answering: Semantic Compression Through Instruction

[openreview] [pdf]

Abstract Large language models (LLMs) excel in various tasks but often suffer from hallucinations, providing incorrect information with high confidence. To address this, we focus on detecting when an LLM knows or does not know an answer, a concept referred to as the ``feeling of knowing’’ (FoK). We propose a novel approach called Semantic Compression by trying to Answer in One-word (SCAO), which enables efficient FoK detection before generating full sentences, with only minimal computational cost. Additionally, we introduce a method to measure confounding variable effects in benchmarks, an Approximate Misannotation Effect (AME) test. Our experiments demonstrate that the feature fusion model of our SCAO and probing achieves enhanced performance in FoK detection in both short and long-form entity questions. The code and the dataset is available online (https://anonymous.4open.science/r/SCAO-2FF8).

1770ExID: Offline RL with Intuitive Expert Insights in Limited-Data Settings

[openreview] [pdf]

Abstract With the ability to learn from static datasets, Offline Reinforcement Learning (RL) emerges as a compelling avenue for real-world applications. However, state-of-the-art offline RL algorithms perform sub-optimally when confronted with limited data confined to specific regions within the state space. The performance degradation is attributed to the inability of offline RL algorithms to learn appropriate actions for rare or unseen observations. This paper proposes a novel domain knowledge-based regularization technique and adaptively refines the initial domain knowledge to considerably boost performance in limited data with partially omitted states. The key insight is that the regularization term mitigates erroneous actions for sparse samples and unobserved states covered by domain knowledge. Empirical evaluations on standard offline RL datasets demonstrate a substantial average performance increase compared to ensemble of domain knowledge and existing offline RL algorithms operating on limited data.

1771Realtime Reinforcement Learning: Towards Rapid Asynchronous Deployment of Large Models

[openreview] [pdf]

Abstract Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectively minimize long-term regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where quick reactions are crucial. We present an analysis of lower bounds on regret in realtime environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment’s effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference and learning processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pokemon and Tetris.

1772RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark

[openreview] [pdf]

Abstract Deep reinforcement learning (RL) has recently shown significant benefits in solving combinatorial optimization (CO) problems, reducing reliance on domain expertise, and improving computational efficiency. However, the field lacks a unified benchmark for easy development and standardized comparison of algorithms across diverse CO problems. To fill this gap, we introduce RL4CO, a unified and extensive benchmark with in-depth library coverage of 23 state-of-the-art methods and 20+ CO problems. Built on efficient software libraries and best practices in implementation, RL4CO features modularized implementation and flexible configuration of diverse RL algorithms, neural network architectures, inference techniques, and environments. RL4CO allows researchers to seamlessly navigate existing successes and develop their unique designs, facilitating the entire research process by decoupling science from heavy engineering. We also provide extensive benchmark studies to inspire new insights and future work. RL4CO has attracted numerous researchers in the community and is open-sourced.

1773RePrompt: Prompt Engineering for Large Language Models Agents through Reflection

[openreview] [pdf]

Abstract In this past year, large language models (LLMs) have had remarkable success in domains outside the traditional natural language processing, and people are starting to explore the usage of LLMs in more general and close to application domains like code generation, travel planning, and robot controls. Connecting these LLMs with great capacity and external tools, people are building the so-called LLM agents, which are supposed to help people do all kinds of work in everyday life. In all these domains, the prompt to the LLMs has been shown to make a big difference in what the LLM would generate and thus affect the performance of the LLM agents. Therefore, automatic prompt engineering (APE) has become an important question for many researchers and users of LLMs. However, previous works in APE all rely on a final checker to evaluate the performance of the given prompt, which is hard to meet in the case of LLM agents where intermediate feedback is easier to get, and the final evaluation could be expensive, inaccurate, or even missing. In this paper, we propose a novel method, \textsc{RePrompt}, which does a ``gradient descent"-like approach to optimize the step-by-step instructions in the prompts given to LLM agents based on the chat history obtained from interactions and reflections with LLM agents. By leveraging intermediate feedback, \textsc{RePrompt} can optimize the prompt without the need for a final solution checker. We have used experiments in PDDL generation and travel planning to show that our method could generally improve the performance for different reasoning tasks.

1774Align Your Intents: Offline Imitation Learning via Optimal Transport

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) addresses the problem of sequential decision-making by learning optimal policy through pre-collected data, without interacting with the environment. As yet, it has remained somewhat impractical, because one rarely knows the reward explicitly and it is hard to distill it retrospectively. Here, we show that an imitating agent can still learn the desired behavior merely from observing the expert, despite the absence of explicit rewards or action labels. In our method, AILOT (Aligned Imitation Learning via Optimal Transport), we involve special representation of states in a form of intents that incorporate pairwise spatial distances within the data. Given such representations, we define intrinsic reward function via optimal transport distance between the expert’s and the agent’s trajectories. We report that AILOT outperforms state-of-the art offline imitation learning algorithms on D4RL benchmarks and improves the performance of other offline RL algorithms by dense reward relabelling in the sparse-reward tasks.

1775FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

[openreview] [pdf]

Abstract Black-Box Discrete Prompt Learning (BDPL) is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible. Adapting Federated Learning (FL) to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we calledFedOne, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs. We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.

1776Towards Optimizing Top-KRanking Metrics in Recommender Systems

[openreview] [pdf]

Abstract In the realm of recommender systems (RS), Top-KK metrics such as NDCG@KK are the gold standard for evaluating performance. Nonetheless, during the training of recommendation models, optimizing NDCG@KK poses significant challenges due to its inherent discontinuous nature and the intricacies of the Top-K truncation mechanism. Recent efforts to optimize NDCG@KK have either neglected the Top-KK truncation or suffered from low computational efficiency. To overcome these limitations, we propose SoftmaxLoss@KK (SL@KK), a new loss function designed as a surrogate for optimizing NDCG@KK in RS. SL@KK integrates a quantile-based technique to handle the complex truncation term; and derives a smooth approximation of NDCG@KK to address discontinuity. Our theoretical analysis confirms the close bounded relationship between NDCG@KK and SL@KK.Besides, SL@KK also exhibits several desirable properties including concise formulation, computational efficiency, and noisy robustness. Extensive experiments on four real-world datasets and three recommendation backbones demonstrate that SL@KK outperforms existing loss functions with a notable average improvement of 6.19%.

1777Distilling Dataset into Neural Field

[openreview] [pdf]

Abstract Utilizing large-scale datasets is essential for training high-performance deep learning models, but it also comes with substantial computation and storage costs. To overcome these challenges, dataset distillation has emerged as a promising solution by compressing large-scale datasets into smaller synthetic versions that retain the essential information needed for training. This paper proposes a novel parameterization framework for dataset distillation, coined Distilling Dataset into Neural Field (DDiF), which leverages the neural field to store the necessary information of large-scale datasets. Due to the unique nature of the neural field, which takes coordinates as input and output quantity, DDiF effectively preserves the information and easily generates various shapes of data. Beyond the efficacy, DDiF has larger feature coverage than some previous literature if same budget is allowed, which is proved from the frequency domain perspective. Under the same budget setting, this larger coverage leads to a significant performance improvement in downstream tasks by providing more synthetic instances due to the coding efficiency. DDiF demonstrates both theoretical and empirical evidence of its ability to operate efficiently within a limited budget, while better preserving the information of the original dataset compared to conventional parameterization methods.

1778Sequential Order-Robust Mamba for Time Series Forecasting

[openreview] [pdf]

Abstract Mamba has recently emerged as a promising alternative to Transformers, offering near-linear complexity in processing sequential data. However, while channels in time series (TS) data have no specific order in general, recent studies have adopted Mamba to capture channel dependencies (CD) in TS, introducing a sequential order bias. To address this issue, we propose SOR-Mamba, a TS forecasting method that 1) incorporates a regularization strategy to minimize the discrepancy between two embedding vectors generated from data with reversed channel orders, thereby enhancing robustness to channel order, and 2) eliminates the 1D-convolution originally designed to capture local information in sequential data. Furthermore, we introduce channel correlation modeling (CCM), a pretraining task aimed at preserving correlations between channels from the data space to the latent space in order to enhance the ability to capture CD. Extensive experiments demonstrate the efficacy of the proposed method across standard and transfer learning scenarios.

1779MS3M: Multi-Stage State Space Model for Motion Forecasting

[openreview] [pdf]

Abstract Motion forecasting is a fundamental component of autonomous driving systems, as it predicts an agent’s future trajectories based on its surrounding environment. Transformer architectures have dominated this domain due to their strong ability to model both temporal and spatial information. However, transformers often suffer from quadratic complexity with respect to input sequence length, limiting their ability to efficiently process scenarios involving numerous agents. Additionally, transformers typically rely on positional encodings to represent temporal or spatial relationships, a strategy that may not be as effective or intuitive as the inductive biases naturally embedded in convolutional architectures. To address these challenges, we leverage recent advancements in state space models (SSMs) and propose the Multi-Stage State Space Model (MS3^3M). In MS3^3M, the Temporal Mamba Model (TMM) is employed to capture fine-grained temporal information, while the Spatial Mamba Model efficiently handles spatial interactions. By injecting temporal and spatial inductive biases through Mamba’s state-space model structure, the model’s capacity is significantly improved. MS3^3M also strikes an exceptional trade-off between accuracy and efficiency, which is achieved through convolutional computations and near-linear computational strategies in the Mamba architecture. Furthermore, a hierarchical query-based decoder is introduced, further enhancing model performance and efficiency. Extensive experimental results demonstrate that the proposed method achieves superior performance while maintaining low latency, which is crucial for practical real-time autonomous driving systems.

1780How Far are Today’s Time-Series Models from Real-world Weather Forecasting Applications?

[openreview] [pdf]

Abstract The development of Time-Series Forecasting (TSF) techniques is often hindered by the lack of comprehensive datasets. This is particularly problematic for time-series weather forecasting, where commonly used datasets suffer from significant limitations such as small size, limited temporal coverage, and sparse spatial distribution. These constraints severely impede the optimization and evaluation of TSF models, resulting in benchmarks that are not representative of real-world applications, such as operational weather forecasting. In this work, we introduce the WEATHER-5K dataset, a comprehensive collection of observational weather data that better reflects real-world scenarios. As a result, it enables a better training of models and a more accurate assessment of the real-world forecasting capabilities of TSF models, pushing them closer to in-situ applications. Through extensive benchmarking against operational Numerical Weather Prediction (NWP) models, we provide researchers with a clear assessment of the gap between academic TSF models and real-world weather forecasting applications. This highlights the significant performance disparity between TSF and NWP models by analyzing performance across detailed weather variables, extreme weather event prediction, and model complexity comparison. Finally, we summarise the result into recommendations to the users and highlight potential areas required to facilitate further TSF research. The dataset and benchmark implementation will be publicly available.

1781BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

[openreview] [pdf]

Abstract World models are receiving increasing attention in autonomous driving for their capability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenize multimodal sensor inputs into a unified and compact Bird’s Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence diffusion model. The multi-modal tokenizer first encodes multi-modality information and the decoder is able to reconstruct the latent BEV tokens into LiDAR and image observations by ray-casting rendering in a self-supervised manner. Then the latent BEV sequence diffusion model predicts future scenarios given action tokens as conditions. Experiments demonstrate the effectiveness of BEVWorld in autonomous driving tasks, showcasing its capability in generating future scenes and benefiting downstream tasks such as perception and motion prediction. Code will be available soon.

1782Differentiation of Multi-objective Data-driven Decision Pipeline

[openreview] [pdf]

Abstract Real-world scenarios frequently involve multi-objective data-driven optimization problems, characterized by unknown problem coefficients and multiple conflicting objectives. Traditional two-stage methods independently apply a machine learning model to estimate problem coefficients, followed by invoking a solver to tackle the predicted optimization problem. The independent use of optimization solvers and prediction models may lead to suboptimal performance due to mismatches between their objectives. Recent efforts have focused on end-to-end training of predictive models that use decision loss derived from the downstream optimization problem. However, these methods have primarily focused on single-objective optimization problems, thus limiting their applicability. We aim to propose a multiobjective decision-focused approach to address this gap. In order to better align with the inherent properties of multi-objective optimization problems, we propose a set of novel loss functions. These loss functions are designed to capture the discrepancies between predicted and true decision problems, considering solution space, objective space, and decision quality, named landscape loss, Pareto set loss, and decision loss, respectively. Our experimental results demonstrate that our proposed method significantly outperforms traditional two-stage methods and most current decision-focused methods.

1783Concealing Backdoors in Federated Learning by Trigger-Optimized Data Poisoning

[openreview] [pdf]

Abstract Federated Learning (FL) is a decentralized machine learning method that enables participants to collaboratively train a model without sharing their private data. Despite its privacy and scalability benefits, FL is susceptible to backdoor attacks, where adversaries poison the local training data of a subset of clients using backdoor triggers, aiming to make the aggregated model produce malicious results when the same backdoor conditions are met by an inference-time input. Existing backdoor attacks in FL suffer from common deficiencies: fixed trigger patterns and reliance on the assistance of model poisoning. State-of-the-art defenses based on analyzing clients’ model updates exhibit a good defense performance on these attacks because of the significant divergence between malicious and benign client model updates. To effectively conceal malicious model updates among benign ones, we propose DPOT, a backdoor attack strategy in FL that dynamically constructs backdoor objectives by optimizing a backdoor trigger, making backdoor data have minimal effect on model updates. We provide theoretical justifications for DPOT’s attacking principle and display experimental results showing that DPOT, via only a data-poisoning attack, effectively undermines state-of-the-art defenses and outperforms existing backdoor attack techniques on various datasets.

1784REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments

[openreview] [pdf]

Abstract Do generalist agents require large models pre-trained on massive amounts of data to rapidly adapt to new environments? We propose a novel approach to pre-train relatively small models and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that retrieval offers a powerful bias for fast adaptation. Indeed, we demonstrate that even a simple retrieval-based 1-nearest neighbor agent offers a surprisingly strong baseline for today’s state-of-the-art generalist agents. From this starting point, we construct a semi-parametric agent, REGENT, that trains a transformer-based policy on sequences of queries and retrieved neighbors. REGENT can generalize to unseen robotics and game-playing environments via retrieval augmentation and in-context learning, achieving this with up to 3x fewer parameters and up to an order-of-magnitude fewer pre-training datapoints, significantly outperforming today’s state-of-the-art generalist agents.

1785Decoupling Variable and Temporal Dependencies: A Novel Approach for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract In multivariate time series forecasting using the Transformer architecture, capturing temporal dependencies and modeling inter-variable relationships are crucial for improving performance. However, overemphasizing temporal dependencies can destabilize the model, increasing its sensitivity to noise, overfitting, and weakening its ability to capture inter-variable relationships. We propose a new approach called the Temporal-Variable Decoupling Network (TVDN) to address this challenge. This method decouples the modeling of variable dependencies from temporal dependencies and further separates temporal dependencies into historical and predictive sequence dependencies, allowing for a more effective capture of both. Specifically, the simultaneous learning of time-related and variable-related patterns can lead to harmful interference between the two. TVDN first extracts variable dependencies from historical data through a permutation-invariant model and then captures temporal dependencies using a permutation-equivariant model. By decoupling variable and temporal dependencies and historical and predictive sequence dependencies, this approach minimizes interference and allows for complementary extraction of both. Our method provides a concise and innovative approach to enhancing the utilization of temporal features. Experiments on multiple real-world datasets demonstrate that TVDN achieves state-of-the-art performance.

1786D2G: Debiased Learning with Distribution Guidance for Generalized Category Discovery

[openreview] [pdf]

Abstract In this paper, we tackle the problem of Generalized Category Discovery (GCD). Given a dataset containing both labelled and unlabelled images, the objective is to cluster all images in the unlabelled subset, irrespective of whether they are from known or unknown classes. In GCD, an inherent label bias exists between known and unknown classes due to the lack of ground-truth labels for the latter. State-of-the-art GCD methods employ parametric classifiers trained with self-distillation using soft labels, leaving the bias issue unattended. Besides, they treat all unlabelled samples uniformly, neglecting variations in certainty levels and resulting in suboptimal learning. Moreover, the explicit identification of semantic distribution shifts between known and unknown classes, a vital aspect for effective GCD, has been neglected. To overcome these obstacles, we introduce the \textbf{D}ebiased Learning with \textbf{D}istribution \textbf{G}uidance (\textbf{D2G}) framework. Initially, D2G co-trains an auxiliary debiased classifier in the same feature space as the GCD classifier, progressively enhancing the GCD features. Moreover, we introduce a semantic distribution detector in a separate feature space to implicitly boost the learning efficacy of GCD. Additionally, we employ a curriculum learning strategy based on semantic distribution certainty to steer the debiased learning at an optimized pace. Thorough evaluations on GCD benchmarks demonstrate the consistent state-of-the-art performance of our D2G framework, highlighting its superiority.

1787Counterfactual Effect Decomposition in Multi-Agent Sequential Decision Making

[openreview] [pdf]

Abstract We address the challenge of explaining counterfactual outcomes in multi-agent Markov decision processes. In particular, we aim to explain the total counterfactual effect of an agent’s action on the outcome of a realized scenario through its influence on the environment dynamics and the agents’ behavior. To achieve this, we introduce a novelcausal explanation formulathat decomposes the counterfactual effect by attributing to each agent and state variable a score reflecting their respective contributions to the effect. First, we show that the total counterfactual effect of an agent’s action can be decomposed into two components: one measuring the effect that propagates through all subsequent agents’ actions and another related to the effect that propagates through the state transitions. Building on recent advancements in causal contribution analysis, we further decompose these two effects as follows. For the former, we consideragent-specific effects-- a causal concept that quantifies the counterfactual effect of an agent’s action that propagates through a subset of agents. Based on this notion, we use Shapley value to attribute the effect to individual agents. For the latter, we consider the concept ofstructure-preserving interventionsand attribute the effect to state variables based on their "intrinsic’’ contributions. Through extensive experimentation, we demonstrate the interpretability of our decomposition approach in a Gridworld environment with LLM-assisted agents and a sepsis management simulator.

1788AIPO: Agreement-Aware Iterative Preference Optimization for Length Exploitation Mitigation

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) is gaining popularity as an alternative to Proximal Policy Optimization (PPO) for aligning Large Language Models (LLMs). Recent research on aligning LLMs iteratively with synthetic or partially synthetic data has shown promising outcomes, facilitating the scalability of DPO training in both academic settings and proprietary models such as Llama 3. Despite its success, we observe that the issue of length exploitation in DPO becomes more pronounced during iterative preference optimization, with the severity escalating progressively with each iteration. This observation prompts an in-depth examination of iterative preference optimization with synthetic data. In this paper, we present our findings and analyses in building our iterative preference optimization pipeline. Specifically, we analyze the issue of length exploitation in this iterative process and propose a novel training objective for iterative preference optimization, namely \textbf{A}greement-aware \textbf{I}terative \textbf{P}reference \textbf{O}ptimization (AIPO). To demonstrate the effectiveness of our proposed method, we conduct extensive experiments and show that it achieves state-of-the-art performance on MT-Bench, AlpacaEval 2.0, and Arena-Hard.

1789Foundation Policies with Memory

[openreview] [pdf]

Abstract A generalist agent should perform well on novel tasks in unfamiliar environments. While Foundation Policies (FPs) enable generalization across new tasks, they lack mechanisms for handling novel dynamics. Conversely, agents equipped with memory models can adapt to new dynamics, but struggle with unseen tasks. In this work, we bridge this gap by integrating memory models into the FP architecture, allowing policies to condition on both task and environment dynamics. We evaluate FPs enhanced with attention, state-space, and RNN-based memory models on POPGym, a memory benchmark, and ExORL, an unsupervised RL benchmark. Our results show that GRUs achieve the best generalization to unseen tasks and dynamics for a given recurrent state size, approaching the performance of a supervised baseline that has access to task information during training and significantly outperforming memory-free FPs. Additionally, our approach improves FP performance on entirely new environments not encountered during training. Our anonymized code is available at \url{https://anonymous.4open.science/r/zero-shot-96A1}.

1790RACCOON: Regret-based Adaptive Curricula for Cooperation

[openreview] [pdf]

Abstract Overfitting to training partners is a common problem in fully-cooperative multi-agent settings, leading to poor zero-shot transfer to novel partners. A popular solution is to train an agent with a diverse population of training partners. However, previous work lacks a principled approach for selecting partners from this population during training, usually sampling at random. We argue that partner sampling is an important and overlooked problem, and motivated by the success of regret-based Unsupervised Environment Design, we propose Regret-based Adaptive Curricula for Cooperation (RACCOON), a novel a method which prioritises high-regret partners and tasks. We test RACCOON in the Overcooked environment, and demonstrate that it leads to sample efficiency gains and increased robustness across diverse partners and tasks, compared with strong baselines. We further analyse the nature of the induced curricula, and conclude with discussions on the limitations of cooperative regret and directions for future work.

1791Escaping Saddle Point Efficiently in Minimax and Bilevel Optimizations

[openreview] [pdf]

Abstract Hierarchical optimization is attracting significant attentions as it can be applied to a broad range of machine learning tasks. Recently, many algorithms are proposed to improve the theoretical results of minimax and bilevel optimizations. Among these works, a core issue that has not been well studies is to escape saddle point and find local minimum. In this paper, thus, we investigate the methods to achieve second-order optimality for nonconvex minimax and bilevel optimization. Specifically, we propose a new algorithm named PRGDA without the computation of second order derivative of the primal function. In nonconvex-strongly-concave minimax optimization, we prove that our algorithm can find a second-order stationary point with the gradient complexity that matches state-of-the-art result to find first-order stationary point. To our best knowledge, PRGDA is the first stochastic algorithm that is guaranteed to obtain the second-order stationary point for nonconvex minimax problems. In nonconvex-strongly-convex bilevel optimization, our method also achieves better gradient complexity to find local minimum. Finally, we conduct two numerical experiments to validate the performance of our new method.

1792End-to-end Learning of Gaussian Mixture Priors for Diffusion Sampler

[openreview] [pdf]

Abstract Diffusion models optimized via variational inference (VI) have emerged as a promising tool for generating samples from unnormalized target densities. These models create samples by simulating a stochastic differential equation, starting from a simple, tractable prior, typically a Gaussian distribution. However, when the support of this prior differs greatly from that of the target distribution, diffusion models often struggle to explore effectively or suffer from large discretization errors. Moreover, learning the prior distribution can lead to mode-collapse, exacerbated by the mode-seeking nature of reverse Kullback-Leibler divergence commonly used in VI. To address these challenges, we propose end-to-end learnable Gaussian mixture priors (GMPs). GMPs offer improved control over exploration, adaptability to target support, and increased expressiveness to counteract mode collapse. We further leverage the structure of mixture models by proposing a strategy to iteratively refine the model through the addition of mixture components during training. Our experimental results demonstrate significant performance improvements across a diverse range of real-world and synthetic benchmark problems when using GMPs without requiring additional target evaluations.

1793Jailbreaking as a Reward Misspecification Problem

[openreview] [pdf]

Abstract The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. This misspecification occurs when the reward function fails to accurately capture the intended behavior, leading to misaligned model outputs. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts in a reward-misspecified space. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark against various target aligned LLMs while preserving the human readability of the generated prompts. Furthermore, these attacks on open-source models demonstrate high transferability to closed-source models like GPT-4o and out-of-distribution tasks from HarmBench. Detailed analysis highlights the unique advantages of the proposed reward misspecification objective compared to previous methods, offering new insights for improving LLM safety and robustness.

1794Why DP “LOCAL” SGD – Faster Convergence in Less Composition with Clipping Bias Reduction

[openreview] [pdf]

Abstract We argue to apply Differentially-Private Local Stochastic Gradient Descent (DP-LSGD), a generalization of regular DP-SGD with per-sample local iterations, to systematically improve privacy-preserving machine learning. We prove and show the following facts in this paper: a). DP-LSGD with local iterations can produce more concentrated per-sample updates and therefore enables a more efficient exploitation of the clipping budget with a better utility-privacy tradeoff; b). given the same TT privacy composition or per-sample update aggregation, with properly-selected local iterations, DP-LSGD can converge faster in O(1/T)O(1/T) to a small neighborhood of (local) optimum compared to O(1/T)O(1/\sqrt{T}) in regular DP-SGD, i.e., DP-LSGD produces the same accuracy while consumes less of the privacy budget. From an empirical side, thorough experiments are provided to support our developed theory and we show DP-LSGD produces the best-known performance in various practical deep learning tasks: for example with an (ϵ=4,δ=105)(\epsilon=4,\delta=10^{-5})-DP guarantee, we successfully train ResNet20 from scratch with test accuracy 74.174.1%, 86.5% and 91.791.7% on CIFAR10, SVHN and EMNIST, respectively. Our code is released in an anonymous GitHub link.

1795DMDSpeech: Distilled Diffusion Model Surpassing The Teacher in Zero-shot Speech Synthesis via Direct Metric Optimization

[openreview] [pdf]

Abstract Diffusion models have demonstrated significant potential in speech synthesis tasks, including text-to-speech (TTS) and voice cloning. However, their iterative denoising processes are inefficient and hinder the application of end-to-end optimization with perceptual metrics. In this paper, we propose a novel method of distilling TTS diffusion models with direct end-to-end evaluation metric optimization, achieving state-of-the-art performance. By incorporating Connectionist Temporal Classification (CTC) loss and Speaker Verification (SV) loss, our approach optimizes perceptual evaluation metrics, leading to notable improvements in word error rate and speaker similarity. Our experiments show that DMDSpeech consistently surpasses prior state-of-the-art models in both naturalness and speaker similarity while being significantly faster. Moreover, our synthetic speech has a higher level of voice similarity to the prompt than the ground truth in both human evaluation and objective speaker similarity metric. This work highlights the potential of direct metric optimization in speech synthesis, allowing models to better align with human auditory preferences. The audio samples are available athttps://dmdspeech.github.io/demo/.

1796LiNo: Advancing Recursive Residual Decomposition of Linear and Nonlinear Patterns for Robust Time Series Forecasting

[openreview] [pdf]

Abstract Forecasting models are pivotal in a data-driven world with vast volumes of time series data that appear as a compound of vast Li\textbf{Li}near and No\textbf{No}nlinear patterns. Recent deep time series forecasting models struggle to utilize seasonal and trend decomposition to separate the entangled components. Such a strategy only explicitly extracts simple linear patterns like trends, leaving the other linear modes and vast unexplored nonlinear patterns to the residual. Their flawed linear and nonlinear feature extraction models and shallow-level decomposition limit their adaptation to the diverse patterns present in real-world scenarios. Given this, we innovate Recursive Residual Decomposition by introducing explicit extraction of both linear and nonlinear patterns. This deeper-level decomposition framework, which is named LiNo\textbf{LiNo}, captures linear patterns using a Li block which can be a moving average kernel, and models nonlinear patterns using a No block which can be a Transformer encoder. The extraction of these two patterns is performed alternatively and recursively. To achieve the full potential of LiNo, we develop the current simple linear pattern extractor to a general learnable autoregressive model, and design a novel No block that can handle all essential nonlinear patterns. Remarkably, the proposed LiNo achieves state-of-the-art on thirteen real-world benchmarks under univariate and multivariate forecasting scenarios. Experiments show that current forecasting models can deliver more robust and precise results through this advanced Recursive Residual Decomposition. We hope this work could offer insight into designing more effective forecasting models. Code is available at this anonymous repository:https://anonymous.4open.science/r/LiNo-8225/.

1797Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning

[openreview] [pdf]

Abstract Automated red teaming can discover rare model failures and generate challenging examples that can be used for training or evaluation. However, a core challenge in automated red teaming is ensuring that the attacks are both diverse and effective. Prior methods typically succeed in optimizing either for diversity or for effectiveness, but rarely both. In this paper, we provide methods that enable automated red teaming to generate a large number of diverse and successful attacks.Our approach decomposes the task into two steps: (1) automated methods for generating diverse attack goals and (2) generating effective attacks for those goals. While we provide multiple straightforward methods for generating diverse goals, our key contributions are to train an RL attacker that both follows those goals and generates diverse attacks for those goals. First, we demonstrate that it is easy to use a large language model (LLM) to generate diverse attacker goals with per-goal prompts and rewards, including rule-based rewards (RBRs) to grade whether the attacks are successful for the particular goal. Second, we demonstrate how training the attacker model with multi-step RL, where the model is rewarded for generating attacks that are different from past attempts further increases diversity while remaining effective. We use our approach to generate both prompt injection attacks and prompts that elicit unsafe responses. In both cases, we find that our approach is able to generate highly-effective and considerably more diverse attacks than past general red-teaming approaches.

1798A Q-learning approach to the Lowest Unique Positive Integer game

[openreview] [pdf]

Abstract The Lowest Unique Positive Integer (LUPI) game is a multiplayer game where participants attempt to choose the smallest number that no one else selects. While previous studies model LUPI using Poisson--Nash equilibrium assumptions, our work introduces a novel Q-learning-based approach to achieve equilibrium without the need for specific distribution assumptions, such as Poisson. We demonstrate that our Q-learning model successfully emulates the Nash equilibrium while allowing flexibility in the number of players, providing a more robust and practical solution for real-world applications like real-time bidding (RTB) systems. We compare our model’s performance against existing Poisson-based strategies, showcasing improved accuracy and adaptability. Furthermore, we apply our model to the Swedish Limbo lottery data and observe significant deviations from theoretical predictions, highlighting the strength of learning-based approaches in dynamic, real-world scenarios.

1799Temporal Distribution-aware Quantization for Diffusion Models

[openreview] [pdf]

Abstract Diffusion models for image generation have achieved notable success in various applications. However, these models often require tremendous storage overhead and inference time cost, severely hampering their deployment on resource-constrained devices. Post-training quantization (PTQ) has recently emerged as a promising way to reduce the model size and the inference latency, by converting the float-point values into lower bit-precision. Nevertheless, most existing PTQ approaches neglect the accumulating quantization errors arising from the substantial distribution variations across distinct layers and blocks at different timesteps, thus suffering a significant accuracy degradation. To address these issues, we propose a novel temporal distribution-aware quantization (DAQ) method for diffusion models. DAQ firstly develops a distribution-aware finetuning (DAF) framework to dynamically suppress the accumulating quantization errors in the calibration process. Subsequently, DAQ employs a full-precision noise estimation network to optimize the quantized noise estimation network at each sampling timestep, further aligning the quantizers with varying input distributions. We evaluate the proposed method on the widely used public benchmarks for image generation tasks. The experimental results clearly demonstrate that DAQ reaches the state-of-the-art performance compared to existing works. We also display that DAQ can be applied as a plug-and-play module to existing PTQ models, remarkably boosting the overall performance.

1800Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval

[openreview] [pdf]

Abstract Mitigating hallucinations is a prerequisite for trusting answers generated by large language models (LLMs) that are prone to making convincing but inaccurate claims. Grounding the answers in data generated and verified by humans provides a natural avenue for improving the reliability of LLMs. However, it can be hard to capture relevant facts for user questions based on just the semantic similarity, especially as questions becomes more complex and the relevant facts become more indirect. What if LLMs could query for relevant facts based on the user question? While this can enable retrieving relevant but indirect facts, zero-shot performance of instruction-tuned LLMs leaves more to be desired and generating supervision on how to retrieve relevant facts can be expensive and retriever dependent. Our key insight is that LLMs can learn to retrieve relevant facts by trying\textit{trying} different queries, learning to upweight queries that result in relevant facts. This leads to our reinforcement learning based framework, Le\underline{Le}arning to Re\underline{Re}trieve by T\underline{T}rying (LeReT), where the LLM generates queries for multi-hop retrieval and uses preference-based reinforcement learning to improve the LLM queries. Our experimental results demonstrate that LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%. The simplicity and flexibility of LeReT allows it to be applied to arbitrary retrievers, and makes it a promising technique for improving general LLM pipelines.

1801Comparison Visual Instruction Tuning

[openreview] [pdf]

Abstract Comparing two images in terms of Commonalities and Differences (CaD) is a fundamental human capability that forms the basis of advanced visual reasoning and interpretation. It is essential for the generation of detailed and contextually relevant descriptions, performing comparative analysis, novelty detection, and making informed decisions based on visual data. However, surprisingly, little attention has been given to these fundamental concepts in the best current mimic of human visual intelligence - Large Multimodal Models (LMMs). We develop and contribute a new two-phase approach CaD-VI for collecting synthetic visual instructions, together with an instruction-following dataset CaD-Inst containing 349K image pairs with CaD instructions collected using CaD-VI. Our approach significantly improves the CaD spotting capabilities in LMMs, advancing the SOTA on a diverse set of related tasks by up to 17.5%. It is also complementary to existing difference-only instruction datasets, allowing automatic targeted refinement of those resources increasing their effectiveness for CaD tuning by up to 10%. Additionally, we propose an evaluation benchmark with 7.5K open-ended QAs to assess the CaD understanding abilities of LMMs.

[openreview] [pdf]

Abstract The advancement of scientific knowledge relies on synthesizing prior research to forecast future developments, a task that has become increasingly intricate. The emergence of large language models (LLMs) offers a transformative opportunity to automate and streamline this process, enabling faster and more accurate academic discovery. However, recent attempts either limit to producing surveys or focus overly on downstream tasks. To this end, we introduce a novel task that bridges two key challenges: the comprehensive synopsis of past research and the accurate prediction of emerging trends, dubbed Dual Temporal Research Analysis\textit{Dual Temporal Research Analysis}. This dual approach requires not only an understanding of historical knowledge but also the ability to predict future developments based on detected patterns. To evaluate, we present an evaluation benchmark encompassing 20 research topics and 210 key AI papers, based on the completeness of historical coverage and predictive reliability. We further draw inspirations from dual-system theory and propose a framework HorizonAI\textit{HorizonAI} which utilizes a specialized temporal knowledge graph for papers, to capture and organize past research patterns (System 1), while leveraging LLMs for deeper analytical reasoning (System 2) to enhance both summarization and prediction. Our framework demonstrates a robust capacity to accurately summarize historical research trends and predict future developments, achieving significant improvements in both areas. For summarizing historical research, we achieve a 18.99% increase over AutoSurvey; for predicting future developments, we achieve a 10.37% increase over GPT-4o.

1803Adaptive teachers for amortized samplers

[openreview] [pdf]

Abstract Amortized inference is the task of training a parametric model, such as a neural network, to approximate a distribution with a given unnormalized density where exact sampling is intractable. When sampling is modeled as a sequential decision-making process, reinforcement learning (RL) methods, such as generative flow networks, can be used to train the sampling policy. Off-policy RL training facilitates the discovery of diverse, high-reward candidates, but existing methods still face challenges in efficient exploration. We propose to use an adaptive training distribution (the Teacher) to guide the training of the primary amortized sampler (the Student) by prioritizing high-loss regions. The Teacher, an auxiliary behavior model, is trained to sample high-error regions of the Student and can generalize across unexplored modes, thereby enhancing mode coverage by providing an efficient training curriculum. We validate the effectiveness of this approach in a synthetic environment designed to present an exploration challenge, two diffusion-based sampling tasks, and four biochemical discovery tasks demonstrating its ability to improve sample efficiency and mode coverage.

1804Altared Environments: The Role of Normative Infrastructure in AI Alignment

[openreview] [pdf]

Abstract Cooperation is central to human life, distinguishing humans as uniquely cooperative among mammals. As AI agents gain autonomy in shared environments, it becomes essential that their integration into human societies is driven by their ability to cooperate with humans and other agents. Humans often achieve stable cooperation by developing normative competence—the ability to recognize, reason about, and coordinate around shared norms. How can we design and train AI agents to be normatively competent, thereby cultivating cooperative intelligence in them? Most AI research frames this as an alignment challenge, focusing on embedding norms and values into AI agents. While this approach is promising, it often overlooks how humans solve cooperation challenges—through normative institutions that establish acceptable behavior within groups. Inspired by this, we propose a novel solution to the AI alignment challenge by introducing the concept of an altar—a dynamic environmental feature functioning as an institution that encodes sanctionable actions and evolves its content over time. Formalized within the context of Markov decision processes, this framework enables agents to learn to recognize institutions and adapt their enforcement behavior in response to dynamic institutional changes. We hypothesize that this learning process will lead to the emergence of compliance behavior in trained agents, resulting in stable cooperative outcomes as authorized by the institution. In mixed-motive environments, which present cooperation challenges such as equilibrium selection, free-riding, and the tragedy of the commons, we demonstrate that agents trained with the altar quickly learn and sustain cooperative behavior by adapting to various environmental and institutional configurations. This further allows them to adjust rapidly to new environments where institutions have different meanings. Additionally, we explore the effectiveness of institutions in scenarios with larger group sizes, mirroring realistic social structures. Our results suggest that integrating normative infrastructure into AI training systems is a crucial step for future research on designing agents capable of seamlessly integrating into human societies.

1805Adversarial Inverse Reward-Constraint Learning with Reward-Feasibility Contrast Prior Inspired by Animal Behaviour

[openreview] [pdf]

Abstract The behaviour of natural and artificial agents is shaped by underlying reward systems, which signal rewards based on internal and external factors, driving reward-oriented actions. However, real-world scenarios often impose constraints that reward alone cannot capture. While existing inverse (constrained) reinforcement learning methods can recover either rewards or constraints from demonstrations, the simultaneous inference of both remains unexplored due to the complexity of inference and the lack of knowledge of their relationship. To address this gap, we propose a novel algorithm that simultaneously infers both rewards and constraints within an adversarial learning framework, where both are updated through a policy optimisation process guided by expert demonstrations. Crucial to this framework is the introduction of the “reward-feasibility contrast prior,” a hypothesis that correlates rewards and constraints. It is inspired by patterns observed in animal behaviour (particularly meerkats), positing that states with high rewards nearby are more likely to be associated with weaker feasibility (stronger constraints). Our experiments on virtual robot control tasks with safety constraints and real-world animal behaviour data with spatio-temporal causal constraints validate our proposed framework’s effectiveness and the reward-feasibility contrast prior hypothesis. The results show accurate recovery of rewards and constraints, reflected by strong alignment with expert demonstrations and a low rate of constraint violations. Additionally, the performance improvement by embedding this prior into other inverse constraint inference methods further confirms its general effectiveness.

1806Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) has revitalized Large Language Models (LLMs) by injecting non-parametric factual knowledge. Compared with long-context LLMs, RAG is considered an effective summarization tool in a more concise and lightweight manner, which can interact with LLMs multiple times using diverse queries to get comprehensive responses. However, the LLM-generated historical responses, which contain potentially insightful information, are largely neglected and discarded by existing approaches, leading to suboptimal results. In this paper, we propose \textit{graph of records} (\textbf{GoR}), which leverages historical responses generated by LLMs to enhance RAG for long-context global summarization. Inspired by the \textit{retrieve-then-generate} paradigm of RAG, we construct a graph by creating an edge between the retrieved text chunks and the corresponding LLM-generated response. To further uncover the sophisticated correlations between them, GoR further features a \textit{graph neural network} and an elaborately designed \textit{BERTScore}-based objective for self-supervised model training, enabling seamless supervision signal backpropagation between reference summaries and node embeddings. We comprehensively compare GoR with 12 baselines on four long-context summarization datasets, and the results indicate that our proposed method reaches the best performance. Extensive experiments further demonstrate the effectiveness of GoR.

1807Interpretable and Efficient Counterfactual Generation for Real-Time User Interaction

[openreview] [pdf]

Abstract Among the various forms of post-hoc explanations for black-box models, counterfactuals stand out for their intuitiveness and effectiveness. However, longstanding challenges in counterfactual explanations involve the efficiency of the search process, the likelihood of generated instances, their interpretability, and in some cases, the validity of the explanations themselves. In this work we introduce a generative framework designed to address all of these issues. Notably, this is the first framework capable of generating interpretable counterfactual images in real-time, making it suitable for human-in-the-loop classification and decision-making. Our method leverages a disentangled regularized autoencoder to achieve two complementary goals: generating high-quality instances and promoting label disentanglement to provide full control over the decision boundary. This allows the model to sidestep expensive gradient-based optimizations by directly generating counterfactuals based on the adversarial distribution. A user study conducted on a challenging human-machine classification task demonstrates the effectiveness of the approach in improving human performance, highlighting the critical role of counterfactual explanations in achieving this advantage.

1808Boosting Recovery in Transformer-Based Symbolic Regression

[openreview] [pdf]

Abstract The traditional objective in regression is generalization. That is, learning a function from training data that performs well beyond the training data. Symbolic regression adds another objective, namely, interpretability of the regressor.In the context of regression, interpretability means that the representation of the regressor facilitates insights into mechanisms that underlie the functional dependence. State-of-the-art symbolic regressors provide such insights. However, the state of the art predominantly incurs high costs at inference time. The recently proposed transformer-based end-to-end approach is orders of magnitude faster at inference time. It does, however, not achieve state-of-the-art performance in terms of interpretability, which is typically measured by the ability to recover ground truth formulas from samples. Here, we show that the recovery performance of the end-to-end approach can be boosted by carefully selecting the training data. We construct a synthetic dataset from first principles and demonstrate that the capacity to recover ground truth formulas is proportional to the available computational resources.

1809Convergence of Adafactor under Non-Convex Smooth Stochastic Optimization

[openreview] [pdf]

Abstract Adafactor, a memory-efficient variant of Adam, has emerged as one of the popular choices for training deep learning tasks, particularly large language models. However, despite its practical success, there is limited theoretical analysis on Adafactor’s convergence. In this paper, we present a comprehensive analysis on Adafactor in a non-convex smooth setting, demonstrating its convergence to find a stationary point at a rate of O~(1/T)\tilde{\mathcal{O}}(1/T) in full-batch case and O~(1/T)\tilde{\mathcal{O}}(1/\sqrt{T}) in stochastic case. We also prove that Adafactor equipped with a suitable time-varying clipping threshold could also find a stationary point with the rate of O~(1/T)\tilde{\mathcal{O}}(1/\sqrt{T}).

1810Root Cause Analysis of Failure with Observational Causal Discovery

[openreview] [pdf]

Abstract Finding the root cause of failures is a prominent problem in many complex networks. Causal inference provides us with tools to address this problem algorithmically to automate this process and solve it efficiently. The existing methods either use a known causal structure to identify root cause via backtracking the changes, or ignore the causal structure but rely on invariance tests to identify the changing causal mechanisms after the failure. We first establish a connection between root cause analysis and the \textit{Interactive Graph Search (IGS)} problem. This mapping highlights the importance of causal knowledge: we demonstrate that any algorithm relying solely on marginal invariance tests to identify root causes must perform at least Ω(log2(n)+dlog1+dn)\Omega(\log_{2}(n) + d\log_{1+d}n) many tests, where nn represents the number of components and dd denotes the maximum out-degree of the graph. We then present an optimal algorithm that achieves this bound by reducing the root cause identification problem as an instance of IGS. Moreover, we show that even if the causal graph is partially known in the form of a Markov equivalence class, we can identify the root-cause with linear number of invariance tests. Our experiments on a production-level application demonstrate that, even in the absence of complete causal information, our approach accurately identifies the root cause of failures.

1811Online Convex Optimization with Prediction Through Accelerated Gradient Descent

[openreview] [pdf]

Abstract We study online convex optimization with predictions, where, at each time step tt, predictions about the next kk steps are available, and with coupled costs over time steps, where the cost function at time step tt depends on the decisions made between time tat-a and time t+bt+b for some nonnegative integers a,ba,b.We provide a general recipe to run synchronous update in an asynchronous fashion that respects the sequential revelation of information. Combined with existing convergence results for convex optimization using inexact first-order oracle, we show that acceleration is possible in this framework, where the dynamic regret can be reduced by a factor of (1O(κ))ka+b(1-O(\sqrt{\kappa}))^{\frac{k}{a+b}} through accelerated gradient descent, at a cost of an additive error term that depends on the prediction accuracy. This generalizes and improves the (1κ/4)k(1-\kappa/4)^k factor obtained by Li & Li (2020) for a+b=1a+b = 1. Our algorithm also has smaller dependency on longer-term prediction error. Moreover, our algorithm is the first gradient based algorithm which, when the strong-convexity assumption is relaxed, constructs a solution whose regret decays at the rate of O(1/k2)O(1/k^2), at a cost of an additive error term that depends on the prediction accuracy.

1812Auxiliary Classifiers Improve Stability and Efficiency in Continual Learning

[openreview] [pdf]

Abstract Continual learning is crucial for applications in dynamic environments, where machine learning models must adapt to changing data distributions while retaining knowledge of previous tasks. Despite significant advancements, catastrophic forgetting — where performance on earlier tasks degrades as new information is learned — remains a key challenge. In this work, we investigate the stability of intermediate neural network layers during continual learning and explore how auxiliary classifiers (ACs) can leverage this stability to improve performance. We show that early network layers remain more stable during learning, particularly for older tasks, and that ACs applied to these layers can outperform standard classifiers on past tasks. By integrating ACs into several continual learning algorithms, we demonstrate consistent and significant performance improvements on standard benchmarks. Additionally, we explore dynamic inference, showing that AC-augmented continual learning methods can reduce computational costs by up to 60% while maintaining or exceeding the accuracy of standard methods. Our findings suggest that ACs offer a promising avenue for enhancing continual learning models, providing both improved performance and the ability to adapt the network computation in environments where such flexibility might be required.

1813Hierarchical Subspaces of Policies for Continual Offline Reinforcement Learning

[openreview] [pdf]

Abstract In dynamic domains such as autonomous robotics and video game simulations, agents must continuously adapt to new tasks while retaining previously acquired skills. This ongoing process, known as Continual Reinforcement Learning, presents significant challenges, including the risk of forgetting past knowledge and the need for scalable solutions as the number of tasks increases. To address these issues, we introduce HIerarchical LOW-rank Subspaces of Policies (HILOW), a novel framework designed for continual learning in offline navigation settings. HILOW leverages hierarchical policy subspaces to enable flexible and efficient adaptation to new tasks while preserving existing knowledge. We demonstrate, through a careful experimental study, the effectiveness of our method in both classical MuJoCo maze environments and complex video game-like simulations, showcasing competitive performance and satisfying adaptability according to classical continual learning metrics, in particular regarding memory usage. Our work provides a promising framework for real-world applications where continuous learning from pre-collected data is essential.

1814Variational Bayes Gaussian Splatting

[openreview] [pdf]

Abstract Recently, 3D Gaussian Splatting has emerged as a promising approach for modeling 3D scenes using mixtures of Gaussians. The predominant optimization method for these models relies on backpropagating gradients through a differentiable rendering pipeline, which struggles with catastrophic forgetting when dealing with continuous streams of data. To address this limitation, we propose Variational Bayes Gaussian Splatting (VBGS), a novel approach that frames training a Gaussian splat as variational inference over model parameters. By leveraging the conjugacy properties of multivariate Gaussians, we derive a closed-form variational update rule, allowing efficient updates from partial, sequential observations without the need for replay buffers. Our experiments show that VBGS not only matches state-of-the-art performance on static datasets, but also enables continual learning from sequentially streamed 2D and 3D data, drastically improving performance in this setting.

1815Offline Model-Based Optimization by Learning to Rank

[openreview] [pdf]

Abstract Offline model-based optimization (MBO) aims to identify a design that maximizes a black-box function using only a fixed, pre-collected dataset of designs and their corresponding scores. This problem has garnered significant attention from both scientific and industrial domains. A common approach in offline MBO is to train a regression-based surrogate model by minimizing mean squared error (MSE) and then find the best design within this surrogate model by different optimizers (e.g., gradient ascent). However, a critical challenge is the risk of out-of-distribution errors, i.e., the surrogate model may typically overestimate the scores and mislead the optimizers into suboptimal regions. Prior works have attempted to address this issue in various ways, such as using regularization techniques and ensemble learning to enhance the robustness of the model, but it still remains. In this paper, we argue that regression models trained with MSE are not well-aligned with the primary goal of offline MBO, which is to select\textit{select} promising designs rather than to predict their scores precisely. Notably, if a surrogate model can maintain the order of candidate designs based on their relative score relationships, it can produce the best designs even without precise predictions. To validate it, we conduct experiments to compare the relationship between the quality of the final designs and MSE, finding that the correlation is really very weak. In contrast, a metric that measures order-maintaining quality shows a significantly stronger correlation. Based on this observation, we propose learning a ranking-based model that leverages learning to rank techniques to prioritize promising designs based on their relative scores. We show that the generalization error on ranking loss can be well bounded. Empirical results across diverse tasks demonstrate the superior performance of our proposed ranking-based models than twenty existing methods.

1816Learning-based Mechanism Design: Scalable, Truthful, and Continuum Approaches for Utility Maximization

[openreview] [pdf]

Abstract Mechanism design is a crucial topic at the intersection of computer science and economics. This paper addresses the automated mechanism design problem by leveraging machine learning and neural networks. The objective is to design atruthful,expressiveandefficientmechanism that maximizes the platform’s expected utility, given that the players’ types are drawn from a pre-specified distribution.We present a general mechanism design model that captures two critical features: hidden information and strategic behavior. Subsequently, we propose thePFM-Netframework, which parameterizes the menu mechanism class by function approximation and identifies an optimal mechanism through ingenious optimization techniques. We also provide both theoretical and empirical justifications for the advantages of our approach. Experimental results demonstrate the effectiveness of PFM-Net over traditional and learning-based baselines, enabling the PFM-Net framework to serve as a new paradigm for automated mechanism design.

1817Prioritized Generative Replay

[openreview] [pdf]

Abstract Sample-efficient online reinforcement learning often uses replay buffers to store experience for reuse when updating the value function. However, uniform replay is inefficient, since certain classes of transitions can be more relevant to learning. While prioritization of more useful samples is helpful, this strategy can also lead to overfitting, as useful samples are likely to be more rare. In this work, we instead propose a prioritized, parametric version of an agent’s memory, using generative models to capture online experience. This paradigm enables (1) densification of past experience, with new generations that benefit from the generative model’s generalization capacity and (2) guidance via a family of ``relevance functions’’ that push these generations towards more useful parts of an agent’s acquired history. We show this recipe can be instantiated using conditional diffusion models and simple relevance functions such as curiosity- or value-based metrics. Our approach consistently improves performance and sample efficiency in both state- and pixel-based domains. We expose the mechanisms underlying these gains, showing how guidance promotes diversity in our generated transitions and reduces overfitting. We also showcase how our approach can train policies with even higher update-to-data ratios than before, opening up avenues to better scale online RL agents.

1818Larger language models do in-context learning differently

[openreview] [pdf]

Abstract We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups - ICL with flipped labels and ICL with semantically-unrelated labels - across various model families (GPT-3, InstructGPT, Codex, an internal model, and an instruction-tuned variant of the internal model). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task. The ability to do SUL-ICL also emerges primarily with scale, and large-enough language models can even perform linear classification in a SUL-ICL setting. Finally, we evaluate instruction-tuned models and find that instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former.

1819ADMM for Nonconvex Optimization under Minimal Continuity Assumption

[openreview] [pdf]

Abstract This paper introduces a novel approach to solving multi-block nonconvex composite optimization problems through a proximal linearized Alternating Direction Method of Multipliers (ADMM). This method incorporates an Increasing Penalization and Decreasing Smoothing (IPDS) strategy. Distinguishing itself from existing ADMM-style algorithms, our approach (denoted IPDS-ADMM) imposes a less stringent condition, specifically requiring continuity in just one block of the objective function. IPDS-ADMM requires that the penalty increases and the smoothing parameter decreases, both at a controlled pace. When the associated linear operator is bijective, IPDS-ADMM uses an over-relaxation stepsize for faster convergence; however, when the linear operator is surjective, IPDS-ADMM uses an under-relaxation stepsize for global convergence. We devise a novel potential function to facilitate our convergence analysis and prove an oracle complexity O(ϵ3)O(\epsilon^{-3}) to achieve an ϵ\epsilon-approximate critical point. To the best of our knowledge, this is the first complexity result for using ADMM to solve this class of nonsmooth nonconvex problems. Finally, some experiments on the sparse PCA problem are conducted to demonstrate the effectiveness of our approach.

1820Contrastive Unlearning: A Contrastive Approach to Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning aims to eliminate the influence of a subset of training samples (i.e., unlearning samples) from a trained model. Effectively and efficiently removing the unlearning samples without negatively impacting the overall model performance is challenging. Existing works mainly exploit input and output space and classification loss, which can result in ineffective unlearning or performance loss. In addition, they utilize on unlearning or remaining samples ineffectively, sacrificing either unlearning efficacy or efficiency. Our main insight is that direct optimization on the representation space utilizing both unlearning and remaining samples can effectively remove influence of unlearning samples while maintaining representations learned from remaining samples. We propose a contrastive unlearning framework, leveraging the concept of representation learning for more effective unlearning. It removes the influence of unlearning samples by contrasting their embeddings against the remaining samples’ embeddings so that their embeddings are closer to the embeddings of unseen samples. Experiments on a variety of datasets and models on both class unlearning and sample unlearning showed that contrastive unlearning achieves the best unlearning effects and efficiency with the lowest performance loss compared with the state-of-the-art algorithms.

1821Mitigating Robust Overfitting in Wasserstein Distributionally Robust Optimization

[openreview] [pdf]

Abstract Wasserstein distributionally robust optimization (WDRO) optimizes against worst-case distributional shifts within a specified uncertainty set, leading to enhanced generalization on unseen adversarial examples, compared to standard adversarial training which focuses on pointwise adversarial perturbations. However, WDRO still suffers fundamentally from the robust overfitting problem, as it does not consider statistical error. We address this gap by proposing a novel robust optimization framework under a new uncertainty set for both adversarial noise (Wasserstein distance) and statistical error (Kullback-Leibler divergence). Our theoretical analysis establishes that out-of-distribution adversarial performance is at least as good as the in-distribution robust performance with high probability. Furthermore, we derive conditions under which Stackelberg and Nash equilibria exist between the learner and the adversary. Finally, through extensive experiments, we demonstrate that our method significantly mitigates robust overfitting and enhances robustness within the framework of WDRO.

1822Revisiting the Variational Information Bottleneck

[openreview] [pdf]

Abstract The Information Bottleneck (IB) framework offers a theoretically optimal approach to data modeling, though it is often intractable. Recent efforts have optimized supervised deep neural networks (DNNs) using a variational upper bound on the IB objective, leading to enhanced robustness against adversarial attacks. In these studies, supervision assumes a dual role: sometimes as a random variable with the empirical distribution of the data, and at other times as a random variable distributed according to the classification performed. This work proposes an extension to the framework, and consequently to the derivation of the bound, that resolves this duality. Applying the resulting bound as an objective for supervised DNNs induces substantial empirical improvements.

1823Archilles’ Heel in Semi-open LLMs: Hiding Bottom against Recovery Attacks

[openreview] [pdf]

Abstract Closed-source large language models deliver strong performance but have limited downstream customizability. Semi-open models, combining both closed-source and public layers, were introduced to improve customizability. However, parameters in the closed-source layers are found vulnerable to recovery attacks. In this paper, we explore the design of semi-open models with fewer closed-source layers, aiming to increase customizability while ensuring resilience to recovery attacks. We analyze the contribution of closed-source layer to the overall resilience and theoretically prove that in a deep transformer-based model, there exists a transition layer such that even small recovery errors in layers before this layer can lead to recovery failure. Building on this, we propose \textbf{SCARA}, a novel approach that keeps only a few bottom layer as closed-source. SCARA employs a fine-tuning-free metric to estimate the maximum number of layers that can be publicly accessible for customization. We apply it to five models (1.3B to 70B parameters) to construct semi-open models, validating their customizability on six downstream tasks and assessing their resilience against various recovery attacks on sixteen benchmarks. We compare SCARA to baselines and observe that it generally improves downstream customization performance and offers similar resilience with over \textbf{10} times fewer closed-source parameters. We empirically investigate the existence of transition layers, analyze the effectiveness of our scheme and finally discuss its limitations.

1824Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

[openreview] [pdf]

Abstract Reward inference (learning a reward model from human preferences) is a critical intermediate step in Reinforcement Learning from Human Feedback (RLHF) for fine-tuning Large Language Models (LLMs) such as ChatGPT. In practice, reward inference faces several fundamental challenges, including double problem misspecification, reward model evaluation without ground truth, distribution shift, and overfitting in joint reward model and policy training. An alternative approach that avoids these pitfalls is direct policy optimization without reward inference, such as Direct Preference Optimization (DPO), which provides a much simpler pipeline and has shown empirical success in LLMs. However, DPO utilizes the closed-form expression between the optimal policy and the reward function, which only works under the bandit setting or deterministic MDPs. This paper develops two RLHF algorithms without reward inference, which work for general RL problems beyond bandits and deterministic MDPs, and general preference models beyond the Bradely-Terry model. The key idea is to estimate the local value function difference from human preferences and then approximate the policy gradient with a zeroth-order gradient approximator. For both algorithms, we establish rates of convergence in terms of the number of policy gradient iterations, as well as the number of trajectory samples and human preference queries per iteration. Our results show there exist provably efficient methods to solve general RLHF problems without reward inference.

1825Preserving the Unique Heritage of Chinese Ancient Architecture in Diffusion Models with Text and Image Integration

[openreview] [pdf]

Abstract Leveraging the impressive generative capabilities of diffusion models, we can create diverse images from imaginative prompts with careful design. To be noticed, the key components, such as CLIP, are essential for aligning prompts with image representations. However, these models often underperform in specialized areas, like the Chinese ancient architecture. One of the important reasons is that historical buildings include not only architectural information, but also historical and cultural content. The preservation and integration of these unique characteristics has become a significant challenge in model expansion. In this paper, we propose an Image-Annotation-Augmented Diffusion pipeline combining human feedback to explore the specific-area paradigm for image generation in the context of small amounts of data and professional concepts. We first leverage Segment Anything 2 (SAM2) to obtain a refined content image to enable an in-depth analysis of the relationship between unique characteristics and multimodal image generation models, and reselected representative images and regrouped them according to their distinctive objective and the existing dataset. Then, we introduce the effective RAG and GraphRAG module to identify the complex structure of relationships among different entities in the training and inference stages respectively. Based on the initial text by BLIP3, the RAG instructs GPT4 to facilitate more accurate, content-aware annotations during training, and augment a high-quality object prompt using the GraphRAG during inference. Benefit from these outstanding models and architectures, we train fine-tuning models to showcase the enhanced performance of our proposed pipeline compared to other existing models. Experiments demonstrate that our pipeline effectively preserves and integrates the unique characteristics of ancient Chinese architecture.

1826DDAD: A Two-Pronged Adversarial Defense Based on Distributional Discrepancy

[openreview] [pdf]

Abstract Statistical adversarial data detection (SADD) detects whether an upcoming batch contains adversarial examples (AEs) by measuring the distributional discrepancies between clean examples (CEs) and AEs. In this paper, we reveal the potential strength of SADD-based methods by theoretically showing that minimizing distributional discrepancy can help reduce the expected loss on AEs. Nevertheless, despite these advantages, SADD-based methods have a potential limitation: they discard inputs detected as AEs, leading to the loss of clean information within those inputs. To address this limitation, we propose a two-pronged adversarial defense method, named Distributional-Discrepancy-based Adversarial Defense (DDAD). In the training phase, DDAD first optimizes the test power of the maximum mean discrepancy (MMD) to derive MMD-OPT, and then trains a denoiser by minimizing the MMD-OPT between CEs and AEs. In the inference phase, DDAD first leverages MMD-OPT to differentiate CEs and AEs, and then applies a two-pronged process: (1) directly feeding the detected CEs into the classifier, and (2) removing noise from the detected AEs by the distributional-discrepancy-based denoiser. Extensive experiments show that DDAD outperforms current state-of-the-art (SOTA) defense methods by notably improving clean and robust accuracy on CIFAR-10 and ImageNet-1K against adaptive white-box attacks. The code is available at:https://anonymous.4open.science/r/DDAD-DB60.

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are prominent in graph machine learning and have shown state-of-the-art performance in Link Prediction (LP) tasks. Nonetheless, recent studies show that GNNs struggle to produce good results on low-degree nodes despite their overall strong performance. In practical applications of LP, like recommendation systems, improving performance on low-degree nodes is critical, as it amounts to tackling the cold-start problem of improving the experiences of users with few observed interactions. In this paper, we investigate improving GNNs’ LP performance on low-degree nodes while preserving their performance on high-degree nodes and propose a simple yet surprisingly effective augmentation technique called NodeDup. Specifically, NodeDup duplicates low-degree nodes and creates links between nodes and their own duplicates before following the standard supervised LP training scheme. By leveraging a ``multi-view’’ perspective for low-degree nodes, NodeDup shows significant LP performance improvements on low-degree nodes without compromising any performance on high-degree nodes. Additionally, as a plug-and-play augmentation module, NodeDup can be easily applied on existing GNNs with very light computational cost. Extensive experiments show that NodeDup achieves 38.49%, 13.34%, and 6.76% improvements on isolated, low-degree, and warm nodes, respectively, on average across all datasets compared to GNNs and state-of-the-art cold-start methods.

1828Discovering Latent Structural Causal Models from Spatio-Temporal Data

[openreview] [pdf]

Abstract Many important phenomenon in scientific fields such as climate, neuroscience and epidemiology are naturally represented as spatiotemporal gridded data with complex interactions. Inferring causal relationships from these data is a difficult problem compounded by the high dimensionality of such data and the correlations between spatially proximate points. We present SPACY (SPAtiotemporal Causal discoverY), a novel framework based on variational inference, designed to explicitly model latent time-series and their causal relationships from spatially confined modes in the data. Our method uses an end-to-end training process that maximizes an evidence-lower bound (ELBO) for the data likelihood. Theoretically, we show that, under some conditions, the latent variables are identifiable up to transformation by an invertible matrix. Empirically, we show that SPACY outperforms state-of-the-art baselines on synthetic data, remains scalable for large grids, and identifies key known phenomena from real-world climate data.

1829Graph Distributional Analytics: Enhancing GNN Explainability through Scalable Embedding and Distribution Analysis

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have achieved significant success in processing graph-structured data but often lack interpretability, limiting their practical applicability. We introduce the Graph Distributional Analytics (GDA) framework, leveraging novel combinations of scalable techniques to enhance GNN explainability. The integration of Weisfeiler-Leman (WL) graph kernels with distributional distance analysis enables GDA to efficiently quantify graph data distributions, while capturing global structural complexities without significant computational costs. GDA creates high-dimensional embeddings employing WL kernels, measures the distribution of distances from measures of categorical central tendency, and assigns distribution scores to quantify each graph’s deviation from this vector We evaluate GDA on the ENZYMES, ogbg-ppa, and MalNet-Tiny datasets. Our experiments demonstrate GDA not only accurately characterizes graph distributions but also outperforms baseline methods in identifying specific structural features responsible for misclassifications. This comprehensive analysis provides deeper insights into how training data distributions affect model performance, particularly with out-of-distribution (OOD) data. By revealing the underlying structural causes of GNN predictions through a novel synergy of established techniques, GDA enhances transparency and offers a practical tool for practitioners to build more interpretable and robust graph-based models. Our framework’s scalability, efficiency, and ability to integrate with various embedding methods make it a valuable addition to the suite of tools available for GNN analysis.

1830Verbalized Bayesian Persuasion

[openreview] [pdf]

Abstract The study of information design explores how an information designer can influence the optimal behavior of players to achieve a specific objective through the strategic selection of the information provided. This paper focuses on a case, Bayesian Persuasion (BP), where the information designer holds an informational advantage over only one player. While information design originates from everyday human communication, traditional game-theoretic or multi-agent reinforcement learning methods often model information structures as discrete or continuous scalars or vectors, this approach fails to capture the nuances of natural language, significantly limiting their applicability in real-world scenarios. By leveraging the powerful language understanding and generation capabilities of large language models (LLMs), this paper proposes a verbalized BP framework that extends classic BP to real-world games involving human dialogues for the first time. Specifically, we map the classic BP to a verbalized mediator-augmented game, where LLMs instantiate the information designer and receiver. To efficiently solve the game in the language space, we transform agents’ policy optimization into prompt optimization and propose a generalized equilibrium-finding algorithm with a convergence guarantee. Numerical experiments in realistic dialogue scenarios, such as recommendation letters, courtroom interactions, and law enforcement, validate that the VBP framework can reproduce theoretical results in classic settings and discover effective persuasion strategies in more complex natural language and multistage settings.

1831KL DIVERGENCE OPTIMIZATION WITH ENTROPY- RATIO ESTIMATION FOR STOCHASTIC GFLOWNETS

[openreview] [pdf]

Abstract This paper introduces a novel approach for optimizing Generative Flow Networks (GFlowNets) in stochastic environments by incorporating KL divergence objectives with entropy-ratio estimation. We leverage the relationship between high and low entropy states, as defined in entropy-regularized Markov Decision Processes (MDPs), to dynamically adjust exploration and exploitation. Detailed proofs and analysis demonstrate the efficacy of this methodology in enhancing mode discovery, state coverage, and policy robustness in complex environments.

1832Identifying Optimal Output Sets for Differential Privacy Auditing

[openreview] [pdf]

Abstract Differential privacy limits an algorithm’s privacy loss, defined as the maximum influenceanyindividual data record can have on the probability of observinganypossible output. Privacy auditing identifies the worst-case input datasets and output event sets that empirically maximize privacy loss, providing statistical lower bounds to evaluate the tightness of an algorithm’s differential privacy guarantees. However, current auditing methods often depend on heuristic or arbitrary selections of output event sets, leading to weak lower bounds. We address this critical gap by introducing a novel framework to compute theoptimal output event setthat maximizes the privacy loss lower bound in auditing. Our algorithm efficiently computes this optimal set when closed-form output distributions are available and approximates it using empirical samples when they are not. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that our method consistently tightens privacy lower bounds for auditing differential privacy mechanisms and black-box DP-SGD training. Our approach outperforms existing auditing techniques, providing a more accurate analysis of differentially-private algorithms.

1833Hierarchical Demonstration Order Optimization for Many-shot In-Context Learning

[openreview] [pdf]

Abstract In-Context Learning (ICL) is a technique where large language models (LLMs) leverage multiple demonstrations (i.e., examples) to perform tasks. With the recent expansion of LLM context windows, many-shot ICL (generally with more than 50 demonstrations) can lead to significant performance improvements on a variety of language tasks such as text classification and question answering. Nevertheless, ICL faces demonstration order instability (ICL-DOI), which means that performance varies significantly depending on the order of demonstrations. Moreover, the ICL-DOI phenomenon persists and can sometimes be more pronounced in many-shot ICL, validated by our thorough experimental investigation. Current strategies handling ICL-DOI, however, are not applicable to many-shot ICL, since they cannot overcome two critical challenges: (1) Most metrics measuring the quality of demonstration order rely on subjective judgment, lacking a theoretical foundation to achieve precise quality characterization. These metrics are thus non-applicable to many-shot situations, where the order quality of different orders is less distinguishable due to the limited ability of LLMs to exploit information in long input contexts. (2) The requirement to examine all orders is computationally infeasible due to the combinatorial complexity of the order space in many-shot ICL. To tackle the first challenge, we design a demonstration order evaluation metric based on information theory for measuring order quality, which effectively quantifies the usable information gain of a given demonstration order. To address the second challenge, we propose a hierarchical demonstration order optimization method named HIDO that enables a more refined exploration of the order space, achieving high ICL performance without the need to evaluate all possible orders. Extensive experiments on multiple LLMs and real-world datasets demonstrate that our HIDO method consistently and efficiently outperforms other baselines. Our code can be found athttps://anonymous.4open.science/r/HIDO-B2DE/.

1834Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought

[openreview] [pdf]

Abstract Chain of Thought (CoT) prompting has been shown to significantly improve the performance of large language models (LLMs), particularly in arithmetic and reasoning tasks, by instructing the model to produce intermediate reasoning steps. Despite the remarkable empirical success of CoT and its theoretical advantages in enhancing expressivity, the mechanisms underlying CoT training remain largely unexplored. In this paper, we study the training dynamics of transformers over a CoT objective on a in-context weight prediction task for linear regression. We prove that while a one-layer linear transformer without CoT can only implement a single step of gradient descent (GD) and fails to recover the ground-truth weight vector, a transformer with CoT prompting can learn to perform multi-step GD autoregressively, achieving near-exact recovery. Furthermore, we show that the trained transformer effectively generalizes on the unseen data. Empirically, we demonstrate that CoT prompting yields substantial performance improvements.

1835Language Models Can Articulate Their Implicit Goals

[openreview] [pdf]

Abstract We investigate LLMs’ awareness of newly acquired goals or policies. We find that a model finetuned on examples that exhibit a particular policy (e.g. preferring risky options) can describe this policy (e.g. “I take risky options”). This holds even when the model does not have any examples in-context, and without any descriptions of the policy appearing in the finetuning data. This capability extends tomany-personascenarios, where models internalize and report different learned policies for different simulated individuals (personas), as well astriggerscenarios, where models report policies that are triggered by particular token sequences in the prompt.This awareness enables models to acquire information about themselves that was only implicit in their training data. It could potentially help practitioners discover when a model’s training data contains undesirable biases or backdoors.

1836Adapting Monte Carlo Tree Search for Generative Flow Network Training

[openreview] [pdf]

Abstract Generative Flow Networks, or GFlowNets, formulate generative modelling in discrete spaces as a sequential decision-making problem. Sampling plays a key role in GFlowNet training, as most algorithms use the learned policy to sample trajectories from the environment. Monte-Carlo Tree Search (MCTS) is a planning algorithm that has successfully been applied to train sequential decision-making models with reinforcement learning (RL). In this work, we leverage known connections between GFlowNets and maximum-entropy RL to adapt MCTS for GFlowNet training. We prove that standard MCTS tree construction processes can be modified to calculate the optimal flows for a GFlowNet, given sufficient samples from the environment. Our results extend to multiple cases of GFN modelling, including terminating-energy and intermediate-energy environments. We investigate practical strategies for employing MCTS as a sampling tool and apply it to different GFN parameterizations and training objectives. Through extensive experiments in a variety of discrete domains, including a language-based reasoning task, we show that our proposed method offers an improvement over standard on-policy sampling.

1837Latent Space Chain-of-Embedding Enables Output-free LLM Self-Evaluation

[openreview] [pdf]

Abstract LLM self-evaluation relies on the LLM’s own ability to estimate response correctness, which can greatly improve its deployment reliability. In this research track, we propose the Chain-of-Embedding (CoE) in the latent space to enable LLMs to perform output-free self-evaluation. CoE consists of all progressive hidden states produced during the inference time, which can be treated as the latent thinking path of LLMs. We find that when LLMs respond correctly and incorrectly, their CoE features differ, these discrepancies assist us in estimating LLM response correctness. Experiments in four diverse domains and seven LLMs fully demonstrate the effectiveness of our method. Meanwhile, its label-free design intent without any training and millisecond-level computational cost ensure real-time feedback in large-scale scenarios. More importantly, we provide interesting insights into LLM response correctness from the perspective of hidden state changes inside LLMs.

1838The Crystal Ball Hypothesis in diffusion models: Anticipating object positions from initial noise

[openreview] [pdf]

Abstract Diffusion models have achieved remarkable success in text-to-image generation tasks, yet the influence of initial noise remains largely unexplored. In this study, we identify specific regions within the initial noise image, termed trigger patches, that play a key role in inducing object generation in the resulting images. Notably, these patches areuniversaland can be generalized across various positions, seeds, and prompts. To be specific, extracting these patches from one noise and injecting them into another noise leads to object generation in targeted areas. To identify the trigger patches even before the image has been generated, just like consulting the crystal ball to foresee fate, we first create a dataset consisting of Gaussian noises labeled with bounding boxes corresponding to the objects appearing in the generated images andtrain a detector that identifies these patches from the initial noise.To explain the formation of these patches, we reveal that they areoutliersin Gaussian noise, and follow distinct distributions through two-sample tests. These outliers can take effect when injected into different noises and generalize well across different settings. Finally, we find the misalignment between prompts and the trigger patch patterns can result in unsuccessful image generations. To overcome it, we propose a reject-sampling strategy to obtain optimal noise, aiming to improve prompt adherence and positional diversity in image generation.

1839Closed-loop Diffusion Control of Complex Physical Systems

[openreview] [pdf]

Abstract The control problems of complex physical systems have broad applications in science and engineering. Previous studies have shown that generative control methods based on diffusion models offer significant advantages for solving these problems. However, existing generative control approaches face challenges in both performance and efficiency when extended to the closed-loop setting, which is essential for effective control. In this paper, we propose an efficient Closed-Loop Diffusion method for Physical systems Control (CL-DiffPhyCon). By employing an asynchronous denoising framework for different physical time steps, CL-DiffPhyCon generates control signals conditioned on real-time feedback from the environment with significantly reduced computational cost during sampling. Additionally, the control process could be further accelerated by incorporating fast sampling techniques, such as DDIM. We evaluate CL-DiffPhyCon on two tasks: 1D Burgers’ equation control and 2D incompressible fluid control. The results demonstrate that CL-DiffPhyCon achieves superior control performance with significant improvements in sampling efficiency.

1840Sketch-Plan-Generalize: Learning Inductive Representations for Grounded Spatial Concepts

[openreview] [pdf]

Abstract Our goal is to enable embodied agents to learn inductive representations for grounded spatial concepts, e.g., learning staircase as an inductive composition of towers of increasing height. Given few human demonstrations, we seek a learning architecture that infers a succinct inductiveprogramrepresentation thatexplainsthe observed instances. The approach should generalize to learning novel structures of different sizes or complexity expressed as a hierarchical composition of previously learned concepts. Existing approaches that use code generation capabilities of pre-trained large (visual) language models, as well as purely neural models, show poor generalization toa-prioriunseen complex concepts. Our key insight is to factor inductive concept learning as: (i)Sketch:detecting and inferring a coarse signature of a new concept (ii)Plan:performing MCTS search over grounded action sequences (iii)Generalize:abstracting out grounded plans as inductive programs. Our pipeline facilitates generalization and modular re-use enabling continual concept learning. Our approach combines the benefits of code generation ability of large language models (LLMs) along with grounded neural representations, resulting in neuro-symbolic programs that show stronger inductive generalization on the task of constructing complex structures vis-'a-vis LLM-only and purely neural approaches. Further, we demonstrate reasoning and planning capabilities with learned concepts for embodied instruction following.

1841Boost Self-Supervised Dataset Distillation via Parameterization, Predefined Augmentation, and Approximation

[openreview] [pdf]

Abstract Although larger datasets are crucial for training large deep models, the rapid growth of dataset size has brought a significant challenge in terms of considerable training costs, which even results in prohibitive computational expenses. Dataset Distillation becomes a popular technique recently to reduce the dataset size via learning a highly compact set of representative exemplars, where the model trained with these exemplars ideally should have comparable performance with respect to the one trained with the full dataset. While most of existing works upon dataset distillation focus on supervised datasets, \todo{we instead aim to distill images and their self-supervisedly trained representations into a distilled set. This procedure, named as Self-Supervised Dataset Distillation, effectively extracts rich information from real datasets, yielding the distilled sets with enhanced cross-architecture generalizability.} Particularly, in order to preserve the key characteristics of original dataset more faithfully and compactly, several novel techniques are proposed: 1) we introduce an innovative parameterization upon images and representations via distinct low-dimensional bases, where the base selection for parameterization is experimentally shown to play a crucial role; 2) we tackle the instability induced by the randomness of data augmentation -- a key component in self-supervised learning but being underestimated in the prior work of self-supervised dataset distillation -- by utilizing predetermined augmentations; 3) we further leverage a lightweight network to model the connections among the representations of augmented views from the same image, leading to more compact pairs of distillation. Extensive experiments conducted on various datasets validate the superiority of our approach in terms of distillation efficiency, cross-architecture generalization, and transfer learning performance.

1842Node-Time Conditional Prompt Learning in Dynamic Graphs

[openreview] [pdf]

Abstract Dynamic graphs capture evolving interactions between entities, such as in social networks, online learning platforms, and crowdsourcing projects. For dynamic graph modeling, dynamic graph neural networks (DGNNs) have emerged as a mainstream technique. However, they are generally pre-trained on the link prediction task, leaving a significant gap from the objectives of downstream tasks such as node classification. To bridge the gap, prompt-based learning has gained traction on graphs, but most existing efforts focus on static graphs, neglecting the evolution of dynamic graphs. In this paper, we propose DyGPrompt, a novel pre-training and prompt learning framework for dynamic graph modeling. First, we designdual promptsto address the gap in both task objectives and temporal variations across pre-training and downstream tasks. Second, we recognize that node and time features mutually characterize each other, and proposedual condition-netsto model the evolving node-time patterns in downstream tasks. Finally, we thoroughly evaluate and analyze DyGPrompt through extensive experiments on four public datasets.

1843Advancing LLM Reasoning Generalists with Preference Trees

[openreview] [pdf]

Abstract We introduce EURUS, a suite of large language models (LLMs) optimized for reasoning. Finetuned from Mistral-7B, Llama-3-8B, and Mixtral-8x22B, EURUS models achieve state-of-the-art results among open-source models on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems. Notably, EURUX-8X22B outperforms GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 test sets covering five tasks. The strong performance of EURUS can be primarily attributed to ULTRAINTERACT, our newly-curated large-scale, high-quality training data dataset specifically designed for complex reasoning tasks. ULTRAINTERACT can be used in both supervised fine-tuning, preference learning, and reward modeling. It pairs each instruction with a preference tree consisting of (1) reasoning chains with diverse planning strategies in a unified format, (2) multi-turn interaction trajectories with the environment and the critique, and (3) pairwise positive and negative responses to facilitate preference learning. ULTRAINTERACT allows us to conduct an in-depth exploration of preference learning for reasoning tasks. Our investigation reveals that some well-established preference learning algorithms may be less suitable for reasoning tasks compared to their effectiveness in general conversations. The hypothesis is that in reasoning tasks, the space of correct answers is much smaller than that of incorrect ones, so it is necessary to explicitly increase the reward of chosen data. Therefore, in addition to increasing the reward margin as many preference learning algorithms do, the absolute values of positive responses’ rewards should be positive and may serve as a proxy for performance. Inspired by this, we derive a novel reward modeling objective and empirically that it leads to a stable reward modeling curve and better performance. Together with ULTRAINTERACT, we obtain a strong reward model.

1844Uniform Wrappers: Bridging Concave to Quadratizable Functions in Online Optimization

[openreview] [pdf]

Abstract This paper presents novel contributions to the field of online optimization, particularly focusing on the adaptation of algorithms from concave optimization to more challenging classes of functions. Key contributions include the introduction of uniform wrappers, establishing a vital link between upper-quadratizable functions and algorithmic conversions. Through this framework, the paper demonstrates superior regret guarantees for various classes of up-concave functions under zeroth-order feedback. Furthermore, the paper extends zeroth-order online algorithms to bandit feedback counterparts and offline counterparts, achieving a notable improvement in regret/sample complexity compared to existing approaches.

1845Robust RLHF with Noisy Rewards

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) is the mainstream paradigm to align large language models (LLMs) with human preferences. Yet existing RLHF heavily relies on accurate and informative reward models, which are vulnerable and sensitive to noise from various sources, e.g. human labeling errors, making the pipeline fragile. In this work, we formulate the problem of performing robust RLHF with noisy reward models. Our goal is to design robust RLHF algorithms that explicitly acknowledge the potential noise in a reward model. Our first contribution is an analysis that revealed a certain transformation of the preference function improves its robustness to noise in the reward function. This observation leads to a new reward function design that involves two steps: (1) an offline sampling step to obtain responses to prompts that serve as baseline calculation and (2) a contrastive reward calculated using the baseline responses in Proximal Policy Optimization (PPO). We show that our suggested rewards enable the LLM to penalize reward uncertainty, improve robustness, encourage improvement over baselines, calibrate according to task difficulty, and reduce variance in PPO. We also empirically demonstrate contrastive reward can improve RLHF substantially, evaluated by both GPTs and humans, and it consistently outperforms strong baselines.

1846Gap-Dependent Bounds for Q-Learning using Reference-Advantage Decomposition

[openreview] [pdf]

Abstract We study the gap-dependent bounds of two important algorithms for on-policy QQ-learning for finite-horizon episodic tabular Markov Decision Processes (MDPs): UCB-Advantage (Zhang et al. 2020) and Q-EarlySettled-Advantage (Li et al. 2021). UCB-Advantage and Q-EarlySettled-Advantage improve upon the results based on Hoeffding-type bonuses and achieve the {almost optimal} T\sqrt{T}-type regret bound in the worst-case scenario, where TT is the total number of steps. However, the benign structures of the MDPs such as a strictly positive suboptimality gap can significantly improve the regret. While gap-dependent regret bounds have been obtained for QQ-learning with Hoeffding-type bonuses, it remains an open question to establish gap-dependent regret bounds for QQ-learning using variance estimators in their bonuses and reference-advantage decomposition for variance reduction. We develop a novel error decomposition framework to prove gap-dependent regret bounds of UCB-Advantage and Q-EarlySettled-Advantage that are logarithmic in TT and improve upon existing ones for QQ-learning algorithms. Moreover, we establish the gap-dependent bound for the policy switching cost of UCB-Advantage and improve that under the worst-case MDPs. To our knowledge, this paper presents the first gap-dependent regret analysis for QQ-learning using variance estimators and reference-advantage decomposition and also provides the first gap-dependent analysis on policy switching cost for QQ-learning.

1847Towards Graph Foundation Models: Learning Generalities Across Graphs via Task-trees

[openreview] [pdf]

Abstract Foundation models aim to create general, cross-task, and cross-domain machine learning models by pretraining on large-scale datasets to capture shared patterns or concepts (generalities), such as contours, colors, textures, and edges in images, or tokens, words, and sentences in text. However, discovering generalities across graphs remains challenging, which has hindered the development of graph foundation models. To tackle this challenge, in this paper, we propose a novel approach to learn generalities across graphs via task-trees. Specifically, we first define the basic learning instances in graphs as task-trees and assume that the generalities shared across graphs are, at least partially, preserved in the task-trees of the given graphs. To validate the assumption, we first perform a theoretical analysis of task-trees in terms of stability, transferability, and generalization. We find that if a graph neural network (GNN) model is pretrained on diverse task-trees through a reconstruction task, it can learn sufficient transferable knowledge for downstream tasks using an appropriate set of fine-tuning samples. To empirically validate the assumption, we further instantiate the theorems by developing a cross-task, cross-domain graph foundation model named Graph generality Identifier on task-Trees (GIT). The extensive experiments over 30 graphs from five domains demonstrate the effectiveness of GIT in fine-tuning, in-context learning, and zero-shot learning scenarios. Particularly, the general GIT model pretrained on large-scale datasets can be quickly adapted to specific domains, matching or even surpassing expert models designed for those domains.

1848Training Free Exponential Context Extension via Cascading KV Cache

[openreview] [pdf]

Abstract The transformer’s context window is vital for tasks such as few-shot learning and conditional generation as it preserves previous tokens for active memory. However, as the context lengths increase, the computational costs grow quadratically, hindering the deployment of large language models (LLMs) in real-world, long sequence scenarios. Although some recent key-value caching (KV Cache) methods offer linear inference complexity, they naively manage the stored context, prematurely evicting tokens and losing valuable information. Moreover, they lack an optimized prefill/prompt stage strategy, resulting in higher latency than even quadratic attention for realistic context sizes. In response, we introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens, enabling the model to maintain longer context histories without increasing the cache size. Our approach outperforms linear caching baselines across key benchmarks, including streaming perplexity, question answering, book summarization, and passkey retrieval, where it retains better retrieval accuracy at 1M tokens after four doublings of the cache size of 65K. Additionally, our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens. These innovations not only enhance the computational efficiency of LLMs but also pave the way for their effective deployment in resource-constrained environments, enabling large-scale, real-time applications with significantly reduced latency.

1849Does Instruction Tuning Reduce Diversity? A Case Study Using Code Generation

[openreview] [pdf]

Abstract Large Language Models (LLMs) should ideally generate diverse content for open-ended prompts (e.g., variety in cooking recipes). Anecdotal evidence has suggested that preference-tuned language models struggle to generate diverse content, which would have important implications for how we align models. However, research on this question has been limited by the difficulty of measuring diversity, which naively would require costly human evaluation. We propose to leverage code as a means to study semantic diversity, since code has executable semantics. To this end, we create an open-ended program synthesis task, enabling us to cheaply evaluate the diversity of hundreds of thousands of generations. Using our methodology, we find that while instruction-tuning reduces syntactic and lexical diversity, it can actually increase semantic diversity. We also study the effect of model size and prompting technique on diversity. Finally, we find that neural diversity metrics correlate poorly with our semantic diversity metrics, highlighting the need for more rigorous methodologies for evaluating diversity.

1850Efficient Active Imitation Learning with Random Network Distillation

[openreview] [pdf]

Abstract Developing agents for complex and underspecified tasks, where no clear objective exists, remains challenging but offers many opportunities. This is especially true in video games, where simulated players (bots) need to play realistically, and there is no clear reward to evaluate them. While imitation learning has shown promise in such domains, these methods often fail when agents encounter out-of-distribution scenarios during deployment. Expanding the training dataset is a common solution, but it becomes impractical or costly when relying on human demonstrations. This article addresses active imitation learning, aiming to trigger expert intervention only when necessary, reducing the need for constant expert input along training. We introduce Random Network Distillation DAgger (RND-DAgger), a new active imitation learning method that limits expert querying by using a learned state-based out-of-distribution measure to trigger interventions. This approach avoids frequent expert-agent action comparisons, thus making the expert intervene only when it is useful. We evaluate RND-DAgger against traditional imitation learning and other active approaches in 3D video games (racing and third-person navigation) and in a robotic locomotion task and show that RND-DAgger surpasses previous methods by reducing expert queries.

1851Beyond Trend and Periodicity: Guide Time Series Forecasting with Textual Cues

[openreview] [pdf]

Abstract This work introduces a novel Text-Guided Time Series Forecasting (TGTSF) task. By integrating textual cues, such as channel descriptions and dynamic news, TGTSF addresses the critical limitations of traditional methods that rely purely on historical data. To support this task, we propose TGForecaster, a robust baseline model that fuses textual cues and time series data using cross-attention mechanisms. We then present four meticulously curated benchmark datasets to validate the proposed framework, ranging from simple periodic data to complex, event-driven fluctuations. Our comprehensive evaluations demonstrate that TGForecaster consistently achieves state-of-the-art performance, highlighting the transformative potential of incorporating textual information into time series forecasting. This work not only pioneers a novel forecasting task but also establishes a new benchmark for future research, driving advancements in multimodal data integration for time series models.

1852Efficient Ensembles Improve Training Data Attribution

[openreview] [pdf]

Abstract Training data attribution (TDA) methods aim to quantify the influence of individual training data points on the model predictions, with broad applications in data-centric AI, such as mislabel detection, data selection, and copyright compensation. However, existing methods in this field, which can be categorized as retraining-based and gradient-based, have struggled with the trade-off between computational efficiency and attribution efficacy. Retraining-based methods can accurately attribute complex non-convex models but are computationally prohibitive, while gradient-based methods are efficient but often fail for non-convex models. Recent research has shown that augmenting gradient-based methods with ensembles of multiple independently trained models can achieve significantly better attribution efficacy. However, this approach remains impractical for very large-scale applications.In this work, we discover that expensive, fully independent training is unnecessary for ensembling the gradient-based methods, and we propose two efficient ensemble strategies, DROPOUT ENSEMBLE and LORA ENSEMBLE, alternative to naive independent ensemble. These strategies significantly reduce training time (up to 80%), serving time (up to 60%), and space cost (up to 80%) while maintaining similar attribution efficacy to the naive independent ensemble. Our extensive experimental results demonstrate that the proposed strategies are effective across multiple TDA methods on diverse datasets and models, including generative settings, significantly advancing the Pareto frontier of TDA methods with better computational efficiency and attribution efficacy. We conduct a theoretical analysis that provides insights into the success of our empirical findings.

1853Generalized Attention Flow: Feature Attribution for Transformer Models via Maximum Flow

[openreview] [pdf]

Abstract This paper introduces Generalized Attention Flow, a novel feature attribution method for Transformer models that addresses the limitations of existing approaches. By generalizing Attention Flow and substituting attention weights with an arbitrary Information Tensor, the method leverages attention weights, their gradients, maximum flow, and the barrier method to generate more accurate feature attributions. The proposed approach demonstrates superior theoretical properties and resolves issues associated with previous methods that rely solely on simple aggregation of attention weights. Comprehensive benchmarking in NLP sequence classification tasks reveals that a specific variant of Generalized Attention Flow consistently outperforms state-of-the-art feature attribution methods across most evaluation scenarios, offering a more accurate explanation of Transformer model outputs.

1854Riemannian Low-Rank Adaptation for Federated Fine-Tuning of Foundation Models

[openreview] [pdf]

Abstract Rank-adaptive low-rank adaptation (LoRA), a parameter-efficient fine-tuning (PEFT) technology, has achieved state-of-the-art performance in fine-tuning foundation models (FM). Directly transplanting the rank-adaptive LoRA methods from centralized learning to federated learning raises two critical issues: client drift and rank drift. This paper presents a Riemannian LoRA algorithm with adaptive rank for federated fine-tuning of foundation models (FFT-FM), RAFFT, which solves the client-drift and rank-drift issues, and significantly improves the computational cost. First, by utilizing Riemannian Procrustes analysis, we propose a Riemannian parameter matching method to avoid the client-drift issue for ensuring the effectiveness of FFT-FM with rank-adaptive LoRA, and to reduce the cost of matrix decomposition by transforming the singular value decomposition (SVD) of high-dimensional full parameter matrices into the SVD of low-dimensional r×rr \times r matrices, where rr is the rank parameter in the LoRA. We theoretically derive the equivalence between our RAFFT algorithm with rank-adaptive LoRA for the FFT-FM and the standard FFT-FM on the full parameter matrices based on FedAvg and verify the bounded error introduced by approximation and numerical errors. Second, by leveraging Riemannian manifold theory, we develop a Riemannian gradient descent (RGD) method to guarantee the local full parameter matrices on clients in the form of low-rank ones with fixed rank optimized by the server in each FFT-FM round, for alleviating the rank-drift issue to speed up the convergence of RAFFT. We theoretically demonstrate that the RGD optimization on the Riemannian manifold ensures the rank invariance during the local update process and the RGD optimization can converge in the FFT-FM context.

1855Ego-centric Learning of Communicative World Models for Autonomous Driving

[openreview] [pdf]

Abstract We study multi-agent reinforcement learning (MARL) for tasks in complex high-dimensional environments, such as autonomous driving. MARL is known to suffer from thepartial observabilityandnon-stationarityissues. To tackle these challenges, information sharing is often employed, which however faces major hurdles in practice, including overwhelming communication overhead and scalability concerns. Based on the key observation that world model encodes high-dimensional inputs to low-dimensional latent representation with a small memory footprint, we developCALL, {C}ommunic{a}tive Wor{l}d Mode{l}, for ego-centric MARL, where 1) each agent first learns its world model that encodes its state and intention into low-dimensional latent representation which can be shared with other agents of interest via lightweight communication; and 2) each agent carries out ego-centric learning while exploiting lightweight information sharing to enrich her world model learning and improve prediction for better planning. We characterize the gain on the prediction accuracy from the information sharing and its impact on performance gap. Extensive experiments are carried out on the challenging local trajectory planning tasks in the CARLA platform to demonstrate the performance gains of usingCALL.

1856Near-Optimal Policy Identification in Robust Constrained Markov Decision Processes via Epigraph Form

[openreview] [pdf]

Abstract Designing a safe policy for uncertain environments is crucial in real-world control applications. However, this challenge remains inadequately addressed within the Markov decision process (MDP) framework. This paper presents the first algorithm guaranteed to identify a near-optimal policy in a robust constrained MDP (RCMDP), where an optimal policy minimizes cumulative cost while satisfying constraints in the worst-case scenario across a set of environments. We first prove that the conventional policy gradient approach to the Lagrangian max-min formulation can become trapped in suboptimal solutions by encountering a sum of conflicting gradients from the objective and constraint functions during its inner minimization problem. To address this, we leverage the epigraph form of the RCMDP problem, which resolves the conflict by selecting a single gradient from either the objective or the constraints. Building on the epigraph form, we propose a binary search algorithm with a policy gradient subroutine and prove that it identifies an ε\varepsilon-optimal policy in an RCMDP with O~(ε4)\widetilde{\mathcal{O}}(\varepsilon^{-4}) robust policy evaluations.

1857Hybrid Fine-Tuning of LLMs: Theoretical Insights on Generalized Smoothness and Convergence

[openreview] [pdf]

Abstract Applying either Parameter-Efficient Fine-Tuning (PEFT) or full fine-tuning to Large Language Models (LLMs) often results in its inherent limitations. To overcome this issue, we propose a novel “hybrid fine-tuning” approach that jointly updates both LLMs and PEFT modules using a combination of zeroth-order and first-order optimization methods. To analyze this approach, we develop a theoretical framework centered on the concept of “hybrid generalized smoothness”, which accounts for the heterogeneous nature of the optimization landscape in joint LLM and PEFT training. We provide a rigorous convergence analysis for the convergence of SGD algorithm under multiple learning rates and demonstrate its effectiveness through extensive empirical studies across various downstream tasks and model architectures. Our work not only offers a solution to the practical challenge of LLM fine-tuning but also contributes a broader theoretical foundation for analyzing hybrid optimization problems in machine learning.

1858LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency

[openreview] [pdf]

Abstract Offline preference-based reinforcement learning (PbRL) provides an effective way to overcome the challenges of designing reward and the high costs of online interaction. However, since labeling preference needs real-time human feedback, acquiring sufficient preference labels is challenging. To solve this, this paper proposes a offLine prEference-bAsed RL with high Sample Efficient (LEASE) algorithm, where a learned transition model is leveraged to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model, where only high confidence and low variance data are selected. Moreover, we provide the generalization bound of reward model to analyze the factors influencing reward accuracy, and demonstrate that the policy learned by LEASE has theoretical improvement guarantee. The developed theory is based on state-action pair, which can be easily combined with other offline algorithms. The experimental results show that LEASE can achieve comparable performance to baseline under fewer preference data without online interaction.

1859Remote Reinforcement Learning with Communication Constraints

[openreview] [pdf]

Abstract We introduce the novel problem of remote reinforcement learning (RRL) with a communication constraint, in which the actor that takes the actions in the environment lacks direct access to the reward signal. Instead, the rewards are observed by a controller, which communicates with the agent through a communication-constrained channel. This can model a remote control scenario over a wireless channel, where the communication link from the controller to the agent has limited capacity due to power, bandwidth, or delay constraints. In the proposed solution, rather than transmitting the reward values to the agent over the rate-limited channel, the controller learns the optimal policy, and at each round, signals the action that the agent should take over the channel. However, instead of sending the precise action--which can be prohibitive when the action set is large--we use an importance sampling approach to reduce the communication load, which allows the agent to sample an action from the current policy. The actor, sampling from the desired policy at each turn, can also learn the optimal policy, albeit at a slower pace, using supervised learning. We exploit the learned policy at the actor to further reduce the communication load. Our solution, called Guided Remote Action Sampling Policy (GRASP), exhibits a significant reduction in communication requirements, achieving an average of 12-fold decrease in data transmission across all experiments, and 50-fold reduction for environments with continuous action spaces. We also show the applicability of GRASP beyond single-agent scenarios, including parallel and multi-agent environments.

1860Reconciling Model Multiplicity for Downstream Decision Making

[openreview] [pdf]

Abstract We consider the problem of \emph{model multiplicity} in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on what action to take for a downstream decision-making problem. Prior work attempts to address model multiplicity by resolving prediction disagreement between models. However, we show that even when the two predictive models approximately agree on their individual predictions almost everywhere, these models can lead the downstream decision-maker to take actions with substantially higher losses. We address this issue by proposing a framework that \emph{calibrates} the predictive models with respect to both a finite set of downstream decision-making problems and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-makers. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Furthermore, we generalize our results to the settings where one has more than two predictive models and an infinitely large downstream action set. Finally, we provide a set of experiments to evaluate our methods empirically. Compared to existing work, our proposed algorithm creates a pair of predictive models with improved downstream decision-making losses and agrees on their best-response actions almost everywhere.

1861MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation

[openreview] [pdf]

Abstract Model merging has emerged as an effective approach to combine multiple single-task models into a multitask model. This process typically involves computing a weighted average of the model parameters without any additional training. Existing model-merging methods focus on enhancing average task accuracy. However, interference and conflicts between the objectives of different tasks can lead to trade-offs during the merging process. In real-world applications, a set of solutions with various trade-offs can be more informative, helping practitioners make decisions based on diverse preferences. In this paper, we introduce a novel and low-compute algorithm, \textbf{Model Merging with Amortized Pareto Front (MAP)}. MAP efficiently identifies a Pareto set of scaling coefficients for merging multiple models, reflecting the trade-offs involved. It amortizes the substantial computational cost of evaluations needed to estimate the Pareto front by using quadratic approximation surrogate models derived from a pre-selected set of scaling coefficients. Experimental results on vision and natural language processing tasks demonstrate that MAP can accurately identify the Pareto front, providing practitioners with flexible solutions to balance competing task objectives. We also introduce Bayesian MAP for scenarios with a relatively low number of tasks and Nested MAP for situations with a high number of tasks, further reducing the computational cost of evaluation.

1862SQT -- rough conservative actor critic

[openreview] [pdf]

Abstract Std QQ-target is a conservative actor critic ensemble based QQ-learning algorithm which based on a single key QQ-formula--QQ-networks standard deviation, an uncertainty penalty. A minimalistic solution to the problem of overestimation bias. We implement SQT on top of actor critic and test it against the SOTA actor critic algorithms on popular MuJoCo tasks. SQT shows a clear performance advantage over TD3, SAC and TD7 on the tested tasks majority.

1863LongSafetyBench: Long-Context LLMs Struggle with Safety Issues

[openreview] [pdf]

Abstract With the development of large language models (LLMs), the sequence length of these models continues to increase, drawing significant attention to long-context language models. However, the evaluation of these models has been primarily limited to their capabilities, with a lack of research focusing on their safety. Existing work, such as ManyShotJailbreak, has to some extent demonstrated that long-context language models can exhibit safety concerns. However, the methods used are limited and lack comprehensiveness. In response, we introduceLongSafetyBench, the first benchmark designed to objectively and comprehensively evaluate the safety of long-context models. LongSafetyBench consists of 10 task categories, with an average length of 41,889 words. After testing eight long-context language models on LongSafetyBench, we found that existing models generally exhibit insufficient safety capabilities. Moreover, models’ safety performance in long-context scenarios does not always align with that in short-context scenarios. Further investigation revealed that long-context models tend to overlook harmful content within lengthy texts. We also proposed a simple yet effective solution, allowing open-source models to achieve performance comparable to that of top-tier closed-source models. We believe that LongSafetyBench can serve as a valuable benchmark for evaluating the safety capabilities of long-context language models. We hope that our work will encourage the broader community to pay attention to the safety of long-context models and contribute to the development of solutions to improve the safety of long-context LLMs.

1864Hint Marginalization for Improved Reasoning in Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to generate a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.

1865Learning Video-Conditioned Policy on Unlabelled Data with Joint Embedding Predictive Transformer

[openreview] [pdf]

Abstract The video-conditioned policy takes prompt videos of the desired tasks as a condition and is regarded for its prospective generalizability. Despite its promise, training a video-conditioned policy is non-trivial due to the need for abundant demonstrations. In some tasks, the expert rollouts are merely available as videos, and costly and time-consuming efforts are required to annotate action labels. To address this, we explore training video-conditioned policy on a mixture of expert demonstrations and unlabeled expert videos to reduce reliance on extensive manually annotated data. We introduce the Joint Embedding Predictive Transformer (JEPT) to learn a video-conditioned policy through sequence modeling. JEPT is designed to jointly learn visual transition prediction and inverse dynamics. The visual transition is captured from both demonstrations and expert videos, on the basis of which the inverse dynamics learned from demonstrations is generalizable to the tasks without action labels. We conduct experiments on a series of simulated visual control tasks and evaluate that JEPT can effectively leverage the mixture dataset to learn a generalizable policy. JEPT outperforms baselines in the tasks without action-labeled data and unseen tasks. We also experimentally reveal the potential of JEPT as a simple visual priors injection approach to enhance the video-conditioned policy.

1866IS TRANSFORMER A STOCHASTIC PARROT? A CASE STUDY IN SIMPLE ARITHMETIC TASK

[openreview] [pdf]

Abstract Large pre-trained language models have demonstrated impressive capabilities, but there is still much to learn about how they operate. In this study, we conduct a investigation of the autoregressive transformer’s ability to perform basic addition operations. Specifically, by using causal analysis we found that a few different attention heads in the middle layers control the addition carry, with each head processing carries of different lengths. Due to the lack of globality in these attention heads, the model struggles to handle long-sequence addition tasks. By performing inference intervention on mistral-7B, partial task performance can be restored, with the accuracy on 20-digit long-sequence additions from 2% to 38%. Moreover, through fine-tuning, we discovered that the model still struggles to generalize carry chains beyond the training sequence length and the formation of the attention heads is crucial to the length generalization. Our research reveals how the model performs addition, and further provides insights into the debate on whether these models are merely statistical.

1867Calibrating Video Watch-time Predictions with Credible Prototype Alignment

[openreview] [pdf]

Abstract Accurately predicting user watch-time is crucial for enhancing user stickiness and retention in video recommendation systems. Existing watch-time prediction approaches typically involve transformations of watch-time labels for prediction and subsequent reversal, ignoring both the natural distribution properties of label and the \textit{instance representation confusion} that results in inaccurate predictions. In this paper, we propose ProWTP, a two-stage method combining prototype learning and optimal transport for watch-time regression prediction, suitable for any deep recommendation model. The core idea of ProWTP is to align label distribution with instance representation distribution to calibrate the instance space, thereby improving prediction accuracy. Specifically, we observe that the watch-ratio (the ratio of watch-time to video duration) within the same duration bucket exhibits a multimodal distribution. To facilitate incorporation into models, we use a hierarchical vector quantised variational autoencoder (HVQ-VAE) to convert the continuous label distribution into a high-dimensional discrete distribution, serving as credible prototypes for calibrations. Based on this, ProWTP views the alignment between prototypes and instance representations as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem, where the marginal constraints of prototypes are relaxed. And the corresponding optimization problem is reformulated as a weighted Lasso problem for solution. Moreover, ProWTP introduces the assignment and compactness losses to encourage instances to cluster closely around their respective prototypes, thereby enhancing the prototype-level distinguishability. Finally, we conducted extensive offline experiments on two industrial datasets, demonstrating our consistent superiority in real-world application.

1868Active Preference Optimization via Maximizing Learning Capacity

[openreview] [pdf]

Abstract The success of deep learning in various complex tasks relies heavily on large amounts of annotated data, which can be prohibitively expensive to acquire. Techniques such as reinforcement learning with human feedback (RLHF) and direct preference optimization (DPO) have emerged as methods for fine-tuning models by leveraging human preferences, but they come with significant costs, especially when applied to large-scale language models (LLMs). Recent efforts to reduce these costs have focused on active preference optimization, which uses certainty-based selection to minimize the annotation burden. However, the two-step process of selecting uncertain input prompts and then acquiring completions can lead to sub-optimal pairings, potentially limiting model learning capacity. This paper suggests that divAPO eliminates suboptimal pairings that are typical of two-step methods and enhances learning capacity by selecting the most informative preference pairs in a single phase, taking into account both data distribution probabilities and preference model certainty. Through experiments on complicated Language tasks, we demonstrate that our method achieves significant performance improvements over existing approaches.

1869FedADM: Adaptive Federated Learning via Dissimilarity Measure

[openreview] [pdf]

Abstract In federated learning, there are two critical challenges: 1) the data on distributed learners is heterogeneous; and 2) communication resources within the network are limited. In this work, we propose a framework, Federated Adaptive Dissimilarity Measure (FedADM), which can be regarded as an adaptively enhanced version of the Federated Proximal (FedProx) algorithm. This adaptiveness is primarily manifested in two aspects: (i) how it adaptively adjusts the proximity between the local models on different learners and the global model; and (ii) how it adaptively aggregates local model parameters. Building on the FedProx model, FedADM incorporates the concept of the Lagrangian multiplier to control the proximal coefficients of different learners, using “\textit{parameter dissimilarity}" to address data heterogeneity. It explicitly captures the essence of using “\textit{loss dissimilarity}" to adaptively adjust the aggregation frequency on distributed learners, thereby reducing communication overhead. Theoretically, we provide the performance upper bounds and convergence analysis of our proposed FedADM. Experiment results demonstrate that FedADM allows for higher accuracy and lower communication overhead compared to the baselines across a suite of realistic datasets.

1870A Framework for Finding Local Saddle Points in Two-Player Zero-Sum Black-Box Games

[openreview] [pdf]

Abstract Saddle point optimization is a critical problem employed in numerous real-world applications, including portfolio optimization, generative adversarial networks, and robotics. It has been extensively studied in cases where the objective function is known and differentiable. Existing work in black-box settings with unknown objectives that can only be sampled either assumes convexity-concavity in the objective to simplify the problem or operates with noisy gradient estimators. In contrast, we introduce a framework inspired by Bayesian optimization which utilizes Gaussian processes to model the unknown (potentially nonconvex-nonconcave) objective and requires only zeroth-order samples. Our approach frames the saddle point optimization problem as a two-level process which can flexibly integrate existing and novel approaches to this problem. The upper level of our framework produces a model of the objective function by sampling in promising locations, and the lower level of our framework uses the existing model to frame and solve a general-sum game to identify locations to sample. This lower level procedure can be designed in complementary ways, and we demonstrate the flexibility of our approach by introducing variants which appropriately trade off between factors like runtime, the cost of function evaluations, and the number of available initial samples. We experimentally demonstrate these algorithms on synthetic and realistic datasets, showcasing their ability to efficiently locate local saddle points in these contexts.

1871Personalized Visual Instruction Tuning

[openreview] [pdf]

Abstract Recent advancements in multimodal large language models (MLLMs) have demonstrated significant progress; however, these models exhibit a notable limitation, which we refer to as “face blindness.” Specifically, they can engage in general conversations but fail to conduct personalized dialogues targeting at specific individuals. This deficiency hinders the application of MLLMs in personalized settings, such as tailored visual assistants on mobile devices, or domestic robots that need to recognize members of the family. In this paper, we introduce Personalized Visual Instruction Tuning (PVIT), a novel data curation and training framework designed to enable MLLMs to identify target individuals within an image and engage in personalized and coherent dialogues. Our approach involves the development of a sophisticated pipeline that autonomously generates training data containing personalized conversations. This pipeline leverages the capabilities of various visual experts, image generation models, and (multi-modal) large language models. To evaluate the personalized potential of MLLMs, we present a benchmark called P-Bench, which encompasses various question types with different levels of difficulty. The experiments demonstrate a substantial personalized performance enhancement after fine-tuning with our curated dataset.

1872Supervised Chain of Thought

[openreview] [pdf]

Abstract Large Language Models (LLMs) have revolutionized the field of natural language processing and hold significant promise for advancements in Artificial Intelligence. However, the backbone architecture of most mainstream LLMs, the Transformer, has inherent limitations regarding computational depth, making them theoretically incapable of solving many reasoning tasks that require increasing depth. Chain of Thought (CoT) techniques, however, have been shown to mitigate these architectural limitations, as demonstrated by several theoretical works, offering a viable approach to solving complex reasoning tasks that were previously out of reach. Despite its successes, CoT and its variants (such as Tree of Thought, Graph of Thought, etc.) follow a one-prompt-for-all-tasks approach. Specifically, they rely on a single prompt structure (e.g., “think step by step”) for a wide range of tasks, from counting to sorting, and from solving mathematical problems to tackling algorithmic challenges. This creates significant challenges for the model to generate the correct steps template for different tasks, as it requires searching in large prompt template space. In this work, we build on previous theoretical analyses of CoT to demonstrate how the “one-prompt-for-all-tasks” template can negatively impact the computability of LLMs. We divide the solution space into prompt space and answer space, showing that the CoT process requires task-specific supervision to accurately navigate the prompt space and achieve optimal performance. Through experiments with the latest LLMs, we reveal a significant gap in reasoning ability when supervision is applied versus when it is not. Our aim is to provide insights into the mechanisms behind CoT and to inspire the effective design of CoT variants. Additionally, we highlight the key limitations of traditional ``unsupervised’’ prompting approaches, suggesting the need for more nuanced, task-specific “supervised” CoT for effective reasoning with LLMs.

1873Personalized Prompt Tuning for Unsupervised Federated Learning

[openreview] [pdf]

Abstract Federated learning facilitates collaborative model training across multiple distributed clients without requiring data sharing. However, conventional federated methods struggle with classification tasks in an unsupervised paradigm due to the absence of category knowledge. Recently, CLIP, a prominent visual language model, has demonstrated impressive results, particularly its remarkable zero-shot classification ability, which alleviates the dependence on labeled data. In this paper, we first explore a new realistic problem, unsupervised federated learning using CLIP, where clients with unlabeled heterogeneous data collaborate to enhance global performance. To address this problem, we propose FedPP, a method that incorporates a cooperative pseudo-label selection strategy and a partial prompt aggregation protocol. Our selection strategy ensures that all classes are trained in a balanced manner through global pseudo-label allocation. Concurrently, the aggregation protocol divides parameters into aggregated and retained components to optimize global performance while supporting local personalization. Extensive experiments across six datasets with various types of heterogeneity demonstrate the effectiveness of FedPP. Our code is available in the supplementary materials.

1874StyleShot: A snapshot on any style

[openreview] [pdf]

Abstract In this paper, we show that, a good style representation is crucial and sufficient for generalized style transfer without test-time tuning. We achieve this through constructing a style-aware encoder and a well-organized style dataset called StyleGallery. With dedicated design for style learning, this style-aware encoder is trained to extract expressive style representation with decoupling training strategy, and StyleGallery enables the generalization ability. We further employ a content-fusion encoder to enhance image-driven style transfer. We highlight that, our approach, named StyleShot, is simple yet effective in mimicking various desired styles, i.e., 3D, flat, abstract or even fine-grained styles, without test-time tuning. Rigorous experiments validate that, StyleShot achieves superior performance across a wide range of styles compared to existing state-of-the-art methods.

1875ICConv: A Large-Scale Intent-Oriented and Context-Aware Conversational Search Dataset

[openreview] [pdf]

Abstract In recent years, search engines have made significant advancements. Yet, traditional ad-hoc search engines often struggle with complex search scenarios (e.g. multi-turn information seeking). This challenge has shifted the focus towards conversational search, an approach enabling search engines to interact directly with users to obtain more precise results. Progress in conversational search has been slow due to a lack of data and difficulties in gathering real-world conversational search data. To address these hurdles, we embarked on a journey to autonomously create a large-scale, high-quality conversational search dataset. Previous efforts to create such datasets often overlooked the multi-intent aspect and contextual information, or resulted in a biased dataset, where all dialogue queries linked to a single positive passage. In our study, we have incorporated multi-intent based on the existing search sessions and converted each keyword-based query into multiple natural language queries based on different latent intents present in the related passage. We then contextualized these natural language queries within the same session and organized them into a conversational search tree. A carefully designed dialogue discriminator was utilized to ensure the consistency and coherence of all generated conversations, assessing their quality and filtering out any substandard ones. After extensive data cleaning, we are proud to introduce the \textbf{I}ntent-oriented and \textbf{C}ontext-aware \textbf{Conv}ersational search dataset (ICConv), a large-scale synthetic dataset comprising over 100,000 high-quality, information-seeking conversations. Our human annotators have evaluated ICConv based on six dialogue and search related criteria and it has performed admirably. We further explore the statistical characteristics of ICConv and validate the effectiveness of various conversational search methods using it as a standard for comparison.

1876Unveiling the latent dynamics in social cognition with multi-agent inverse reinforcement learning

[openreview] [pdf]

Abstract Understanding the intentions and beliefs of others, a phenomenon known as “theory of mind”, is a crucial element in social behavior. These beliefs and perceptions are inherently subjective and latent, making them often unobservable for investigation. Social interactions further complicate the matter, as multiple agents can engage in recursive reasoning about each other’s strategies with increasing levels of cognitive hierarchy. While previous research has shown promise in understanding a single agent’s belief of values through inverse reinforcement learning, extending this to model interactions among multiple agents remains an open challenge due to the computational complexity. In this work, we adopted a probabilistic recursive modeling of cognitive levels and joint value decomposition to achieve efficient multi-agent inverse reinforcement learning (MAIRL). We validated our method using simulations of a cooperative foraging task. Our algorithm revealed both the ground truth goal-directed value function and agents’ beliefs about their counterparts’ strategies. When applied to human behavior in a cooperative hallway task, our method identified meaningful goal maps that evolved with task proficiency and an interaction map that is related to key states in the task without accessing to the task rules. Similarly, in a non-cooperative task performed by monkeys, we identified mutual predictions that correlated with the animals’ social hierarchy, highlighting the behavioral relevance of the latent beliefs we uncovered. Together, our findings demonstrate that MAIRL offers a new framework for uncovering human or animal beliefs in social behavior, thereby illuminating previously opaque aspects of social cognition.

1877TimeBase: The Power of Minimalism in Long-term Time Series Forecasting

[openreview] [pdf]

Abstract Long-term time series forecasting (LTSF) has traditionally relied on models with large parameters to capture extended temporal dependencies. However, time series data, unlike high-dimensional images or text, often exhibit strong periodicity and low-rank structures, especially in long forecasting horizons. This characteristic can lead many models focusing on redundant patterns, resulting in inefficient use of computational resources. In this paper, we introduce TimeBase, an ultra-lightweight network with fewer than 0.4kk parameters, designed to harness the power of minimalism in LTSF. TimeBase extracts core periodic features by leveraging full-rank typical period representations under orthogonality constraints, enabling accurate prediction of future cycles. Extensive experiments on real-world datasets demonstrate that TimeBase not only achieves minimalism in both model size and computational cost, reducing MACs by 35x and parameter counts by over 1000 times compared to standard linear models, but also wins state-of-the-art forecasting performance, ranking Top1-Top5 in all 28 prediction settings. Moreover, TimeBase exhibits robust generalization, maintaining high accuracy in limited-data, zero-shot, and low-quality scenarios. Code is available at \url{https://anonymous.4open.science/r/TimeBase/}.

1878Prompting Fairness: Integrating Causality to Debias Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs), despite their remarkable capabilities, are susceptible to generating biased and discriminatory responses. As LLMs increasingly influence high-stakes decision-making (e.g., hiring and healthcare), mitigating these biases becomes critical. In this work, we propose a causality-guided debiasing framework to tackle social biases, aiming to reduce the harmful dependence between LLMs’ decisions and the social information in the input. Our framework introduces a novel perspective to identify how social information can affect an LLM’s decision through different causal pathways. Leveraging these causal insights, we outline principled prompting strategies that regulate these pathways through selection mechanisms. This framework not only unifies existing prompting-based debiasing techniques but also opens up new directions for reducing bias by encouraging the model to prioritize fact-based reasoning over reliance on biased social cues. We validate our framework through extensive experiments on real-world datasets across multiple domains, demonstrating its effectiveness in debiasing LLM decisions, even with only black-box access to the model.

1879Long-form Hallucination Detection with Self-elicitation

[openreview] [pdf]

Abstract While Large Language Models (LLMs) have exhibited impressive performance in long-form question-answering tasks, they frequently present a hazard of producing factual inaccuracies or hallucinations. An effective strategy to mitigate this hazard is to leverage off-the-shelf LLMs to detect hallucinations after the generation. The primary challenge resides in the comprehensive elicitation of the intrinsic knowledge acquired during their pre-training phase. However, existing methods that employ complex reasoning chains predominantly fall short of addressing this issue. Moreover, since existing methods for hallucination detection tend to decompose the text into isolated statements, they are unable to learn the inherent semantic continuity in long-form content. In this paper, we propose a novel framework, SelfElicit, which synergizes the self-elicitation of intrinsic knowledge of large language models and long-form continuity understanding. Specifically, we leverage self-generated thoughts derived from prior statements as catalysts to elicit the expression of intrinsic knowledge, which is integrated with knowledge hypergraphs to alleviate induced hallucinations and guide the factual evaluation by effectively organizing the elicited knowledge. Extensive experiments on real-world medical QA datasets demonstrate the effectiveness of self-elicitation and the superiority of our proposed method.

1880From Demonstrations to Rewards: Alignment Without Explicit Human Preferences

[openreview] [pdf]

Abstract One of the challenges of aligning large models with human preferences lies in both the data requirements and the technical complexities of current approaches. Predominant methods, such as RLHF, involve multiple steps, each demanding distinct types of data, including demonstrations data and preference data. In RLHF, human preferences are typically modeled through a reward model, which serves as a proxy to guide policy learning during the reinforcement learning stage, ultimately producing a policy aligned with human preferences. However, in this paper, we propose a fresh perspective on learning alignment based on inverse reinforcement learning principles, where the optimal policy is still derived from reward maximization. However, instead of relying on preference data, we directly learn the reward model from demonstration data. This new formulation offers the flexibility to be applied even when only demonstration data is available, a capability that current RLHF methods lack, and it also shows that demonstration data offers more utility than what conventional wisdom suggests. Our extensive evaluation, based on public reward benchmark and HuggingFace Open LLM Leaderboard, demonstrates that our approach compares favorably to state-of-the-art methods that rely solely on demonstration data.

1881Low-Switching Primal-Dual Algorithms for Safe Reinforcement Learning

[openreview] [pdf]

Abstract Safety is a key challenge in reinforcement learning (RL), especially in real-world applications like autonomous driving and healthcare. To address this, Constrained Markov Decision Processes (CMDPs) are commonly used to incorporate safety constraints while optimizing performance. However, current methods often face significant safety violations during exploration or suffer from high regret, which represents the performance loss compared to an optimal policy. We propose a low-switching primal-dual algorithm that balances regret with bounded constraint violations, drawing on techniques from online learning and CMDPs. Our approach minimizes policy changes through low-switching updates and enhances sample efficiency using empirical Bernstein-based bonuses. This leads to tighter theoretical bounds on regret and safety, achieving a state-of-the-art regret of O~(SAH5K/(τc0))\tilde{O}(\sqrt{SAH^5K}/(\tau - c^0)), where SS and AA is the number of states and actions, HH is the horizon, KK is the number of episodes, and (τc0)(\tau - c^0) reflects the safety margin of a known existing safe policy. Our method also ensures a O~(1)\tilde{O}(1) constraint violation and removes unnecessary dependencies on state space SS and planning horizon HH in the reward regret, offering a scalable solution for constrained RL in complex environments.

1882Fast and Noise-Robust Diffusion Solvers for Inverse Problems: A Frequentist Approach

[openreview] [pdf]

Abstract Diffusion models have been firmly established as principled zero-shot solvers for linear and nonlinear inverse problems, owing to their powerful image prior and ease of formulation as Bayesian posterior samplers. However, many existing solvers struggle in the noisy measurement regime, either overfitting or underfitting to the measurement constraint, resulting in poor sample quality and inconsistent performance across noise levels. Moreover, existing solvers rely on an approximation of Tweedie’s formula, where an intractable conditional\textit{conditional} score is replaced by an unconditional\textit{unconditional} score network, introducing a fundamental source of error in the resulting solution. In this work, we propose a novel frequentist’s approach to diffusion-based inverse solvers, where each diffusion step can be seen as the maximum likelihood solution to a simple single-parameter conditional likelihood model, derived by an adjusted application of Tweedie’s formula to the forward measurement model. We demonstrate that this perspective is not only scalable and fast, but also allows for a noise-aware maximization scheme with a likelihood-based stopping criterion that promotes the proper noise-adapted fit given knowledge of the measurement noise σy\sigma_\mathbf{y}. Finally, we demonstrate comparable or improved performance against a wide selection of contemporary inverse solvers across multiple datasets, tasks, and noise levels.

1883Regretful Decisions under Label Noise

[openreview] [pdf]

Abstract Machine learning models are routinely used to support decisions that affect individuals -- be it to screen a patient for a serious illness or to gauge their response to treatment. In these tasks, we are limited to learning models from datasets where the labels are subject to noise. In this work, we study the impact of learning under label noise at the instance level. We introduce a notion of regret for this regime, which measures the number of unforeseen mistakes when learning from noisy labels. We show that standard approaches to learn models from noisy labels can return models that perform well at a population level while subjecting individuals to a lottery of mistakes. We develop machinery to estimate the likelihood of mistakes at an instance level from a noisy dataset, by training models over plausible realizations of datasets without label noise. We present a comprehensive empirical study of label noise in clinical prediction tasks. Our results reveal how our failure to anticipate mistakes can compromise model reliance and adoption, and demonstrate how we can address these challenges by anticipating and abstaining from regretful decisions.

1884On the Sample Complexity of a Policy Gradient Algorithm with Occupancy Approximation for General Utility Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning with general utilities has recently gained attention thanks to its ability to unify several problems, including imitation learning, pure exploration, and safe RL. However, prior work for solving this general problem in a unified way has only focused on the tabular setting. This is restrictive when considering larger state-action spaces because of the need to estimate occupancy measures during policy optimization. In this work, we address this issue and propose to approximate occupancy measures within a function approximation class using maximum likelihood estimation (MLE). We propose a simple policy gradient algorithm (PG-OMA) where an actor updates the policy parameters to maximize the general utility objective whereas a critic approximates the occupancy measure using MLE. We provide a statistical complexity analysis of PG-OMA showing that our occupancy measure estimation error only scales with the dimension of our function approximation class rather than the size of the state action space. Under suitable assumptions, we establish first order stationarity and global optimality performance bounds for the proposed PG-OMA algorithm for nonconcave and concave general utilities respectively. We complement our methodological and theoretical findings with promising empirical results showing the scalability potential of our approach compared to existing tabular count-based approaches.

1885Decoupling Dependency Structures: Sklar’s theorem for explainable outlier detection

[openreview] [pdf]

Abstract Recent advances in outlier detection have been primarily driven by deep learning models, which, while powerful, have substantial drawbacks in terms of explainability. This is particularly relevant in fields that demand detailed reasoning and understanding of why observations are classified as outliers. To close the gap between state-of-the-art performance and enhanced explainability, we propose Vine Copula-Based Outlier Detection (VC-BOD). We utilize Sklar’s theorem in conjunction with vine copulas and univariate kernel density estimators to decouple marginal distributions and their dependency structure for outlier detection. Our model uses a closed-form equation for the outlier score, which allows for detailed explainability and feature attribution. VC-BOD employs a traceable criterion to determine whether a new observation is an outlier, while also identifying the specific features responsible for this classification. The proposed model further distinguishes whether these features deviate from their own distributions or from interactions with other features. Our empirical assessments reveal that VC-BOD surpasses all classical models in terms of average rank performance and yields competitive results compared with the latest deep learning models.

1886Temporal Causal Discovery and Generative Prediction of Vehicular CO2emission

[openreview] [pdf]

Abstract Global warming from greenhouse gas emissions is humanity’s largest environmental hazard. Greenhouse gases, like CO2_2 emissions from transportation, notably cars, contribute to the greenhouse effect. Effective CO2_2 emission monitoring is needed to regulate vehicle emissions. Few studies have predicted automobile CO2_2 emissions using OBD port data. For precise and effective prediction, the system must capture the underlying cause-effect structure between vehicular parameters that may contribute to the emission of CO2_2 in the transportation sector. Thus, we present a causal RNN-based generative deep learning architecture that predicts vehicle CO2_2 emissions using OBD-II data while keeping the underlying causal structure. Most widely used real-life datasets lack causal relationships between features or components, so we use our proposed architecture to discover and learn the underlying causal structure as an adjacency matrix during training and employ that during forecasting. Our framework learns a sparse adjacency matrix by imposing a sparsity-encouraging penalty on model weights and allowing some weights to be zero. This matrix is capable of capturing the causal relationships between all variable pairs. In this work, we first train the model with widely used synthetic datasets with known causal structure among variables, then we apply it to the state-of-the-art OBD-II dataset to find the internal causal structure among the vehicular parameters and perform causal inference to predict CO2_2 emission. Experimental results reveal that our causal discovery and forecasting method surpasses state-of-the-art methods for the tasks of causal discovery in terms of AUROC, forecasting on multivariate causal time series data, and OBD-II dataset in terms of MMD, RMSE, and MAE. After successful completion, we will release the code (Code for review -https://anonymous.4open.science/r/causal-obd-co2-0A0C).

1887Variance-Reduced Forward-Reflected Algorithms for Generalized Equations

[openreview] [pdf]

Abstract We develop two novel stochastic variance-reduction methods to approximate a solution of generalized equations applicable to both equations and inclusions. Our algorithms leverage a new combination of ideas from the forward-reflected-backward splitting method and a class of unbiased variance-reduced estimators. We construct two new stochastic estimators within this class, inspired by the well-known SVRG and SAGA estimators. These estimators significantly differ from existing approaches used in minimax and variational inequality problems. By appropriately selecting parameters, both algorithms achieve the state-of-the-art oracle complexity of O(n+n2/3ϵ2)\mathcal{O}(n + n^{2/3} \epsilon^{-2}) for obtaining an ϵ\epsilon-solution in terms of the operator residual norm, where nn represents the number of summands and ϵ\epsilon signifies the desired accuracy. This complexity aligns with the best-known results in SVRG and SAGA methods for stochastic nonconvex optimization. We test our algorithms on two numerical examples and compare them with existing methods. The results demonstrate promising improvements offered by the new methods compared to their competitors.

1888ConDa: Fast Federated Unlearning with Contribution Dampening

[openreview] [pdf]

Abstract Federated learning (FL) has enabled collaborative model training across decentralized data sources or clients. While adding new participants to a shared model does not pose great technical hurdles, the removal of a participant and their related information contained in the shared model remains a challenge. To address this problem, federated unlearning has emerged as a critical research direction, seeking to remove information from globally trained models without harming the model performance on the remaining data. Most modern federated unlearning methods use costly approaches such as the use of remaining clients data to retrain the global model or methods that would require heavy computation on client or server side. We introduce Contribution Dampening (\textsc{ConDa}), a framework that performs efficient unlearning by tracking down the parameters which affect the global model for each client and performs synaptic dampening on the parameters of the global model that have privacy infringing contributions from the forgetting client. Our technique does not require clients data or any kind of retraining and it does not put any computational overhead on either the client or server side. We perform experiments on multiple datasets and demonstrate that \textsc{ConDa} is effective to forget a client’s data. In experiments conducted on the MNIST, CIFAR10, and CIFAR100 datasets, \textsc{ConDa} proves to be the fastest federated unlearning method, outperforming the nearest state-of-the-art approach by at least 100×. Our emphasis is on the non-IID Federated Learning setting, which presents the greatest challenge for unlearning. Additionally, we validate \textsc{ConDa}'s robustness through backdoor and membership inference attacks. We envision this work as a crucial component for FL in adhering to legal and ethical requirements.

1889Maintaining Adversarial Robustness in Continuous Learning

[openreview] [pdf]

Abstract Adversarial robustness is essential for security and reliability of machine learning systems. However, adversarial robustness enhanced by defense algorithms is easily erased as the neural network’s weights update to learn new tasks. To address this vulnerability, it is essential to improve the capability of neural networks in terms of robust continual learning. Specially, we propose a novel gradient projection technique that effectively stabilizes sample gradients from previous data by orthogonally projecting back-propagation gradients onto a crucial subspace before using them for weight updates. This technique can maintaining robustness by collaborating with a class of defense algorithms through sample gradient smoothing. The experimental results on four benchmarks including Split-CIFAR100 and Split-miniImageNet, demonstrate that the superiority of the proposed approach in mitigating rapidly degradation of robustness during continual learning even when facing strong adversarial attacks.

1890PredFormer: Transformers Are Effective Spatial-Temporal Predictive Learners

[openreview] [pdf]

Abstract Spatiotemporal predictive learning methods generally fall into two categories: recurrent-based approaches, which face challenges in parallelization and performance, and recurrent-free methods, which employ convolutional neural networks (CNNs) as encoder-decoder architectures. These methods benefit from strong inductive biases but often at the expense of scalability and generalization. This paper proposes PredFormer, a pure transformer-based framework for spatiotemporal predictive learning. Motivated by the Vision Transformers (ViT) design, PredFormer leverages carefully designed Gated Transformer blocks, following a comprehensive analysis of 3D attention mechanisms, including full-, factorized-, and interleaved- spatial-temporal attention. With its recurrent-free, transformer-based design, PredFormer is both simple and efficient, significantly outperforming previous methods by large margins. Extensive experiments on synthetic and real-world datasets demonstrate that PredFormer achieves state-of-the-art performance. On Moving MNIST, PredFormer achieves a 51.3% reduction in MSE relative to SimVP. For TaxiBJ, the model decreases MSE by 33.1% and boosts FPS from 533 to 2364. Additionally, on WeatherBench, it reduces MSE by 11.1% while enhancing FPS from 196 to 404. These performance gains in both accuracy and efficiency demonstrate PredFormer’s potential for real-world applications. The source code and trained models will be made available to the public.

1891How to Probe: Simple Yet Effective Techniques for Improving Post-hoc Explanations

[openreview] [pdf]

Abstract Post-hoc importance attribution methods are a popular tool for “explaining” Deep Neural Networks (DNNs) and are inherently based on the assumption that the explanations can be applied independently of how the models were trained. Contrarily, in this work we bring forward empirical evidence that challenges this very notion. Surprisingly, we discover a strong dependency on and demonstrate that the training details of a pre-trained model’s classification layer (<10% of model parameters) play a crucial role, much more than the pre-training scheme itself. This is of high practical relevance: (1) as techniques for pre-training models are becoming increasingly diverse, understanding the interplay between these techniques and attribution methods is critical; (2) it sheds light on an important yet overlooked assumption of post-hoc attribution methods which can drastically impact model explanations and how they are interpreted eventually. With this finding we also present simple yet effective adjustments to the classification layers, that can significantly enhance the quality of model explanations. We validate our findings across several visual pre-training frameworks (fully-supervised, self-supervised, contrastive vision-language training) and analyse how they impact explanations for a wide range of attribution methods on a diverse set of evaluation metrics.

1892AIR-BENCH 2024: A Safety Benchmark based on Regulation and Policies Specified Risk Categories

[openreview] [pdf]

Abstract Foundation models (FMs) provide societal benefits but also amplify risks. Governments, companies, and researchers have proposed regulatory frameworks, acceptable use policies, and safety benchmarks in response. However, existing public benchmarks often define safety categories based on previous literature, intuitions, or common sense, leading to disjointed sets of categories for risks specified in recent regulations and policies, which makes it challenging to evaluate and compare FMs across these benchmarks. To bridge this gap, we introduce AIR-BENCH 2024, the first AI safety benchmark aligned with emerging government regulations and company policies, following the regulation-based safety categories grounded in the AI Risks taxonomy, AIR 2024. AIR 2024 decomposes 8 government regulations and 16 company policies into a four-tiered safety taxonomy with 314 granular risk categories in the lowest tier. AIR-BENCH 2024 contains 5,694 diverse prompts spanning these categories, with manual curation and human auditing to ensure quality. We evaluate leading language models on AIR-BENCH 2024 uncovering insights into their alignment with specified safety concerns. By bridging the gap between public benchmarks and practical AI risks, AIR-BENCH 2024 provides a foundation for assessing model safety across jurisdictions, fostering the development of safer and more responsible AI systems.

1893Human-like Episodic Memory for Infinite Context LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) have shown remarkable capabilities, but still struggle with processing extensive contexts, limiting their ability to maintain coherence and accuracy over long sequences. In contrast, the human brain excels at organising and retrieving episodic experiences across vast temporal scales, spanning a lifetime. In this work, we introduce EM-LLM, a novel approach that integrates key aspects of human episodic memory and event cognition into LLMs with no fine-tuning, enabling them to handle practically infinite context lengths while maintaining computational efficiency. EM-LLM organises sequences of tokens into coherent episodic events using a combination of Bayesian surprise and graph-theoretic boundary refinement in an online fashion. When needed, these events are retrieved through a two-stage memory process, combining similarity-based and temporally contiguous retrieval for efficient and human-like access to relevant information. Experiments on the LongBench and InfiniteBench benchmarks demonstrate EM-LLM’s superior performance, consistently outperforming the state-of-the-art retrieval model InfLLM across various baseline LLMs. In addition, EM-LLM outperforms its popular counterpart, RAG, in a wide range of tasks, while requiring similar resources. Notably, EM-LLM’s performance even surpasses full-context models in most tasks, while successfully performing retrieval across 5 million tokens -- a scale computationally infeasible for such models. Finally, our analysis reveals strong correlations between EM-LLM’s event segmentation and human-perceived events, suggesting a bridge between this artificial system and its biological counterpart, thereby offering a novel computational framework for exploring human memory mechanisms.

1894Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models

[openreview] [pdf]

Abstract Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio, thanks to their temporally uni-directional attention mechanism, which models correlations between the current token andprevioustokens. However, video streaming remains much less explored, despite a growing need for live video processing. State-of-the-art video diffusion models leveragebi-directional temporal attention to model the correlations between the current frame and all thesurrounding(i.e. includingfuture) frames, which hinders them from processing streaming videos. To address this problem, we presentLive2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation. Compared to previous works, our approach ensures temporal consistency and smoothness by correlating the current frame with its predecessors and a few initial warmup frames, without any future frames. Additionally, we use a highly efficient denoising scheme featuring akv-cachemechanism and pipelining, to facilitate streaming video translation at interactive framerates. Extensive experiments demonstrate the effectiveness of the proposed attention mechanism and pipeline, outperforming previous methods in terms of temporal smoothness and/or efficiency.

1895Auditing Privacy Protection of Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning aims to remove the effect of specific data from trained models to ensure individuals’ privacy. However, it’s arguable how to evaluate whether the privacy protection goal is achieved by machine unlearning. Furthermore, recent studies show unlearning may also increase the retained samples’ privacy risks. This paper takes a holistic approach to auditing both unlearned and retained samples’ privacy risks before and after unlearning. We derive the privacy criteria for unlearned and retained samples, respectively, based on the perspectives of differential privacy and membership inference attacks. To make the auditing practical, we also develop an efficient membership inference attack, A-LiRA, utilizing data augmentation to reduce the cost of shadow model training. Our experimental findings indicate that existing machine unlearning algorithms do not consistently protect the privacy of unlearned samples and may inadvertently compromise the privacy of retained samples. For reproducibility, we have pubished our code.\footnote{ \url{https://anonymous.4open.science/r/Auditing-machine-unlearning-CB10/README.md}}

1896Don’t Throw Away Data: Better Sequence Knowledge Distillation

[openreview] [pdf]

Abstract A critical component in knowledge distillation is the means of coupling the teacher and student. The predominant sequence knowledge distillation method involves supervised learning of the student against teacher-decoded outputs, and is exemplified by the current state of the art, which incorporates minimum Bayes risk (MBR) decoding. In this paper we seek to integrate MBR more tightly in distillation training, specifically by using several high scoring MBR translations, rather than a single selected sequence, thus capturing a rich diversity of teacher outputs. Our experiments on English to German and English to Japanese translation show consistent improvements over strong baseline methods for both tasks and with varying model sizes. Additionally, we conduct a detailed analysis focusing on data efficiency and capacity curse aspects to elucidate MBR-n and explore its further potential.

1897Towards Federated RLHF with Aggregated Client Preference for LLMs

[openreview] [pdf]

Abstract Reinforcement learning with human feedback (RLHF) fine-tunes a pretrained large language model (LLM) using user preference data, enabling it to generate content aligned with human preferences. However, due to privacy concerns, users may be reluctant to share sensitive preference data. To address this, we propose utilizing Federated Learning (FL) techniques, allowing large-scale preference collection from diverse real-world users without requiring them to transmit data to a central server. Our federated RLHF methods (i.e., FedBis and FedBiscuit) encode each client’s preferences into binary selectors and aggregate them to capture common preferences. In particular, FedBiscuit overcomes key challenges, such as preference heterogeneity and reward hacking, through innovative solutions like grouping clients with similar preferences to reduce heterogeneity and using multiple binary selectors to enhance LLM output quality. To evaluate the performance of the proposed methods, we establish the first federated RLHF benchmark with a heterogeneous human preference dataset. Experimental results show that by integrating the LLM with aggregated client preferences, FedBis and FedBiscuit significantly enhance the professionalism and readability of the generated content.

1898Optimizing Posterior Samples for Bayesian Optimization via Rootfinding

[openreview] [pdf]

Abstract Bayesian optimization devolves the global optimization of a costly objective function to the global optimization of a sequence of acquisition functions. This inner-loop optimization can be catastrophically difficult if it involves posterior samples, especially in higher dimensions. We introduce an efficient global optimization strategy for posterior samples based on global rootfinding. It provides gradient-based optimizers with judiciously selected starting points, designed to combine exploitation and exploration. The algorithm scales practically linearly to high dimensions. For posterior sample-based acquisition functions such as Gaussian process Thompson sampling (GP-TS) and variants of entropy search, we demonstrate remarkable improvement in both inner- and outer-loop optimization, surprisingly outperforming alternatives like EI and GP-UCB in most cases. We also propose a sample average formulation of GP-TS, which has a parameter to explicitly control exploitation and can be computed at the cost of one posterior sample.

1899Using Contrastive Learning with Generative Similarity to Learn Spaces that Capture Human Inductive Biases

[openreview] [pdf]

Abstract Humans rely on strong inductive biases to learn from few examples and abstract useful information from sensory data. Instilling such biases in machine learning models has been shown to improve their performance on various benchmarks including few-shot learning, robustness, and alignment. However, finding effective training procedures to achieve that goal can be challenging as psychologically-rich training data such as human similarity judgments are expensive to scale, and Bayesian models of human inductive biases are often intractable for complex, realistic domains. Here, we address this challenge by introducing a Bayesian notion of generative similarity whereby two datapoints are considered similar if they are likely to have been sampled from the same distribution. This measure can be applied to complex generative processes, including probabilistic programs. We show that generative similarity can be used to define a contrastive learning objective even when its exact form is intractable, enabling learning of spatial embeddings that express specific inductive biases. We demonstrate the utility of our approach by showing that it can be used to capture human inductive biases for geometric shapes, distinguish different abstract drawing styles that are parameterized by probabilistic programs, and capture abstract high-level categories that enable generalization.

1900Policy Consistency in Multi-Agent Reinforcement Learning with Mixed Reward

[openreview] [pdf]

Abstract The sparsity of team rewards poses a significant challenge that hinders the effective learning of optimal team policies in cooperative multi-agent reinforcement learning. One common approach to mitigate this issue involves augmenting sparse rewards with individual rewards to guide policy training. However, a significant drawback of such approaches is that modifying the reward function can potentially alter the optimal policy. To tackle this challenge, we propose a novel multi-agent policy optimization approach that ensures consistency between the mixed policy (learned from a combination of individual and team rewards) and the team policy (based solely on team rewards), through a new policy consistency constraint that aligns the returns of both policies in policy optimization model. We further develop an iterated policy optimization procedure to solve the formulated problem, deriving an approximate optimization objective for each iteration of the mixed and team policies. Experimental evaluation conducted in the StarCraft II Multi-Agent Challenge Environment (SMAC), Multi-Agent Particle Environment (MPE), and Google Research Football (GRF) environments demonstrate that our proposed approach effectively addresses the policy inconsistency problem, i.e.{\it i.e.}, it consistently outperforms strong baseline methods.

1901UOE: Unlearning One Expert is Enough for Mixture-of-Experts LLMs

[openreview] [pdf]

Abstract Recent advancements in large language model (LLM) unlearning have shown remarkable success in removing unwanted data-model influences while preserving the model’s utility for legitimate knowledge. However, despite these strides, sparse Mixture-of-Experts (MoE) LLMs--a key subset of the LLM family--have received little attention and remain largely unexplored in the context of unlearning. As MoE LLMs are celebrated for their exceptional performance and highly efficient inference processes, we ask: How can unlearning be performed effectively and efficiently on MoE LLMs? And will traditional unlearning methods be applicable to MoE architectures? Our pilot study shows that the dynamic routing nature of MoE LLMs introduces unique challenges, leading to substantial utility drops when existing unlearning methods are applied. Specifically, unlearning disrupts the router’s expert selection, causing significant selection shift from the most unlearning target-related experts to irrelevant ones. As a result, more experts than necessary are affected, leading to excessive forgetting and loss of control over which knowledge is erased. To address this, we propose a novel single-expert unlearning framework, referred to as UOE, for MoE LLMs. Through expert attribution, unlearning is concentrated on the most actively engaged expert for the specified knowledge. Concurrently, an anchor loss is applied to the router to stabilize the active state of this targeted expert, ensuring focused and controlled unlearning that preserves model utility. The proposed UOE framework is also compatible with various unlearning algorithms. Extensive experiments demonstrate that UOE enhances both forget quality up to 5% and model utility by 35% on MoE LLMs across various benchmarks, LLM architectures, while only unlearning 0.06% of the model parameters.

1902Representation learning for financial time series forecasting

[openreview] [pdf]

Abstract The accurate forecasting of financial time series remains a significant challenge due to the stochastic nature of the underlying data. To improve prediction accuracy, feature engineering has become a vital aspect of forecasting financial assets. However, engineering features manually often requires domain expertise. We propose to utilise an automated feature generation architecture, Contrastive Predictive Coding (CPC), to generate embeddings as input to improve the performance of downstream financial time series forecasting models. To benchmark the effectiveness of our approach, we evaluate forecasting models on predicting the next day’s log return on various foreign exchange markets with and without embeddings. Finally, we assess our CPC architecture by employing the same trained encoder on different currency pairs and calculating the Sharpe ratio of our strategies.

1903Dynamic Self-Distillation via Previous Mini-batches for Fine-tuning Small Language Models

[openreview] [pdf]

Abstract Knowledge Distillation (KD) has become a widely adopted approach for compressing large language models (LLMs) to reduce computational costs and memory footprint. However, the availability of complex teacher models is a prerequisite for running most KD pipelines. Thus, the traditional KD procedure can be unachievable or budget-unfriendly, particularly when relying on commercial LLMs like GPT4. In this regard, Self Distillation (SelfD) emerges as an advisable alternative, enabling student models to learn without teachers’ guidance. Nonetheless, existing SelfD approaches for LMs often involve architectural modifications, assuming the models are open-source, which may not always be practical. In this work, we introduce a model-agnostic and task-agnostic method named dynamic SelfD from the previous mini-batch (DynSDPB), which realizes current iterations’ distillation from the last ones’ generated logits. Additionally, to address prediction inaccuracies during the early iterations, we dynamically adjust the distillation influence and temperature values to enhance the adaptability of fine-tuning. Furthermore, we propose Vocabulary Map Matching (VMM), aiming to address output inconsistency for auto-regressive LLMs. Last but not least, DynSDPB facilitates the seamless integration of existing self-correction and self-training techniques for small language models (SLMs). We apply DynSDPB to both encoder-only LMs (e.g., BERT model families) and decoder-only LMs (e.g., LLaMA model families), validating its effectiveness across natural language understanding (NLU) and natural language generation (NLG) benchmarks.

1904Adaptive dense reward:Understanding the Gap Between Action and Reward Space in Alignment

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has proven highly effective in aligning Large Language Models (LLMs) with human preferences. However, the original RLHF typically optimizes under an overall reward, which can lead to a suboptimal learning process. This limitation stems from RLHF’s lack of awareness regarding which specific tokens should be reinforced or suppressed. Moreover, conflicts in supervision can arise, for instance, when a chosen response includes erroneous tokens, while a rejected response contains accurate elements. To rectify these shortcomings, increasing dense reward methods, such as step-wise and token-wise RLHF, have been proposed. However, these existing methods are limited to specific tasks (like mathematics).In this paper, we propose the “Adaptive Message-wise RLHF” method, which robustly applies to various tasks. By defining pivot tokens as key indicators, our approach adaptively identifies essential information and converts sample-level supervision into fine-grained, subsequence-level supervision. This aligns the density of rewards and action spaces more closely with the information density of the input.Experiments demonstrate that our method can be integrated into various training methods, significantly mitigating hallucinations and catastrophic forgetting problems, while outperforming other methods on multiple evaluation metrics. Our method improves the success rate on adversarial samples by 10% compared to the sample-wise approach, and achieves a 1.3% improvement on evaluation benchmarks such as MMLU, GSM8K, and HumanEval et al.

1905Explanation-Assisted Data Augmentation for Graph Learning

[openreview] [pdf]

Abstract This work introduces a novel class of Data Augmentation (DA) techniques in the context of graph learning. In general, DA refers to techniques that enlarge the training set using label-preserving transformations. Such techniques enable increased robustness and generalization, especially when the size of the original training set is limited. A fundamental idea in DA is that labels are invariant to domain-specific transformations of the input samples. However, it is challenging to identify such transformations in learning over graphical input domains due to the complex nature of graphs and the need to preserve their structural and semantic properties. In this work, we propose explanation-assisted DA (EA-DA) for Graph Neural Networks (GNNs). A graph explanation is a subgraph which is an `almost sufficient’ statistic of the input graph with respect to its classification label. Consequently, the classification label is invariant, with high probability, to perturbations of graph edges not belonging to its explanation subgraph. We develop EA-DA techniques leveraging such perturbation invariances. First, we show analytically that the sample complexity of explanation-assisted learning can be arbitrarily smaller than explanation-agnostic learning. On the other hand, we show that if the training set is enlarged using EA-DA techniques and the learning rule does not distinguish between the augmented data and the original data, then the sample complexity can be worse than that of explanation-agnostic learning. We identify the main reason for the potential increase in sample complexity as the out-of-distribution nature of graph perturbations. We conclude that theoretically EA-DA may improve sample complexity, and that the learning rule must distinguish between the augmented data and the original data. Subsequently, we build upon these theoretical insights, introduce practically implementable EA-DA techniques and associated learning mechanisms, and perform extensive empirical evaluations.

1906Topological Zigzag Spaghetti for Diffusion-based Generation and Prediction on Graphs

[openreview] [pdf]

Abstract Diffusion models have recently emerged as a new powerful machinery for generative artificial intelligence on graphs, with applications ranging from drug design to knowledge discovery. However, despite their high potential, most, if not all, currently existing graph diffusion models are limited in their ability to holistically describe the intrinsic {\it higher-order} topological graph properties, which obstructs model generalizability and adoption for downstream tasks. We propose to address this fundamental challenge and extract the latent salient topological graph descriptors at different resolutions by leveraging zigzag persistence. We develop a new computationally efficient topological summary, zigzag spaghetti (ZS), which delivers the most inherent topological properties {\it simultaneously over a sequence of graphs at multiple resolutions}. We derive theoretical stability guarantees of ZS and present the first attempt to integrate dynamic topological information into graph diffusion models. Our extensive experiments on %9 benchmark datasets for graph classification and prediction tasks suggest that ZS has a high promise not only to enhance performance of graph diffusion models, with gains up 10%, but also to substantially booster model robustness under uncertainties.

1907AgentRefine: Enhancing Agent Generalization through Refinement Tuning

[openreview] [pdf]

Abstract Large Language Model (LLM) based agents have proved their ability to perform complex tasks like humans. However, there is still a large gap between open-sourced LLMs and commercial models like the GPT series. In this paper, we focus on improving the agent generalization capabilities of LLMs via instruction tuning. We first observe that the existing agent training corpus exhibits satisfactory results on held-in evaluation sets but fails to generalize to held-out sets. These agent-tuning works face severe formatting errors and are frequently stuck in the same mistake for a long while. We analyze that the poor generalization ability comes from overfitting to several manual agent environments and a lack of adaptation to new situations. They struggle with the wrong action steps and can not learn from the experience but just memorize existing observation-action relations. Inspired by the insight, we propose a novel AgentRefine framework for agent-tuning. The core idea is to enable the model to learn the correct its mistakes via observation in the trajectory. Specifically, we propose an agent synthesis framework to encompass a diverse array of environments and tasks and prompt a strong LLM to refine its error action according to the environment feedback. AgentRefine significantly outperforms state-of-the-art agent-tuning work in terms of generalization ability on diverse agent tasks. It also has better robustness facing perturbation and can generate diversified thought in inference. Our findings establish the correlation between agent generalization and self-refinement and provide a new paradigm for future research.

1908Boosting Concept Bottleneck Models with Supervised, Hierarchical Concept Learning

[openreview] [pdf]

Abstract Concept Bottleneck Models (CBMs) aim to deliver interpretable and interventionable predictions by bridging features and labels with human-understandable concepts. While recent CBMs show promising potential, they suffer from information leakage, where unintended information beyond the concepts (either in probabilistic or binary-state form) is leaked to the subsequent label prediction. Consequently, distinct classes are falsely classified via indistinguishable concepts, undermining the interpretation and intervention of CBMs.This paper alleviates the information leakage issue by introducing label supervision in concept prediction and constructing a hierarchical concept set. Accordingly, we propose a new paradigm of CBMs, namely SupCBM, which stands for Structured Understanding of leakage Prevention Concept Bottleneck Model, achieving label prediction via predicted concepts and a deliberately structural-designed intervention matrix. SupCBM focuses on concepts that are mostly relevant to the predicted label and only distinguishes classes when different concepts are presented. Our evaluations show that SupCBM’s label prediction outperforms SOTA CBMs over diverse datasets. Its predicted concepts also exhibit better interpretability. With proper quantification of information leakage in different CBMs, we demonstrate that SupCBM significantly reduces the information leakage.

1909Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

[openreview] [pdf]

Abstract The task of “unlearning” certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model’s parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize “concept vectors” — parameter vectors that encode concrete concepts — and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model’s susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark athttps://anonymous.4open.science/r/ConceptVectors_review-98EF.

1910Differentiable Causal Discovery for Latent Hierarchical Causal Models

[openreview] [pdf]

Abstract Discovering causal structures with latent variables from observational data is a fundamental challenge in causal discovery. Existing methods often rely on constraint-based, iterative discrete searches, limiting their scalability to large numbers of variables. Moreover, these methods frequently assume linearity or invertibility, restricting their applicability to real-world scenarios. We present new theoretical results on the identifiability of nonlinear latent hierarchical causal models, relaxing previous assumptions in literature about the deterministic nature of latent variables and exogenous noise. Building on these insights, we develop a novel differentiable causal discovery algorithm that efficiently estimates the structure of such models. To the best of our knowledge, this is the first work to propose a differentiable causal discovery method for nonlinear latent hierarchical models. Our approach outperforms existing methods in both accuracy and scalability. We demonstrate its practical utility by learning interpretable hierarchical latent structures from high-dimensional image data and demonstrate its effectiveness on downstream tasks.

1911Lossgate: Incomplete Information and Misaligned Incentives Hinder Regulation of Societal Risks in Machine Learning

[openreview] [pdf]

Abstract Regulators seek to curb the societal risks of machine learning; a common aim is to protect the public from excessive privacy violations or bias in models. In the status quo, regulators and companies independently evaluate societal risk. We find that discrepancies in these evaluations can be either a detriment or an advantage for companies. To abide by regulation, a company needs to conservatively evaluate risk: it should train its model such that risk remains below the acceptable threshold-even if the regulator’s evaluation returns higher risk measurements. This decreases model utility (up to 8%, in our experiments). Conversely, when the regulator’s measurements are consistently lower than theirs, we find that a company can behave strategically and game regulation to train more accurate models. We call this Lossgate, an allusion to Dieselgate in environmental regulation: Volkswagen produced cars that limited their emissions when being subjected to a regulator’s emissions measurement. To model incomplete information and the misaligned incentives that explain Lossgate, we leverage game theory. We obtain SpecGame, a model for regulator-company interactions which allows us to estimate the excessive risk that results from the strategic behavior observed in Lossgate. We show Lossgate costs up to 96% higher compared to collaborative regulation in the sum cost for all players.

1912On the Limitation and Redundancy of Transformers: A Rank Perspective

[openreview] [pdf]

Abstract Transformers have showcased superior performances across a variety of real-world applications, particularly leading to unparalleled successes of large “foundation” models. However, since these models are usually trained on web-scale datasets, the overall computation and memory loads are considerably increasing, calling for moreefficientmethods in machine learning. In this work, we step towards this direction by exploring the architectural limitation and redundancy of Transformers via investigating the ranks of attention score matrices. On one hand, extensive experiments are conducted on various model configurations (model dimensions, heads, layers, etc) and data distributions (both synthetic and real-world datasets with varied sequence lengths), uncovering two key properties: although the attention rank increases with the head dimension dhd_h, as expected, the rank is eventually upper bounded (limitation) and gets saturated (redundancy). We call them thelow-rank barrierandmodel-reduction effect, respectively. On the other hand, we provide rigorous demonstrations for these observations through a fine-grained mathematical analysis, highlighting (i) a consistent theoretical upper bound (0.63n\approx 0.63n, nn: the sequence length) of the attention rank regardless of the head dimension dhd_h, and (ii) a critical position of the rank saturation (dh=Ω(logn)d_h=\Omega(\log n)). These results shed light on the inductive biases and internal dynamics of Transformers, contributing to the theoretical understanding and assessment of the model capacity and efficiency in practical applications.

1913What Makes a Good Time-series Forecasting Model? A Causal Perspective

[openreview] [pdf]

Abstract Generalization is a long-standing challenge in multivariate time series forecasting (MTSF) tasks. Most existing forecasting methods use all available variables in historical series to predict all future variables, assuming that there may be correlations among all variables. From a causal perspective, this reliance on correlated variables can compromise the model’s generalization. To address this, we aim to explore the role of causal relationships in enhancing the generalization of multivariate time series models. We examine how graphical causal models, through conditional independence constraints, can reduce the hypothesis space, thereby improving generalization. Building on this foundation, we introduce a novel causality-based MTSF algorithm CAusal Informed Transformer (CAIFormer). It first constructs a Directed Acyclic Graph (DAG) among variables using causal discovery techniques. Then we build the forecasting model by enforcing the causal constraints informed by the DAG. Empirical evaluations on benchmark datasets demonstrate that our method surpasses traditional approaches in predictive accuracy. Additionally, we present the structural causal models derived for these datasets, underscoring the practical applicability of our causality-driven framework in MTSF.

1914HiLo: A Learning Framework for Generalized Category Discovery Robust to Domain Shifts

[openreview] [pdf]

Abstract Generalized Category Discovery (GCD) is a challenging task in which, given a partially labelled dataset, models must categorize all unlabelled instances, regardless of whether they come from labelled categories or from new ones. In this paper, we challenge a remaining assumption in this task: that all images share the same \underline{domain}. Specifically, we introduce a new task and method to handle GCD when the unlabelled data also contains images from different domains to the labelled set. Our proposed `HiLo’ networks extract High-level semantic and Low-level domain features, before minimizing the mutual information between the representations. Our intuition is that the clusterings based on domain information and semantic information should be independent. We further extend our method with a specialized domain augmentation tailored for the GCD task, as well as a curriculum learning approach. Finally, we construct a benchmark from corrupted fine-grained datasets as well as a large-scale evaluation on DomainNet with real-world domain shifts, reimplementing a number of GCD baselines in this setting. We demonstrate that HiLo outperforms SoTA category discovery models by a large margin on all evaluations.

1915Output-Constrained Decision Trees

[openreview] [pdf]

Abstract When there is a correlation between any pair of targets, one needs a prediction method that can handle vector-valued output. In this setting, multi-target learning is particularly important as it is widely used in various applications. This paper introduces new variants of decision trees that can handle not only multi-target output but also the constraints among the targets. We focus on the customization of conventional decision trees by adjusting the splitting criteria to handle the constraints and obtain feasible predictions. We present both an optimization-based exact approach and several heuristics, complete with a discussion on their respective advantages and disadvantages. To support our findings, we conduct a computational study to demonstrate and compare the results of the proposed approaches.

1916Blind Unlearning: Unlearning Without a Forget Set

[openreview] [pdf]

Abstract Machine unlearning is the study of methods to efficiently remove the influence of some subset of the training data from the parameters of a previously-trained model. Existing methods typically require direct access to the “forget set” – the subset of training data to be forgotten by the model. This limitation impedes privacy, as organizations need to retain user data for the sake of unlearning when a request for deletion is made, rather than being able to delete it immediately. We first introduce the setting of blind unlearning – unlearning without explicit access to the forget set. Then, we propose a method for approximate unlearning called RELOAD, that leverages ideas from gradient-based unlearning and neural network sparsity to achieve blind unlearning. The method serially applies an ascent step with targeted parameter re-initialization and fine-tuning, and on empirical unlearning tasks, RELOAD often approximates the behaviour of a from-scratch retrained model better than approaches that leverage the forget set. Finally, we extend the blind unlearning setting to blind remedial learning, the task of efficiently updating a previously-trained model to an amended dataset.

1917Local convergence of simultaneous min-max algorithms to differential equilibrium on Riemannian manifold

[openreview] [pdf]

Abstract We study min-max algorithms to solve zero-sum differential games on Riemannian manifold. Based on the notions of differential Stackelberg equilibrium and differential Nash equilibrium on Riemannian manifold, we analyze the local convergence of two representative deterministic simultaneous algorithms τ\tau-GDA and τ\tau-SGA to such equilibrium. Sufficient conditions are obtained to establish their linear convergence rates by Ostrowski theorem on manifold and spectral analysis. The τ\tau-SGA algorithm is extended from the symplectic gradient-adjustment method in Euclidean space to avoid strong rotational dynamics in τ\tau-GDA. In some cases, we obtain a faster convergence rate of τ\tau-SGA through an asymptotic analysis which is valid when the learning rate ratio τ\tau is big. We show numerically how the insights obtained from the convergence analysis may improve the training of orthogonal Wasserstein GANs using stochastic τ\tau-GDA and τ\tau-SGA on simple benchmarks.

1918Machine Reinforced Perturbation on Drifted Human Logical Reasoning

[openreview] [pdf]

Abstract Using deep neural networks as computational models to simulate cognitive process can provide key insights into human behavioral dynamics. This enables synthetic data generation to test hypotheses for neuroscience and guides adaptive interventions for cognitive regulation. Challenges arise when environments are highly dynamic, obscuring stimulus-behavior relationships. However, the majority of current research focuses on simulating human cognitive behaviors under ideal conditions, neglecting the influence of environmental disturbances. We propose ReactiveAgent, integrating drift-diffusion with deep reinforcement learning to simulate granular effects of dynamic environmental stimuli on human logical reasoning process. This framework is built and evaluated upon our contributed large dataset of 21,157 logical responses of humans under various dynamic stimuli. Quantitatively, the framework improves cognition modelling by considering temporal effect of environmental stimuli on logical reasoning and captures both subject-specific and stimuli-specific behavioural differences. Qualitatively, it captures general trends in human logical reasoning under stress, better than baselines. Our approach is extensible to examining diverse environmental influences on cognitive behaviors. Overall, it demonstrates a powerful, data-driven methodology to simulate, align with, and understand the vagaries of human logical reasoning in dynamic contexts.

1919Provence: efficient and robust context pruning for retrieval-augmented generation

[openreview] [pdf]

Abstract Retrieval-Augmented Generation improves various aspects of large language models (LLMs) generation, but suffers from computational overhead caused by long contexts, and the propagation of irrelevant retrieved information into generated responses. Context pruning deals with both aspects, by removing irrelevant parts of retrieved contexts before LLM generation. Existing context pruning approaches are limited, and do not present a universal model that would be bothefficientandrobustin a wide range of scenarios, e.g., when contexts contain a variable amount of relevant information or vary in length, or when evaluated on various domains. In this work, we close this gap and introduce Provence (Pruning and Reranking Of retrieVEd relevaNt ContExts), an efficient and robust context pruner for Question Answering, which dynamically detects the needed amount of pruning for a given context and can be used out-of-the-box for various domains. The three key ingredients of Provence are formulating the context pruning task as sequence labeling, unifying context pruning capabilities with context reranking, and training on diverse data. Our experimental results show that Provence enables context pruning with negligible to no drop in performance, in various domains and settings, at almost no cost in a standard RAG pipeline. We also conduct a deeper analysis alongside various ablations to provide insights into training context pruners for future work.

1920Leveraging Flatness to Improve Information-Theoretic Generalization Bounds for SGD

[openreview] [pdf]

Abstract Information-theoretic generalization bounds have been used to study the generalization of learning algorithms. These bounds are intrinsically data- and algorithm-dependent so that one can exploit the properties of data and algorithm to derive tighter bounds. However, we observe such algorithm dependence is still inadequate in existing information-theoretic bounds for SGD because they have not adequately leveraged the algorithmic bias toward flat minima of SGD. Since the flatness of minima given by SGD is crucial for SGD’s generalization, the bounds fail to capture the improved generalization under better flatness and are also numerically loose. This paper derives a more flatness-leveraged information-theoretic bound for the flatness-favoring SGD. The bound indicates that the learned models generalize better if the large-variance directions of the final weight covariance have small local curvatures in the loss landscape. Experiments on deep neural networks show that our bound not only correctly reflects the better generalization when flatness is improved, but is also numerically tighter by being only a few percentages looser. This is achieved by a technique called “omniscient trajectory.” When applied to Gradient Descent on convex-Lipschitz-Bounded (CLB) problems, it yields an O(1/n)O(1/\sqrt{n}) minimax rate for excess risks, which has been shown to be impossible for representative existing information-theoretic bounds.

1921A Hybrid Loss Framework for Decomposition-based Time Series Forecasting Methods: Balancing Global and Component Errors

[openreview] [pdf]

Abstract Accurate time series forecasting, predicting future values based on past data, is crucial for diverse industries. Many current time series methods decompose time series into multiple sub-series, applying different model architectures and training with an end-to-end overall loss for forecasting. However, this raises a question: does this overall loss prioritize the importance of critical sub-series within the decomposition for the better performance? To investigate this, we conduct a study on the impact of overall loss on existing time series methods with sequence decomposition. Our findings reveal that overall loss may introduce bias in model learning, hindering the learning of the prioritization of more significant sub-series and limiting the forecasting performance. To address this, we propose a hybrid loss framework combining the global and component errors. This framework introduces component losses for each sub-series alongside the original overall loss. It employs a dual min-max algorithm to dynamically adjust weights between the overall loss and component losses, and within component losses. This enables the model to achieve better performance of current time series methods by focusing on more critical sub-series while still maintaining a low overall loss. We integrate our loss framework into several time series methods and evaluate the performance on multiple datasets. Results show an average improvement of 0.5-2% over existing methods without any modifications to the model architectures.

1922Generalizable Motion Planning via Operator Learning

[openreview] [pdf]

Abstract In this work, we introduce a planning neural operator (PNO) for predicting the value function of a motion planning problem. We recast value function approximation as learning a single operator from the cost function space to the value function space, which is defined by an Eikonal partial differential equation (PDE). Therefore, our PNO model, despite being trained with a finite number of samples at coarse resolution, inherits the zero-shot super-resolution property of neural operators. We demonstrate accurate value function approximation at 16× the training resolution on the MovingAI lab’s 2D city dataset and compare with state-of-the-art neural value function predictors on 3D scenes from the iGibson building dataset. Lastly, we investigate employing the value function output of PNO as a heuristic function to accelerate motion planning. We show theoretically that the PNO heuristic is ϵ\epsilon-consistent by introducing an inductive bias layer that guarantees our value functions satisfy the triangle inequality. With our heuristic, we achieve a 30% decrease in nodes visited while obtaining near optimal path lengths on the MovingAI lab 2D city dataset, compared to classical planning methods (A^\ast, RRT^\ast).

1923Self-Taught Evaluators

[openreview] [pdf]

Abstract Model-based evaluation is at the heart of successful model development -- as a reward model for training, and as a replacement for human evaluation. To train such evaluators, the standard approach is to collect a large amount of human preference judgments over model responses, which is costly and the data becomes stale as models improve. In this work, we present an approach that aims to im-prove evaluators without human annotations, using synthetic training data only. Starting from unlabeled instructions, our iterative self-improvement scheme generates contrasting model outputs and trains an LLM-as-a-Judge to produce reasoning traces and final judgments, repeating this training at each new iteration using the improved predictions. Without any labeled preference data, our Self-Taught Evaluator can improve a strong LLM (Llama3-70B-Instruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench. This outperforms commonly used LLM judges such as GPT-4 and matches the performance of the top-performing reward models trained with labeled examples.

1924Rethinking Behavior Regularization in Offline Safe RL: A Region-Based Approach

[openreview] [pdf]

Abstract Behavior regularization is a widely adopted technique in offline reinforcement learning (RL) to control distributional shift and mitigate extrapolation errors from out-of-distribution (OOD) actions by keeping the learned policy close to the behavior policy used to collect the dataset. However, directly applying behavior regularization to offline safe RL presents several issues. The optimal policy in safe RL should not only favor actions that prevent the agent from entering unsafe regions but also identify the shortest escape path when the agent finds itself in unsafe states. Enforcing safety and behavior regularization constraints simultaneously is inherently difficult and can often lead to infeasible solutions, especially when multiple constraints are involved. Furthermore, adding behavior regularization may cause the learned policy to imitate the behavior policy, even in states where the behavior policy performs poorly (not safe). This issue becomes particularly severe in offline safe RL, where the quality of the dataset collected by the behavior policy heavily impacts the learned policy’s effectiveness. To address these challenges, we propose BARS\textit{BARS} (B\underline{B}ehavior-A\underline{A}ware R\underline{R}egion-Based S\underline{S}afe offline RL), a novel algorithm that distinguishes between safe and unsafe states and applies region-specific, selective behavior regularization to optimize the policy. Extensive experiments show that BARS significantly outperforms several state-of-the-art baselines in terms of both rewards and safety, particularly in scenarios where the behavior policy is far from optimal. Notably, when dataset quality is low, BARS continues to perform well and ensure safety, while all other baselines fail to guarantee a safe policy in most of the environments. Our work has great potential to address a previously overlooked issue in offline safe RL.

1925Adversarial Diffusion Bridge Model for Reliable Adversarial Purification

[openreview] [pdf]

Abstract Recently Diffusion-based Purification (DiffPure) has been recognized as an effective defense method against adversarial examples. However, we find DiffPure which directly employs the original pre-trained diffusion models for adversarial purification, to be suboptimal. This is due to an inherent trade-off between noise purification performance and data recovery quality. Additionally, the reliability of existing evaluations for DiffPure is questionable, as they rely on weak adaptive attacks. In this work, we propose a novel Adversarial Diffusion Bridge Model, termed ADBM. ADBM directly constructs a reverse bridge from the diffused adversarial data back to its original clean examples, enhancing the purification capabilities of the original diffusion models. Through theoretical analysis and experimental validation across various scenarios, ADBM has proven to be a superior and robust defense mechanism, offering significant promise for practical applications.

1926Lyapunov Stability Learning with Nonlinear Control via Inductive Biases

[openreview] [pdf]

Abstract Finding a control Lyapunov function (CLF) in a dynamical system with a controller is an effective way to guarantee stability, which is a crucial issue in safety-concerned applications. Recently, deep learning models representing CLFs have been applied into a learner-verifier framework to identify satisfiable candidates. However, the learner treats Lyapunov conditions as complex constraints for optimisation, which is hard to achieve global convergence. It is also too complicated to implement these Lyapunov conditions for verification. To improve this framework, we treat Lyapunov conditions as inductive biases and design a neural CLF and a CLF-based controller guided by this knowledge. This design enables a stable optimisation process with limited constraints, and allows end-to-end learning of both the CLF and the controller. Our approach achieves higher convergence rate and larger region of attraction (ROA) in learning the CLF compared to existing methods among abundant experiment cases. We also thoroughly reveal why the success rate decreases with previous methods during learning.

1927BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics

[openreview] [pdf]

Abstract Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on extensive sampling or local gradient descent, struggle with long-horizon motion planning tasks involving complex contact events. In this paper, we present a GPU-accelerated branch-and-bound (BaB) framework for motion planning in manipulation tasks that require trajectory optimization over neural dynamics models. Our approach employs a specialized branching heuristic to divide the search space into sub-domains and applies a modified bound propagation method, inspired by the state-of-the-art neural network verifier α,β\alpha,\beta-CROWN, to efficiently estimate objective bounds within these sub-domains. The branching process guides planning effectively, while the bounding process strategically reduces the search space. Our framework achieves superior planning performance, generating high-quality state-action trajectories and surpassing existing methods in challenging, contact-rich manipulation tasks such as non-prehensile planar pushing with obstacles, object sorting, and rope routing in both simulated and real-world settings. Furthermore, our framework supports various neural network architectures, ranging from simple multilayer perceptrons to advanced graph neural dynamics models, and scales efficiently with different model sizes.

1928Learning to Steer Markovian Agents under Model Uncertainty

[openreview] [pdf]

Abstract Designing incentives for an adapting population is a ubiquitous problem in a wide array of economic applications and beyond. In this work, we study how to design additional rewards to steer multi-agent systems towards desired policies \emph{without} prior knowledge of the agents’ underlying learning dynamics. Motivated by the limitation of existing works, we consider a new and general category of learning dynamics called \emph{Markovian agents}. We introduce a model-based non-episodic Reinforcement Learning (RL) formulation for our steering problem. Importantly, we focus on learning a \emph{history-dependent} steering strategy to handle the inherent model uncertainty about the agents’ learning dynamics. We introduce a novel objective function to encode the desiderata of achieving a good steering outcome with reasonable cost. Theoretically, we identify conditions for the existence of steering strategies to guide agents to the desired policies. Complementing our theoretical contributions, we provide empirical algorithms to approximately solve our objective, which effectively tackles the challenge in learning history-dependent strategies. We demonstrate the efficacy of our algorithms through empirical evaluations.

1929Task Diversity Shortens the ICL Plateau

[openreview] [pdf]

Abstract In-context learning (ICL) describes a language model’s ability to generate outputs based on a set of input demonstrations and a subsequent query. To understand this remarkable capability, researchers have studied simplified, stylized models. These studies have consistently observed long loss plateaus, during which models exhibit minimal improvement, followed by a sudden, rapid surge of learning. In this work, we reveal that training on multiple diverse ICL tasks simultaneously shortens the loss plateaus, making each task easier to learn. This finding is surprising as it contradicts the natural intuition that the combined complexity of multiple ICL tasks would lengthen the learning process, not shorten it. Our result suggests that the recent success in large-scale training of language models may be attributed not only to the richness of the data at scale but also to the easier optimization (training) induced by the diversity of natural language training data.

1930Towards Evaluating Generalist Agents: An Automated Benchmark in Open World

[openreview] [pdf]

Abstract Evaluating generalist agents presents significant challenges due to their wide-ranging abilities and the limitations of current benchmarks in assessing true generalization. We introduce the \textbf{M}ine\textbf{C}raft \textbf{U}niverse (\textbf{MCU}), a fully automated benchmarking framework set within the open-world game \emph{Minecraft}. MCU dynamically generates and evaluates a broad spectrum of tasks, offering three core components: 1) a task generation mechanism that provides maximal freedom and variability, 2) an ever-expanding set of over \textbf{3K} composable atomic tasks, and 3) a general evaluation framework that supports open-ended task assessment. By integrating large language models (LLMs), MCU dynamically creates diverse environments for each evaluation, fostering agent generalization. The framework uses a vision-language model (VLM) to automatically generate evaluation criteria, achieving over 90% agreement with human ratings across multi-dimensional assessments, which demonstrates that MCU is a scalable and explainable solution for evaluating generalist agents. Additionally, we show that while state-of-the-art foundational models perform well on specific tasks, they often struggle with increased task diversity and difficulty.

1931Goal-conditioned Reinforcement Learning with Subgoals Generated from Relabeling

[openreview] [pdf]

Abstract In goal-conditioned reinforcement learning (RL), the primary objective is to develop a goal-conditioned policy capable of reaching diverse desired goals, a process often hindered by sparse reward signals. To address the challenges associated with sparse rewards, existing approaches frequently employ hindsight relabeling, substituting original goals with achieved goals. However, these methods exhibit a tendency to prioritize the optimization of closer achieved goals during training, leading to the loss of potentially valuable information from the trajectory and low sample efficiency. Our key insight is that these achieved goals, generated from the same hindsight relabeling, can serve as effective subgoals to facilitate the learning of policies that reach possible long-horizon desired goals within the same trajectory. Leveraging this perspective, we propose a novel framework called Goal-Conditioned reinforcement learning with Q-BC (i.e, behavior cloning (BC)-regularized Q) and Subgoals (GCQS) for goal-conditioned RL. GCQS is a innovative goal-conditioned actor-critic framework that systematically exploits more trajectory information to improve policy learning and sample efficiency. Specifically, GCQS initially optimizes a Q-BC objective to facilitate learning policies that reach achieved goals effectively. Subsequently, these achieved goals are redefined as subgoals, which serve to enhance the goal-conditioned policies, thereby predicting better actions to reach the desired goals. Experimental results in simulated robotics environments demonstrate that GCQS significantly enhances sample efficiency and overall performance compared to existing goal-conditioned methods.

1932Masked, Regularized Fidelity With Diffusion Models For Highly Ill-posed Inverse Problems

[openreview] [pdf]

Abstract Diffusion models have been well-investigated for solving ill-posed inverse problems to yield excellent performance. However, these models have not been well-adopted to highly ill-posed inverse problems. In this work, we propose zero-shot diffusion model for large and complex kernels, dubbed Dilack, with novel data fidelity terms that are inspired by a hybrid form of generative and classical priors, leading to \emph{regularized fidelity} called pseudo-inverse anchor for constraining (PiAC) fidelity loss, and also are designed from the investigation on the behavior of diffusion models interacting with data fidelity globally unlike locally acting classical regularizers, leading to \emph{masked fidelity} that adaptively enforces spatially and step-wisely local fidelity via mask. Our proposed scheme effectively reduces erratic behavior and inherent artifacts in diffusion models, thereby improving restoration quality including perceptual aspects and outperforming prior arts on both synthetic and real-world datasets for modern lensless imaging and large motion deblurring.

1933Mix-CPT: A Domain Adaptation Framework via Decoupling Knowledge Learning and Format Alignment

[openreview] [pdf]

Abstract Adapting large language models (LLMs) to specialized domains typically requires domain-specific corpora for continual pre-training to facilitate knowledge memorization and related instructions for fine-tuning to apply this knowledge. However, this method may lead to inefficient knowledge memorization due to a lack of awareness of knowledge utilization during the continual pre-training and demands LLMs to simultaneously learn knowledge utilization and format alignment with divergent training objectives during the fine-tuning. To enhance the domain adaptation of LLMs, we revise this process and propose a new domain adaptation framework including domain knowledge learning and general format alignment, called \emph{Mix-CPT}. Specifically, we first conduct a knowledge mixture continual pre-training that concurrently focuses on knowledge memorization and utilization. To avoid catastrophic forgetting, we further propose a logit swap self-distillation constraint. By leveraging the knowledge and capabilities acquired during continual pre-training, we then efficiently perform instruction tuning and alignment with a few general training samples to achieve format alignment. Extensive experiments show that our proposed \emph{Mix-CPT} framework can simultaneously improve the task-solving capabilities of LLMs on the target and general domains.

1934DreamDistribution: Prompt Distribution Learning for Text-to-Image Diffusion Models

[openreview] [pdf]

Abstract The popularization of Text-to-Image (T2I) diffusion models enables the generation of high-quality images from text descriptions. However, generating diverse customized images with reference visual attributes remains challenging. This work focuses on personalizing T2I diffusion models at a more abstract concept or category level, adapting commonalities from a set of reference images while creating new instances with sufficient variations. We introduce a solution that allows a pretrained T2I diffusion model to learn a set of soft prompts, enabling the generation of novel images by sampling prompts from the learned distribution. These prompts offer text-guided editing capabilities and additional flexibility in controlling variation and mixing between multiple distributions. We also show the adaptability of the learned prompt distribution to other tasks, such as text-to-3D. Finally we demonstrate effectiveness of our approach through quantitative analysis including automatic evaluation and human assessment.

1935STARJOB: DATASET FOR LLM-DRIVEN JOB SHOP SCHEDULING

[openreview] [pdf]

Abstract The Job Shop Scheduling Problem (JSSP) presents a significant challenge in opti- mizing production processes. This problem requires efficient allocation of jobs to a limited number of machines while minimizing total processing time (makespan). Although recent advancements in artificial intelligence have produced promising solutions, such as reinforcement learning and graph neural networks, this paper investigates the potential of Large Language Models (LLMs) for addressing JSSP. We introduce the first supervised 120k dataset called Starjob specifically designed to train LLMs for JSSP and we subsequently fintune the LLaMA 8B model on this dataset using Lora. We compare the average makespan gap of our end-to- end LLM-based scheduling method with that of the most widely used priority dispatching rules (PDRs) and neural methods such as L2D. Surprisingly, our find- ings indicate that LLM-based scheduling not only surpasses traditional PDRs but also achieves on average 11.28% on DMU and 3.29% gap improvement on the Tailard benchmarks compared to the state-of-the-art L2D method.

1936GOLD: Graph Out-of-Distribution Detection via Implicit Adversarial Latent Generation

[openreview] [pdf]

Abstract Despite graph neural networks’ (GNNs) great success in modelling graph-structured data, out-of-distribution (OOD) test instances still pose a great challenge for current GNNs. One of the most effective techniques to detect OOD nodes is to expose the detector model with an additional OOD node-set, yet the extra OOD instances are often difficult to obtain in practice. Recent methods for image data address this problem using OOD data synthesis, typically relying on pre-trained generative models like Stable Diffusion. However, these approaches require vast amounts of additional data, as well as one-for-all pre-trained generative models, which are not available for graph data. Therefore, we propose the GOLD framework for graph OOD detection, an implicit adversarial learning pipeline with synthetic OOD exposure without pre-trained models. The implicit adversarial training process employs a novel alternating optimisation framework by training: (1) a latent generative model to regularly imitate the in-distribution (ID) embeddings from an evolving GNN, and (2) a GNN encoder and an OOD detector to accurately classify ID data while increasing the energy divergence between the ID embeddings and the generative model’s synthetic embeddings. This novel approach implicitly transforms the synthetic embeddings into pseudo-OOD instances relative to the ID data, effectively simulating exposure to OOD scenarios without auxiliary data. Extensive OOD detection experiments are conducted on five benchmark graph datasets, verifying the superior performance of GOLD without using real OOD data compared with the state-of-the-art OOD exposure and non-exposure baselines. The code will be released upon acceptance.

1937Improving out-of-distribution generalization by mimicking the human visual diet.

[openreview] [pdf]

Abstract Human visual experience is markedly different from the large-scale computer vision datasets consisting of internet images. Babies densely sample a few 3D3D scenes with diverse variations such as object viewpoints or illuminations, while datasets like ImageNet contain one single snapshot from millions of 3D scenes. We investigated how these differences in input data composition (i.e., visual diet) impact the Out-Of-Distribution (OOD) generalization capabilities of a visual system. Training models on a dataset mimicking attributes of the human-like visual diet improved generalization to OOD lighting, material, and viewpoint changes by up to 18%18\%. This observation held despite the fact that the models were trained on 1,0001,000-fold less training data. Furthermore, when trained on purely synthetic data and tested on natural images, incorporating these visual diet attributes in the training dataset improved OOD generalization by 17%17\%. These experiments are enabled by our newly proposed benchmark---the Human Visual Diet (HVD) dataset, and a new model (Human Diet Network) designed to leverage the attributes of a human-like visual diet. These findings highlight a critical problem in modern day Artificial Intelligence---building better datasets requires thinking beyond dataset size and rather focus on improving data composition. All data and source code will be made available upon publication.

1938Unlocking SVD-Space for Feedback Aligned Local Training

[openreview] [pdf]

Abstract Deep Neural Networks (DNNs) are typically trained using backpropagation, which, despite its effectiveness, requires substantial memory and computing resources. To address these limitations, we propose a novel local training framework that enables efficient and scalable neural network training without relying on global backpropagation. Our framework harnesses the alignment of Singular Value Decomposed (SVD) weight space with feedback matrices, guided by custom layerwise loss functions, to enable efficient and scalable neural network training. We decompose weight matrices into their SVD components before training, and perform local updates on the SVD components themselves, driven by a tailored objective that integrates feedback error, alignment regularization, orthogonality constraints, and sparsity. Our approach leverages Direct Feedback Alignment (DFA) to eliminate the need for global backpropagation and further optimizes model complexity by dynamically reducing the rank of the SVD components during training. The result is a compute- and memory-efficient model with classification accuracy on par with traditional backpropagation while achieving a 50-75% reduction in memory usage and computational cost during training. With strong theoretical convergence guarantees, we demonstrate that training in the SVD space with DFA not only accelerates computation but also offers a powerful, energy-efficient solution for scalable deep learning in resource-constrained environments. Code is available.

1939Two-Step Offline Preference-Based Reinforcement Learning with Constrained Actions

[openreview] [pdf]

Abstract Preference-based reinforcement learning (PBRL) in the offline setting has succeeded greatly in industrial applications such as chatbots. A two-step learning framework where one applies a reinforcement learning step after a reward modeling step has been widely adopted for the problem. However, such a method faces challenges from the risk of reward hacking and the complexity of reinforcement learning. To overcome the challenge, our insight is that both challenges come from the state-actions not supported in the dataset. Such state-actions are unreliable and increase the complexity of the reinforcement learning problem at the second step. Based on the insight, we develop a novel two-step learning method called PRC: preference-based reinforcement learning with constrained actions. The high-level idea is to limit the reinforcement learning agent to optimize over a constrained action space that excludes the out-of-distribution state-actions. We empirically verify that our method has high learning efficiency on various datasets in robotic control environments.

1940MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

[openreview] [pdf]

Abstract Recent advancements in foundation models have enhanced AI systems’ capabilities in autonomous tool usage and reasoning. However, their ability in location or map-based reasoning - which improves daily life by optimizing navigation, facilitating resource discovery, and streamlining logistics - has not been systematically studied. To bridge this gap, we introduce MapEval, a benchmark designed to assess diverse and complex map-based user queries with geo-spatial reasoning. MapEval features three task types (textual, API-based, and visual) that require collecting world information via map tools, processing heterogeneous geo-spatial contexts (e.g., named entities, travel distances, user reviews or ratings, images), and compositional reasoning, which all state-of-the-art foundation models find challenging. Comprising 550+ unique multiple-choice questions about locations across 138 cities and 54 countries, MapEval evaluates foundation models’ ability to handle spatial relationships, map infographics, travel planning, and navigation challenges. Using MapEval, we conducted a comprehensive evaluation of 25 prominent foundation models. While no single model excelled across all tasks, Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieved competitive performance overall. However, substantial performance gaps emerged, particularly in MapEval, where agents with Claude-3.5-Sonnet outperformed GPT-4o and Gemini-1.5-Pro by 13% and 22%, respectively, and the gaps became even more amplified when compared to open-source LLMs. Our in-depth ablations and analyses provide insights strengths and weaknesses of current models, though all models still fall short of human performance by more than 20% on average, struggling with complex map images and rigorous geo-spatial reasoning. This gap highlights MapEval’s critical role in advancing general-purpose foundation models with stronger geo-spatial understanding.

1941CONTRA: Conformal Prediction Region via Normalizing Flow Transformation

[openreview] [pdf]

Abstract Density estimation and reliable prediction regions for outputs are crucial in supervised and unsupervised learning. While conformal prediction effectively generates coverage-guaranteed regions, it struggles with multi-dimensional outputs due to reliance on one-dimensional nonconformity scores. To address this, we introduce CONTRA: CONformal prediction region via normalizing flow TRAnsformation. CONTRA utilizes the latent spaces of normalizing flows to define nonconformity scores based on distances from the center. This allows for the mapping of high-density regions in latent space to sharp prediction regions in the output space, surpassing traditional hyperrectangular or elliptical conformal regions. Further, for scenarios where other predictive models are favored over flow-based models, we extend CONTRA to enhance any such model with a reliable prediction region by training a simple normalizing flow on the residuals. We demonstrate that both CONTRA and its extension maintain guaranteed coverage probability and outperform existing methods in generating accurate prediction regions across various datasets. We conclude that CONTRA is an effective tool for (conditional) density estimation, addressing the under-explored challenge of delivering multi-dimensional prediction regions.

1942PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Model

[openreview] [pdf]

Abstract Controllable generation is considered a potentially vital approach to address the challenge of annotating 3D data, and the precision of such controllable generation becomes particularly imperative in the context of data production for autonomous driving. Existing methods focus on the integration of diverse generative information into controlling inputs, utilizing frameworks such as GLIGEN or ControlNet, to produce commendable outcomes in controllable generation. However, such approaches intrinsically restrict generation performance to the learning capacities of predefined network architectures. In this paper, we explore the integration of controlling information and introduce PerLDiff (\textbf{Per}spective-\textbf{L}ayout \textbf{Diff}usion Models), a method for effective street view image generation that fully leverages perspective 3D geometric information. Our PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process, resulting in a more robust and controllable output. Moreover, it demonstrates superior controllability compared to alternative layout control methods. Empirical results justify that our PerLDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.

1943ComaDICE: Offline Cooperative Multi-Agent Reinforcement Learning with Stationary Distribution Shift Regularization

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) has garnered significant attention for its ability to learn effective policies from pre-collected datasets without the need for further environmental interactions. While promising results have been demonstrated in single-agent settings, offline multi-agent reinforcement learning (MARL) presents additional challenges due to the large joint state-action space and the complexity of multi-agent behaviors. A key issue in offline RL is the distributional shift, which arises when the target policy being optimized deviates from the behavior policy that generated the data. This problem is exacerbated in MARL due to the interdependence between agents’ local policies and the expansive joint state-action space. Prior approaches have primarily addressed this challenge by incorporating regularization in the space of either Q-functions or policies. In this work, we propose a novel type of regularizer in the space of stationary distributions to address the distributional shift more effectively. Our algorithm, ComaDICE, provides a principled framework for offline cooperative MARL to correct the stationary distribution of the global policy, which is then leveraged to derive local policies for individual agents. Through extensive experiments on the offline multi-agent MuJoCo and StarCraft II benchmarks, we demonstrate that ComaDICE achieves superior performance compared to state-of-the-art offline MARL methods across nearly all tasks.

1944Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise

[openreview] [pdf]

Abstract Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.

1945Collapsed Language Models Promote Fairness

[openreview] [pdf]

Abstract To mitigate societal biases implicitly encoded in recent successful pretrained language models, a diverse array of approaches have been proposed to encourage model fairness, focusing on prompting, data augmentation, regularized fine-tuning, and more. Despite the development, it is nontrivial to reach a principled understanding of fairness and an effective algorithm that can consistently debias language models. In this work, by rigorous evaluations of Neural Collapse -- a learning phenomenon happen in last-layer representations and classifiers in deep networks -- on fairness-related words, we find that debiased language models exhibit collapsed alignment between token representations and word embeddings. More importantly, this observation inspires us to design a principled fine-tuning method that can effectively improve fairness in a wide range of debiasing methods, while still preserving the performance of language models on standard natural language understanding tasks. We attach our code athttps://anonymous.4open.science/r/Fairness_NC-457E.

1946On the Convergence of Adam-Type Algorithms for Bilevel Optimization under Unbounded Smoothness

[openreview] [pdf]

Abstract Adam has become one of the most popular optimizers for training modern deep neural networks, such as transformers. However, its applicability is largely restricted to single-level optimization problems. In this paper, we aim to extend vanilla Adam to tackle bilevel optimization problems, which have important applications in machine learning, such as meta-learning. In particular, we study stochastic bilevel optimization problems where the lower-level function is strongly convex and the upper-level objective is nonconvex with potentially unbounded smoothness. This unbounded smooth objective function covers a broad class of neural networks, including transformers, which may exhibit non-Lipschitz gradients. In this work, we first introduce AdamBO, a single-loop Adam-type method that achieves O~(ϵ4)\widetilde{O}(\epsilon^{-4}) oracle complexity to find ϵ\epsilon-stationary points, where the oracle calls involve stochastic gradient or Hessian/Jacobian-vector product evaluations. The key to our analysis is a novel randomness decoupling lemma that provides refined control over the lower-level variable. Additionally, we propose VR-AdamBO, a variance-reduced version with an improved oracle complexity of O~(ϵ3)\widetilde{O}(\epsilon^{-3}). The improved analysis is based on a novel stopping time approach and a careful treatment of the lower-level error. We conduct extensive experiments on various machine learning tasks involving bilevel formulations with recurrent neural networks (RNNs) and transformers, demonstrating the effectiveness of our proposed Adam-type algorithms.

1947Reducing Bias in Feature Extractors for Extreme Universal Domain Adaptation

[openreview] [pdf]

Abstract Universal Domain Adaptation (UniDA) aims to transfer knowledge from a labeled source domain to an unlabeled target domain without prior knowledge of the label sets between the two domains. The goal of UniDA is to achieve robust performance under arbitrary label-set distributions. However, existing literature has not sufficiently explored performance across diverse distribution scenarios. Our experiments reveal that existing methods struggle when the source domain has significantly more non-overlapping classes than overlapping ones, a setting we refer to asExtreme UniDA. In this paper, we demonstrate that classical partial domain alignment, which focuses on aligning only overlapping-class data between domains, is limited in mitigating feature extractor bias in extreme UniDA scenarios. We argue that feature extractors trained with source supervised loss disrupt the intrinsic structure of target data due to the inherent differences between source-private-class data and target data. To mitigate this bias, we employ self-supervised learning to preserve the structure of target data. This method can be easily integrated into existing frameworks. We apply the proposed approach to two distinct training paradigms—adversarial-based and optimal-transport-based—and show consistent improvements across various class-set distributions, with significant gains in extreme UniDA settings.

1948Mitigating Suboptimality of Deterministic Policy Gradients in Complex Q-functions

[openreview] [pdf]

Abstract In reinforcement learning, off-policy actor-critic approaches like DDPG and TD3 are based on the deterministic policy gradient. Herein, the Q-function is trained from off-policy environment data and the actor (policy) is trained to maximize the Q-function via gradient ascent. We observe that in complex tasks like dexterous manipulation and restricted locomotion, the Q-value is a complex function of action, having several local optima or discontinuities. This poses a challenge for gradient ascent to traverse and makes the actor prone to get stuck at local optima. To address this, we introduce a new actor architecture that combines two simple insights: (i) use multiple actors and evaluate the Q-value maximizing action, and (ii) learn surrogates to the Q-function that are simpler to optimize with gradient-based methods. We evaluate tasks such as restricted locomotion, dexterous manipulation, and large discrete-action space recommender systems and show that our actor finds more optimal actions and outperforms alternate actor architectures.

1949Approximation algorithms for combinatorial optimization with predictions

[openreview] [pdf]

Abstract We initiate a systematic study of utilizing predictions to improve over approximation guarantees of classic algorithms, without increasing the running time. We propose a generic method for a wide class of optimization problems that ask to select a feasible subset of input items of minimal (or maximal) total weight. This gives simple (near-)linear-time algorithms for, e.g., Vertex Cover, Steiner Tree, Minimum Weight Perfect Matching, Knapsack, and Maximum Clique. Our algorithms produce an optimal solution when provided with perfect predictions and their approximation ratio smoothly degrades with increasing prediction error. With small enough prediction error we achieve approximation guarantees that are beyond the reach without predictions in given time bounds, as exemplified by the NP-hardness and APX-hardness of many of the above problems. Although we show our approach to be optimal for this class of problems as a whole, there is a potential for exploiting specific structural properties of individual problems to obtain improved bounds; we demonstrate this on the Steiner Tree problem. We conclude with an empirical evaluation of our approach.

1950Generalization Bounds for Canonicalization: A Comparative Study with Group Averaging

[openreview] [pdf]

Abstract Canonicalization, a popular method for generating invariant or equivariant function classes from arbitrary function sets, involves initial data projection onto a reduced input space subset, followed by applying any learning method to the projected dataset. Despite recent research on the expressive power and continuity of functions represented by canonicalization, its generalization capabilities remain less explored. This paper addresses this gap by theoretically examining the generalization benefits and sample complexity of canonicalization, comparing them with group averaging, another popular technique for creating invariant or equivariant function classes. Our findings reveal two distinct regimes where canonicalization may outperform or underperform compared to group averaging, with precise quantification of this phase transition in terms of sample size and group action characteristics. To the best of our knowledge, this study represents the first theoretical exploration of such behavior, offering insights into the relative effectiveness of canonicalization and group averaging under varying conditions.

1951Leveraging AutoML for Sustainable Deep Learning: A Multi-Objective HPO Approach on Deep Shift Neural Networks

[openreview] [pdf]

Abstract Deep Learning (DL) has advanced various fields by extracting complex patterns from large datasets. However, the computational demands of DL models pose environmental and resource challenges. Deep Shift Neural Networks (DSNNs) improve the situation by leveraging shift operations to reduce computational complexity at inference. Compared to common DNNs, DSNNs are still less well understood and less well optimized. By leveraging AutoML techniques, we provide valuable insights into the potential of DSNNs and how to design them in a better way. Following the insights from common DNNs, we propose to leverage the full potential of DSNNs by means of AutoML techniques. We study the impact of hyperparameter optimization (HPO) on maximizing DSNN performance while minimizing resource consumption. Since we consider complementary objectives such as accuracy and energy consumption, we combine state-of-the-art multi-fidelity (MF) HPO with multi-objective optimization to find a set of Pareto-optimal trade-offs on how to design DSNNs. Our approach led to significantly better configurations of DSNNs regarding loss and emissions compared to default DSNNs. This includes simultaneously increasing performance by about 20% and reducing emissions by about 10%. Investigating the behavior of quantized networks in terms of both emissions and accuracy, our experiments reveal surprising model-specific trade-offs, yielding the greatest energy savings. For example, in contrast to common expectations, selectively quantizing smaller portions of the network with low precision is optimal while retaining or improving performance. We corroborated these findings across multiple backbone architectures, highlighting important nuances in quantization strategies and offering an automated approach to balancing energy efficiency and model performance.

1952RRM: Robust Reward Model Training Mitigates Reward Hacking

[openreview] [pdf]

Abstract Reward models (RMs) play a pivotal role in aligning large language models (LLMs) with human preferences. However, traditional RM training, which relies on response pairs tied to specific prompts, struggles to disentangle prompt-driven preferences from prompt-independent artifacts, such as response length and format. In this work, we expose a fundamental limitation of current RM training methods, where RMs fail to effectively distinguish between contextual signals and irrelevant artifacts when determining preferences. To address this, we introduce a causal framework that learns preferences independent of these artifacts and propose a novel data augmentation technique designed to eliminate them. Extensive experiments show that our approach successfully filters out undesirable artifacts, yielding a more robust reward model (RRM). Our RRM improves the performance of a pairwise reward model trained on Gemma-2-9b-it, on Reward-Bench, increasing accuracy from 80.61% to 84.15%. Additionally, we train two DPO policies using both the RM and RRM, demonstrating that the RRM significantly enhances DPO-aligned policies, improving MT-Bench scores from 7.27 to 8.31 and length-controlled win-rates in AlpacaEval-2 from 33.46% to 52.49%.

1953On-Policy Policy Gradient Reinforcement Learning Without On-Policy Sampling

[openreview] [pdf]

Abstract On-policy reinforcement learning RL algorithms perform policy updates using i.i.d. trajectories collected by the current policy. However, after observing only a finite number of trajectories, on-policy sampling may produce data that fails to match the expected on-policy data distribution. This sampling error leads to noisy updates and data inefficient on-policy learning. Recent work in the policy evaluation setting has shown that non-i.i.d., off-policy sampling can produce data with lower sampling error than on-policy sampling can produce~\citep{zhong2022robust}. Motivated by this observation, we introduce an adaptive, off-policy sampling method to improve the data efficiency of on-policy policy gradient algorithms. Our method, Proximal Robust On-Policy Sampling (PROPS) reduces sampling error by collecting data with a behavior policy that increases the probability of sampling actions that are under-sampled with respect to the current policy. Rather than discarding data from old policies -- as is commonly done in on-policy algorithms -- PROPS uses data collection to adjust the distribution of previously collected data to be approximately on-policy. We empirically evaluate PROPS on both continuous-action MuJoCo benchmark tasks as well discrete-action tasks and demonstrate that (1) PROPS decreases sampling error throughout training and (2) improves the data efficiency of on-policy policy gradient algorithms. Our work improves the RL community’s understanding of a nuance in the on-policy vs off-policy dichotomy: on-policy learning requires on-policy data, not on-policy sampling.

1954Calibration of ordinal regression networks

[openreview] [pdf]

Abstract Recent studies have shown that deep neural networks are not well-calibrated and produce over-confident predictions. The miscalibration issue primarily stems from the minimization of cross-entropy, which aims to align predicted softmax probabilities with one-hot labels. In ordinal regression tasks, this problem is compounded by an additional challenge: the expectation that softmax probabilities should exhibit unimodal distribution is not met with cross-entropy. Rather, the ordinal regression literature has focused on unimodality and overlooked calibration. To address these issues, we propose a novel loss function that introduces order-aware calibration, ensuring that prediction confidence adheres to ordinal relationships between classes. It incorporates soft ordinal encoding and label-smoothing-based regularization to enforce both calibration and unimodality. Extensive experiments across three popular ordinal regression benchmarks demonstrate that our approach achieves state-of-the-art calibration without compromising accuracy.

1955Minimizing Dependence between Embedding Dimensions with Adversarial Networks

[openreview] [pdf]

Abstract Learning representations with minimally dependent embedding dimensions can have many potential benefits such as improved generalization and interpretability. This work provides a differentiable and scalable algorithm for dependence minimization, moving beyond existing linear pairwise decorrelation methods. Our algorithm involves an adversarial game where small networks identify dimension relationships, while the main model exploits this information to reduce dependencies. We empirically verify that the algorithm converges. We then explore dependence reduction as a proxy for maximizing information content. We showcase the algorithm’s effectiveness on the Clevr-4 dataset, both with and without supervision, and achieve promising results on the ImageNet dataset. Finally, we propose an algorithm modification that gives more control over the level of dependency, sparking a discussion on optimal redundancy levels for specific applications. Although the algorithm performs well on synthetic data, further research is needed to optimize it for tasks such as out-of-distribution detection.

1956Investigating Mixture Policies in Entropy-Regularized Actor-Critic

[openreview] [pdf]

Abstract We study mixture policies in entropy-regularized reinforcement learning. Mixture policies offer greater flexibility than base policies like Gaussians, which we show theoretically provides improved solution quality and robustness to the entropy scale. Despite these potential benefits, they are rarely used for algorithms like Soft Actor-Critic, potentially due to the fact that Gaussians are easily reparameterized to get lower variance gradient updates, but mixtures are not. We fill this gap, introducing reparameterization gradient estimators for the mixture policy. Through extensive experiments on environments from classic control, MuJoCo, the DeepMind Control Suite and a suite of randomly generated bandits, our results show that mixture policies explore more efficiently in tasks with unshaped rewards (across entropy scales), while performing comparably to base policies in tasks with shaped rewards, and are more robust to multimodal critic surfaces.

1957Adam Exploitsℓ∞-geometry of Loss Landscape via Coordinate-wise Adaptivity

[openreview] [pdf]

Abstract Adam outperforms SGD when training language models. Yet such benefits are not well-understood theoretically -- previous convergence analysis for Adam and SGD mainly focuses on the number of steps TT and is already minimax-optimal in non-convex cases, which are both O(T1/4)O(T^{-1/4}). In this work, we argue that the better dependence on the loss smoothness is the key advantage of Adam over SGD. More specifically, we give a new convergence analysis for Adam under novel assumptions that loss is smooth under \ell_\infty geometry rather than the more common 2\ell_2 geometry, which yields a much better empirical smoothness constant for GPT-2 and ResNet models. Moreover, we show that if we rotate the training loss randomly, Adam can be outperformed by some variants of SGD which is invariant to rotations. This implies that any practically relevant explanation of Adam’s optimization benefit must involve non-rotational invariant properties of loss, such as \ell_\infty smoothness as used in our analysis. We also extend the convergence analysis to blockwise Adam, which is a generalization of standard Adam.

1958Multi-level Certified Defense Against Poisoning Attacks in Offline Reinforcement Learning

[openreview] [pdf]

Abstract Similar to other machine learning frameworks, Offline Reinforcement Learning (RL) is shown to be vulnerable to poisoning attacks, due to its reliance on externally sourced datasets, a vulnerability that is exacerbated by its sequential nature. To mitigate the risks posed by RL poisoning, we extend certified defenses to provide larger guarantees against adversarial manipulation, ensuring robustness for both per-state actions, and the overall expected cumulative reward. Our approach leverages properties of Differential Privacy, in a manner that allows this work to span both continuous and discrete spaces, as well as stochastic and deterministic environments---significantly expanding the scope and applicability of achievable guarantees. Empirical evaluations demonstrate that our approach ensures the performance drops to no more than 50% with up to 7% of the training data poisoned, significantly improving over the 0.008% in prior work (Wu et al., 2022), while producing certified radii that is 5 times larger as well. This highlights the potential of our framework to enhance safety and reliability in offline RL.

1959Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques

[openreview] [pdf]

Abstract We initiate the study of Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective MARLHF, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We propose an extra penalty based on dataset distribution to incorporate pessimism, enhancing stability and effectiveness during training. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.

1960REPANA: Reasoning Path Navigated Program Induction for Universally Reasoning over Heterogeneous Knowledge Bases

[openreview] [pdf]

Abstract Program induction is a typical approach that helps Large Language Models (LLMs) in complex knowledge-intensive question answering over knowledge bases (KBs) to alleviate the hallucination of LLMs. However, the accurate program induction usually requires a large number of high-quality parallel data of a specific KB, which is difficult to acquire for many low-resource KBs. Additionally, due to heterogeneity of questions and KB schemas, the transferability of a model trained on a single dataset is poor. To this end, we propose REPANA, a reasoning path navigated program induction framework that enables LLMs to reason over heterogeneous KBs. We decouple the program generation capability into perceiving the KB and mapping questions to program sketches. Accordingly, our framework consists of two main components. The first is an LLM-based navigator, which retrieves reasoning paths of the input question from the given KB. The second is a KB-agnostic parser trained on data from multiple heterogeneous datasets, taking the navigator’s retrieved paths and the question as input and generating the corresponding program. Experiments show that REPANA exhibits strong generalization and transferability. It can directly perform inference on datasets not seen during training, outperforming other SoTA low-resource methods and even approaching the performance of supervised methods.

1961Agile Flight with Optimization Embedded Networks

[openreview] [pdf]

Abstract To bridge the gap between perception and planning in traditional navigation systems, we address the challenge of learning optimal trajectories directly from depth information in an end-to-end fashion. Using neural networks as black-box replacements for traditional modules can compromise robustness and stability. Moreover, such methods often fail to adequately account for the robot’s kinematic constraints, leading to trajectories that may not be satisfactorily executable. In this paper, we integrate the strengths of conventional methods and neural networks by introducing an optimization-embedded neural network based on a compact trajectory library. Neural networks establish spatial constraints for model-based trajectory planning, followed by robust numerical optimization to achieve feasible and optimal solutions. By making the process differentiable, our model seamlessly approximates the optimal trajectory. Additionally, the introduction of a regularized trajectory library enables the method to efficiently capture the spatial distribution of optimal trajectories with minimal storage cost, ensuring multimodal planning characteristics. Evaluations in complex, unseen environments demonstrate our method’s superior performance over state-of-the-art algorithms. Real-world flight experiments with a small onboard computer showcase the autonomous quadrotor’s ability to navigate swiftly through dense forests.

1962Learning the Optimal Stopping for Early Classification within Finite Horizons via Sequential Probability Ratio Test

[openreview] [pdf]

Abstract Time-sensitive machine learning benefits from Sequential Probability Ratio Test (SPRT), which provides an optimal stopping time for early classification of time series. However, infinite horizonscenarios, where input lengths are finite, determining the optimal stopping rule becomes computationally intensive due to the need forbackward induction, limiting practical applicability. We thus introduce FIRMBOUND, an SPRT-based framework that efficiently estimates the solution to backward induction from training data, bridging the gap between optimal stopping theory and real-world deployment. It employsdensity ratio estimationandconvex function learningto provide statistically consistent estimators for sufficient statistic and conditional expectation, both essential for solving backward induction; consequently, FIRMBOUND minimizes Bayes risk to reach optimality. Additionally, we present a faster alternative using Gaussian process regression, which significantly reduces training time while retaining low deployment overhead, albeit with potential compromise in statistical consistency. Experiments across independent and identically distributed (i.i.d.), non-i.i.d., binary, multiclass, synthetic, and real-world datasets show that FIRMBOUND achieves optimalities in the sense of Bayes risk and speed-accuracy tradeoff. Furthermore, it advances the tradeoff boundary toward optimality when possible and reduces decision-time variance, ensuring reliable decision-making. Code is included in the supplementary materials.

1963Preference Discerning in Generative Sequential Recommendation

[openreview] [pdf]

Abstract Sequential recommendation systems aim to provide personalized recommendations for users based on their interaction history. To achieve this, they often incorporate auxiliary information, such as textual descriptions of items and auxiliary tasks, like predicting user preferences and intent. Despite numerous efforts to enhance these models, they still suffer from limited personalization. To address this issue, we propose a new paradigm, which we termpreference discerning. Inpreference discerning, we explicitly condition a generative sequential recommendation system on user preferences within its context. The user preferences are generated by large language models (LLMs) based on user reviews. To evaluatepreference discerningcapabilities of sequential recommendation systems, we introduce a novel benchmark that provides a holistic evaluation across various scenarios, including preference steering and sentiment following. We assess current state-of-the-art methods using our benchmark and show that they struggle to accurately discern user preferences. Therefore, we propose a new method named Mender (Multimodal preferencediscerner), which improves upon existing methods and achieves state-of-the-art performance on our benchmark. Our results show that Mender can be effectively guided by human preferences, paving the way toward more personalized sequential recommendation systems. We will open-source the code and benchmarks upon publication.

1964A Closer Look at Personalized Fine-Tuning in Heterogeneous Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) enables privacy-preserving, decentralized model training but faces significant challenges in balancing global generalization and local personalization due to non-identical data distributions across clients. While Personalized Fine-Tuning (PFT) adapts models to local data, excessive personalization often degrades global performance. In this work, we present a comprehensive empirical study encompassing seven diverse datasets, multiple model architectures, and various fine-tuning methods under both covariate and concept shift scenarios. Our extensive evaluation reveals critical limitations in existing PFT methods, which struggle with overfitting and exhibit inconsistent performance across distribution shifts, even with careful hyperparameter tuning and regularization. To address these issues, we identify LP-FT, a simple yet effective strategy that combines Linear Probing with full Fine-Tuning, adapted to the FL setting. LP-FT consistently outperforms existing methods, achieving an optimal balance between local personalization and global generalization across all tested scenarios. By investigating the feature change after PFT, we hypothesize the a phenomena dubbed as federated feature distortion is linked to the global generalization. Motivated by the observation, we provide a theoretical analysis of two-layer linear networks, offering novel insights into the conditions under which LP-FT excels, thereby enhancing our understanding of personalization dynamics in FL. This work contributes in three key areas: (1) a rigorous and comprehensive evaluation of PFT methods under diverse distribution shifts, (2) the introduction of LP-FT as a robust and versatile solution to FL personalization challenges, and (3) theoretical foundations that explain LP-FT’s superior effectiveness. Our findings set a new venue for PFT research and provide valuable insights to the broader FL community.

1965LayerDAG: A Layerwise Autoregressive Diffusion Model for Directed Acyclic Graph Generation

[openreview] [pdf]

Abstract Directed acyclic graphs (DAGs) serve as crucial data representations in domains such as hardware synthesis and compiler/program optimization for computing systems. DAG generative models facilitate the creation of synthetic DAGs, which can be used for benchmarking computing systems while preserving intellectual property. However, generating realistic DAGs is challenging due to their inherent directional and logical dependencies. This paper introduces LayerDAG, an autoregressive diffusion model, to address these challenges. LayerDAG decouples the strong node dependencies into manageable units that can be processed sequentially. By interpreting the partial order of nodes as a sequence of bipartite graphs, LayerDAG leverages autoregressive generation to model directional dependencies and employs diffusion models to capture logical dependencies within each bipartite graph. Comparative analyses demonstrate that LayerDAG outperforms existing DAG generative models in both expressiveness and generalization, particularly for generating large-scale DAGs with up to 400 nodes—a critical scenario for system benchmarking. Extensive experiments on both synthetic and real-world flow graphs from various computing platforms show that LayerDAG generates valid DAGs with superior statistical properties and benchmarking performance. The synthetic DAGs generated by LayerDAG enhance the training of ML-based surrogate models, resulting in improved accuracy in predicting performance metrics of real-world DAGs across diverse computing platforms.

1966GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning

[openreview] [pdf]

Abstract Parameter-efficient fine-tuning (PEFT) of vision-language models (VLMs) excels in various vision tasks thanks to the rich knowledge and generalization ability of VLMs. However, recent studies revealed that such fine-tuned VLMs are vulnerable to spurious correlations stemming from the subgroup imbalance in the fine-tuning datasets. To resolve this issue, we propose Group Context Optimization (GroupCoOp), a simple and effective debiased fine-tuning algorithm that enhances the group robustness of fine-tuned VLMs without group labels. Its key idea is to employ group-specific text prompts as group representatives serving as multiple classifiers for their target class. The rich semantic knowledge of the text encoder of VLM enables the discovery of effective group prompts even for groups with a small number of training samples. Leveraging the group prompts for each class addresses the issues caused by the group-imbalanced training set, such as the neglect of minority groups and the scattered distribution of each class in the embedding space. Moreover, we propose a simple yet fairly effective pseudo group labeling algorithm, which allows GroupCoOp to fine-tune VLMs without manual group labels. GroupCoOp achieved the best results on five benchmarks across five CLIP architectures and even outperformed prior methods that train the entire network, despite training only 0.016% of the network’s parameters. GroupCoOp demonstrates robust performance even with extremely limited training samples, where the minority group sample is limited to a single instance.

1967Optimization-Biased Hypernetworks for Generalizable Policy Generation

[openreview] [pdf]

Abstract Policy learning through behavior cloning poses significant challenges, particularly when demonstration data is limited. In this work, we present HyPoGen, a novel optimization-biased hypernetwork structure for policy generation. The proposed hypernetwork learns to synthesize optimal policy parameters solely from task specifications, by modeling policy generation as an approximation of the optimization process executed over a finite number of steps and assuming these specifications serve as a sufficient representation of the demonstration data. By incorporating structural designs that bias the hypernetwork towards optimization, we can improve its generalization capability while being trained only on source task demonstrations. During the feed-forward prediction pass, the hypernetwork effectively performs an optimization in the latent (compressed) policy space, which is then decoded into policy parameters for action prediction. Experimental results on locomotion and manipulation benchmarks show that HyPoGen significantly outperforms state-of-the-art methods in generating policies for unseen target tasks without any demonstrations, achieving higher success rates and evidencing improved generalizable policy generation capability. Our work underscores the potential of optimization-biased hypernetworks in advancing generalizable policy generation. Our code and models will be made available.

1968Stutter makes large language models smarter

[openreview] [pdf]

Abstract Large language models (LLMs) have achieved remarkable success in generating coherent and contextually relevant text. However, their large parameters and high memory requirements limit their efficiency and adoption in industry and academia. Recent studies have shown that dynamically adjusting inference operations can improve model performance without significantly increasing size. In this paper, we introduce the stutter mechanism, a novel method that enhances transformer models by selectively applying additional layers to more challenging tokens. This approach mimics a human speaker’s stutter, allocating more computational effort where needed, thus improving language capabilities without generating excessive tokens. Our experiments with various Pythia models demonstrate that the stutter mechanism consistently enhances performance across benchmark datasets. Specifically, the Pythia-410M model, enhanced by our method, outperforms the larger Pythia-1B model on WinoGrande and WSC. Additionally, our method is data-efficient, requiring only less than 1% of the pretraining data for the additional training. These results highlight the stutter mechanism’s potential to enhance LLMs’ efficiency and performance in real-world applications.

1969Stealing User Prompts from Mixture-of-Experts Models

[openreview] [pdf]

Abstract Mixture of Expert (MoE) models improve the efficiency and scalability of dense language models by \emph{routing} each token to a small number of experts in each layer of the model. In this paper, we show how an adversary that can arrange for their queries to appear in the same batch of examples as a victim’s queries can exploit expert-choice routing to the full disclosure of a victim’s prompt. We successfully demonstrate the effectiveness of this attack on a two-layered Mixtral model. Our results show that we can extract the entire prompt using O(Vocabulary size×prompt length2)\mathcal{O}(\text{Vocabulary size} \times \text{prompt length}^2) queries or a maximum of 100 queries per token in the setting we consider. Our work is the first of its kind data reconstruction attack that originates from in a flaw in the model architecture, as opposed to the model parameterization.

1970A Phase Transition Induces Catastrophic Overfitting in Adversarial Training

[openreview] [pdf]

Abstract We derive the implicit bias of Projected Gradient Descent (PGD) Adversarial Training (AT). We show that a phase transition in the loss structure of as a function of the adversarial budget ϵ\epsilon manifests as Catastrophic Overfitting (CO). Below a critical threshold ϵc\epsilon_c, single step methods efficiently provide an increase in robustness, while above this critical point, additional PGD steps and/or regularization are needed. We show that high curvature solutions arise in the implicit bias of PGD AT. We provide analytical and empirical evidence for our arguments by appealing to a simple model with one-dimensional inputs and a single trainable parameter, where the CO phenomenon can be replicated. In this model, we show that such high curvature solutions exist for arbitrarily small ϵ\epsilon. Additionally, we can compute the critical value ϵc\epsilon_c in single-step AT for bounded parameter norms. We believe our work provides a deeper understanding of CO that aligns with the intuition the community has built around it.

1971Robust Consensus Anchor Learning for Efficient Multi-view Subspace Clustering

[openreview] [pdf]

Abstract As a leading unsupervised classification algorithm in artificial intelligence, multi-view subspace clustering segments unlabeled data from different subspaces. Recent works based on the anchor have been proposed to decrease the computation complexity for the datasets with large scales in multi-view clustering. The major differences among these methods lie on the objective functions they define. Despite considerable success, these works pay few attention to guaranting the robustness of learned consensus anchors via effective manner for efficient multi-view clustering and investigating the specific local distribution of cluster in the affine subspace. Besides, the robust consensus anchors as well as the common cluster structure shared by different views are not able to be simultaneously learned. In this paper, we propose Robust Consensus anchors learning for efficient multi-view Subspace Clustering (RCSC). We first show that if the data are sufficiently sampled from independent subspaces, and the objective function meets some conditions, the achieved anchor graph has the block-diagonal structure. As a special case, we provide a model based on Frobenius norm, non-negative and affine constraints in consensus anchors learning, which guarantees the robustness of learned consensus anchors for efficient multi-view clustering and investigates the specific local distribution of cluster in the affine subspace. While it is simple, we theoretically give the geometric analysis regarding the formulated RCSC. The union of these three constraints is able to restrict how each data point is described in the affine subspace with specific local distribution of cluster for guaranting the robustness of learned consensus anchors. RCSC takes full advantages of correlation among consensus anchors, which encourages the grouping effect and groups highly correlated consensus anchors together with the guidance of view-specific projection. The anchor graph construction, partition and robust anchor learning are jointly integrated into a unified framework. It ensures the mutual enhancement for these procedures and helps lead to more discriminative consensus anchors as well as the cluster indicator. We then adopt an alternative optimization strategy for solving the formulated problem. Experiments performed on eight multi-view datasets confirm the superiority of RCSC based on the effectiveness and efficiency.

1972Training and Evaluating Causal Forecasting Models for Time-Series

[openreview] [pdf]

Abstract Deep learning time-series models are often used to make forecasts that inform downstream decisions. Since these decisions can differ from those in the training set, there is an implicit requirement that time-series models will generalize outside of their training distribution. Despite this core requirement, time-series models are typically trained and evaluated on in-distribution predictive tasks. We extend the orthogonal statistical learning framework to train causal time-series models that generalize better when forecasting the effect of actions outside of their training distribution. To evaluate these models, we leverage Regression Discontinuity Designs popular in economics to construct a test set of causal treatment effects.

1973Error Feedback for Smooth and Nonsmooth Convex Optimization with Constant, Decreasing and Polyak Stepsizes

[openreview] [pdf]

Abstract Error feedback, originally proposed a decade ago by Seide et al (2014), is an immensely popular strategy for stabilizing the convergence behavior of distributed algorithms employing communication compression via the application of contractive compression operators, such as greedy and random sparsification, quantization, and low-rank approximation. While our algorithmic and theoretical understanding of error feedback has grown immensely over the years, several important considerations remained elusive. For example, the theory of error feedback is fully focused on the smooth convex and nonconvex regimes, and results in the nonsmooth convex setting are limited. This is not a coincidence: Error feedback works when the gradients converge, and this is not necessarily the case in the nonsmooth setting. Further, existing stepsize rules for error feedback are limited to constant schedules; a by-product of the current theoretical approach to analyzing error feedback. By modifying the algorithmic design of error feedback, we are able to resolve these issues. In particular, we provide a comprehensive analysis covering both the smooth and nonsmooth convex regimes, and give support for constant, decreasing and adaptive (Polyak-type) stepsizes. This is the first time such results are obtained. In particular, this is the first time adaptive stepsizes have successfully been combined with compression mechanisms. Our theoretical results are corroborated with suitable numerical experiments.

1974Debiasing Text-to-image Diffusion Models with Self-discovering Latent Directions

[openreview] [pdf]

Abstract While Diffusion Models (DM) exhibit remarkable performance across various image generative tasks, they nonetheless reflect the inherent bias presented in the training set. As DMs are now widely used in real-world applications, these biases could perpetuate a distorted worldview and hinder opportunities for minority groups. Existing methods on debiasing DMs usually requires model re-training with a human-crafted reference dataset or additional classifiers, which suffer from two major limitations: (1) collecting reference datasets causes expensive annotation cost; (2) the debiasing performance is heavily constrained by the quality of the reference dataset or the additional classifier. To address the above limitations, we propose DebiasDiff, a plug-and-play method that learns attribute latent directions in a self-discovering manner, thus eliminating the reliance on such reference dataset. Specifically, DebiasDiff consists of two parts: a set of attribute adapters and a distribution indicator. Each adapter in the set aims to learn an attribute latent direction, and is optimized via noise composition through a self-discovering process.Then, the distribution indicator is multiplied by the set of adapters to guide the generation process towards the prescribed distribution. Our method enables debiasing multiple attributes in DMs simultaneously, while remaining lightweight and easily integrable with other DMs, eliminating the need for re-training. Extensive experiments on debiasing gender, racial, and their intersectional biases show that our method outperforms previous SOTA by a large margin.

1975COSTAR: Dynamic Safety Constraints Adaptation in Safe Reinforcement Learning

[openreview] [pdf]

Abstract Recent advancements in safe reinforcement learning (safe RL) have focused on developing agents that maximize rewards while satisfying predefined safety constraints. However, the challenge of learning policies capable of generalizing to dynamic safety requirements has rarely been explored. To this end, we propose a novel COntrastive Safe TAsk Representation (COSTAR) framework for safe RL, which can boost existing algorithm’s generalization to dynamic safety constraints, including variable cost functions and safety thresholds.In COSTAR, we employ a Safe Task Encoder to extract safety-specific representations from trajectory contexts, effectively distinguishing between various safety constraints with contrastive learning. It is noteworthy that our framework can integrate with existing safe RL algorithms and possesses zero-shot adaptation capability to varying safety constraints during deployment. Extensive experiments demonstrate that our COSTAR framework consistently achieves high rewards while maintaining low costs, and exhibits robust generalization capabilities when dealing with out-of-distribution (OOD) tasks.

1976TOMVALLEY: EVALUATING THE THEORY OF MIND REASONING OF LLMS IN REALISTIC SOCIAL CONTEXT

[openreview] [pdf]

Abstract As large language models (LLMs) are increasingly involved in human society, some studies try to evaluate LLMs’ capability of theory of mind (ToM), which is about the understanding and reasoning of others’ mental states and possible actions. However, these previous works simplify the ToM capability required in real social contexts during their evaluations. This can be reflected in three aspects: (1) most evaluations focus on astatic mental stateafter several social scenarios while ignoring the changes of mental states across different scenarios; (2) they mainly considerindependent mental states, however different kinds of mental states (beliefs, intentions, and emotions) and actions can influence one another in our real life; (3) there is anabsence of social settings and character profilesin their evaluation, even though humans can effortlessly obtain and utilize this information in ToM reasoning processes. This lack can underestimate the abilities of LLMs. This paper aims to evaluate LLMs’ ToM capability in closer alignment with a realistic social context. Correspondingly, we propose a new benchmark, namedToMValley, which alleviates the limitations mentioned above of previous works. Specifically, the benchmark is constructed using a framework that includes four steps: social background determination, mental state sketch, social scenario design, and rule-based question generation. Overall, there are 1100 social contexts and 78100 questions about characters’ mental states. The quality of the benchmark is manually verified. Additionally, we evaluate ten popular LLMs onToMValley. Experimental results suggest that LLMs’ performances are significantly inferior to human levels by 11%. Subsequent investigation indicates that LLMs are ineffective at interpreting alterations in mental states across social scenarios. Furthermore, we observe that LLMs are incapable of addressing compositional questions that necessitate multi-hop reasoning within the social context.

1977Human Expertise Really Matters! Mitigating Unfair Utility Induced by Heterogenous Human Expertise in AI-assisted Decision-Making

[openreview] [pdf]

Abstract AI-assisted decision-making often involves an AI model providing calibrated confidence, which helps humans integrate these with their own confidence to make higher-utility final decisions. However, when human decision-makers are heterogeneous in their expertise, existing AI-assisted decision-making may fail to provide fair utility across them. Such unfairness raises concerns about social welfare among diverse humans due to inequities in access to equally effective AI assistance, which may reduce their willingness and trust to engage with AI systems. In this work, we investigate how to calibrate AI confidence to provide fair utility across diverse human populations with heterogeneous expertise. We first demonstrate that rational decision-makers with heterogeneous expertise are unlikely to obtain fair decision utility from existing AI confidence calibrations. We propose a novel confidence calibration criterion,inter-group alignment, which synergizes with human alignment to jointly determine the upper bound of utility disparity across diverse human populations. Building on this foundation, we propose a new fairness-aware confidence calibration method,group-level multicalibration, which ensures a sufficient condition for achieving both inter-group and human alignment. We validate our theoretical findings through extensive experiments on four real-world multimodal tasks, where classifiers assist human experts in decision-making. The results indicate that our calibrated AI confidence facilitates fairer utility across human populations, concurrently enhancing overall utility.The implementation code is available athttps://anonymous.4open.science/r/iclr4103.

1978LoCA: Location-Aware Cosine Adaptation for Parameter-Efficient Fine-Tuning

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) has become a prevalent method for adapting pre-trained large language models to downstream tasks. However, the simple low-rank decomposition form may constrain the optimization flexibility. To address this limitation, we introduce Location-aware Cosine Adaptation (LoCA), a novel frequency-domain parameter-efficient fine-tuning method based on inverse Discrete Cosine Transform (iDCT) with selective locations of learnable components. We begin with a comprehensive theoretical comparison between frequency-domain and low-rank decompositions for fine-tuning pre-trained large models. Our analysis reveals that frequency-domain approximation with carefully selected frequency components can surpass the expressivity of traditional low-rank-based methods. Furthermore, we demonstrate that iDCT offers a more efficient implementation compared to inverse Discrete Fourier Transform (iDFT), allowing for better selection and tuning of frequency components while maintaining equivalent expressivity to the optimal iDFT-based adaptation. By employing finite-difference approximation to estimate gradients for discrete locations of learnable coefficients on the DCT spectrum, LoCA dynamically selects the most informative frequency components during training. Experiments on diverse language and vision fine-tuning tasks demonstrate that LoCA offers enhanced parameter efficiency while maintains computational feasibility comparable to low-rank-based methods.

1979Contextual Experience Replay for Continual Learning of Language Agents

[openreview] [pdf]

Abstract Large language model-based agents have shown their potential in decision-making tasks, such as web navigation. However, solving multi-step decision-making tasks in complex environments like websites often requires the acquisition of environment-specific experiences. Without continual learning of environment-specific knowledge, current methods often fail in these complex tasks. To address this, we propose Contextual Experience Replay (CER), a novel training-free framework to enable efficient continual learning for language agents through experience replay contextually, i.e. in their context window. CER is loosely inspired by experience replay in reinforcement learning, where the agent is trained with past experiences to do continual learning. Specifically, CER accumulates and synthesizes past experiences, which are represented as natural language summarizations and concrete trajectory examples, into a dynamic memory buffer. These experiences encompass environment dynamics and common decision-making patterns, allowing the agents to retrieve and augment themselves with relevant knowledge in new contexts, enhancing their adaptability in complex environments. We evaluate CER on the challenging WebArena and VisualWebArena benchmarks. While orthogonal to other methods, CER improves the GPT-4o agent baseline by a large margin and gets competitive results. On VisualWebArena, CER surpasses the tree search method with much lower token costs and achieves a state-of-the-art success rate of 31.9%. On WebArena, CER also gets a competitive average success rate of 33.16%, relatively improving the success rate of the GPT-4o agent baseline by 36.69%. CER shows that the continual learning of environment-specific knowledge is important and can lead to significant improvements in sequential decision-making tasks in complex environments.

1980Mixture of In-Context Prompters for Tabular PFNs

[openreview] [pdf]

Abstract Recent benchmarks find In-Context Learning (ICL) outperforms both deep learning and tree-based algorithms on small tabular datasets. However, on larger datasets, ICL for tabular learning suffers in both efficiency and effectiveness. In terms of efficiency, transformers incur linear space and quadratic time complexity w.r.t. context size. In terms of effectiveness, contexts at inference encounter distribution shift compared to contexts from pretraining. We propose MixturePFN, which extends Sparse Mixture of Experts to the state-of-the-art ICL for tabular learning model. Specifically, MixturePFN finetunes a specialized ICL expert on each cluster of tabular data and routes new test samples to appropriate experts at inference. MixturePFN supports constant-size contexts by splitting large training datasets into more manageable clusters. MixturePFN addresses distribution shift by finetuning an expert on each training dataset cluster via bootstrapping. Extensive experimental results shows MixturePFN outperforms 19 baselines both in mean rank and as the Condorcet winner across 36 diverse tabular datasets under both accuracy and F1 score with statistical significance.

1981ConDS: Context Distribution Shift for Robust In-Context Learning

[openreview] [pdf]

Abstract In-context Learning (ICL) is a popular approach to filling Large Language Models (LLMs) with the context without fine-tuning. ICL works by feeding the test input along with the context information selected from the candidate dataset as examples of explaining the target task and getting the answer. In real-world applications, noisy samples are easily to be included in the datasets, so it is unavoidable that the candidate set might contain noise caused by human or measurement errors. The effectiveness of ICL is highly dependent on the quality of the selected ICL samples. Thus the noise in the candidate set can severely mislead the query answer and degrade the ICL performance. However, the noise ICL problem is largely overlooked. To tackle this challenge, in this paper, we propose Context Distribution Shift (ConDS), which iteratively revises the distribution of the candidate dataset so that the retrieved ICL samples are emphasized to improve the robustness of ICL. Specifically, we first identify the informative samples based on the retriever ranking score and the feedback from the LLMs, and then augment the identified informative samples. A subsampling strategy is also adopted to emphasize the importance of informative samples and decrease the size of noisy samples. Thus, ICL’s reliability can be improved by reducing the catastrophic impact of noisy samples on almost all test queries to a small percentage. Our ConDS can be easily combined with existing off-the-shelf and fine-tuned retrievers. An analysis is also provided to reveal the relationship between ConDS and retrievers. Experimental results show that ConDS outperforms baselines on various tasks under the influence of noise by a large margin of 8.12%.

1982Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation

[openreview] [pdf]

Abstract Recent advancements in visual language models (VLMs) have notably enhanced their capabilities in handling complex Graphical User Interface (GUI) interaction tasks. Despite these improvements, current frameworks often struggle to generate correct actions in challenging GUI environments. State-of-the-art commercial VLMs are black-boxes, and fine-tuning open-source VLMs for GUI tasks requires significant resources. Additionally, existing trajectory-level evaluation and refinement techniques frequently fall short due to delayed feedback and local optimization issues. To address these challenges, we propose an approach that guides VLM agents with process supervision by a reward model during GUI navigation and control at inference time. This guidance allows the VLM agent to optimize actions at each inference step, thereby improving performance in both static and dynamic environments. In particular, our method demonstrates significant performance gains in the GUI navigation task setting, achieving a around 5% improvement in action accuracy for static environments and a near 15% increase in task success rate in dynamic environments. With further integration of trajectory reflection and retry mechanisms, we also demonstrate even greater enhancement in task success.

1983PN-GAIL: Leveraging Non-optimal Information from Imperfect Demonstrations

[openreview] [pdf]

Abstract Imitation learning aims at constructing an optimal policy by emulating expert demonstrations. However, the prevailing approaches in this domain typically presume that the demonstrations are optimal, an assumption that seldom holds true in the complexities of real-world applications. The data collected in practical scenarios often contains imperfections, encompassing both optimal and non-optimal examples. In this study, we propose Positive-Negative Generative Adversarial Imitation Learning (PN-GAIL), a novel approach that falls within the framework of Generative Adversarial Imitation Learning (GAIL). PN-GAIL innovatively leverages non-optimal information from imperfect demonstrations, allowing the discriminator to comprehensively assess the positive and negative risks associated with these demonstrations. Furthermore, it requires only a small subset of labeled confidence scores. Theoretical analysis indicates that PN-GAIL deviates from the non-optimal data while mimicking imperfect demonstrations. Experimental results demonstrate that PN-GAIL surpasses conventional baseline methods in dealing with imperfect demonstrations, thereby significantly augmenting the practical utility of imitation learning in real-world contexts. Our codes are available athttps://anonymous.4open.science/r/PN-GAIL-3828.

1984PoI: Pixel of Interest for Novel View Synthesis Assisted Scene Coordinate Regression

[openreview] [pdf]

Abstract The task of estimating camera poses can be enhanced through novel view synthesis techniques such as NeRF and Gaussian Splatting to increase the diversity and extension of training data. However, these techniques often produce rendered images with issues like blurring and ghosting, which compromise their reliability. These issues become particularly pronounced for Scene Coordinate Regression (SCR) methods, which estimate 3D coordinates at the pixel level. To mitigate the problems associated with unreliable rendered images, we introduce a novel filtering approach, which selectively extracts well-rendered pixels while discarding the inferior ones. The threshold of this filter is adaptively determined by the real-time reprojection loss recorded by the SCR models during training. Building on this filtering technique, we also develop a new strategy to improve scene coordinate regression using sparse inputs, drawing on successful applications of sparse input techniques in novel view synthesis. Our experimental results validate the effectiveness of our method, demonstrating the state-of-the-art performance on both indoor and outdoor datasets.

1985FedPMVR: Addressing Data Heterogeneity in Federated Learning through Partial Momentum Variance Reduction

[openreview] [pdf]

Abstract Federated learning (FL) emerges as a promising paradigm for training machine learning models on decentralized data sources while preserving privacy. However, the presence of not independent and identically distributed (non-IID) data among the clients introduces high variance in gradient updates, posing a significant challenge to the global model’s performance in terms of accuracy and convergence. To mitigate the adverse effects of data heterogeneity, we propose a novel momentum-based partial variance reduction technique. Our approach adjusts the gradient updates for the final classification layers of the client’s neural network by leveraging the gradient differences between local and global models. This adjustment aims to effectively capture and mitigate client drift, a key challenge arises from the presence of non-IID data distributions across clients. We systematically explains client drifts and conduct extensive experiments on three widely-used datasets, demonstrating that our method significantly enhances global model accuracy while reducing the communication rounds needed for convergence. Notably, our momentum-based partial variance reduction technique provides a robust mechanism, rendering more efficient and effective in scenarios with inherently non-IID and heterogeneous data distributions. By addressing the critical challenge of data heterogeneity in FL, our proposed approach paves the way for more reliable and accurate model training while preserving the privacy of decentralized data sources. The code is available at the following link {https://anonymous.4open.science/r/FedPMVR-33C1}.

1986One Language, Many Gaps: Evaluating Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks

[openreview] [pdf]

Abstract Language is not monolithic. While many benchmarks are used as proxies to systematically estimate Large Language Models’ (LLM) performance in real-life tasks, they tend to ignore the nuances of within-language variation and thus fail to model the experience of speakers of minority dialects. Focusing on African American Vernacular English (AAVE), we present the first study on LLMs’ fairness and robustness to a dialect in canonical reasoning tasks (algorithm, math, logic, and comprehensive reasoning). We hire AAVE speakers, including experts with computer science backgrounds, to rewrite seven popular benchmarks, such as HumanEval and GSM8K. The result of this effort is ReDial, a dialectal benchmark comprising 1.2K+ parallel query pairs in Standardized English and AAVE. We use ReDial to evaluate state-of-the-art LLMs, including GPT-4o/4/3.5-turbo, LLaMA-3.1/3, Mistral, and Phi-3. We find that, compared to Standardized English, almost all of these widely used models show significant brittleness and unfairness to queries in AAVE. Furthermore, AAVE queries can degrade performance more substantially than misspelled texts in Standardized English, even when LLMs are more familiar with the AAVE queries. Finally, asking models to rephrase questions in Standardized English does not close the performance gap but generally introduces higher costs. Overall, our findings indicate that LLMs provide unfair service to dialect users in complex reasoning tasks. Code can be found athttps://anonymous.4open.science/r/redial_eval-0A88.

1987(Pre-)training Dynamics: Scaling Generalization with First-Order Logic

[openreview] [pdf]

Abstract Transformer-based models have demonstrated a remarkable capacity for learning complex nonlinear relationships. While previous research on generalization dynamics has primarily focused on small transformers (1-2 layers) and simple tasks like XOR and modular addition, we extend this investigation to larger models with 125M parameters, trained on a more sophisticated first-order logic (FOL) task. We introduce a novel FOL dataset that allows us to systematically explore generalization across varying levels of complexity. Our analysis of the pretraining dynamics reveals a series of distinct phase transitions corresponding to the hierarchical generalization of increasingly complex operators and rule sets within the FOL framework. Our task and model establish a testbed for investigating pretraining dynamics at scale, offering a foundation for future research on the learning trajectories of advanced AI systems.

1988Sharper Bounds of Non-Convex Stochastic Gradient Descent with Momentum

[openreview] [pdf]

Abstract Stochastic gradient descent with momentum (SGDM) has been widely used in machine learning. However, in non-convex domains, high probability learning bounds for SGDM are scarce. In this paper, we provide high probability convergence bounds and generalization bounds for SGDM. Firstly, we establish these bounds for the gradient norm in the general non-convex case. The derived convergence bounds are tighter than the theoretical results of related work, and to our best knowledge, the derived generalization bounds are the first ones for SGDM. Then, if the Polyak-{\L}ojasiewicz condition is satisfied, we establish these bounds for the error of the function value, instead of the gradient norm. Moreover, the derived learning bounds have faster rates than the general non-convex case. Finally, we further provide sharper generalization bounds by considering a mild Bernstein condition on the gradient. In the case of low noise, their learning rates can reach O~(1/n2)\widetilde{\mathcal{O}}(1/n^2), where nn is the sample size. Overall, we relatively systematically investigate the high probability learning bounds for non-convex SGDM.

1989Boosting Neural Combinatorial Optimization for Large-Scale Vehicle Routing Problems

[openreview] [pdf]

Abstract Neural Combinatorial Optimization (NCO) methods have exhibited promising performance in solving Vehicle Routing Problems (VRPs). However, most NCO methods rely on the conventional self-attention mechanism that induces excessive computational complexity, thereby struggling to contend with large-scale VRPs and hindering their practical applicability. In this paper, we propose a lightweight cross-attention mechanism with linear complexity, by which a Transformer network is developed to learn efficient and favorable solutions for large-scale VRPs. We also propose a Self-Improved Training (SIT) algorithm that enables direct model training on large-scale VRP instances, bypassing extensive computational overhead for attaining labels. By iterating solution reconstruction, the Transformer network itself can generate improved partial solutions as pseudo-labels to guide the model training. Experimental results on the Travelling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) with up to 100K nodes indicate that our method consistently achieves superior performance for synthetic and real-world benchmarks, significantly boosting the scalability of NCO methods.

1990Strong Model Collapse

[openreview] [pdf]

Abstract Within the scaling laws paradigm, which underpins the training of large neural networks like ChatGPT and Llama, we consider a supervised regression setting and establish a strong form of the model collapse phenomenon, a critical performance degradation due to synthetic data in the training corpus. Our results show that even the smallest fraction of synthetic data (e.g., as little as 1 per 1000) can still lead to model collapse: larger and larger training sets do not enhance performance. We further investigate whether increasing model size, an approach aligned with current trends in training large language models, exacerbates or mitigates model collapse. In a simplified regime where neural networks are approximated via random projections of tunable size, we both theoretically and empirically show that larger models can amplify model collapse. Interestingly, our theory also indicates that, beyond the interpolation threshold (which can be extremely high for very large datasets), larger models may mitigate the collapse, although they do not entirely prevent it. Our theoretical findings are empirically verified through experiments on language models and neural networks for images.

1991Simulating, Fast and Slow: Learning Policies for Black-Box Optimization

[openreview] [pdf]

Abstract Simulators are vital in science and engineering, as they faithfully model the influence of design parameters on real-world observations. A common problem is leveraging the simulator to optimize the design parameters to minimize a desired objective function. Since simulators are often non-differentiable blackboxes and each simulation incurs significant compute time, gradient-based optimization techniques can often be intractable or, in some cases, impossible. Furthermore, in many experiment design settings, practitioners are required to solve sets of closely related optimization problems. Thus, starting the optimization from scratch each time might be inefficient if the forward simulation model is expensive to evaluate. To address these challenges, this paper introduces a novel method for solving classes of similar black-box optimization problems by learning an active learning policy that guides the training of a differentiable surrogate and then uses that surrogate’s gradients to optimize the simulation parameters with gradient descent. After training the policy, the cost for downstream optimization of problems involving black-box simulators is amortized and we require up to \sim90% fewer expensive simulator calls compared to baselines such as local surrogate-based approaches, numerical optimization, and Bayesian methods.

1992Post-hoc Reward Calibration: A Case Study on Length Bias

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback aligns the outputs of Large Language Models with human values and preferences. Central to this process is the reward model (RM), which translates human feedback into training signals for optimising LLM behaviour. However, RMs can develop biases by exploiting spurious correlations in their training data, such as favouring outputs based on length or style rather than true quality. These biases can lead to incorrect output rankings, sub-optimal model evaluations, and the amplification of undesirable behaviours in LLMs alignment. This paper addresses the challenge of correcting such biases without additional data and training, introducing the concept of Post-hoc Reward Calibration. We first propose to use local average reward to estimate the bias term and, thus, remove it to approximate the underlying true reward. We then extend the approach to a more general and robust form with the Locally Weighted Regression. Focusing on the prevalent length bias, we validate our proposed approaches across three experimental settings, demonstrating consistent improvements: (1) a 3.11 average performance gain across 33 reward models on the RewardBench dataset; (2) improved agreement of RM produced rankings with GPT-4 evaluations and human preferences based on the AlpacaEval benchmark; and (3) improved Length-Controlled win rate (Dubois et al., 2024) of the RLHF process in multiple LLM–RM combinations. According to our experiments, our method is computationally efficient and generalisable to other types of bias and RMs, offering a scalable and robust solution for mitigating biases in LLM alignment and evaluation.

1993LLM Embeddings Improve Test-Time Adaptation to TabularY|X-Shifts

[openreview] [pdf]

Abstract For tabular datasets, the change in the relationship between the label and covariates (YXY|X-shifts) is common due to missing variables. Since it is impossible to generalize to a completely new and unknown domain, we study models that are easy to adapt to the target domain even with few labeled examples. We focus on building more informative representations of tabular data that can mitigate YXY|X-shifts, and propose to leverage the prior world knowledge in LLMs by serializing the tabular data to encode it. We find LLM embeddings alone provide inconsistent improvements in robustness, but models trained on them can be well adapted to the target domain even using 32 labeled observations. Our finding is based on a systematic study consisting of 7650 source-target pairs and benchmark against261,000model configurations trained by 20 algorithms. Our observation holds when ablating the size of accessible target data and different adaptation strategies.

1994SafetyAnalyst: Interpretable, transparent, and steerable LLM safety moderation

[openreview] [pdf]

Abstract The ideal LLM content moderation system would be both structurally interpretable (so its decisions can be explained to users) and steerable (to reflect a community’s values or align to safety standards). However, current systems fall short on both of these dimensions. To address this gap, we present SafetyAnalyst, a novel LLM safety moderation framework. Given a prompt, SafetyAnalyst creates a structured ``harm-benefit tree,‘’ which identifies 1) the actions that could be taken if a compliant response were provided, 2) the harmful and beneficial effects of those actions (along with their likelihood, severity, and immediacy), and 3) the stakeholders that would be impacted by those effects. It then aggregates this structured representation into a harmfulness score based on a parameterized set of safety preferences, which can be transparently aligned to particular values. Using extensive harm-benefit features generated by SOTA LLMs on 19k prompts, we fine-tuned an open-weight LM to specialize in generating harm-benefit trees through symbolic knowledge distillation. On a comprehensive set of prompt safety benchmarks, we show that our system (average F1=0.75) outperforms existing LLM safety moderation systems (average F1<<0.72) on prompt harmfulness classification, while offering the additional advantages of interpretability and steerability.

1995Proximal Mapping Loss: Understanding Loss Functions in Crowd Counting & Localization

[openreview] [pdf]

Abstract Crowd counting and localization involves extracting the number and distribution of crowds from images or videos using computer vision techniques. Most counting methods are based on density regression and are based on an ``intersection’’ hypothesis,i.e., one pixel is influenced by multiple points in ground truth, which is inconsistent with reality since one pixel would not contain two objects. This paper proposes Proximal Mapping Loss (PML), a density regression method that eliminates this hypothesis. PML divides the predicted density map into multiple point-neighbor cases through nearest neighbor, and then dynamically constructs a learning target for each sub-case via proximal mapping, leading to more robust and accurate training. Furthermore, PML is theoretically linked to various existing loss functions, such as Gaussian-blurred L2 loss, Bayesian loss, and the training schemes in P2PNet and DMC, demonstrating its versatility and adaptability. Experimentally, PML significantly improves the performance of crowd counting and localization, and illustrates the robustness against annotation noise.

1996Learning Constrained Markov Decision Processes With Non-stationary Rewards and Constraints

[openreview] [pdf]

Abstract In constrained Markov decision processes (CMDPs) with adversarial rewards and constraints, a well-known impossibility result prevents any algorithm from attaining both sublinear regret and sublinear constraint violation, when competing against a best-in-hindsight policy that satisfies constraints on average. In this paper, we show that this negative result can be eased in CMDPs with non-stationary rewards and constraints, by providing algorithms whose performances smoothly degrade as non-stationarity increases. Specifically, we propose algorithms attaining O~(T+C)\tilde{\mathcal{O}} (\sqrt{T} + C) regret and positive constraint violation under bandit feedback, where CC is a corruption value measuring the environment non-stationarity. This can be Θ(T)\Theta(T) in the worst case, coherently with the impossibility result for adversarial~CMDPs. First, we design an algorithm with the desired guarantees when CC is known. Then, in the case CC is unknown, we show how to obtain the same results by embedding such an algorithm in a general meta-procedure. This is of independent interest, as it can be applied to any non-stationary constrained online learning setting.

1997Ad-Hoc Human-AI Coordination Challenge

[openreview] [pdf]

Abstract Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is an established, fully cooperative benchmark environment that involves imperfect information, limited communication, theory of mind, and the necessity for coordination among agents to achieve a shared goal. These characteristics, in principle, make Hanabi an ideal testbed for exploring human-AI coordination. However, one key issue is that evaluation with human partners is both expensive and difficult to reproduce. To address this, we first develop \textit{human proxy agents} via a combination of behavioural cloning on a large-scale dataset of human game play and regularised reinforcement learning. These proxies serve as robust, cheap and reproducible human-like evaluation partners in our Ad-Hoc Human-AI Coordination Challenge (AH2AC2). To facilitate the exploration of methods that leverage \textit{limited amounts} of human data, we introduce two data-limited challenge settings, using 1,000 and 5,000 games, which we open-source. Finally, we provide baselines for each challenge variety. These include zero-shot coordination methods, which do not utilise any human data, and methods that make use of the available human data combined with reinforcement learning. To prevent overfitting and ensure fair evaluation, we introduce an evaluation protocol that involves us hosting the proxy agents rather than publicly releasing them, and a public leaderboard for tracking the progress of the community. We make our code available as an anonymous repository: \url{https://anonymous.4open.science/r/ah2ac2-E451/}

1998NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

[openreview] [pdf]

Abstract We present Nirantar based on a large-scale effort to collect extempore and conversational speech data from participants spanning 22 languages across diverse locations in India. Given the extensive number of languages and locations involved, data is collected in incremental batches. Each batch introduces new languages, new domains (locations), or both, creating a practical playground for continual learning (CL). Nirantar contains a total of 3250 hours of human-transcribed speech data covering 208 Indian districts across 22 languages, with 1720 hours newly released as a part of this work. The data inflow and resulting multilingual multi-domain episodes are based on real-world data collection rather than simulated episodes commonly found in existing CL datasets. In particular, the amount of data collected and the number of languages and domains involved are not uniform across episodes, reflecting a practical and real-world continual learning scenario. This dataset serves as a playground for training and evaluating CL approaches in three different scenarios: Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL), which has not been studied before. To establish the dataset’s usefulness, we evaluate several existing CL approaches within these scenarios. Our findings indicate that the behaviour of these algorithms varies across the three scenarios, emphasizing the need for detailed independent studies of each.

1999Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization

[openreview] [pdf]

Abstract Federated learning is a popular distributed learning paradigm in machine learning. Meanwhile, composition optimization is an effective hierarchical learning model, which appears in many machine learning applications such as meta learning and robust learning. More recently, although a few federated composition optimization algorithms have been proposed, they still suffer from high sample and communication complexities. In the paper, thus, we propose a class of faster adaptive federated compositional optimization algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated composition optimization algorithms. Specifically, our algorithms obtain lower sample complexity of O~(ϵ3)\tilde{O}(\epsilon^{-3}) with lower communication complexity of O~(ϵ2)\tilde{O}(\epsilon^{-2}) in finding an ϵ\epsilon-stationary solution. We conduct numerical experiments on robust federated learning and distributed meta learning tasks to demonstrate the efficiency of our algorithms.

2000An Asynchronous Bundle Method for Distributed Learning Problems

[openreview] [pdf]

Abstract We propose a novel asynchronous bundle method to solve distributed learning problems. Compared to existing asynchronous methods, our algorithm computes the next iterate based on a more accurate approximation of the objective function and does not require any prior information about the maximal information delay in the system. This makes the proposed method fast and easy to tune. We prove that the algorithm converges in both deterministic and stochastic (mini-batch) settings, and quantify how the convergence times depend on the level of asynchrony. The practical advantages of our method are illustrated through numerical experiments on classification problems of varying complexities and scales.

2001Efficient and Generalizable Second-Order Certified Unlearning: A Hessian-Free Online Model Updates Approach

[openreview] [pdf]

Abstract Machine unlearning strives to uphold the data owners’ right to be forgotten by enabling models to selectively forget specific data. Recent advances suggest pre-computing and storing statistics extracted from second-order information and implementing unlearning through Newton-style updates. However, the Hessian matrix operations are extremely costly and previous works conduct unlearning for empirical risk minimizer with strong convexity assumption, precluding their applicability to high-dimensional over-parameterized models and the nonconvergence condition. In this paper, we propose an efficient Hessian-free unlearning approach. The key idea is to maintain a statistical vector for each training data, computed through affine stochastic recursion of the difference between the retrained and learned models. We prove that our proposed method outperforms the state-of-the-art methods in terms of the unlearning and generalization guarantees, the deletion capacity, and the time/storage complexity, under the same regularity conditions. Through the strategy of recollecting statistics for removing data, we develop an online unlearning algorithm that achieves near-instantaneous data removal, as it requires only vector addition. Experiments demonstrate that our proposed scheme surpasses existing results by orders of magnitude in terms of time/storage costs with millisecond-level unlearning execution, while also enhancing test accuracy.

2002NTK-DFL: Enhancing Decentralized Federated Learning in Heterogeneous Settings via Neural Tangent Kernel

[openreview] [pdf]

Abstract Decentralized federated learning (DFL) is a collaborative machine learning framework for training a model across participants without a central server or raw data exchange. DFL faces challenges due to statistical heterogeneity, as participants often possess different data distributions reflecting local environments and user behaviors. Recent work has shown that the neural tangent kernel (NTK) approach, when applied to federated learning in a centralized framework, can lead to improved performance. The NTK-based update mechanism is more expressive than typical gradient descent methods, enabling more efficient convergence and better handling of data heterogeneity. We propose an approach leveraging the NTK to train client models in the decentralized setting, while introducing a synergy between NTK-based evolution and model averaging. This synergy exploits inter-model variance and improves both accuracy and convergence in heterogeneous settings. Our model averaging technique significantly enhances performance, boosting accuracy by at least 10% compared to the mean local model accuracy. Empirical results demonstrate that our approach consistently achieves higher accuracy than baselines in highly heterogeneous settings, where other approaches often underperform. Additionally, it reaches target performance in 4.6 times fewer communication rounds. We validate our approach across multiple datasets, network topologies, and heterogeneity settings to ensure robustness and generalizability. The source code will be available as a link on the discussion forum once it is open.

2003Reconstructive Visual Instruction Tuning

[openreview] [pdf]

Abstract This paper introduces reconstructive visual instruction tuning (ROSS), a family of Large Multimodal Models (LMMs) that exploit vision-centric supervision signals. In contrast to conventional visual instruction tuning approaches that exclusively supervise text outputs, ROSS prompts LMMs to supervise visual outputs via reconstructing input images. By doing so, it capitalizes on the inherent richness and detail present within input images themselves, which are often lost in pure text supervision. However, producing meaningful feedback from natural images is challenging due to the heavy spatial redundancy of visual signals. To address this issue, ROSS employs a denoising objective to reconstruct latent representations of input images, avoiding directly regressing exact raw RGB values. This intrinsic activation design inherently encourages LMMs to maintain image detail, thereby enhancing their fine-grained comprehension capabilities and reducing hallucinations. Empirically, ROSS consistently brings significant improvements across different visual encoders and language models. In comparison with extrinsic assistance state-of-the-art alternatives that aggregate multiple visual experts, ROSS delivers competitive performance with a single SigLIP visual encoder, demonstrating the efficacy of our vision-centric supervision tailored for visual outputs. The code will be made publicly available upon acceptance.

2004Chain-of-Thought Provably Enables Learning the (Otherwise) Unlearnable

[openreview] [pdf]

Abstract Modern language models have demonstrated remarkable reasoning capabilities by using chain-of-thought (CoT). One hypothesis about the inner workings of CoT is that it breaks down originally complex tasks into smaller subtasks that are more amenable to learning. We formalize this notion by showing possibility and impossibility results of learning from in-context demonstrations with and without CoT. In particular, with CoT, we examine a family of learning algorithms that learn a task step-by-step, capable of composing simpler functions from individual reasoning steps to form an overall complex function. This process reduces the difficulty of learning a task to that of the hardest reasoning step in the chain. Moreover, we prove Transformers can express this algorithm and thus they can efficiently in-context learn arbitrary tasks as long as these tasks can be decomposed into a finite number of subtasks, each of which are efficiently learnable. In contrast, without CoT, we demonstrate that there exist tasks that are inherently unlearnable by the same algorithm. Overall, our results suggest several provably effective ways for decomposing target problems to instantiate CoT. Empirically, we demonstrate our proposed CoT construction significantly enhances the reasoning capabilities of real-world LLMs in solving challenging arithmetic reasoning tasks, including learning polynomials and Boolean formulas.

2005TimeKAN: A Transparent KAN-Based Approach for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract In recent years, numerous deep learning models have been proposed for Multi-variate Time Series (MTS) forecasting, with Transformer-based models showing significant potential due to their ability to capture long-term dependencies. However, existing models based on MLPs or Transformers often suffer from a lack of interpretability due to their large parameter sizes, which can be problematic in many real-world applications. To address this issue, we propose TimeKAN, a model based on Kolmogorov-Arnold Networks. The KAN model offers two key advantages: (1) it achieves accuracy comparable to MLPs with significantly fewer parameters, and (2) its parameters can be symbolized, which makes it possible to interpret the meaning of the parameters. Additionally, instead of the usual attention mechanisms, we designed a Multi-Scale Patching (MSP) module for MTS that allows for more flexible and simple multi-patching and effectively extracts both temporal and cross-dimensional features. By leveraging this strategy along with KAN, TimeKAN constructs a hierarchical structure capable of utilizing information across different scales, leading to highly accurate predictions. Extensive experiments on six real-world datasets demonstrate that TimeKAN outperforms state-of-the-art (SOTA) methods in terms of predictive performance. Furthermore, we interpret TimeKAN by visualizing its learning process for extracting symbolized features, opening the black box and revealing meaningful patterns within the time series.

2006Demystifying Online Clustering of Bandits: Enhanced Exploration Under Stochastic and Smoothed Adversarial Contexts

[openreview] [pdf]

Abstract The contextual multi-armed bandit (MAB) problem is crucial in sequential decision-making. A line of research, known as online clustering of bandits, extends contextual MAB by grouping similar users into clusters, utilizing shared features to improve learning efficiency. However, existing algorithms, which rely on the upper confidence bound (UCB) strategy, struggle to gather adequate statistical information to accurately identify unknown user clusters. As a result, their theoretical analyses require several strong assumptions about the “diversity” of contexts generated by the environment, leading to impractical settings, complicated analyses, and poor practical performance. Removing these assumptions has been a long-standing open problem in the clustering of bandits literature. In this work, we provide two partial solutions. First, we introduce an additional exploration phase to accelerate the identification of clusters. We integrate this general strategy into both graph-based and set-based algorithms and propose two new algorithms, UniCLUB and UniSCLUB. Remarkably, our algorithms require substantially weaker assumptions and simpler theoretical analyses while achieving superior cumulative regret compared to previous studies. Second, inspired by the smoothed analysis framework, we propose a more practical setting that eliminates the requirement for i.i.d. context generation used in previous studies, thus enhancing the performance of existing algorithms for online clustering of bandits. Extensive evaluations on both synthetic and real-world datasets demonstrate that our proposed algorithms outperform existing approaches.

2007Sentinel: Multi-Patch Transformer with Temporal and Channel Attention for Time Series Forecasting

[openreview] [pdf]

Abstract Transformer-based time series forecasting has recently gained strong interest due to the ability of transformers to model sequential data. Most of the state-of-the-art architectures exploit either temporal or inter-channel dependencies, limiting their effectiveness in multivariate time-series forecasting where both types of dependencies are crucial. We propose Sentinel, a full transformer-based architecture composed of an encoder able to extract contextual information from the channel dimension, and a decoder designed to capture causal relations and dependencies across the temporal dimension. Additionally, we introduce a multi-patch attention mechanism, which leverages the patching process to structure the input sequence in a way that can be naturally integrated into the transformer architecture, replacing the multi-head splitting process. Extensive experiments on standard benchmarks demonstrate that Sentinel, because of its ability to ``monitor" both the temporal and the inter-channel dimension, achieves better or comparable performance with respect to state-of-the-art approaches.

2008Sensitivity Verification for Decision Tree Ensembles

[openreview] [pdf]

Abstract Tree ensemble models, such as Gradient Boosted Decision trees (GBDTs) and random forests, are widely popular models for a variety of machine learning tasks. The power of these models comes from the ensemble of decision trees, which makes analysis of such models significantly harder than for single trees. As a result, recent work has focussed on developing exact and approximate techniques for questions such as robustness verification, fairness and explainability, for such models of tree ensembles.In this paper, we focus on a specific problem of feature sensitivity of decision tree ensembles and build a formal verification framework for it. We start by showing theoretical (NP-)hardness of the problem and explain how it relates to other verification problems. Next, we provide a novel encoding of the problem using pseudo-Boolean constraints. Based on this encoding, we develop a tunable algorithm to perform sensitivity analysis, which can trade off precision for running time. We implement our algorithm and study its performance on a suite of GBDT benchmarks from the literature. Our experiments show the practical utility of our approach and its improved performance compared to existing approaches.

2009Mitigating Forgetting in Continually Pretraining MoE-LLMs by Adding and Chilling Experts

[openreview] [pdf]

Abstract As model training requires more and more compute, the cost of re-training models to support new data or domains increases as well. Methods to adapt existing models to new data distributions are crucial to avoid spending redundant compute re-training models from scratch. However, naive finetuning often incurs forgetting of previously learned capabilities. In this paper, we analyse how different factors such as model size, dataset size and replay data impact forgetting when adapting models to new data distributions. We also propose to increase the capacity of Mixture-of-experts models by adding new experts and reducing the learning rate of the old model weights. Our experiments show that this simple method allows to reduce forgetting and learn efficiently on the new domain.

2010GPromptShield: Elevating Resilience in Graph Prompt Tuning Against Adversarial Attacks

[openreview] [pdf]

Abstract The paradigm of ``pre-training and prompt fine-tuning", with its effectiveness and lightweight characteristics, has rapidly spread from the language field to the graph field. Several pioneering studies have designed specialized prompt functions for diverse downstream graph tasks based on various graph pre-training strategies. These prompts concentrate on the compatibility between the pre-training pretext and downstream graph tasks, aiming to bridge the gap between them. However, designing prompts to blindly adapt to downstream tasks based on this concept neglects crucial security issues. By conducting covert attacks on downstream graph data, we find that even when the downstream task data closely matches that of the pre-training tasks, it is still feasible to generate highly misleading prompts using simple deceptive techniques. In this paper, we shift the primary focus of graph prompts from compatibility to vulnerability issues in adversarial attack scenarios. We design a highly extensible shield defense system for the prompts, which enhances their robustness from two perspectives: Direct Handling and Indirect Amplification. When downstream graph data exhibits unreliable biases, the former directly combats invalid information by adding mixed multi-defense prompts to the input graph’s feature space, while the latter employs a training strategy that circumvents invalid part and amplifies valid part. We provide a theoretical derivation that proves their feasibility, indicating that unbiased prompts exist under certain conditions on unreliable data. Extensive experiments across various adversarial attack scenarios indicate that the prompts within our shield defense system exhibit enhanced resilience and superiority. Our work explores new perspectives in the field of graph prompts, offering a novel option for downstream robust prompt fine-tuning.

2011Learning positional encodings in transformers depends on initialization

[openreview] [pdf]

Abstract The attention mechanism is central to the transformer’s ability to capture complex dependencies between tokens of an input sequence. Key to the successful application of the attention mechanism in transformers is its choice of positional encoding (PE). The PE provides essential information that distinguishes the position and order amongst tokens in a sequence. Most prior investigations of PE effects on generalization were tailored to 1D input sequences, such as those presented in natural language, where adjacent tokens (e.g., words) are highly related. In contrast, many real world tasks involve datasets with highly non-trivial positional arrangements, such as datasets organized in multiple spatial dimensions, or datasets for which ground truth positions are not known, such as in biological data. Here we study the importance of learning accurate PE for problems which rely on a non-trivial arrangement of input tokens. Critically, we find that the choice of initialization of a learnable PE greatly influences its ability to discover accurate PEs that lead to enhanced generalization. We empirically demonstrate our findings in a 2D relational reasoning task and a real world 3D neuroscience dataset, applying interpretability analyses to verify the learning of accurate PEs. Overall, we find that a learned PE initialized from a small-norm distribution can 1) uncover interpretable PEs that mirror ground truth positions, 2) learn non-trivial and modular PEs in a real-world neuroscience dataset, and 3) lead to improved downstream generalization in both datasets. Importantly, choosing an ill-suited PE can be detrimental to both model interpretability and generalization. Together, our results illustrate the feasibility of discovering accurate PEs for enhanced generalization.

2012Language Imbalance Driven Rewarding for Multilingual Self-improving

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited “first-class” languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose Language Imbalance Driven Rewarding\textit{Language Imbalance Driven Rewarding}, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language’s capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs.

2013HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom

[openreview] [pdf]

Abstract The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduceHASARD(A Benchmark forHArnessingSAfeReinforcement Learning withDoom), tailored for egocentric pixel-based safe RL. HASARD features a suite of diverse and stochastic 3D environments. Unlike prior vision-based 3D task suites with simple navigation objectives, the environments require spatial comprehension, short-term planning, and active prediction to obtain high rewards while ensuring safety. The benchmark offers three difficulty levels to challenge advanced future methods while providing an easier training loop for more streamlined analysis. Accounting for the variety of potential safety protocols, HASARD supports both soft and hard safety constraints. An empirical evaluation of baseline methods highlights their limitations and demonstrates the benchmark’s utility, emphasizing unique algorithmic challenges. The difficulty levels offer a built-in curriculum, enabling more efficient learning of safe policies at higher levels. HASARD utilizes heatmaps to visually trace and analyze agent navigation within the environment, offering an interpretive view of strategy development. Our work is the first benchmark to exclusively target vision-based embodied safe RL, offering a cost-effective and insightful way to explore the potential and boundaries of current and future safe RL methods. The environments, code, and baseline implementations will be open-sourced.

2014Knowledge Graph Tuning: Real-time Large Language Model Personalization based on Human Feedback

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable proficiency in a range of natural language processing tasks. Once deployed, LLMs encounter users with personalized factual knowledge, and such personalized knowledge is consistently reflected through users’ interactions with the LLMs. To enhance user experience, real-time model personalization is essential, allowing LLMs to adapt user-specific knowledge based on user feedback during human-LLM interactions. Existing methods mostly require back-propagation to finetune the model parameters, which incurs high computational and memory costs. In addition, these methods suffer from low interpretability, which will cause unforeseen impacts on model performance during long-term use, where the user’s personalized knowledge is accumulated extensively. To address these challenges, we propose Knowledge Graph Tuning (KGT), a novel approach that leverages knowledge graphs (KGs) to personalize LLMs. KGT extracts personalized factual knowledge triples from users’ queries and feedback and optimizes KGs without modifying the LLM parameters. Our method improves computational and memory efficiency by avoiding back-propagation and ensures interpretability by making the KG adjustments comprehensible to humans. Experiments with state-of-the-art LLMs, including GPT-2, Llama2, and Llama3, show that KGT significantly improves personalization performance while reducing latency and GPU memory costs. Ultimately, KGT offers a promising solution of effective, efficient, and interpretable real-time LLM personalization during user interactions with the LLMs.

2015Learning guarantee of reward modeling using deep neural networks

[openreview] [pdf]

Abstract In this work, we study the learning theory of reward modeling using pairwise comparison data and deep neural networks. We establish a novel non-asymptotic regret bound for deep reward estimators in a non-parametric setting, which depends explicitly on the network architecture. Furthermore, to underscore the critical importance of clear human beliefs, we introduce a margin-type condition requiring the conditional winning probability of the optimal action in pairwise comparisons to be significantly distanced from 1/2. This condition enables a sharper regret bound, which substantiates the empirical efficiency in Reinforcement Learning from Human Feedback (RLHF) and highlights the role of clear human beliefs in its success. Notably, this improvement stems from high-quality pairwise comparison data under the margin-type condition and is independent of the specific estimators used, making it applicable to various learning algorithms and models.

2016From Search to Sampling: Generative Models for Robust Algorithmic Recourse

[openreview] [pdf]

Abstract Algorithmic Recourse provides recommendations to individuals who are adversely impacted by automated model decisions, on how to alter their profiles to achieve a favorable outcome. Effective recourse methods must balance three conflicting goals: proximity to the original profile to minimize cost, plausibility for realistic recourse, and validity to ensure the desired outcome. We show that existing methods train for these objectives separately and then search for recourse through a joint optimization over the recourse goals during inference, leading to poor recourse recommendations. We introduce GenRe, a generative recourse model designed to train the three recourse objectives jointly. Training such generative models is non-trivial due to lack of direct recourse supervision. We propose efficient ways to synthesize such supervision and further show that GenRe’s training leads to a consistent estimator. Unlike most prior methods, that employ non-robust gradient descent based search during inference, GenRe simply performs a forward sampling over the generative model to produce minimum cost recourse, leading to superior performance across multiple metrics. We also demonstrate GenRe provides the best trade-off between cost, plausibility and validity, compared to state-of-art baselines. We release anonymized code at:https://anonymous.4open.science/r/GenRe-BD71

2017FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information

[openreview] [pdf]

Abstract This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving SoTA results in ASR.

2018Context-Scaling versus Task-Scaling in In-Context Learning

[openreview] [pdf]

Abstract Transformers exhibit In-Context Learning (ICL), a phenomenon in which these models solve new tasks by using examples in the prompt without additional training. In our work, we analyze two key components of ICL: (1) context-scaling, where model performance improves as the number of in-context examples increases and (2) task-scaling, where model performance improves as the number of pre-training tasks increases. While transformers are capable of both context-scaling and task-scaling, we empirically show that standard Multi-Layer Perceptrons (MLPs) with vectorized input are only capable of task-scaling. To understand how transformers are capable of context-scaling, we first propose a significantly simplified transformer that performs ICL comparably to the original GPT-2 model in statistical learning tasks (e.g., linear regression, teacher-student settings). By analyzing a single layer of our proposed model, we identify classes of feature maps that enable context scaling. Theoretically, these feature maps can implement the Hilbert estimate, a model that is provably consistent for context-scaling. We then show that using the output of the Hilbert estimate along with vectorized input empirically enables both context-scaling and task-scaling with MLPs. Overall, our findings provide insights into the fundamental mechanisms of how transformers are able to learn in context.

2019One-shot World Models Using a Transformer Trained on a Synthetic Prior

[openreview] [pdf]

Abstract A World Model is a compressed spatial and temporal representation of a real world environment that allows one to train an agent or execute planning methods. However, world models are typically trained on observations from the real world environment, and they usually do not enable learning policies for other real environments. We propose One-Shot World Model (OSWM), a transformer world model that is learned in an in-context learning fashion from purely synthetic data sampled from a prior distribution. Our prior is composed of multiple randomly initialized neural networks, where each network models the dynamics of each state and reward dimension of a desired target environment. We adopt the supervised learning procedure of Prior-Fitted Networks by masking next-state and reward at random context positions and query OSWM to make probabilistic predictions based on the remaining transition context. During inference time, OSWM is able to quickly adapt to the dynamics of a simple grid world, as well as the CartPole gym and a custom control environment by providing 1k transition steps as context and is then able to successfully train environment-solving agent policies. However, transferring to more complex environments remains a challenge, currently. Despite these limitations, we see this work as an important stepping-stone in the pursuit of learning world models purely from synthetic data.

2020Splitting & Integrating: Out-of-Distribution Detection via Adversarial Gradient Attribution

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection is essential for enhancing the robustness and security of deep learning models in unknown and dynamic data environments. Gradient-based OOD detection methods, such as GAIA, analyse the explanation pattern representations of in-distribution (ID) and OOD samples by examining the sensitivity of model outputs w.r.t. model inputs, resulting in superior performance compared to traditional OOD detection methods. However, we argue that the non-zero gradient behaviors of OOD samples do not exhibit significant distinguishability, especially when ID samples are perturbed by random noise in high-dimensional spaces, which negatively impacts the accuracy of OOD detection. In this paper, we propose a novel OOD detection method calledS & Ibased on layerSplitting and gradientIntegration via Adversarial Gradient Attribution. Specifically, our approach involves splitting the model’s intermediate layers and iteratively updating adversarial examples layer-by-layer. We then integrate the attribution gradients from each intermediate layer along the attribution path from adversarial examples to the actual input, yielding true explanation pattern representations for both ID and OOD samples. Experiments demonstrate that our S & I algorithm achieves state-of-the-art results, with the average FPR95 of 29.05% (38.61%) and 37.31% on the CIFAR100 and ImageNet benchmarks, respectively. Our code is available at:https://anonymous.4open.science/r/S-I-F6F7/.

2021Pathologies of Out-of-Distribution Detection

[openreview] [pdf]

Abstract There is a proliferation of out-of-distribution (OOD) detection methods in deep learning which aim to detect distribution shifts and improve model safety. These methods often rely on supervised learning to train models with in-distribution data and then use the models’ predictive uncertainty or features to identify OOD points. In this paper, we critically re-examine this popular family of OOD detection procedures, revealing deep-seated pathologies. In contrast to prior work, we argue that these procedures are fundamentally answering the wrong question for OOD detection, with no easy fix. Uncertainty-based methods incorrectly conflate high uncertainty with being OOD, and feature-based methods incorrectly conflate far feature-space distance with being OOD. Moreover, there is no reason to expect a classifier trained only on in-distribution classes to be able to identify OOD points; for example, we should not necessarily expect a cat-dog classifier to be uncertain about the label of an airplane, which may share features with a cat that help distinguish cats from dogs, despite generally appearing nothing alike. We show how these pathologies manifest as irreducible errors in OOD detection and identify common settings where these methods are ineffective. Additionally, interventions to improve OOD detection such as feature-logit hybrid methods, scaling of model and data size, Bayesian (epistemic) uncertainty representation, and outlier exposure also fail to address the fundamental misspecification.

2022Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy

[openreview] [pdf]

Abstract The high costs and risks involved in extensive environment interactions hinder the practical application of current online safe reinforcement learning (RL) methods. While offline safe RL addresses this by learning policies from static datasets, the performance therein is usually limited due to reliance on data quality and challenges with out-of-distribution (OOD) actions. Inspired by recent successes in offline-to-online (O2O) RL, it is crucial to explore whether offline safe RL can be leveraged to facilitate faster and safer online policy learning, a direction that has yet to be fully investigated. To fill this gap, we first demonstrate that naively applying existing O2O algorithms from standard RL would not work well in the safe RL setting due to two unique challenges: \emph{erroneous Q-estimations}, resulted from offline-online objective mismatch and offline cost sparsity, and \emph{Lagrangian mismatch}, resulted from difficulties in aligning Lagrange multipliers between offline and online policies. To address these challenges, we introduce \textbf{Marvel}, a novel framework for O2O safe RL, comprising two key components that work in concert: \emph{Value Pre-Alignment} to align the Q-functions with the underlying truth before online learning, and \emph{Adaptive PID Control} to effectively adjust the Lagrange multipliers during online finetuning. Extensive experiments demonstrate that Marvel significantly outperforms existing baselines in both reward maximization and safety constraint satisfaction. By introducing the first policy-finetuning based framework for O2O safe RL, which is compatible with many offline and online safe RL methods, our work has the great potential to advance the field towards more efficient and practical safe RL solutions.

2023Integrating Planning into Single-Turn Long-Form Text Generation

[openreview] [pdf]

Abstract Generating high-quality, in-depth textual documents, such as academic papers, news articles, Wikipedia entries, and books, remains a significant challenge for Large Language Models (LLMs). In this paper, we propose to use planning to generate long form content. To achieve our goal, we generate intermediate steps via an auxiliary task that teaches the LLM to plan, reason and structure before generating the final text. Our main novelty lies in a single auxiliary task that does not require multiple rounds of prompting or planning. To overcome the scarcity of training data for these intermediate steps, we leverage LLMs to generate synthetic intermediate writing data such as outlines, key information and summaries from existing full articles. Our experiments demonstrate on two datasets from different domains, namely the scientific news dataset SciNews and Wikipedia datasets in KILT-Wiki and FreshWiki, that LLMs fine-tuned with the auxiliary task generate higher quality documents. We observed +2.5% improvement in ROUGE-Lsum, and a strong 3.60 overall win/loss ratio via human SxS evaluation, with clear wins in organization, relevance, and verifiability.

2024EVINCE: Optimizing Adversarial LLM Dialogues via Conditional Statistics and Information Theory

[openreview] [pdf]

Abstract This paper introduces EVINCE (Entropy and Variation IN Conditional Exchanges), a dialogue framework advancing Artificial General Intelligence (AGI) by enhancing versatility, adaptivity, and reasoning in large language models (LLMs). Leveraging adversarial debate and a novel dual entropy theory, EVINCE improves prediction accuracy, robustness, and stability in LLMs by integrating statistical modeling, information theory, and machine learning to balance diverse perspective exploration with strong prior exploitation. The framework’s effectiveness is demonstrated through consistent convergence of information-theoretic metrics, particularly improved mutual information, fostering productive LLM collaboration. We apply EVINCE to healthcare, showing improved disease diagnosis, and discuss its broader implications for decision-making across domains. This work provides theoretical foundations and empirical validation for EVINCE, paving the way for advancements in LLM collaboration and AGI development.

2025Beyond Data Scarcity: A Frequency-Driven Framework for Zero-Shot Forecasting

[openreview] [pdf]

Abstract Time series forecasting is critical in numerous real-world applications, requiring accurate predictions of future values based on observed patterns. While traditional forecasting techniques work well in in-domain scenarios with ample data, they struggle when data is scarce or not available at all, motivating the emergence of zero-shot and few-shot learning settings. Recent advancements often leverage large-scale foundation models for such tasks, but these methods require extensive data and compute resources, and their performance may be hindered by ineffective learning from the available training set. This raises a fundamental question:What factors influence effective learning from data in time series forecasting?Toward addressing this, we propose using Fourier analysis to investigate how models learn from synthetic and real-world time series data. Our findings reveal that forecasters commonly suffer from poor learning from data with multiple frequencies and poor generalization to unseen frequencies, which impedes their predictive performance. To alleviate these issues, we present a novel synthetic data generation framework, designed to enhance real data or replace it completely by creating task-specific frequency information, requiring only the sampling rate of the target data. Our approach,Freq-Synth, improves the robustness of both foundation as well as non-foundation forecast models in zero-shot and few-shot settings, facilitating more reliable time series forecasting under limited data scenarios.

2026Heavy Labels Out! Dataset Distillation with Label Space Lightening

[openreview] [pdf]

Abstract Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.

2027Understanding the Training and Generalization of Pretrained Transformer for Sequential Decision Making

[openreview] [pdf]

Abstract In this paper, we consider the supervised pre-trained transformer for a class of sequential decision-making problems. The class of considered problems is a subset of the general formulation of reinforcement learning in that there is no transition probability matrix; though seemingly restrictive, the subset class of problems covers bandits, dynamic pricing, and newsvendor problems as special cases. Such a structure enables the use of optimal actions/decisions in the pre-training phase, and the usage also provides new insights for the training and generalization of the pre-trained transformer. We first note the training of the transformer model can be viewed as a performative prediction problem, and the existing methods and theories largely ignore or cannot resolve an out-of-distribution issue. We propose a natural solution that includes the transformer-generated action sequences in the training procedure, and it enjoys better properties both numerically and theoretically. The availability of the optimal actions in the considered tasks also allows us to analyze the properties of the pre-trained transformer as an algorithm and explains why it may lack exploration and how this can be automatically resolved. Numerically, we categorize the advantages of pre-trained transformers over the structured algorithms such as UCB and Thompson sampling into three cases: (i) it better utilizes the prior knowledge in the pre-training data; (ii) it can elegantly handle the misspecification issue suffered by the structured algorithms; (iii) for short time horizon such as T50T\le50, it behaves more greedy and enjoys much better regret than the structured algorithms designed for asymptotic optimality.

2028Measuring Diversity: Axioms and Challenges

[openreview] [pdf]

Abstract The concept of diversity is widely used in various applications: from image or molecule generation to recommender systems. Thus, being able to properly measure diversity is important. This paper addresses the problem of evaluating diversity for a set of objects. First, we make a systematic review of existing diversity measures and explore their undesirable behavior in some cases. Based on this review, we formulate three desirable properties (axioms) of a reliable diversity measure: monotonicity, uniqueness, and continuity. We show that none of the existing diversity measures has all three properties and thus these measures are not suitable for quantifying diversity. Then, we construct two examples of measures that have all the desirable properties, thus proving that the list of axioms is not self-contradicting. Unfortunately, the constructed examples are too computationally complex for practical use, thus we pose an open problem of constructing a diversity measure that has all the listed properties and can be computed in practice.

2029AVOIDING BARREN PLATEAUS VIA GAUSSIAN MIXTURE MODEL

[openreview] [pdf]

Abstract Variational quantum algorithms is one of the most representative algorithms in quantum computing, which has a wide range of applications in quantum machine learning, quantum simulation and other related fields. However, they face challenges associated with the barren plateau phenomenon, especially when dealing with large numbers of qubits, deep circuit layers, or global cost functions, making them often untrainable. In this paper, we propose a novel parameter initialization strategy based on Gaussian Mixture Models. We rigorously prove that, the proposed initialization method consistently avoids the barren plateaus problem for hardware-efficient ansatz with arbitrary length and qubits and any given cost function. Specifically, we find that the gradient norm lower bound provided by the proposed method is independent of the number of qubits N and increases with the circuit depth L. Our results strictly highlight the significance of Gaussian Mixture model initialization strategies in determining the trainability of quantum circuits, which provides valuable guidance for future theoretical investigations and practical applications.

2030Adversarial Machine Unlearning

[openreview] [pdf]

Abstract This paper focuses on the challenge of machine unlearning, aiming to remove the influence of specific training data on machine learning models. Traditionally, the development of unlearning algorithms runs parallel with that of membership inference attacks (MIA), a type of privacy threat to determine whether a data instance was used for training. However, the two strands are intimately connected: one can view machine unlearning through the lens of MIA success with respect to removed data. Recognizing this connection, we propose a game-theoretic framework that integrates MIAs into the design of unlearning algorithms. Specifically, we model the unlearning problem as a Stackelberg game in which an unlearner strives to unlearn specific training data from a model, while an auditor employs MIAs to detect the traces of the ostensibly removed data. Adopting this adversarial perspective allows the utilization of new attack advancements, facilitating the design of unlearning algorithms. Our framework stands out in two ways. First, it takes an adversarial approach and proactively incorporates the attacks into the design of unlearning algorithms. Secondly, it uses implicit differentiation to obtain the gradients that limit the attacker’s success, thus benefiting the process of unlearning. We present empirical results to demonstrate the effectiveness of the proposed approach for machine unlearning.

2031Contrastive Learners Are Semantic Learners

[openreview] [pdf]

Abstract In this work, we explore the definition of semantic equivalence to establish a connection between contrastive tasks and their downstream counterparts. Specifically, we investigate when a contrastive dataset can learn representations that encode semantic relations for a specific downstream task. In our analysis, we recover a surprising hypothesis resembling the distributional one—dubbed distributional alignment hypothesis. Under this assumption, we demonstrate that a simple contrastive learning procedure can generate representations that encode semantic relations for the downstream task. Furthermore, we support the theory with a series of experiments designed to test the presented intuitions.

2032TabDiff: a Multi-Modal Diffusion Model for Tabular Data Generation

[openreview] [pdf]

Abstract Synthesizing high-quality tabular data is an important topic in many data science tasks, ranging from dataset augmentation to privacy protection. However, developing expressive generative models for tabular data is challenging due to its inherent heterogeneous data types, complex inter-correlations, and intricate column-wise distributions. In this paper, we introduce TabDiff, a joint diffusion framework that models all multi-modal distributions of tabular data in one model. Our key innovation is the development of a joint continuous-time diffusion process for numerical and categorical data, where we propose feature-wise learnable diffusion processes to counter the high disparity of different feature distributions. TabDiff is parameterized by a transformer handling different input types, and the entire framework can be efficiently optimized in an end-to-end fashion. We further introduce a multi-modal stochastic sampler to automatically correct the accumulated decoding error during sampling, and propose classifier-free guidance for conditional missing column value imputation. Comprehensive experiments on seven datasets demonstrate that TabDiff achieves superior average performance over existing competitive baselines across seven out of eight metrics, with up to 22.522.5% improvement over the state-of-the-art model on pair-wise column correlation estimations.

2033Faster Diffusion Sampling with Randomized Midpoints: Sequential and Parallel

[openreview] [pdf]

Abstract Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works (Chen et al., 2023c;b; Benton et al., 2023; Lee et al., 2022) have analyzed algorithms for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels.In this work, we propose a new scheme inspired by Shen and Lee’s randomized midpoint method for log-concave sampling (Shen & Lee, 2019). We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance (O~(d5/12)\widetilde O(d^{5/12}) compared to O~(d)\widetilde O(\sqrt{d}) from prior work). We also show that our algorithm can be parallelized to run in only O~(log2d)\widetilde O(\log^2 d) parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models.As a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence O~(d5/12)\widetilde O(d^{5/12}) compared to O~(d)\widetilde O(\sqrt{d}) from prior work.

2034A Unified Approach Towards Active Learning and Out-of-Distribution Detection

[openreview] [pdf]

Abstract When applying deep learning models in real-world scenarios, active learning (AL) strategies are crucial for identifying label candidates from a nearly infinite amount of unlabeled data. In this context, robust out-of-distribution (OOD) detection mechanisms are essential for handling data outside the target distribution of the application. However, current works investigate both problems separately. In this work, we introduce SISOM as the first unified solution for both AL and OOD detection. By leveraging feature space distance metrics SISOM combines the strengths of the currently independent tasks to solve both effectively. We conduct extensive experiments showing the problems arising when migrating between both tasks. In these evaluations SISOM underlined its effectiveness by achieving first place in two of the widely used OpenOOD benchmarks and second place in the remaining one. In AL, SISOM outperforms others and delivers top-1 performance in three benchmarks.

2035Understanding Reasoning with Looped Models

[openreview] [pdf]

Abstract Large language models have shown promising abilities in reasoning problems and scaling laws suggest that parameter count is a key driver. Recent works (Chen & Zou, 2024; Ye et al., 2024) argue that for reasoning, depth plays a very important role in addition to parameter count. In this work, we make a more fine-grained claim — many reasoning problems require large depth but not necessarily many parameters, in the sense that they can be solved via looped models. This unlocks a novel application of looped models for reasoning. We empirically study various synthetic reasoning problems like addition, variable assignment and math problems. For each of these, we find that kk-layer transformer model looped LL times nearly matches the quality of a kLkL-layer non-looped model and is much better than a k-layer model. Thus, using a small model and providing depth via looping can suffice for such reasoning problems. We then show theoretical results proving that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved with looped models. Motivated by these findings, we train autoregressive models on general language modeling datasets with looping and compare a kk-layer model looped LL times to a kLkL-layer model. While the looped model is understandably worse on perplexity and memorization tasks, it surprisingly does very well on tasks that require reasoning, like open book QA, math word problems and reasoning primitives. Despite having significantly fewer parameters, it can even match or outperform the non-looped kLkL-layer model on some of these tasks. These results suggest a novel inductive bias of looped models towards enhanced reasoning. We provide further evidence for this inductive bias by visualizing perplexity vs downstream isoplots, and design a looping-inspired regularization that solidifies this hypothesis.

2036Collaborative Compressors in Distributed Mean Estimation with Limited Communication Budge

[openreview] [pdf]

Abstract Distributed high dimensional mean estimation is a common aggregation routine used often in distributed optimization methods (e.g. federated learning). Most of these applications call for a communication-constrained setting where vectors, whose mean is to be estimated, have to be compressed before sharing. One could independently encode and decode these to achieve compression, but that overlooks the fact that these vectors are often similar with each other. To exploit these similarities, recently Suresh et al., 2022, Jhunjhunwala et al., 2021, Jiang et al, 2023, proposed multiple {\em correlation-aware compression schemes.} However, in most cases, the correlations have to be known for these schemes to work. Moreover, a theoretical analysis of graceful degradation of these correlation-aware compression schemes with increasing {\em dissimilarity} is limited to only the 2\ell_2-error in the literature. In this paper, we propose three different collaborative compression schemes that agnostically exploit the similarities among vectors in a distributed setting. Our schemes are all simple to implement and computationally efficient, while resulting in big savings in communication. We do a rigorous theoretical analysis of our proposed schemes to show how the 2\ell_2, \ell_\infty and cosine estimation error varies with the degree of similarity among vectors. In the process, we come up with appropriate dissimilarity-measures for these applications as well.

2037Topic and Description Reasoning Generation based on User-Contributed Comments

[openreview] [pdf]

Abstract We propose Topic and Description Reasoning Generation (TDRG), a text inference and generation method based on user-contributed comments with large language models (LLMs). Unlike summarization methods, TDRG can infer the topic according to comments contributed by different users, and generate a readable description that addresses the issue of the lack of interpretability in traditional topic modeling for text mining. In this paper, we adopted zero-shot and fine-tuning methods to generate topics and descriptions for comments. We use a human-annotated YouTube comment dataset to evaluate performance. Our results demonstrate that the potential of large language models of reasoning the topic and description. Generated topic titles and descriptions are similar to human references in textual semantics, but the words used are different from those of humans.

2038Cross-Domain Graph Data Scaling: A Showcase with Diffusion Models

[openreview] [pdf]

Abstract Models for natural language and images benefit from data scaling behavior: the more data fed into the model, the better they perform. This ‘better with more’ phenomenon enables the effectiveness of large-scale pre-training on vast amounts of data. However, current graph pre-training methods struggle to scale up data due to heterogeneity across graphs. To achieve effective data scaling, we aim to develop a general model that is able to capture diverse data patterns of graphs and can be utilized to adaptively help the downstream tasks. To this end, we propose UniAug, a universal graph structure augmentor built on a diffusion model. We first pre-train a discrete diffusion model on thousands of graphs across domains to learn the graph structural patterns. In the downstream phase, we provide adaptive enhancement by conducting graph structure augmentation with the help of the pre-trained diffusion model via guided generation. By leveraging the pre-trained diffusion model for structure augmentation, we consistently achieve performance improvements across various downstream tasks in a plug-and-play manner. To the best of our knowledge, this study represents the first demonstration of a data-scaling graph structure augmentor on graphs across domains.

2039Merlin: Multi-View Representation Learning for Robust Multivariate Time Series Forecasting with Unfixed Missing Rates

[openreview] [pdf]

Abstract Multivariate time series forecasting (MTSF) aims to predict the future values of multiple interrelated time series and support decision-making. While deep learning models have attracted much attention in MTSF for their powerful spatial-temporal encoding capabilities, they frequently encounter the challenge of missing data resulting from numerous malfunctioning data collectors in practice. In this case, existing models only rely on sparse observations and are prone to capturing incorrect semantics, leading to a decline in their forecasting performance. Furthermore, the unfixed missing rates across different samples in reality pose robustness challenges. To address these issues, we propose Multi-View Representation Learning (Merlin) based on offline knowledge distillation and multi-view contrastive learning, which aims to help existing models achieve semantic alignment between sparse observations with different missing rates and complete observations, and enhance their robustness. On the one hand, we introduce offline knowledge distillation where a teacher model guides a student model in learning how to extract semantics from sparse observations similar to those obtainable from complete observations. On the other hand, we construct positive and negative data pairs using sparse observations with different missing rates. Then, we use multi-view contrastive learning to help the student model align semantics across sparse observations with different missing rates, thereby further enhancing its robustness. In this way, Merlin can fully enhance the robustness of existing forecasting models to MTS with unfixed missing rates and achieves high-precision MTSF with sparse observations. Experiments on four real-world datasets validate our motivation and demonstrate the superiority and practicability of Merlin.

2040Ordinal Preference Optimization: Aligning Human Preferences via NDCG

[openreview] [pdf]

Abstract Aligning Large Language Models (LLMs) with diverse human preferences is a pivotal technique for controlling model behaviors and enhancing generation quality. Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and their variants optimize language models by pairwise comparisons. However, when multiple responses are available, these approaches fall short of leveraging the extensive information in the ranking given by the reward models or human feedback. In this work, we propose a novel listwise approach named Ordinal Preference Optimization (OPO), which employs the Normalized Discounted Cumulative Gain (NDCG), a widely-used ranking metric, to better utilize relative proximity within ordinal multiple responses. We develop an end-to-end preference optimization algorithm by approximating NDCG with a differentiable surrogate loss. This approach builds a connection between ranking models in information retrieval and the alignment problem. In aligning multi-response datasets assigned with ordinal rewards, OPO outperforms existing pairwise and listwise approaches on evaluation sets and general benchmarks like AlpacaEval. Moreover, we demonstrate that increasing the pool of negative samples can enhance model performance by reducing the adverse effects of trivial negatives.

2041EXAGREE: Towards Explanation Agreement in Explainable Machine Learning

[openreview] [pdf]

Abstract Explanations in machine learning are critical for trust, transparency, and fairness. Yet, complex disagreements among these explanations limit the reliability and applicability of machine learning models, especially in high-stakes environments. We formalize four fundamental ranking-based explanation disagreement problems and introduce a novel framework, EXplanation AGREEment (EXAGREE), to bridge diverse interpretations in explainable machine learning, particularly from stakeholder-centric perspectives. Our approach leverages a Rashomon set for attribution predictions and then optimizes within this set to identify Stakeholder-Aligned Explanation Models (SAEMs) that minimize disagreement with diverse stakeholder needs while maintaining predictive performance. Rigorous empirical analysis on synthetic and real-world datasets demonstrates that EXAGREE reduces explanation disagreement and improves fairness across subgroups in various domains. EXAGREE not only provides researchers with a new direction for studying explanation disagreement problems but also offers data scientists a tool for making better-informed decisions in practical applications.

2042LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning

[openreview] [pdf]

Abstract Mastering a skill generally relies on both hands-on experience from doers and insightful, high-level guidance by mentors. Will this strategy also work well for solving complex non-convex optimization problems? Here, a common gradient-based optimizer acts like a disciplined doer, making locally optimal updates at each step. Large Language Models (LLMs) can also search for better solutions by inferring from natural language instructions, akin to a high-level mentor. In this paper, we show that these two participators are complementary to each other and can effectively collaborate as a combined optimization framework. The collaborative optimization is achieved by alternating between the gradient-based and LLM-based optimizers. We instruct LLMs to generate possibly improved solutions by taking parameter trajectories recorded during the previous stage of gradient-based optimization into account. Inferred results of LLMs are used as restarting points for the next stage of gradient optimization. We verify the effectiveness of this optimization framework on prompt tuning. By leveraging both the locally rigorous gradient-based optimizer and the high-level deductive LLM-based optimizer, the combined optimization method consistently yields improvements over competitive baselines on a variety of tasks. Our results demonstrate the synergistic effect of conventional gradient-based optimization and the inference ability of LLMs. The code will be made publicly available.

2043MetaInv: Overcoming Iterative and Direct Method Limitations for Inverse Learning

[openreview] [pdf]

Abstract Invertible neural networks (INNs) have gained significant traction in tasks requiring reliable bidirectional inferences, such as data encryption, scientific computing, and real-time control. However, iterative methods like i-ResNet face notable limitations, including instability on non-contractive mappings and failure in scenarios requiring strict one-to-one mappings. In contrast, analytical approaches like DipDNN guarantee invertibility but at the expense of performance, particularly in tasks demanding rich feature extraction (e.g., convolutional operations in complex image processing). This work presents a detailed analysis of the limitations in current invertible architectures, examining the trade-offs between iterative and analytical approaches. We identify key failure modes, particularly when handling information redundancy or strict bijections, and propose a meta-inverse framework that dynamically combines the advantages of both i-ResNet and DipDNN. Our framework adapts in real-time based on task-specific signals, ensuring both flexibility and guaranteed invertibility. Extensive experiments across diverse domains demonstrate that our hybrid approach outperforms existing methods in forward accuracy, inverse consistency, and computational efficiency. Our results highlight the utility of this meta-inverse strategy for critical applications where precision, stability, and adaptability are crucial.

2044Rare event modeling with self-regularized normalizing flows: what can we learn from a single failure?

[openreview] [pdf]

Abstract Increased deployment of autonomous systems in fields like transportation and robotics have seen a corresponding increase in safety-critical failures. These failures can be difficult to model and debug due to the relative lack of data: compared to tens of thousands of examples from normal operations, we may have only seconds of data leading up to the failure. This scarcity makes it challenging to train generative models of rare failure events, as existing methods risk either overfitting to noise in the limited failure dataset or underfitting due to an overly strong prior. We address this challenge with CalNF, or calibrated normalizing flows, a self-regularized framework for posterior learning from limited data. CalNF achieves state-of-the-art performance on data-limited failure modeling and inverse problems and enables a first-of-a-kind case study into the root causes of the 2022 Southwest Airlines scheduling crisis.

2045Universal Black-Box Reward Poisoning Attack against Offline Reinforcement Learning

[openreview] [pdf]

Abstract We study the problem of universal black-boxed reward poisoning attacks against general offline reinforcement learning with deep neural networks. We consider a black-box threat model where the attacker is entirely oblivious to the learning algorithm, and its budget is limited by constraining the amount of corruption at each data point and the total perturbation. We require the attack to be universally efficient against any efficient algorithms that might be used by the agent. We propose an attack strategy called the `policy contrast attack.’ The idea is to find low- and high-performing policies covered by the dataset and make them appear to be high- and low-performing to the agent, respectively. To the best of our knowledge, we propose the first universal black-box reward poisoning attack in the general offline RL setting. We provide theoretical insights on the attack design and empirically show that our attack is efficient against current state-of-the-art offline RL algorithms in different learning datasets.

2046Bayesian Neighborhood Adaptation for Graph Neural Networks

[openreview] [pdf]

Abstract The neighborhood scope (i.e., number of hops) where graph neural networks (GNNs) aggregate information to characterize a node’s statistical property is critical to GNNs’ performance. Two-stage approaches, training and validating GNNs for every pre-specified neighborhood scope to search for the best setting, is a daunting and time-consuming task and tends to be biased due to the search space design. How to adaptively determine proper neighborhood scopes for the aggregation process for both homophilic and heterophilic graphs remains largely unexplored. We thus propose to model the GNNs’ message-passing behavior on a graph as a stochastic process by treating the number of hops as a beta process. This Bayesian framework allows us to infer the most plausible neighborhood scope for messsage aggregation simultaneously with the optimization of GNN parameters. Our theoretical analysis show the scope inference improves the expressivity of GNN models. Experiments on benchmark homophilic and heterophilic datasets show that the proposed method is compatible with state-of-the-art GNN variants, improving their performance and providing well-calibrated predictions.

2047Symmetric Space Learning for Combinatorial Generalization

[openreview] [pdf]

Abstract Symmetries on representations within generative models have shown essential roles in predicting unobserved combinations of semantic changes, known as combinatorial generalization tasks. However, these efforts have primarily focused on learning symmetries from only training data, and thus, the extension of trained symmetries to unseen samples remains uncontrolled. A potential approach for generalizing the symmetries is leveraging geometric information on manifolds that contain functional semantic structures for unseen data, but it still falls short of supporting symmetry learning. In this paper, we address this symmetry generalization\textit{symmetry generalization} by forcing symmetric space\textit{symmetric space} on latent space for utilizing semantic structures from symmetry and manifold perspectives. We clarify an equivariance-based constraint that restricts symmetry generalization, and prove that: 1) enforcing the homogeneous space property of symmetric space onto the data manifold eliminates this constraint, 2) a homogeneous latent manifold induces the data manifold through diffeomorphic data-to-latent mapping, and 3) the isometry property of symmetric space extends neighbor symmetries of a point to another within the space. For practical implementation, we propose a method to align sampled points from symmetric space with their explicitly trained geodesic. We verify the method in a detailed analysis on a toy dataset and enhance combinatorial generalization on common benchmarks. This work represents the first effective effort to align symmetries with manifolds for combinatorial generalization.

2048Long-Short Decision Transformer: Bridging Global and Local Dependencies for Generalized Decision-Making

[openreview] [pdf]

Abstract Decision Transformers (DTs) effectively capture long-range dependencies using self-attention but struggle with fine-grained local relationships, especially the Markovian properties in many offline-RL datasets. Conversely, Decision Convformer (DC) utilizes convolutional filters for capturing local patterns but shows limitations in tasks demanding long-term dependencies, such as Maze2d. To address these limitations and leverage both strengths, we propose the Long-Short Decision Transformer (LSDT), a general-purpose architecture to effectively capture global and local dependencies across two specialized parallel branches (self-attention and convolution). We explore how these branches complement each other by modeling various ranged dependencies across different environments, and compare it against other baselines. Experimental results demonstrate our LSDT achieves state-of-the-art performance and notable gains over the standard DT in D4RL offline RL benchmark. Leveraging the parallel architecture, LSDT performs consistently on diverse datasets, including Markovian and non-Markovian. We also demonstrate the flexibility of LSDT’s architecture, where its specialized branches can be replaced or integrated into models like DC to improve their performance in capturing diverse dependencies. Finally, we also highlight the role of goal states in improving decision-making for goal-reaching tasks like Antmaze.

2049UniCoTT: A Unified Framework for Structural Chain-of-Thought Distillation

[openreview] [pdf]

Abstract Chains of thought (CoTs) have achieved success in enhancing the reasoning capabilities of large language models (LLMs), while their effectiveness is predominantly observed in LLMs. Existing solutions methods adopt distillation to inject chain-of-thought capabilities into small models (SLMs). However, they: (1) can not guarantee the rationality of the generated explanation due to hallucinations; (2) ignore diverse structures of CoT during knowledge transfer. In this paper, we propose a unified CoT distillation framework termed UniCoTT for considering diverse structural CoTs (\emph{i.e.}, chain, tree, and graph). UniCoTT contains two core strategies: iterative construction for structured CoTs and the structural constraint strategy. Specifically, UniCoTT prompts LLMs to iteratively produce accurate explanations with answers and unifies structured explanations as UniCoT which is seen as a bridge for knowledge transfer. Furthermore, UniCoTT utilizes the proposed unified supervised learning and structural consistency learning strategies to transfer knowledge of structured CoT to SLMs. Experimental results show that UniCoTT can significantly improve the performance of SLMs on multiple datasets across different NLP tasks.Our code is available in our supplementary materials.

2050ComboStoc: Combinatorial Stochasticity for Diffusion Generative Models

[openreview] [pdf]

Abstract In this paper, we study an under-explored but important factor of diffusion generative models, i.e., the combinatorial complexity. Data samples are generally high-dimensional, and for various structured generation tasks, additional attributes are combined to associate with data samples. We show that the space spanned by the combination of dimensions and attributes is insufficiently sampled by existing training scheme of diffusion generative models, causing degraded test time performance. We present a simple fix to this problem by constructing stochastic processes that fully exploit the combinatorial structures, hence the name ComboStoc. Using this simple strategy, we show that network training is significantly accelerated across diverse data modalities, including images and 3D structured shapes. Moreover, ComboStoc enables a new way of test time generation which uses asynchronous time steps for different dimensions and attributes, thus allowing for varying degrees of control over them.

2051Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient

[openreview] [pdf]

Abstract In contrast to moderate-size neural network pruning, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically hand-crafted metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method eliminates the back-propagation through the LLM per se during the optimization, requiring only the forward pass of the LLM. We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally initialize with a metric-based method (for our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. both perplexity and the majority of various zero-shot tasks. Codes will be released.

2052Zero redundancy distributed learning with differential privacy

[openreview] [pdf]

Abstract Deep learning using large models have achieved great success in a wide range of domains. However, training these models on billions of parameters is very challenging in terms of the training speed, memory cost, and communication efficiency, especially under the privacy-preserving regime with differential privacy (DP). On the one hand, DP optimization has comparable efficiency to the standard non-private optimization on a single GPU, but on multiple GPUs, existing DP distributed learning (such as pipeline parallel) has suffered from significantly worse efficiency. On the other hand, the Zero Redundancy Optimizer (ZeRO) is a state-of-the-art solution to the standard distributed learning, exhibiting excellent training efficiency on large models, but to work compatibly with DP is technically complicated. In this work, we develop a new systematic solution, DP-ZeRO, (I) to scale up the trainable DP model size, e.g. to GPT-100B, (II) to obtain the same computation and communication efficiency as the standard ZeRO, and (III) to enable mixed-precision DP training. Our DP-ZeRO, like the standard ZeRO, has the potential to train models with arbitrary size and is evaluated on the world’s largest DP models in terms of the number of trainable parameters.

2053ConvINT: A Semi-Structured Intention Framework for Conversational Understanding

[openreview] [pdf]

Abstract Understanding user intentions is critical for conversational AI, especially with the rise of large language models (LLMs) that demand a more nuanced comprehension of dialogue. Existing approaches, relying on rigid slot-value structures or unstructured representations, often miss the complexity of human intentions. In this work, we propose ConvINT, a novel semi-structured intention framework that offers a more holistic and fine-grained understanding of user intentions by organizing them into four key aspects: situation, emotion, action, and knowledge. Grounded in psychological and cognitive intention theories, ConvINT provides LLMs with a richer context for understanding user inputs while offering a semi-structured format that seamlessly integrates with prompt-based intention learning. To enable the efficient adoption of this framework, we introduce a Weakly-supervised Reinforced Generation (WeRG) method that scales ConvINT annotations across large datasets with high quality. By combining a small set of human-annotated instances with coarsely labeled data as weak supervision signals, WeRG effectively learns to generate ConvINT annotations, ensuring both scalability and precision. Experimental results demonstrate that integrating ConvINT with WeRG markedly improves LLMs’ ability to comprehend user intentions, yielding significant gains in downstream tasks such as response generation and task completion, as validated by both automatic metrics and human evaluations. These findings highlight ConvINT’s potential as a comprehensive and adaptable framework for advancing intention understanding in conversational AI.

2054Long-Sequence Recommendation Models Need Decoupled Embeddings

[openreview] [pdf]

Abstract Lifelong user behavior sequences, comprising up to tens of thousands of history behaviors, are crucial for capturing user interests and predicting user responses in modern recommendation systems. A two-stage paradigm is typically adopted to handle these long sequences: a few relevant behaviors are first searched from the original long sequences via an attention mechanism in the first stage and then aggregated with the target item to construct a discriminative representation for prediction in the second stage. In this work, we identify and characterize, for the first time, a neglected deficiency in existing long-sequence recommendation models: a single set of embeddings struggles with learning both attention and representation, leading to interference between these two processes. Initial attempts to address this issue using linear projections---a technique borrowed from language processing---proved ineffective, shedding light on the unique challenges of recommendation models. To overcome this, we propose the Decoupled Attention and Representation Embeddings (DARE) model, where two distinct embedding tables are initialized and learned separately to fully decouple attention and representation. Extensive experiments and analysis demonstrate that DARE provides more accurate search of correlated behaviors and outperforms baselines with AUC gains up to 9‰ on public datasets and notable online system improvements. Furthermore, decoupling embedding spaces allows us to reduce the attention embedding dimension and accelerate the search procedure by 50% without significant performance impact, enabling more efficient, high-performance online serving.

2055Labits: Layered Bidirectional Time Surfaces Representation for Event Camera-based Continuous Dense Trajectory Estimation

[openreview] [pdf]

Abstract Event cameras provide a compelling alternative to traditional frame-based sensors, capturing dynamic scenes with high temporal resolution and low latency. Moving objects trigger events with precise timestamps along their trajectory, enabling smooth continuous-time estimation. However, few works have attempted to optimize the information loss during event representation construction, imposing a ceiling on this task. Fully exploiting event cameras requires representations that simultaneously preserve fine-grained temporal information, stable and characteristic 2D visual features, and temporally consistent information density—an unmet challenge in existing representations. We introduce Labits: Layered Bidirectional Time Surfaces, a simple yet elegant representation designed to retain all these features. Additionally, we propose a dedicated module for extracting active pixel local optical flow (APLOF), significantly boosting the performance. Our approach achieves an impressive 49% reduction in trajectory end-point error (TEPE) compared to the previous state-of-the-art on the MultiFlow dataset. Labits offers a potential out-of-the-box performance boost for other event-based local motion-sensitive tasks, with code to be released upon acceptance.

2056Narrowing the Focus: Learned Optimizers for Pretrained Models

[openreview] [pdf]

Abstract In modern deep learning, the models are learned by applying gradient updates using an optimizer, which transforms the updates based on various statistics. Optimizers are often hand-designed and tuning their hyperparameters is a big part of the training process. Learned optimizers have shown some initial promise, but are generally unsuccessful as a general optimization mechanism applicable to every problem. In this work we explore a different direction: instead of learning general optimizers, we instead specialize them to a specific training environment. We propose a novel optimizer technique that learns a layer-specific linear combination of update directions provided by a set of base optimizers, effectively adapting its strategy to the specific model and dataset. When evaluated on image classification tasks, this specialized optimizer significantly outperforms both traditional off-the-shelf methods such as Adam, as well as existing general learned optimizers. Moreover, it demonstrates robust generalization with respect to model initialization, evaluating on unseen datasets, and training durations beyond its meta-training horizon.

2057What Matters When Repurposing Diffusion Models for General Dense Perception Tasks?

[openreview] [pdf]

Abstract Extensive pre-training with large data is indispensable for downstream geometry and semantic visual perception tasks. Thanks to large-scale text-to-image (T2I) pretraining, recent works show promising results by simply fine-tuning T2I diffusion models for dense perception tasks. However, several crucial design decisions in this process still lack comprehensive justification, encompassing the necessity of the multi-step stochastic diffusion mechanism, training strategy, inference ensemble strategy, and fine-tuning data quality. In this work, we conduct a thorough investigation into critical factors that affect transfer efficiency and performance when using diffusion priors. Our key findings are: 1) High-quality fine-tuning data is paramount for both semantic and geometry perception tasks. 2) The stochastic nature of diffusion models has a slightly negative impact on the deterministic perception tasks. 3) Apart from fine-tuning the diffusion model with only latent space supervision, task-specific image-level supervision is beneficial to enhance fine-grained details. These observations culminate in the development of GenPercept, an effective deterministic one-step fine-tuning paradigm tailed for dense visual perception tasks. Different from the previous multi-step methods, our paradigm has a much faster inference speed, and can be seamlessly integrated with customized perception decoders and loss functions for image-level supervision, which is critical to improving the fine-grained details of predictions. Comprehensive experiments on diverse dense visual perceptual tasks, including monocular depth estimation, surface normal estimation, image segmentation, and matting, are performed to demonstrate the remarkable adaptability and effectiveness of our proposed method.

2058ACCO: Accumulate while you Communicate, Hiding Communications in Distributed LLM Training

[openreview] [pdf]

Abstract Training Large Language Models (LLMs) relies heavily on distributed implementations, employing multiple GPUs to compute stochastic gradients on model replicas in parallel. However, synchronizing gradients in data parallel settings induces a communication overhead increasing with the number of distributed workers, impeding the efficiency gains of parallelization. To address this challenge, local optimization algorithms such as the ones used in Federated Learning have emerged. While effective in minimizing communication overhead, they incur significant memory costs, hindering scalability: in addition to extra momentum variables, optimizer’s states cannot be partitioned among workers as communications are only allowed between rounds of local optimization steps. To conceal communication costs, we propose instead to synchronize delayed gradientswhilecomputing new ones between each model’s update and introduce AC\textbf{AC}cumulate while CO\textbf{CO}mmunicate (ACCO\textbf{ACCO}), a memory-efficient optimization algorithm tailored for distributed training of LLMs. Accumulating local gradients on the workers until the communication finishes naturally reduces the idle time of GPUs and even allows the use of heterogeneous hardware. However, we show that the one-step delay inherent in parallel execution of gradient computations and communications has drastic impacts on Transformers’ convergence. To compensate this delay we introduce a novel technique which leads to training dynamics aligned with standard distributed optimization. Compared to ZeRO, our implementation and experiments on several LLMs pre-training and fine-tuning tasks demonstrates that ACCO\textbf{ACCO} reduces the learning time up to 87% and successfully allows both sharding optimizer states across workers and the use of heterogeneous hardware.

2059DualFast: Dual-Speedup Framework for Fast Sampling of Diffusion Models

[openreview] [pdf]

Abstract Diffusion probabilistic models (DPMs) have achieved impressive success in visual generation. While, they suffer from slow inference speed due to iterative sampling. Employing fewer sampling steps is an intuitive solution, but this will also introduces discretization error. Existing fast samplers make inspiring efforts to reduce discretization error through the adoption of high-order solvers, potentially reaching a plateau in terms of optimization. This raises the question: can the sampling process be expedited further? In this paper, we re-examine the nature of sampling errors, discerning that they comprise two distinct elements: the widely recognized discretization error and the less acknowledged approximation error. Our research elucidates the dynamics between these errors and the step by implementing a dual-error disentanglement strategy. Building on these foundations, we introduce an unified and training-free acceleration framework, DualFast, designed to enhance the speed of DPM sampling by concurrently accounting for both error types, thereby minimizing the total sampling error. DualFast is seamlessly compatible with existing samplers and significantly boost their sampling quality and speed, particularly in extremely few sampling steps. We substantiate the effectiveness of our framework through comprehensive experiments, spanning both unconditional and conditional sampling domains, across both pixel-space and latent-space DPMs.

2060Watchmaker Functions and Meta Specification of Open-Ended Learning Systems

[openreview] [pdf]

Abstract Open-ended learning systems aim to foster the continuous evolution of increasingly capable agents through the dynamic generation of novel challenges. The efficacy of these systems is fundamentally influenced by two critical factors: the design of the underlying system, which delineates the space of possibilities, and the open-ended algorithms that drive ongoing progress within this space. Current approaches to system design rely on explicit specification, where state spaces and evolution functions are fully defined at design time, often leading to prohibitive design complexity as systems scale. To address this challenge, we propose an alternative design principle termedmeta specification. This approach defines systems implicitly through constraints, utilizingwatchmaker functions—generalized stochastic evolution functions—coupled with verification routines to perform system evolution. Meta specification principles have the potential to significantly expand the space of possibilities while reducing design complexity, thereby enhancing the potential for open-ended learning. We demonstrate the viability of this principle through an illustrative implementation that co-evolves robot morphologies and robotic tasks, showcasing its capacity for emergent novelty and highlighting the shift in focus towards verification in system design.

2061Quantile-Optimal Policy Learning under Unmeasured Confounding

[openreview] [pdf]

Abstract We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest α\alpha-th quantile for some α(0,1)\alpha \in (0, 1). We focus on the offline setting whose generating process involves unobserved confounders. Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset. To address these challenges, we propose a suite of causal-assisted policy learning methods that provably enjoy strong theoretical guarantees under mild conditions. In particular, to address (i) and (ii), using causal inference tools such as instrumental variables and negative controls, we propose to estimate the quantile objectives by solving nonlinear functional integral equations. Then we adopt a minimax estimation approach with nonparametric models to solve these integral equations, and propose to construct conservative policy estimates that address (iii). The final policy is the one that maximizes these pessimistic estimates. In addition, we propose a novel regularized policy learning method that is more amenable to computation. Finally, we prove that the policies learned by these methods are O~(n1/2)\tilde{O}(n^{-1/2}) quantile-optimal under a mild coverage assumption on the offline dataset. To our best knowledge, we propose the first sample-efficient policy learning algorithms for estimating the quantile-optimal policy when there exists unmeasured confounding.

2062Enhancing Time-Series Forecasting with Iterative Decomposition and Separable Training

[openreview] [pdf]

Abstract Time series data, crucial for decision-making in fields like finance and healthcare, often presents challenges due to its inherent complexity, exacerbating the bias-variance tradeoff and leading to overfitting and underfitting in conventional forecasting models. While promising, state-of-the-art models like PatchTST, iTransformer, and DLinear are hindered by this tradeoff, limiting their ability to separate predictable patterns from noise. To resolve this, we propose the IDEAS framework, which reduces the bias-variance tradeoff to help models achieve optimal performance. IDEAS combines iterative residual decomposition, which reduces bias by extracting predictable patterns, and separable training, which reduces variance by independently optimizing each component. We provide theoretical proof and demonstrate through experiments that IDEAS significantly improves performance across four state-of-the-art models on nine complex benchmark datasets, offering a more robust solution for complex time series forecasting.

2063Is the sparsity of high dimensional spaces the reason why VAEs are poor generative models?

[openreview] [pdf]

Abstract Variational autoencoders (VAE) encode data into lower dimension latent vectors before decoding those vectors back to data. Once trained, decoding a random latent vector usually does not produce meaningful data, at least when the latent space has more than a dozen dimensions. In this paper, we investigate this issue drawing insight from high dimensional physical systems such as spin-glasses, which exhibit a phase transition from a high entropy random configuration to a lower energy and more organised state when cooled quickly in the presence of a magnetic field. The latent of a standard VAE is by definition close to a uniform distribution on a hypersphere, and thus similar to the high entropy spin-glass state. We propose to formulate the latent variables of a VAE using hyperspherical coordinates, which allows to compress the latent vectors towards an island on the hypersphere, thereby reducing the latent sparsity, analogous to a quenched spin-glass. We show that this is feasible with modest computational increase and that it improves the generation ability of the VAE.

2064MvHSTM: A Multi-view Hypergraph Spatio-Temporal Model for Traffic Speed Forecasting

[openreview] [pdf]

Abstract Accurate traffic speed prediction is critical in modern society as it is effective for both individuals and authorities. Due to the large scale of urban road networks, traffic speed exhibits complex spatio-temporal dependencies, not only among adjacent nodes but also across the network, reflecting both local and cross-regional simultaneous correlations. However, existing studies have not effectively addressed these characteristics. In this context, we propose a novel framework called Multi-view Hypergraph Spatio-Temporal Model (MvHSTM) that employs a temporal transformer to capture temporal dependencies and utilizes hypergraph convolutional networks to inherently model spatial relationships. Specifically, we introduce two hypergraph construction methods, the Geographical Adjacency Hypergraph (GAH) and the Feature Similarity Hypergraph (FSH), to capture spatial correlations on neighboring and non-neighboring scales. Extensive experiments on real-world traffic speed datasets demonstrate that our approach achieves state-of-the-art performance compared to baseline methods.

2065STAR: Stability-Inducing Weight Perturbation for Continual Learning

[openreview] [pdf]

Abstract Humans can naturally learn new and varying tasks in a sequential manner. Continual learning is a class of learning algorithms that updates its learned model as it sees new data (on potentially new tasks) in a sequence. A key challenge in continual learning is that as the model is updated to learn new tasks, it becomes susceptible to \textit{catastrophic forgetting}, where knowledge of previously learned tasks is lost. A popular approach to mitigate forgetting during continual learning is to maintain a small buffer of previously-seen samples, and to replay them during training. However, this approach is limited by the small buffer size and, while forgetting is reduced, it is still present. In this paper, we propose a novel loss function STAR that exploits the worst-case parameter perturbation that reduces the KL-divergence of model predictions with that of its local parameter neighborhood to promote stability and alleviate forgetting. STAR can be combined with almost any existing rehearsal-based methods as a plug-and-play component. We empirically show that STAR consistently improves performance of existing methods by up to 15\sim15% across varying baselines, and achieves superior or competitive accuracy to that of state-of-the-art methods aimed at improving rehearsal-based continual learning.

2066Distilling an End-to-End Voice Assistant Without Instruction Training Data

[openreview] [pdf]

Abstract Voice assistants, such as Siri and Google Assistant, typically model audio and text separately, resulting in lost speech information and increased complexity. Recent efforts to address this with end-to-end Speech Large Language Models (LLMs) trained with supervised finetuning (SFT) have led to models ``forgetting" capabilities from text-only LLMs. Our work proposes an alternative paradigm for training Speech LLMs without instruction data, using the response of a text-only LLM to transcripts as self-supervision. Importantly, this process can be performed without annotated responses. We show that our Distilled Voice Assistant (DiVA) generalizes to Spoken Question Answering, Classification, and Translation. Furthermore, we show that DiVA better meets user preferences, achieving a 72% win rate compared with state-of-the-art models like Qwen 2 Audio, despite using >>100x less training compute.

2067Tighter Privacy Auditing of DP-SGD in the Hidden State Threat Model

[openreview] [pdf]

Abstract Machine learning models can be trained with formal privacy guarantees via differentially private optimizers such as DP-SGD. In this work, we focus on a threat model where the adversary has access only to the final model, with no visibility into intermediate updates. In the literature, this ``hidden state’’ threat model exhibits a significant gap between the lower bound from empirical privacy auditing and the theoretical upper bound provided by privacy accounting. To challenge this gap, we propose to audit this threat model with adversaries that craft a gradient sequence designed to maximize the privacy loss of the final model without relying on intermediate updates. Our experiments show that this approach consistently outperforms previous attempts at auditing the hidden state model. Furthermore, our results advance the understanding of achievable privacy guarantees within this threat model. Specifically, when the crafted gradient is inserted at every optimization step, we show that concealing the intermediate model updates in DP-SGD does not amplify privacy. The situation is more complex when the crafted gradient is not inserted at every step: our auditing lower bound matches the privacy upper bound only for an adversarially-chosen loss landscape and a sufficiently large batch size. This suggests that existing privacy upper bounds can be improved in certain regimes.

2068Efficient Online Pruning and Abstraction for Imperfect Information Extensive-Form Games

[openreview] [pdf]

Abstract Efficiently computing approximate equilibrium strategies in large-scale Imperfect Information Extensive-Form Games (IIEFGs) poses significant challenges due to the vast size of the game trees. Pruning and abstraction methods effectively reduce the size of the game tree and enhance computational efficiency. However, seamlessly integrating pruning techniques with variants of Counterfactual Regret Minimization (CFR), a leading method for solving IIEFGs, remains a complex task. Furthermore, existing information abstraction methods often involve high computational costs and may require months of offline pre-computation, limiting their practical applicability. In this paper, we introduce Expected-Value Pruning and Abstraction (EVPA), an online approach that improves efficiency by leveraging expected value estimation within information sets. EVPA consists of three core components: expected value estimation of information sets, expected value-based pruning, and information abstraction for subgames. It estimates the expected value of information sets using Nash equilibrium strategies, employing these estimations for both pruning and abstraction. By integrating Minimax pruning with CFR, EVPA streamlines decision-making by permanently eliminating sub-optimal actions from the game tree prior to CFR application. Additionally, EVPA features an advanced information abstraction mechanism that merges information sets based on both current and future expected values in the subgame, achieving efficient information abstraction within just 1 second. Experiments on HUNL demonstrate that EVPA outperforms DeepStack’s replication and Slumbot, with win-rates of 202±31202\pm 31 and 96±4396\pm 43 mbb/h, respectively. Remarkably, EVPA requires only 1%-2% of the solving time to reach an approximate Nash equilibrium compared to DeepStack’s replication.

2069Understanding Prejudice and Fidelity of Diverge-to-Converge Multi-Agent Systems

[openreview] [pdf]

Abstract Large language model (LLM) agents have demonstrated substantial potential across various tasks, particularly in multi-agent systems. Among these, \textit{Diverge-to-Converge} (D2C) frameworks stand out for their ability to iteratively diversify and converge intermediate thoughts to improve problem-solving. In this paper, we conduct a comprehensive study on the \textit{\textbf{prejudice}} and \textit{\textbf{fidelity}} of typical D2C frameworks, including both model-level and society-level frameworks. \ding{182} In the \textit{prejudice} section, we uncover an inherent \textit{confirmation bias} in D2C systems, which not only leads to suboptimal performance, but also amplifies social biases, such as gender discrimination and political partisanship. Surprisingly, we find that by reframing open-ended problems into binary questions, this bias can be leveraged to foster more equitable and effective agent interactions, ultimately improving performance. \ding{183} In the \textit{fidelity} section, we explore the scaling laws of D2C frameworks at different granularities, revealing that increasing the number of agents enhances performance only when the system is not yet saturated---such as in complex tasks or with weaker agents. In saturated scenarios, however, adding more agents can degrade performance. To facilitate further study, we develop \texttt{APF-Bench}, a benchmark specifically designed to evaluate such inherent weaknesses of D2C frameworks. We hope our findings offer instructional insights into the strengths and limitations of D2C multi-agent systems, offering guidance for developing more robust and effective collaborative AI systems.

2070Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

[openreview] [pdf]

Abstract The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model’s efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

2071Retraction-free optimization over the Stiefel manifold with application to the LoRA fine-tuning

[openreview] [pdf]

Abstract Optimization over the Stiefel manifold has played a significant role in various machine learning tasks. Many existing algorithms either use the retraction operator to keep each iterate staying on the manifold, or solve an unconstrained quadratic penalized problem. The retraction operator in the former corresponds to orthonormalization of matrices and can be computationally costly for large-scale matrices. The latter approach usually equips with an unknown large penalty parameter. To address the above issues, we propose a retraction-free and penalty parameter-free algorithm, which lands on the manifold. Moreover, our convergence theory allows using constant step size, which improve the result of converging to a neighborhood in \citep{ablin2022fast}. A key component of the analysis is the convex-like property of the quadratic penalty of the Stiefel manifold, which enables us to explicitly characterize the constant penalty parameter. As an application, we introduce a new algorithm, Manifold-LoRA, which employs the landing technique and a carefully designed step size strategy to accelerate low-rank adaptation (LoRA) in fine-tuning large language models. Numerical experiments on the benchmark datasets demonstrate the efficiency of our proposed method.

2072MaskInversion: Localized Embeddings via Optimization of Explainability Maps

[openreview] [pdf]

Abstract Vision-language foundation models such as CLIP have achieved tremendous results in global vision-language alignment, but still show some limitations in creating representations for specific image regions. To address this problem, we propose MaskInversion, a method that leverages the feature representations of pre-trained foundation models, such as CLIP, to generate a context-aware embedding for a query image region specified by a mask at test time. MaskInversion starts with initializing an embedding token and compares its explainability map, derived from the pretrained model, to the query mask. The embedding token is then subsequently refined to approximate the query region by minimizing the discrepancy between its explainability map and the query mask. During this process, only the embedding vector is updated, while the underlying foundation model is kept frozen allowing to use MaskInversion with any pre-trained model. As deriving the explainability map involves computing its gradient, which can be expensive, we propose a gradient decomposition strategy that simplifies this computation. The learned region representation can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation. We evaluate the proposed method on all those tasks on several datasets such as PascalVOC, MSCOCO, RefCOCO, and OpenImagesV7 and show its capabilities compared to other SOTA approaches.

2073Gap-Aware Preference Optimization: Enhancing Model Alignment with Perception Margin

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) approaches are widely used for fine-tuning large language models (LLMs) to align with instructional preferences. However, traditional RLHF methods often rely on binary labels, which fail to capture the pairwise differences in human perception, leading to potential performance degradation. To address this limitation, we introduce Gap-Aware Preference Optimization\textbf{Gap-Aware Preference Optimization} (GaPO), a novel approach that integrates the degree of semantic gaps into preference optimization. By modifying the margin term in the loss function and replacing it with an estimated gap computed using general metrics, GaPO provides a new supervisory signal that explicitly highlights the nuances between preference pairs. This new signal helps the model allocate gradients more rationally during optimization, facilitating more effective learning from the preference data. Experiments conducted with a strong base model, Llama-3-8B-Instruct, demonstrate that GaPO surpasses state-of-the-art methods on widely used benchmarks. Our best-performing model, GaPO-ROUGE_L, achieves a win rate of 52.8% on AlpacaEval 2.0, exceeding the baseline methods by 5.3 points.

2074Explainable Transfer Learning on Graphs Using a Novel Label Frequency Representation

[openreview] [pdf]

Abstract Graphs are characterized by their versatility in representing objects from a wide range of domains, such as social networks or protein structures. This flexibility and power poses a significant challenge for transfer learning between graph domains. Current methods of transfer learning between graph domains tend to focus exclusively on the structure of the underlying graphs, neglecting the characteristics of the nodes and not addressing the difficulties in comparing nodes that represent very dissimilar entities, such as atoms and people for instance. In this paper, we propose a novel universal representation of graphs based on the relative frequency of the node labels. This novel representation enables explainable transfer learning between labeled graphs from different domains for the first time, without the need for additional adaptations. That is, we show that our novel representation can be readily combined with a data alignment technique that in turn allows transfer learning between data from different domains. Experimental results show that knowledge can be acquired from graphs belonging to chemical and biological domains to improve the accuracy of classification models in social network analysis. A comparison with state-of-the-art techniques indicates that our approach outperforms existing non-topological methods and, in some cases, even graph neural networks. In summary, our technique represents a major advance in graph node representation for transfer learning between different domains, opening up new perspectives for future research.

2075Overcoming Catastrophic Forgetting: A Novel Fine-Tuning Method

[openreview] [pdf]

Abstract Despite remarkable advances in Large Language Models (LLMs), a persistent challenge remains: the potential for these models to acquire erroneous or outdated information from their training data. Direct fine-tuning with data containing new knowledge can be ineffective due to conflicts between old and new knowledge. This paper proposes a novel fine-tuning paradigm called Delicate Fine-Tuning (DFT ) that leverages parametric arithmetic to pinpoint the location of knowledge and update only the minimal set of relevant parameters. Experimental results on two publicly available datasets demonstrate that our proposed DFT significantly improves the knowledge updating performance of full fine-tuning, consistently outperforming existing baselines in most cases.

2076Exponential-Wrapped Mechanisms for Differential Privacy on Hadamard Manifolds

[openreview] [pdf]

Abstract We extend the Differential Privacy (DP) framework to Hadamard manifolds, the class of complete and simply connected Riemannian manifolds with non-positive sectional curvature. Inspired by the Cartan–Hadamard theorem, we introduce Exponential-Wrapped Laplace and Gaussian mechanisms to achieve ε\varepsilon-DP, (ε,δ)(\varepsilon, \delta)-DP, Gaussian DP (GDP), and R’enyi DP (RDP) on these manifolds. Our approach employs efficient, straightforward algorithms that circumvent the computationally intensity Monte Carlo Markov Chain (MCMC) methods. This work is the first to extend (ε,δ)(\varepsilon, \delta)-DP, GDP, and RDP to Hadamard manifolds. We further demonstrate the effectiveness of our methodology through simulations on the space of Symmetric Positive Definite Matrices, a frequently used Hadamard manifold in statistics. Our findings reveal that our Exponential-Wrapped mechanisms surpass traditional MCMC-based approaches, which require careful tuning and extensive diagnostics, in both performance and ease of use. Additionally, our methods achieve comparable utility to the Riemannian Laplace mechanism with enhanced utility for smaller privacy budgets (ε\varepsilon) and operate orders of magnitude faster computationally.

2077Weak-to-Strong Generalization Through the Data-Centric Lens

[openreview] [pdf]

Abstract The weak-to-strong generalization phenomenon is the driver for important machine learning applications including highly data-efficient learning and, most recently, performing superalignment. While decades of research have resulted in numerous algorithms that produce strong empirical performance, understanding what aspects of data enable weak-to-strong generalization has been understudied. We propose a simple data-centric mechanism that characterizes weak-to-strong generalization: the overlap density. Intuitively, generalization tracks the number of points that contain overlaps, i.e., both easy patterns (learnable by a weak model) and challenging patterns (only learnable by a stronger model), as with such points, weak predictions can be used to learn challenging patterns by stronger models. And, we provide a practical overlap detection algorithm to find overlap density from data. Finally, we provide an algorithm to learn, among multiple sources of data, which to query when seeking to maximize overlap density and thereby enhance weak-to-strong generalization. We provide a theoretical result showing that the generalization benefit is a function of the overlap density and a regret bound of our data selection algorithm. Empirically, we validate the mechanism and the overlap detection algorithm on a wide array of settings.

[openreview] [pdf]

Abstract Time trends can be classified into intrinsic (real) and measurement (false) trends. There has long been a critical need for techniques to discern them, especially in investment decision-making. In causal discovery, these measurement trends, essentially measurement errors, can significantly impact the performance of algorithms, making it crucial to identify and eliminate them before analysis as well. Recognizing this need, we present a novel algorithm, termed Trend Differentiator (TrendDiff). It is capable of detecting all trend-influenced variables and differentiating between those affected by measurement trends and those displaying intrinsic trends, relying on changing causal module detection and trend-influenced variables’ structural properties, respectively. Extensive experiments on synthetic and real-world data demonstrate the efficacy of this approach.

2079EnsemW2S: Can an Ensemble of LLMs be Leveraged to Obtain a Stronger LLM?

[openreview] [pdf]

Abstract How can we harness the collective capabilities of multiple Large Language Models (LLMs) to create an even more powerful model? This question forms the foundation of our research, where we propose an innovative approach to weak-to-strong (w2s) generalization—a critical problem in AI alignment. Our work introduces an easy-to-hard (e2h) framework for studying the feasibility of w2s generalization, where weak models trained on simpler tasks collaboratively supervise stronger models on more complex tasks. This setup mirrors real-world challenges, where direct human supervision is limited. To achieve this, we develop a novel AdaBoost-inspired ensemble method, demonstrating that an ensemble of weak supervisors can enhance the performance of stronger LLMs across classification and generative tasks on difficult QA datasets. In several cases, our ensemble approach matches the performance of models trained on ground-truth data, establishing a new benchmark for w2s generalization. We observe an improvement of upto 14% over existing baseline and an average improvement of 5% and 4% for binary classification and generation task respectively. This research points to a promising direction for enhancing AI through collective supervision, especially in scenarios where labeled data is sparse or insufficient.

2080COME: Test-time Adaption by Conservatively Minimizing Entropy

[openreview] [pdf]

Abstract Machine learning models must continuously self-adjust themselves for novel data distribution in the open world. As the predominant principle, entropy minimization (EM) has been proven to be a simple yet effective cornerstone in existing test-time adaption (TTA) methods. While unfortunately its fatal limitation (i.e., overconfidence) tends to result in model collapse. For this issue, we propose to conservatively minimize the entropy (COME), which is a simple drop-in replacement of traditional EM to elegantly address the limitation. In essence, COME explicitly models the uncertainty by characterizing a Dirichlet prior distribution over model predictions during TTA. By doing so, COME naturally regularizes the model to favor conservative confidence on unreliable samples. Theoretically, we provide a preliminary analysis to reveal the ability of COME in enhancing the optimization stability by introducing a data-adaptive lower bound on the entropy. Empirically, our method achieves state-of-the-art performance on commonly used benchmarks, showing significant improvements in terms of classification accuracy and uncertainty estimation under various settings including standard, life-long and open-world TTA. Our code is available at: \href{https://anonymous.4open.science/r/anonymous-9F46}{https://anonymous.4open.science/r/anonymous-9F46}.

2081Video Diffusion Models Learn the Structure of the Dynamic World

[openreview] [pdf]

Abstract Diffusion models have demonstrated significant progress in visual perception tasks due to their ability to capture fine-grained, object-centric features through large-scale vision-language pretraining. While their success in image-based tasks is well-established, extending this capability to the domain of video understanding remains a key challenge. In this work, we explore the potential of diffusion models for video understanding by analyzing the feature representations learned by both image- and video-based diffusion models, alongside non-generative, self-supervised approaches. We propose a unified probing framework to evaluate six models across four core video understanding tasks: action recognition, object discovery, scene understanding, and label propagation. Our findings reveal that video diffusion models consistently rank among the top performers, particularly excelling at modeling temporal dynamics and scene structure. This observation not only sets them apart from image-based diffusion models but also opens a new direction for advancing video understanding, offering a fresh alternative to traditional discriminative pre-training objectives. Interestingly, we demonstrate that higher generation performance does not always correlate with improved performance in downstream tasks, highlighting the importance of careful representation selection. Overall, our results suggest that video diffusion models hold substantial promise for video understanding by effectively capturing both spatial and temporal information, positioning them as strong competitors in this evolving domain.

2082Differentially Private Network Training under Hidden State Assumption

[openreview] [pdf]

Abstract We present a novel approach called differentially private stochastic block coordinate descent (DP-SBCD) for training neural networks with provable guarantees of differential privacy under the hidden state assumption. Our methodology regards neural networks as optimization problems and decomposes the training process of the neural network into sub-problems, each corresponding to the training of a specific layer. By doing so, we extend the analysis of differential privacy under the hidden state assumption to encompass non-convex problems and algorithms employing proximal gradient descent. Furthermore, in contrast to existing methods, we adopt a novel approach by utilizing calibrated noise sampled from adaptive distributions, yielding improved empirical trade-offs between utility and privacy.

2083Tackling Feature and Sample Heterogeneity in Decentralized Multi-Task Learning: A Sheaf-Theoretic Approach

[openreview] [pdf]

Abstract Federated multi-task learning (FMTL) aims to simultaneously learn multiple related tasks across clients without sharing sensitive raw data. However, in the decentralized setting, existing FMTL frameworks are limited in their ability to capture complex task relationships and handle feature and sample heterogeneity across clients. To address these challenges, we introduce a novel sheaf-theoretic-based approach for FMTL. By representing client relationships using cellular sheaves, our framework can flexibly model interactions between heterogeneous client models. We formulate the sheaf-based FMTL optimization problem using sheaf Laplacian regularization and propose the Sheaf-FMTL algorithm to solve it. We show that the proposed framework provides a unified view encompassing many existing federated learning (FL) and FMTL approaches. Furthermore, we prove that our proposed algorithm, Sheaf-FMTL, achieves a sublinear convergence rate in line with state-of-the-art decentralized FMTL algorithms. Extensive experiments demonstrate that Sheaf-FMTL exhibits communication savings by sending significantly fewer bits compared to decentralized FMTL baselines.

2084Human-in-the-loop Neural Networks: Human Knowledge Infusion

[openreview] [pdf]

Abstract This study proposes a method for infusing human knowledge into neural networks. The primary objective of this study is to build a mechanism that allows neural networks to learn not only from data but also from humans. This motivation is triggered by the fact that human knowledge, experience, personal preferences, and other subjective characteristics are not necessarily easy to mathematically formulate or present as structured data, hindering them from being learned by neural networks. This study is made possible by a neural network model with a two-dimensional topological hidden representation, Restricted Radial Basis Function (rRBF) network. The hidden layer’s low dimensionality allows humans to visualize the internal representation of the neural network and thus intuitively understand its characteristics. In this study, the topological layer is further utilized to allow humans to organize it considering their subjective similarities criterion for the inputs. Hence, the infusion of human knowledge, experience, and preference occurs during this process that initializes the rRBF. The subsequent learning process of rRBF ensures that the infused knowledge is inherited during and after the learning process, thus generating a unique neural network that benefits from human knowledge. The infusion can be executed in two different stages of neural network training: the initialization before learning and the post-training correction. This study contributes to the new field of human-in-the-loop AI, which aims to allow humans to participate in AI’s learning process or decision-making. Knowledge infusion broadens the scope of human participation in human-in-the-loop AI, usually limited to arranging the training curriculum or participating in the decision-making process. The proposed method is tested against real-world problems of Alzheimer’s detection from MRI images.

2085Provable Convergence of Single-Timescale Neural Actor-Critic in Continuous Spaces

[openreview] [pdf]

Abstract Actor-critic (AC) algorithms have been the powerhouse behind many successful yet challenging applications. However, the theoretical understanding of finite-time convergence in AC’s most practical form remains elusive. Existing research often oversimplifies the algorithm and only considers simple finite state and action spaces. We analyze the more practical single-timescale AC on continuous state and action spaces and use deep neural network approximations for both critic and actor. Our analysis reveals that the iterates of the more practical framework we consider converge towards the stationary point at rate O~(T1/2)+O~(m1/2)\widetilde{\mathcal{O}}(T^{-1/2})+\widetilde{\mathcal{O}}(m^{-1/2}), where TT is the total number of iterations and mm is the width of the deep neural network. To our knowledge, this is the first finite-time analysis of single-timescale AC in continuous state and action spaces, which further narrows the gap between theory and practice.

2086Scaling up Masked Diffusion Models on Text

[openreview] [pdf]

Abstract Masked diffusion models (MDMs) have shown promise in language modeling, yet their scalability and effectiveness in core language tasks, such as conditional generation and language understanding, remain underexplored. This paper establishes the first scaling law for MDMs, demonstrating a scaling rate comparable to autoregressive models (ARMs) and a relatively small compute gap. Motivated by their scalability, we train a family of MDMs with up to 1.1 billion (B) parameters to systematically evaluate their performance against ARMs of comparable or larger sizes. Fully leveraging the probabilistic formulation of MDMs, we propose a simple yet effectiveunsupervised classifier-free guidancethat effectively exploits large-scale unpaired data, boosting performance for conditional inference. In language understanding, a 1.1B MDM shows competitive results, outperforming the larger 1.5B GPT-2 model on four out of eight zero-shot benchmarks. In conditional generation, MDMs provide a flexible trade-off compared to ARMs utilizing KV-cache: MDMs match the performance of ARMs while being 1.5 times faster, or achieve higher quality than ARMs at a slightly higher computational cost. Moreover, MDMs address challenging tasks for ARMs by effectively handling bidirectional reasoning and adapting to temporal shifts in data. Notably, a 1.1B MDM breaks thereverse curseencountered by much larger ARMs with significantly more data and computation, such as Llama (13B) and GPT-3 (175B).

2087Robustness Inspired Graph Backdoor Defense

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have achieved promising results in tasks such as node classification and graph classification. However, recent studies reveal that GNNs are vulnerable to backdoor attacks, posing a significant threat to their real-world adoption. Despite initial efforts to defend against specific graph backdoor attacks, there is no work on defending against various types of backdoor attacks where generated triggers have different properties. Hence, we first empirically verify that prediction variance under edge dropping is a crucial indicator for identifying poisoned nodes. With this observation, we propose using random edge dropping to detect backdoors and theoretically show that it can efficiently distinguish poisoned nodes from clean ones. Furthermore, we introduce a novel robust training strategy to efficiently counteract the impact of the triggers. Extensive experiments on real-world datasets show that our framework can effectively identify poisoned nodes, significantly degrade the attack success rate, and maintain clean accuracy when defending against various types of graph backdoor attacks with different properties. Our code is available at:https://anonymous.4open.science/r/RIGBD-A670.

2088BinaryDM: Accurate Weight Binarization for Efficient Diffusion Models

[openreview] [pdf]

Abstract With the advancement of diffusion models (DMs) and the substantially increased computational requirements, quantization emerges as a practical solution to obtain compact and efficient low-bit DMs. However, the highly discrete representation leads to severe accuracy degradation, hindering the quantization of diffusion models to ultra-low bit-widths. This paper proposes a novel weight binarization approach for DMs, namely BinaryDM, pushing binarized DMs to be accurate and efficient by improving the representation and optimization. From the representation perspective, we present an Evolvable-Basis Binarizer (EBB) to enable a smooth evolution of DMs from full-precision to accurately binarized. EBB enhances information representation in the initial stage through the flexible combination of multiple binary bases and applies regularization to evolve into efficient single-basis binarization. The evolution only occurs in the head and tail of the DM architecture to retain the stability of training. From the optimization perspective, a Low-rank Representation Mimicking (LRM) is applied to assist the optimization of binarized DMs. The LRM mimics the representations of full-precision DMs in low-rank space, alleviating the direction ambiguity of the optimization process caused by fine-grained alignment. Comprehensive experiments demonstrate that BinaryDM achieves significant accuracy and efficiency gains compared to SOTA quantization methods of DMs under ultra-low bit-widths. With 1-bit weight and 4-bit activation (W1A4), BinaryDM achieves as low as 7.74 FID and saves the performance from collapse (baseline FID 10.87). As the first binarization method for diffusion models, W1A4 BinaryDM achieves impressive 15.2x OPs and 29.2x model size savings, showcasing its substantial potential for edge deployment.

2089The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

[openreview] [pdf]

Abstract The performance of large language models (LLMs) in natural language processing (NLP) tasks is significantly influenced by the quality and diversity of data used for supervised fine-tuning (SFT). Current data selection methods often focus solely on quality or diversity, leading to underperforming models due to suboptimal training data. In this paper, we introduce GraphFilter, a novel method that represents the dataset as a bipartite graph, linking sentences to their constituent n-grams. This representation effectively captures the relationships between sentences and linguistic patterns, facilitating the selection of sentences that enhance n-gram diversity. To balance quality and diversity during selection, we propose a priority function that combines the quality metric with the diversity metric in a multiplicative manner. GraphFilter iteratively selects high-priority sentences, updates the bipartite graph by removing covered n-grams, and re-calculates priorities to reflect the evolving data landscape. We conduct extensive experiments using three model backbones across six widely used benchmarks. The results demonstrate that GraphFilter outperforms all nine baseline approaches, achieving superior model performance and computational efficiency. Our analyses validate the effectiveness of our design choices, examine the subsets selected by GraphFilter and other methods, highlight the importance of instruction diversity, and explore the role of quality and diversity in relation to subset sizes. GraphFilter establishes a new foundation for effective data selection strategies, encouraging further research in data selection for LLMs.

2090Learning Nash Equilibria in Normal-Form Games via Approximating Stationary Points

[openreview] [pdf]

Abstract Nash equilibrium (NE) plays an important role in game theory. However, learning an NE in normal-form games (NFGs) is a complex, non-convex optimization problem. Deep Learning (DL), the cornerstone of modern artificial intelligence, has demonstrated remarkable empirical performance across various applications involving non-convex optimization. However, applying DL to learn an NE poses significant difficulties since most existing loss functions for using DL to learn an NE introduce bias under sampled play. A recent work proposed an unbiased loss function. Unfortunately, it suffers from high variance, which degrades the convergence rate. Moreover, learning an NE through this unbiased loss function entails finding a global minimum in a non-convex optimization problem, which is inherently difficult. To mitigate the high variance and reduce the complexity of learning an NE associated with the existing unbiased loss function, we propose a novel loss function, named Nash Advantage Loss (NAL). NAL is unbiased and exhibits significantly lower variance than the existing unbiased loss function. Furthermore, learning an NE by minimizing NAL is more tractable, as an NE is a stationary point of NAL rather than having to be a global minimum. Experimental results demonstrate that the algorithm minimizing NAL achieves significantly faster empirical convergence rates compared to previous algorithms, while also reducing the variance of estimated loss value by several orders of magnitude.

2091SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models

[openreview] [pdf]

Abstract Exploring useful behavior is a keystone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, existing approaches to intrinsic motivation that follow general principles such as information gain, mostly uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as environments already embedded in language or access to high-level actions. To bridge this gap, we propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model- based RL agents with intrinsic motivation for semantically meaningful behavior. To do so, we distill an intrinsic reward signal of interestingness from Vision Language Model (VLM) annotations. The agent learns to predict and maximize these intrinsic rewards using a world model learned directly from intrinsic rewards, image observations, and low-level actions. We show that in both robotic and video game-like simulations SENSEI manages to discover a variety of meaningful behaviors. We believe SENSEI provides a general tool for integrating feedback from foundation models into autonomous agents, a crucial research direction, as openly available VLMs become more powerful.

2092Direct Acquisition Optimization for Low-Budget Active Learning

[openreview] [pdf]

Abstract Active Learning (AL) has gained prominence in integrating data-intensive machine learning (ML) models into domains with limited labeled data. However, its effectiveness diminishes significantly when the labeling budget is low. In this paper, we first empirically observe the performance degradation of existing AL algorithms in the low-budget settings, and then introduce Direct Acquisition Optimization (DAO), a novel AL algorithm that optimizes sample selections based on expected true loss reduction. Specifically, DAO utilizes influence functions to update model parameters and incorporates an additional acquisition strategy to mitigate bias in loss estimation. This approach facilitates a more accurate estimation of the overall error reduction, without extensive computations or reliance on labeled data. Experiments demonstrate DAO’s effectiveness in low budget settings, outperforming state-of-the-arts approaches across seven benchmarks.

2093BlockDance: Reuse Structurally Similar Spatio-Temporal Features to Accelerate Diffusion Transformers

[openreview] [pdf]

Abstract Diffusion models have demonstrated impressive generation capabilities, particularly with recent advancements leveraging transformer architectures to improve both visual and artistic quality. However, Diffusion Transformers (DiTs) continue to encounter challenges related to low inference speed, primarily due to the iterative denoising process. To address this issue, we propose BlockDance, a training-free approach that explores feature similarities at adjacent time steps to accelerate DiTs. Unlike previous feature-reuse methods that lack tailored reuse strategies for features at different scales, BlockDance prioritizes the identification of the most structurally similar features, referred to as Structurally Similar Spatio-Temporal (STSS) features. These features are primarily located within the structure-focused blocks of the transformer during the later stages of denoising. BlockDance caches and reuses these highly similar features to mitigate redundant computation, thereby accelerating DiTs while maximizing consistency with the generated results of the original model. Furthermore, considering the diversity of generated content and the varying distributions of redundant features, we introduce BlockDance-Ada, a lightweight decision-making network tailored for instance-specific acceleration. BlockDance-Ada dynamically allocates resources and provides superior content quality. Both BlockDance and BlockDance-Ada have demonstrated effectiveness across diverse generation tasks and models, achieving an acceleration ranging from 25% to 50% while preserving generation quality.

2094Learning Dynamics of LLM Finetuning

[openreview] [pdf]

Abstract Learning dynamics, which describes how the learning of specific training examples influences the model’s predictions on other examples, gives us a powerful tool for understanding the behavior of deep learning systems. We study the learning dynamics of large language models during different types of finetuning, by analyzing the step-wise decomposition of how influence accumulates among different potential responses. Our framework allows a uniform interpretation of many interesting observations about the training of popular algorithms for both instruction tuning and preference tuning. In particular, we propose a hypothetical explanation of why specific types of hallucination are strengthened after finetuning, e.g., the model might use phrases or facts in the response for question B to answer question A, or the model might keep repeating similar simple phrases when generating responses. We also extend our framework and highlight a unique ``squeezing effect’’ to explain a previously observed phenomenon in off-policy direct preference optimization (DPO), where running DPO for too long makes even the desired outputs less likely. This framework also provides insights into where the benefits of on-policy DPO and other variants come from. The analysis not only provides a novel perspective of understanding LLM’s finetuning but also inspires a simple, effective method to improve alignment performance.

2095Strong Preferences Affect the Robustness of Value Alignment

[openreview] [pdf]

Abstract Value alignment, which aims to ensure that large language models (LLMs) and other AI agents behave in accordance with human values, is critical for ensuring safety and trustworthiness of these systems. A key component of value alignment is the modeling of human preferences as a representation of human values. In this paper, we investigate the robustness of value alignment by examining the sensitivity of preference models. Specifically, we ask: how do changes in the probabilities of some preferences affect the predictions of these models for other preferences? To answer this question, we theoretically analyze the robustness of widely used preference models by examining their sensitivities to minor changes in preferences they model. Our findings reveal that, in the Bradley-Terry and the Placket-Luce model, the probability of a preference can change significantly as other preferences change, especially when these preferences are dominant (i.e., with probabilities near zero or one). We identify specific conditions where this sensitivity becomes significant for these models and discuss the practical implications for the robustness and safety of value alignment in AI systems.

2096Reinforcement learning with combinatorial actions for coupled restless bandits

[openreview] [pdf]

Abstract Reinforcement learning (RL) has increasingly been applied to solve real-world planning problems, with progress in handling large state spaces and time horizons. However, a key bottleneck in many domains is that RL methods cannot accommodate large, combinatorially structured action spaces. In such settings, even representing the set of feasible actions at a single step may require a complex discrete optimization formulation. We leverage recent advances in embedding trained neural networks into optimization problems to propose SEQUOIA, an RL algorithm that directly optimizes for long-term reward over the feasible action space. Our approach embeds a Q-network into a mixed-integer program to select a combinatorial action in each timestep. Here, we focus on planning over restless bandits, a class of planning problems which capture many real-world examples of sequential decision making. We introduce coRMAB, a broader class of restless bandits with combinatorial actions that cannot be decoupled across the arms of the restless bandit, requiring direct solving over the joint, exponentially large action space. We empirically validate SEQUOIA on four novel restless bandit problems with combinatorial constraints: multiple interventions, path constraints, bipartite matching, and capacity constraints. Our approach significantly outperforms existing methods—which cannot address sequential planning and combinatorial selection simultaneously—by an average of 28.3% on these difficult instances.

2097TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

[openreview] [pdf]

Abstract Causal language models have demonstrated remarkable capabilities, but their size poses significant challenges for deployment in resource-constrained environments. Knowledge distillation, a widely-used technique for transferring knowledge from a large teacher model to a small student model, presents a promising approach for model compression. A significant remaining issue lies in the major differences between teacher and student models, namely the substantial capacity gap, mode averaging, and mode collapse, which pose barriers during distillation. To address these issues, we introduce Temporally Adaptive Interpolated Distillation (TAID)\textit{Temporally Adaptive Interpolated Distillation (TAID)}, a novel knowledge distillation approach that dynamically interpolates student and teacher distributions through an adaptive intermediate distribution, gradually shifting from the student’s initial distribution towards the teacher’s distribution. We provide a theoretical analysis demonstrating TAID’s ability to prevent mode collapse and empirically show its effectiveness in addressing the capacity gap while balancing mode averaging and mode collapse. Our comprehensive experiments demonstrate TAID’s superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. Furthermore, we showcase TAID’s practical impact by developing two state-of-the-art compact foundation models: TAID-LLM-1.5B\texttt{TAID-LLM-1.5B} for language tasks and TAID-VLM-2B\texttt{TAID-VLM-2B} for vision-language tasks. These results demonstrate TAID’s effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.

2098MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. Typically, an LLM is first pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may forget some knowledge acquired in the pre-training stage, leading to a decline in general capabilities. To address this challenge, we propose a new fine-tuning algorithm termed Momentum-Filtered Optimizer (MoFO). As an extension of greedy block coordinate descent (BCD) methods, MoFO iteratively selects and updates the model parameters with the largest momentum magnitudes. MoFO achieves similar fine-tuning performance to the default fine-tuning algorithm while effectively mitigating knowledge forgetting. Furthermore, MoFO does not require access to pre-training data, making it highly suitable for scenarios where the pre-training data is unavailable, such as fine-tuning checkpoint-only open-source LLMs. We validate MoFO through rigorous convergence analysis and extensive experiments, demonstrating its superiority over existing methods in mitigating forgetting.

2099DiffVAS: Diffusion-Guided Visual Active Search in Partially Observable Environments

[openreview] [pdf]

Abstract Visual active search (VAS) has been introduced as a modeling framework that leverages visual cues to direct aerial (e.g., UAV-based) exploration and pinpoint areas of interest within extensive geospatial regions. Potential applications of VAS include detecting hotspots for rare wildlife poaching, aiding in search-and-rescue missions, and uncovering illegal trafficking of weapons, among other uses. Previous VAS approaches assume that the entire search space is known upfront, which is often unrealistic due to constraints such as a restricted field of view and high acquisition costs, and they typically learn policies tailored to specific target objects, which limits their ability to search for multiple target categories simultaneously. In this work, we propose DiffVAS, a target-conditioned policy that searches for diverse objects simultaneously according to task requirements in partially observable environments, which advances the deployment of visual active search policies in real-world applications. DiffVAS uses a diffusion model to reconstruct the entire geospatial area from sequentially observed partial glimpses, which enables a target-conditioned reinforcement learning-based planning module to effectively reason and guide subsequent search steps. Our extensive experiments demonstrate that DiffVAS excels in searching diverse objects in partially observable environments, significantly surpassing state-of-the-art methods across datasets.

2100Text-to-graph Generation with Conditional Diffusion Models Guided by Graph-aligned LLMs

[openreview] [pdf]

Abstract Text-to-graph generation, aiming for controlled graph generation based on natural language instructions, holds significant application potentials in real-world scenarios such as drug discoveries. However, existing generative models fail to achieve text-to-graph generation in the following two aspects: i) language model-based generative models struggle with generating complex graph structures, and ii) graph-based generative models mainly focus on unconditional graph generation, falling short in understanding as well as following human instructions. In this paper, we tackle the text-to-graph generation problem by employing graph diffusion models with guidance from large language models (LLMs) for the first time, to the best of our knowledge. The problem is highly non-trivial with the following challenges: 1) How to align LLMs for understanding the irregular graph structures and the graph properties hidden in human instructions, 2) How to align graph diffusion models for following natural language instructions in order to generate graphs with expected relational semantics from human. To address these challenges, we propose a novel LLM-aligned Graph Diffusion Model (LLM-GDM), which is able to generate graphs based on natural language instructions. In particular, we first propose the self-supervised text-graph alignment to empower LLMs with the ability to accurately understand graph structures and properties by finetuning LLMs with several specially designed alignment tasks involving various graph components such as nodes, edges, and subgraphs. Then, we propose a structure-aware cross-attention mechanism guiding the diffusion model to follow human instructions through inherently capturing the relational semantics among texts and structures. Extensive experiments on both synthetic and real-world molecular datasets demonstrate the effectiveness of our proposed LLM-GDM model over existing baseline methods.

2101Sample-efficient Imitative Multi-token Decision Transformer for Real-world Driving

[openreview] [pdf]

Abstract Recent advancements in autonomous driving technologies involve the capability to effectively process and learn from extensive real-world driving data. Current imitation learning and offline reinforcement learning methods have shown remarkable promise in autonomous systems, harnessing the power of offline datasets to make informed decisions in open-loop (non-reactive agents) settings. However, learning-based agents face significant challenges when transferring knowledge from open-loop to closed-loop (reactive agents) environment. The performance is significantly impacted by data distribution shift, sample efficiency, the complexity of uncovering hidden world models and physics. To address these issues, we propose Sample-efficient Imitative Multi-token Decision Transformer (SimDT). SimDT introduces multi-token prediction, online imitative learning pipeline and prioritized experience replay to sequence-modelling reinforcement learning. The performance is evaluated through empirical experiments and results exceed popular imitation and reinforcement learning algorithms both in open-loop and closed-loop settings on Waymax benchmark. SimDT exhibits 41% reduction in collision rate and 18% improvement in reaching the destination compared with the baseline method.

2102How Social is It? A Benchmark for LLMs’ Capabilities in Multi-user Multi-turn Social Agent Tasks

[openreview] [pdf]

Abstract Expanding the application of large language models (LLMs) to societal life, instead of primary function only as auxiliary assistants to communicate with only one person at a time, necessitates LLMs’ capabilities to independently play roles in multi-user, multi-turn social agent tasks within complex social settings. However, currently the capability has not been systematically measured with available benchmarks. To address this gap, we first introduce an agent task leveling framework grounded in sociological principles. Concurrently, we propose a novel benchmark, How Social Is It (we call it HSII below), designed to assess LLM’s social capabilities in comprehensive social agents tasks and benchmark representative models. HSII comprises four stages: format parsing, target selection, target switching conversation, and stable conversation, which collectively evaluate the communication and task completion capabilities of LLMs within realistic social interaction scenarios dataset, HSII-Dataset. The dataset is derived step by step from news dataset. We perform an ablation study by doing clustering to the dataset. Additionally, we investigate the impact of chain of thought (COT) method on enhancing LLMs’ social performance. Since COT cost more computation, we further introduce a new statistical metric, COT-complexity, to quantify the efficiency of certain LLMs with COTs for specific social tasks and strike a better trade-off between measurement of correctness and efficiency. Various results of our experiments demonstrate that our benchmark is well-suited for evaluating social skills in LLMs.

2103ReAttention: Training-Free Infinite Context with Finite Attention Scope

[openreview] [pdf]

Abstract The long-context capability of the Large Language Models (LLM) has made significant breakthroughs, but the maximum supported context length remains a critical bottleneck limiting their practical applications. The constraint of context length in LLMs arises from the self-attention mechanism, which cannot effectively and efficiently capture the semantic relationships within infinitely long contexts via the limited pre-trained positional information and attention scope. In this work, we propose \textbf{ReAttention}, a training-free approach enabling LLM based on the self-attention mechanism to support an infinite context with a finite attention scope under sufficient memory resources. ReAttention performs the position-agnostic top-kk attention before the ordinary position-aware self-attention, freeing LLMs from the length extrapolation issue. We validate the performance of ReAttention on the LongBench, L-Eval, and InfiniteBench and demonstrate that it is on par with traditional methods. Furthermore, we also apply ReAttention on mainstream LLMs, including LLaMA3.1-8B and Mistral-v0.3-7B, enabling them to support context lengths of at least 1M and even expanding the context length of LLaMA3.2-3B-chat by 128×\times to 4M without any further training in Needle-In-A-Haystack tests. We also improve the efficiency of ReAttention with Triton and achieve an efficient extrapolation without additional overhead.

2104Bypassing Skip-Gram Negative Sampling: Dimension Regularization as a More Efficient Alternative for Graph Embeddings

[openreview] [pdf]

Abstract A wide range of graph embedding objectives decompose into two components: one that attracts the embeddings of nodes that are perceived as similar, and another that repels embeddings of nodes that are perceived as dissimilar. Without repulsion, the embeddings would collapse into trivial solutions. Skip-Gram Negative Sampling (SGNS) is a popular and efficient repulsion approach that prevents collapse by repelling each node from a sample of dissimilar nodes. In this work, we show that when repulsion is most needed and the embeddings approach collapse, SGNS node-wise repulsion is, in the aggregate, an approximate re-centering of the node embedding dimensions. Such dimension operations are much more scalable than node operations and yield a simpler geometric interpretation of the repulsion. Our result extends findings from self-supervised learning to the skip-gram model, establishing a connection between skip-gram node contrast and dimension regularization. We use this observation to propose a flexible algorithm augmentation framework that improves the scalability of any existing algorithm using SGNS. The framework prioritizes node attraction and replaces SGNS with dimension regularization. We instantiate this generic framework for LINE and node2vec and show that the augmented algorithms preserve downstream link-prediction performance while reducing GPU memory usage by up to 33.3% and training time by 22.1%. Further, for graphs that are globally sparse but locally dense, we show that removing repulsion altogether can improve performance, but, when repulsion is otherwise needed, dimension regularization provides an effective and efficient alternative to SGNS.

2105Implicit Bias of Mirror Descent for Shallow Neural Networks in Univariate Regression

[openreview] [pdf]

Abstract We examine the implicit bias of mirror flow in univariate least squares error regression with wide and shallow neural networks. For a broad class of potential functions, we show that mirror flow exhibits lazy training and has the same implicit bias as ordinary gradient flow when the network width tends to infinity. For ReLU networks, we characterize this bias through a variational problem in function space. Our analysis includes prior results for ordinary gradient flow as a special case and lifts limitations which required either an intractable adjustment of the training data or networks with skip connections. We further introducescaled potentialsand show that for these, mirror flow still exhibits lazy training but is not in the kernel regime. For networks with absolute value activations, we show that mirror flow with scaled potentials induces a rich class of biases, which generally cannot be captured by an RKHS norm. A takeaway is that whereas the parameter initialization determines how strongly the curvature of the learned function is penalized at different locations of the input space, the scaled potential determines how the different magnitudes of the curvature are penalized.

2106Exploring Representations and Interventions in Time Series Foundation Models

[openreview] [pdf]

Abstract Time series foundation models promise to be powerful tools for a wide range of applications. However, their internal representations and learned concepts are still not well understood. In this study, we investigate the structure and redundancy of representations across various TSFMs, examining the self-similarity of model layers within and across different model sizes. This analysis reveals block-like redundancy in the representations, which can be utilized for informed pruning to improve inference speed and efficiency. Additionally, we explore the concepts learned by these models—such as periodicity and trends—and how these can be manipulated through latent space steering to influence model behavior. Our experiments show that steering interventions can introduce new features, like adding periodicity or trends to signals that initially lacked them. These findings underscore the value of representational analysis for optimizing models and demonstrate how conceptual steering offers new possibilities for more controlled time series modeling.

2107A bird’s eye view on informed classification

[openreview] [pdf]

Abstract Neurosymbolic AI is a growing field of research aiming to combine neural network learning capabilities with the reasoning abilities of symbolic systems. In this paper, we tackle informed classification tasks, i.e. multi-label classification tasks informed by prior knowledge that specifies which combinations of labels are semantically valid. Several neurosymbolic formalisms and techniques have been introduced in the literature, each relying on a particular language to represent prior knowledge. We take a bird’s eye view on informed classification and introduce a unified formalism that encapsulates all knowledge representation languages. Then, we build upon this formalism to identify several concepts in probabilistic reasoning that are at the core of many techniques across representation languages. We also define a new technique called semantic conditioning at inference, which only constrains the system during inference while leaving the training unaffected, an interesting property in the era of off-the-shelves and foundation models. We discuss its theoritical and practical advantages over two other probabilistic neurosymbolic techniques: semantic conditioning and semantic regularization. We then evaluate experimentally and compare the benefits of all three techniques on several large-scale datasets. Our results show that, despite only working at inference, our technique can efficiently leverage prior knowledge to build more accurate neural-based systems.

2108Cross-modal Mitigation of Spurious Correlation for Prompt-tuning in VLMs with Causally Motivated Logic Alignment

[openreview] [pdf]

Abstract Recent studies have shown that pre-trained vision-language models can effectively adapt to diverse downstream tasks through parameter-efficient prompt tuning. Unfortunately, the tuned models can exploit spurious correlations during prediction, resulting in a failure to generalize to out-of-distribution test data, especially when the tuning dataset exhibits bias. How to achieve cross-modal mitigation of spurious correlations during prompt tuning of vision-language models remains an open question. In this paper, the challenging problem is tackled by leveraging the stable relationship between necessary and sufficient causal features and the corresponding label. On the one hand, we constrain the learning process of prompt by reinforcing the necessary and sufficient connection between the textual labels and textual features. On the other hand, the probability of necessity and sufficiency between the textual features and the filtered visual features is measured and maximized to enhance cross-modal feature alignment. By iteratively optimizing these two objectives, we can achieve cross-modal mitigation of spurious correlations because the logic equivalence between textual labels and visual features is bolstered. The theoretical analysis on generalization error indicates that our method can achieve a tighter generalization error bound than existing approaches. We evaluate the proposed method on several commonly adopted out-of-distribution datasets, and the empirical results demonstrate the superiority of our method over the state-of-the-art competitors.

2109Achieving Optimal Complexity in Decentralized Learning over Row-Stochastic Networks

[openreview] [pdf]

Abstract A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms that can achieve it. While this issue has been thoroughly addressed for doubly-stochastic and column-stochastic mixing matrices, the row-stochastic setting remains largely unexplored. This study establishes the first convergence lower bound for decentralized learning over row-stochastic networks. However, developing algorithms to achieve this lower bound is highly challenging due to several factors: (i) the widely used Row-Only gossip protocol, Pull-Diag, suffers from significant instability in achieving average consensus; (ii) Pull-Diag-based algorithms are sensitive to data heterogeneity; and (iii) there has been no analysis in nonconvex and stochastic settings to date. This work addresses these deficiencies by proposing and analyzing a new gossip protocol called Pull-Sum, along with its gradient tracking extension, Pull-Sum-GT. The Pull-Sum protocol mitigates the instability issues of Pull-Diag, while Pull-Sum-GT achieves the first linear speedup convergence rate without relying on data heterogeneity assumptions. Additionally, we introduce a multi-step strategy that enables Pull-Sum-GT to match the established lower bound up to logarithmic factors, demonstrating its near-optimal performance and the tightness of our established lower bound. Experiments validate our theoretical results.

2110CoMRes: Semi-Supervised Time Series Forecasting Utilizing Consensus Promotion of Multi-Resolution

[openreview] [pdf]

Abstract Long-term time series forecasting poses significant challenges due to the complex dynamics and temporal variations, particularly when dealing with unseen patterns and data scarcity. Traditional supervised learning approaches, which rely on cleaned, labeled data, struggle to capture these unseen characteristics, limiting their effectiveness in real-world applications. In this study, we propose a semi-supervised approach that leverages multi-view setting to address these limitations. By introducing a consensus promotion framework, we enhance agreement among multiple single-view models using unseen augmented data. This approach not only improves forecasting accuracy but also reduces error accumulation in long-horizon predictions. Furthermore, we explore the impact of autoregressive and non-autoregressive decoding schemes on error propagation, demonstrating the robustness of our model in extending prediction horizons. Experimental results demonstrate that our proposed method not only surpasses traditional supervised models in accuracy but also exhibits greater robustness when extending the prediction horizon.

2111When can isotropy help adapt LLMs’ next word prediction to numerical domains?

[openreview] [pdf]

Abstract Recent studies have shown that vector representations of embeddings learned by pre-trained large language models (LLMs) are effective in various downstream tasks in numerical domains. Despite their significant benefits, the tendency of LLMs to hallucinate in such domains can have severe consequences in applications like finance, energy, retail, climate science, wireless networks, synthetic tabular generation, among others. To guarantee prediction reliability and accuracy in numerical domains, it is necessary to have performance guarantees through explainability. However, there is little theoretical understanding of when pre-trained language models help solve numeric downstream tasks. This paper seeks to bridge this gap by understanding when the next-word prediction capability of LLMs can be adapted to numerical domains through the lens of isotropy. Specifically, we first provide a general numeric data generation process that captures the core characteristics of numeric data across various numerical domains. Then, we consider a log-linear model for LLMs in which numeric data can be predicted from its context through a network with softmax as its last layer. We demonstrate that, in order to achieve state-of-the-art performance in numerical domains, the hidden representations of the LLM embeddings must possess a structure that accounts for the shift-invariance of the softmax function. We show how the isotropic property of LLM embeddings preserves the underlying structure of representations, thereby resolving the shift-invariance problem problem of softmax function. In other words, isotropy allows numeric downstream tasks to effectively leverage pre-trained representations, thus providing performance guarantees in the numerical domain. Experiments show that different characteristics of numeric data could have different impacts on isotropy.

2112Revisiting Generative Policies: A Simpler Reinforcement Learning Algorithmic Perspective

[openreview] [pdf]

Abstract Generative models, particularly diffusion models, have achieved remarkable success in density estimation for multimodal data, drawing significant interest from the reinforcement learning (RL) community, especially in policy modeling in continuous action spaces. However, existing works exhibit significant variations in training schemes and RL optimization objectives, and some methods are only applicable to diffusion models. In this study, we compare and analyze various generative policy training and deployment techniques, identifying and validating effective designs for generative policy algorithms. Specifically, we revisit existing training objectives and classify them into two categories, each linked to a simpler approach. The first approach, Generative Model Policy Optimization (GMPO), employs a native advantage-weighted regression formulation as the training objective, which is significantly simpler than previous methods. The second approach, Generative Model Policy Gradient (GMPG), offers a numerically stable implementation of the native policy gradient method. We introduce a standardized experimental framework named GenerativeRL. Our experiments demonstrate that the proposed methods achieve state-of-the-art performance on various offline-RL datasets, offering a unified and practical guideline for training and deploying generative policies.

2113On the Learn-to-Optimize Capabilities of Transformers in In-Context Sparse Recovery

[openreview] [pdf]

Abstract An intriguing property of the Transformer is its ability to perform in-context learning (ICL), where the Transformer can solve different inference tasks without parameter updating based on the contextual information provided by the corresponding input-output demonstration pairs. It has been theoretically proved that ICL is enabled by the capability of Transformers to perform gradient-descent algorithms (Von Oswald et al., 2023a; Bai et al., 2024). This work takes a step further and shows that Transformers can perform learning-to-optimize (L2O) algorithms. Specifically, for the ICL sparse recovery (formulated as LASSO) tasks, we show that a K-layer Transformer can perform an L2O algorithm with a provable convergence rate linear in K. This provides a new perspective explaining the superior ICL capability of Transformers, even with only a few layers, which cannot be achieved by the standard gradient-descent algorithms. Moreover, unlike the conventional L2O algorithms that require the measurement matrix involved in training to match that in testing, the trained Transformer is able to solve sparse recovery problems generated with different measurement matrices. Besides, Transformers as an L2O algorithm can leverage structural information embedded in the training tasks to accelerate its convergence during ICL, and generalize across different lengths of demonstration pairs, where conventional L2O algorithms typically struggle or fail. Such theoretical findings are supported by our experimental results.

2114Classic but Everlasting: Traditional Gradient-Based Algorithms Converges Fast Even in Time-Varying Multi-Player Games

[openreview] [pdf]

Abstract Last-iterate convergence behaviours of well-known algorithms are intensively investigated in various games, such as two-player bilinear zero-sum games. However, most known last-iterate convergence properties rely on strict settings where the underlying games must have time-invariant payoffs. Besides, the limited known attempts on the games with time-varying payoffs are in two-player bilinear time-varying zero-sum games and strictly monotone games. By contrast, in other time-varying games, the last-iterate behaviours of two classic algorithms, i.e., optimistic gradient (OG) and extra gradient (EG) algorithms, still lack research, especially the convergence rates in multi-player games. In this paper, we investigate the last-iterate behaviours of OG and EG algorithms for convergent perturbed games, which extend upon the usual model of time-invariant games and incorporate external factors, such as vanishing noises. Using the recently proposed notion of the tangent residual (or its modifications) as the potential function of games and the measure of proximity to the Nash equilibrium, we prove that the last-iterate convergence rates of EG and OG algorithms for perturbed games on bounded convex closed sets are O(1/T)O({1}/{\sqrt{T}}) if such games converge to monotone games at rates fast enough and that such a result holds true for certain unconstrained perturbed games. With this result, we address an open question asking for the last-iterate convergence rate of the extra gradient and the optimistic gradient algorithms in constrained and time-varying settings. The above convergence rates are similar to known tight results on corresponding time-invariant games.

2115Interpolating Autoregressive and Discrete Denoising Diffusion Language Models

[openreview] [pdf]

Abstract Diffusion language models offer unique benefits over autoregressive (AR) models due to their potential for parallelized generation and controllability, yet they lag in likelihood modeling and are limited to fixed-length generation. In this work, we introduce a class of semi-autoregressive (SAR) diffusion models that interpolate between discrete denoising diffusion and autoregressive models. We propose a recipe for building effective SAR models that includes an efficient training algorithm, estimators of gradient variance, and data-driven noise schedules to minimize the variance. SAR models overcome key limitations of diffusion language models, setting a new state-of-the-art performance on language modeling benchmarks and enabling generation of arbitrary-length sequences.

2116UNIFYING LONG AND SHORT SPATIO-TEMPORAL FORECASTING WITH SPECTRAL GRAPH NEURAL NETWORKS

[openreview] [pdf]

Abstract Multivariate Time Series (MTS) forecasting plays a vital role in various practical applications. Current research in this area is categorized into Spatial-Temporal Forecasting (STF) and Long-term Time Series Forecasting (LTSF). While these tasks share similarities, the methods and benchmarks used differ significantly. Spatio-Temporal Graph Neural Networks (STGNNs) excel at modeling interrelationships in STF tasks but face difficulties with long sequence inputs due to inefficient training. In contrast, LTSF models handle long sequences well but struggle with capturing complex variable interrelationships. This paper proposes the Spectral Spatio-Temporal Graph Neural Network (S2GNN) to address these challenges, unifying short- and long-sequence spatio-temporal forecasting within a single framework. S2GNN leverages spectral GNNs for global feature extraction incorporates an adaptive graph structure to manage varying sequence lengths and adopts a decoupled framework to improve scalability. Additionally, we introduce scale-adaptive node embeddings and cross-correlation embeddings for better differentiation between similar temporal patterns. Extensive experiments on eight public datasets, including both STF and LTSF datasets, demonstrate that S2GNN consistently outperforms state-of-the-art models across diverse prediction tasks. Code is available at \url{https://anonymous.4open.science/r/S2GNN-B21D}.

2117Exploring Learning Complexity for Efficient Downstream Dataset Pruning

[openreview] [pdf]

Abstract The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35×\times while establishing state-of-the-art performance with FlexRand.

2118Diffusion Implicit Policy for Unpaired Scene-aware Motion Synthesis

[openreview] [pdf]

Abstract Human motion generation is a long-standing problem, and scene-aware motion synthesis has been widely researched recently due to its numerous applications. Prevailing methods rely heavily on paired motion-scene data whose quantity is limited. Meanwhile, it is difficult to generalize to diverse scenes when trained only on a few specific ones. Thus, we propose a unified framework, termed Diffusion Implicit Policy (DIP), for scene-aware motion synthesis, where paired motion-scene data are no longer necessary. In this framework, we disentangle human-scene interaction from motion synthesis during training and then introduce an interaction-based implicit policy into motion diffusion during inference. Synthesized motion can be derived through iterative diffusion denoising and implicit policy optimization, thus motion naturalness and interaction plausibility can be maintained simultaneously. The proposed implicit policy optimizes the intermediate noised motion in a GAN Inversion manner to maintain motion continuity and control keyframe poses though the ControlNet branch and motion inpainting. For long-term motion synthesis, we introduce motion blending for stable transitions between multiple sub-tasks, where motions are fused in rotation power space and translation linear space. The proposed method is evaluated on synthesized scenes with ShapeNet furniture, and real scenes from PROX and Replica. Results show that our framework presents better motion naturalness and interaction plausibility than cutting-edge methods. This also indicates the feasibility of utilizing the DIP for motion synthesis in more general tasks and versatile scenes.

2119Lean-STaR: Learning to Interleave Thinking and Proving

[openreview] [pdf]

Abstract Traditional language model-based theorem proving assumes that by training on a sufficient amount of formal proof data, a model will learn to prove theorems. Our key observation is that a wealth of informal information that is not present in formal proofs can be useful for learning to prove theorems. For instance, humans think through steps of a proof, but this thought process is not visible in the resulting code. We present Lean-STaR, a framework for training language models to produce informal thoughts prior to each step of a proof, thereby boosting the model’s theorem-proving capabilities. Lean-STaR uses retrospective ground-truth tactics to generate synthetic thoughts for training the language model. At inference time, the trained model directly generates the thoughts prior to the prediction of the tactics in each proof step. Building on the self-taught reasoner framework, we then apply expert iteration to further fine-tune the model on the correct proofs it samples and verifies using the Lean solver. Lean-STaR significantly outperform base models (43.4% → 46.3%, Pass@64). We also analyze the impact of the augmented thoughts on various aspects of the theorem proving process, providing insights into their effectiveness.

2120D-FINE: Redefine Regression Task of DETRs as Fine-grained Distribution Refinement

[openreview] [pdf]

Abstract We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, which serve as a fine-grained intermediate representation, significantly enhancing localization accuracy. GO-LSD is a bidirectional optimization strategy that utilizes the model’s own refined distributions to enhance earlier layers through self-distillation, while simplifying the prediction task for subsequent layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 129 / 81 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 56.9% / 59.0% AP at 81 FPS, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and models will be made publicly available.

2121Inference time LLM alignment in single and multidomain preference spectrum

[openreview] [pdf]

Abstract Aligning Large Language Models (LLM) to address subjectivity and nuanced preference levels requires adequate flexibility and control, which can be a resource-intensive and time-consuming procedure. Existing training-time alignment methods require full re-training when a change is needed and inference-time ones typically require access to the reward model at each inference step. To address these limitations, we introduce an inference-time model alignment method that learns encoded representations of preference dimensions, called Alignment Vectors (AV). These representations are computed by subtracting the base model from the aligned model as in model editing enabling dynamically adjusting the model behavior during inference through simple linear operations. Even though the preference dimensions can span various granularity levels, here we focus on three gradual response levels across three specialized domains: medical, legal, and financial, exemplifying its practical potential. This new alignment paradigm introduces adjustable preference knobs during inference, allowing users to tailor their LLM outputs while reducing the inference cost by half compared to the prompt engineering approach. Additionally, we find that AVs are transferable across different fine-tuning stages of the same model, demonstrating their flexibility. AVs also facilitate multidomain, diverse preference alignment, making the process 12x faster than the retraining approach.

2122Stochastic Bandits Robust to Adversarial Attacks

[openreview] [pdf]

Abstract This paper investigates stochastic multi-armed bandit algorithms that are robust to adversarial attacks, where an attacker can first observe the learner’s action andthenalter their reward observation. We study two cases of this model, with or without the knowledge of an attack budget CC, defined as an upper bound of the summation of the difference between the actual and altered rewards. For both cases, we devise two types of algorithms with regret bounds having additive or multiplicative CC dependence terms. For the known attack budget case, we prove our algorithms achieve the regret bound of O((K/Δ)logT+KC){O}((K/\Delta)\log T + KC) and O~(KTC)\tilde{O}(\sqrt{KTC}) for the additive and multiplicative CC terms, respectively, where KK is the number of arms, TT is the time horizon, Δ\Delta is the gap between the expected rewards of the optimal arm and the second-best arm, and O~\tilde{O} hides the logarithmic factors. For the unknown case, we prove our algorithms achieve the regret bound of O~(KT+KC2)\tilde{O}(\sqrt{KT} + KC^2) and O~(KCT)\tilde{O}(KC\sqrt{T}) for the additive and multiplicative CC terms, respectively. In addition to these upper bound results, we provide several lower bounds showing the tightness of our bounds and the optimality of our algorithms. These results delineate an intrinsic separation between the bandits with attacks and corruption models.

2123A Finite-Time Analysis of Distributed Q-Learning

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) has witnessed a remarkable surge in interest, fueled by the empirical success achieved in applications of single-agent reinforcement learning (RL). In this study, we consider a distributed Q-learning scenario, wherein a number of agents cooperatively solve a sequential decision making problem without access to the central reward function which is an average of the local rewards. In particular, we study finite-time analysis of a distributed Q-learning algorithm, and provide a new sample complexity result of O~(max{1ϵ2tmix(1γ)6dmin4,1ϵSA(1σ2(W))(1γ)4dmin3})\tilde{\mathcal{O}}\left( \max\left\{ \frac{1}{\epsilon^2}\frac{t_{\text{mix}}}{(1-\gamma)^6 d_{\min}^4 } ,\frac{1}{\epsilon}\frac{\sqrt{|\mathcal{S}||\mathcal{A}|}}{(1-\sigma_2(\boldsymbol{W}))(1-\gamma)^4 d_{\min}^3} \right\} \right) under tabular lookup setting for Markovian observation model.

2124Balanced Ranking with Relative Centrality: A multi-core periphery perspective

[openreview] [pdf]

Abstract Ranking of vertices in a graph for different objectives is one of the most fundamental tasks in computer science. It is known that traditional ranking algorithms can generate unbalanced ranking when the graph has underlying communities, resulting in loss of information, polarised opinions, and reduced diversity (Celis, Straszak & Vishnoi [ICALP 2018]).In this paper, we focus onunsupervised rankingon graphs and observe that popular centrality measure based ranking algorithms such as PageRank may often generate unbalanced ranking here as well. We address this issue by coining a new approach, which we termrelative centrality. Our approach is based on an iterative graph-dependent local normalization of the centrality score, which promotes balancedness while maintaining the validity of the ranking.We further quantify reasons behind this unbalancedness of centrality measures on a novel structure that we propose is called multi-core-periphery with communities (MCPC). We also provide theoretical and extensive simulation support for our approach towards resolving the unbalancedness in MCPC.Finally, we consider graph embeddings of 11 single-cell datasets. We observe that top-ranked as per existing centrality measures are better separable into the ground truth communities. However, due to the unbalanced ranking, the top nodes often do not contain points from some communities. Here, our relative-centrality-based approach generates a ranking that provides a similar improvement in clusterability while providing significantly higher balancedness.

2125Model Entanglement for solving Privacy Preserving in Federated Learning

[openreview] [pdf]

Abstract Federated learning (FL) is widely adopted as a secure and reliable distributed machine learning system for it allows participants to retain their training data locally, transmitting only model updates, such as gradients or parameters. However, the transmission process to the server can still lead to privacy leakage, as the updated information may be exploited to launch various privacy attacks. In this work, we present a key observation that the middle layer outputs, referred to as data representations, can exhibit independence in value distribution across different types of data. This enables us to capture the intrinsic relationship between data representations and private data, and inspires us to propose a Model Entanglement(ME) strategy aimed at enhancing privacy preserving by obfuscating the data representations of private models in a fine-grained manner, while improving the balance between privacy preservation and model accuracy. We compare our approach to the baseline FedAvg and two state-of-the-art defense methods. Our method demonstrates strong defense capabilities against mainstream privacy attacks, only reducing the global model accuracy by less than 0.7% and training efficiency of 6.8% respectively on the widely used dataset, excelling in both accuracy and privacy preserving.

2126Gradient descent with generalized Newton’s method

[openreview] [pdf]

Abstract We propose the generalized Newton’s method (GeN) --- a Hessian-informed approach that applies to any optimizer such as SGD and Adam, and covers the Newton-Raphson method as a sub-case. Our method automatically and dynamically selects the learning rate that accelerates the convergence, without the intensive tuning of the learning rate scheduler. In practice, our method is easily implementable, since it only requires additional forward passes with almost zero computational overhead (in terms of training time and memory cost), if the overhead is amortized over many iterations. We present extensive experiments on language and vision tasks (e.g. GPT and ResNet) to showcase that GeN optimizers match the state-of-the-art performance, which was achieved with carefully tuned learning rate schedulers.

2127On Explaining Equivariant Graph Networks via Improved Relevance Propagation

[openreview] [pdf]

Abstract We consider explainability in equivariant graph neural networks for 3D geometric graphs. While many XAI methods have been developed for analyzing graph neural networks, they predominantly target 2D graph structures. The complex nature of 3D data and the sophisticated architectures of equivariant GNNs present unique challenges. Current XAI techniques either struggle to adapt to equivariant GNNs or fail to effectively handle positional data and evaluate the significance of geometric features adequately. To address these challenges, we introduce a novel method, known as EquiGX, which uses the Deep Taylor decomposition framework to extend the layer-wise relevance propagation rules tailored for spherical equivariant GNNs. Our approach decomposes prediction scores and back-propagates the relevance scores through each layer to the input space. Our decomposition rules provide a detailed explanation of each layer’s contribution to the network’s predictions, thereby enhancing our understanding of how geometric and positional data influence the model’s outputs. Through experiments on both synthetic and real-world datasets, our method demonstrates its capability to identify critical geometric structures and outperform alternative baselines. These results indicate that our method provides significantly enhanced explanations for equivariant GNNs.

2128AdaWM: Adaptive World Model based Planning for Autonomous Driving

[openreview] [pdf]

Abstract World model based reinforcement learning (RL) has emerged as a promising approach for autonomous driving, which learns a latent dynamics model and uses it to train a planning policy. To speed up the learning process, the pretrain-finetune paradigm is often used, where online RL is initialized by a pretrained model and a policy learned offline. However, naively performing such initialization in RL may result in dramatic performance degradation during the online interactions in the new task. To tackle this challenge, we first analyze the performance degradation and identify two primary root causes therein: the mismatch of the planning policy and the mismatch of the dynamics model, due to distribution shift. We further analyze the effects of these factors on performance degradation during finetuning, and our findings reveal that the choice of finetuning strategies plays a pivotal role in mitigating these effects. We then introduce AdaWM, an Adaptive World Model based planning method, featuring two key steps: (a) mismatch identification, which quantifies the mismatches and informs the finetuning strategy, and (b) alignment-driven finetuning, which selectively updates either the policy or the model as needed using efficient low-rank updates. Extensive experiments on the challenging CARLA driving tasks demonstrate that AdaWM significantly improves the finetuning process, resulting in more robust and efficient performance in autonomous driving systems.

2129Understanding Grokking: Insights from Neural Network Robustness

[openreview] [pdf]

Abstract Recently, an interesting phenomenon called grokking has gained much attention, where generalization occurs long after the models have initially overfitted the training data. We try to understand this seemingly strange phenomenon through the robustness of the neural network. From a robustness perspective, we show that the usually observed decreasing of l2l_2 weight norm of the neural network is theoretically connected to the occurrence of grokking. Therefore, we propose to use perturbation-based methods to enhance robustness and speed up the generalization process. Furthermore, we show that the speed-up of generalization when using our proposed method can be explained by learning the commutative law, a necessary condition when the model groks on the test dataset. In addition, we empirically observe that l2l_2 norm correlates with grokking on the test data not in a timely way and then propose new metrics based on robustness that correlate better with the grokking phenomenon.

2130OPTIMAL TRANSPORT BARYCENTER VIA NONCONVEX CONCAVE MINIMAX OPTIMIZATION

[openreview] [pdf]

Abstract The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the \emph{unregularized} barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension d>1d > 1. Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time O(mlogm)O(m \log{m}) and linear space complexity O(m)O(m) primal-dual algorithm, the Wasserstein-Descent H˙1\dot{\mathbb{H}}^1-Ascent (WDHA) algorithm, for computing the exact barycenter when the input probability density functions are discretized on an mm-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., 1024×10241024 \times 1024 images) 2D synthetic and real data.

2131Generalized Smooth Stochastic Variational Inequalities: Almost Sure Convergence and Convergence Rates

[openreview] [pdf]

Abstract This paper focuses on solving a stochastic variational inequality (SVI) problem under relaxed smoothness assumption for a class of structured non-monotone operators. The SVI problem has attracted significant interest in the machine learning community due to its immediate application to adversarial training and multi-agent reinforcement learning. In many such applications, the resulting operators do not satisfy the smoothness assumption. To address this issue, we focus on the generalized smoothness assumption and consider two well-known stochastic methods with clipping, namely, projection and Korpelevich. For these clipped methods, we provide the first almost-sure convergence results without making any assumptions on the boundedness of either the stochastic operator or the stochastic samples. Furthermore, we provide the first in-expectation convergence rate results for these methods under a relaxed smoothness assumption.

2132Long Context Compression with Activation Beacon

[openreview] [pdf]

Abstract Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible compression of long contexts. To achieve this, our method introduces the following technical designs.We directly compress the activations (i.e. keys and values at every layer), rather than leveraging soft prompts to relay information (which constitute a major bottleneck to encapsulate the complex information within long contexts).We tailor the compression workflow, where each fine-grained input unit is progressively compressed, enabling high-quality compression and efficient computation during both training and inference.We train the model through compression-based auto-regression, making full use of plain texts and instructional data to optimize the model’s compression performance.During training, we randomly sample a compression ratio at each step, teaching the model to support a wide range of compression configurations.Extensive evaluations are conducted on various long-context tasks whose lengths (e.g., 128K) may far exceed the maximum training length (20K), such as document understanding, few-shot learning, and Needle-in-a-Haystack. Whilst existing methods struggle to handle these challenging tasks, Activation Beacon maintains a comparable performance to the uncompressed baseline across various scenarios, achieving a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.

2133Towards Understanding Token Selection in Self-Attention: Successes and Pitfalls in Learning Random Walks

[openreview] [pdf]

Abstract As a key component of the transformer architecture, the self-attention mechanism is known for its capability to perform token selection, which can often significantly enhance model performance. However, when and how self-attention can be trained to perform effective token selection remains poorly understood in theory. In this paper, we study the problem of using a single self-attention layer to learn random walks on circles. We theoretically demonstrate that, after training with gradient descent, the self-attention layer can successfully learn the Markov property of the random walk, and achieve optimal next-token prediction accuracy by focusing on the correct parent token. In addition, we also study the performance of a single self-attention layer in learning relatively simpler “deterministic walks” on circles. Surprisingly, in this case, our findings indicate that the self-attention model trained with gradient descent consistently yields next-token prediction accuracy no better than a random guess. This counter-intuitive observation that self-attention can learn random walks but struggles with deterministic walks reveals a potential issue in self-attention: when there are multiple highly informative tokens, self-attention may fail to properly utilize any of them.

2134Suppressing recency bias through implicit task in task-agnostic continual adaptation for foundation language models

[openreview] [pdf]

Abstract Foundation language models have significantly advanced natural language processing but face challenges such as catastrophic forgetting when adapting to dynamic environments with diverse tasks. Recently, among the continual learning (CL) methods for these models, model architecture expansion methods have been spotlighted due to the growth of parameter-efficient fine-tuning (PEFT) methods. However, these methods need to store past PEFT adapters for each task and require task identifiers (task IDs) to distinguish each task, thus limiting their applicability in task-agnostic settings. They also overlook recency bias, where models focus overly on current tasks at the expense of past knowledge. To address these issues, we propose suppressing recency bias (SRB) by using the concept of implicit tasks. SRB assigns a fixed-size adapter to an implicit task, recursively storing historical knowledge through arithmetic operations with current adapters at every time step instead of task IDs. This arithmetic mitigates recency bias by integrating non-overlapping information between historical and current adapters. Our approach requires only simple arithmetic operations without backpropagation, minimizing additional computation, and allocates a fixed-size adapter to the implicit task, resulting in low memory requirements. We evaluate SRB on CL benchmarks for foundational LMs. Experimental results demonstrate that SRB outperforms state-of-the-art methods, achieving superior generalization performance across various task sequences and models by effectively mitigating recency bias.

2135From Graph Embedding to LKH: Bridging Learning and Heuristics for a Streamlined General TSP Solver

[openreview] [pdf]

Abstract The Traveling Salesman Problem (TSP) is known as one of the most notorious NP-hard combinatorial optimization problems. In recent decades, researchers from fields such as computer science, operations research, and artificial intelligence including deep learning (DL) have made numerous attempts on the problem. Among the works, the Lin-Kernighan-Helsgaun (LKH) heuristic algorithm is one of the most competent methods for obtaining optimal or near-optimal solutions. Despite the rapid development in DL-based solvers, few of them can defeat LKH in terms of both running efficiency and solution quality across different distributions. In this paper, we would introduce a very novel approach that enhances LKH with graph embedding (GE) techniques in solving general TSP (distances can be non-metric and asymmetric), named as Embed-LKH. It is presented as two stages: i) in the GE stage, it transforms the distances to transition probabilities, then conduct GE given the transition probabilities, and finally it uses the learned embeddings to construct the so-called `ghost distances’; ii) in the LKH stage, LKH generates candidates based on the ghost distances but searches tours according to the original distances. As the experiments show, compared with the original LKH counterpart, in most cases, our approach can obtain better solutions within the same amount of trials across six distance distributions (non-metric and asymmetric: normal, uniform, exponential, metric and symmetric: Euclidean 2D/10D/50D) and two problem scales (TSP-100/1000). The source files, running scripts, and data will be made publicly available after the review.

2136Demystifying GNN Distillation by Replacing the GNN

[openreview] [pdf]

Abstract It has recently emerged that Multilayer Perceptrons (MLPs) can achieve excellent performance on graph node classification, but only if they distill a previously-trained Graph Neural Network (GNN). This finding is confusing; if MLPs are expressive enough to perform node classification, what is the role of the GNNs? This paper aims to answer this question. Rather than suggesting a new technique, we aim to demystify GNN distillation methods. Through our analysis, we identify the key properties of GNNs that enable them to serve as effective regularizers, thereby overcoming limited training data. We validate our analysis by demonstrating an MLP training process that successfully leverages GNN-like properties without actually training a GNN.

2137Optimizing Attention

[openreview] [pdf]

Abstract The attention mechanism is an important part of transformer architectures. It en- ables the network to compare samples within a sequence. Before the comparison is performed, tokens are multiplied by trainable matrices. These matrices can constitute a significant part of the total number of parameters. Their size creates problems on systems with limited cache in the compute unit, especially if there is limited bandwidth between compute unit and memory. In particular, GPUs on mobile devices suffer from this double bottleneck. Prior works mitigate this problem for instance by storing low-rank approxima- tions, quantization or minimizing the amount of data that needs to be transferred. In this paper, an alternative to the traditional attention mechanism is proposed which does not require any trainable matrices to perform the attention. The idea rests upon solving optimization problems, whereby memory is substituted for compute. It will be shown however, that the computational demand can be re- duced such that auto-differentiation becomes possible. An experimental evalua- tion shows that the proposed algorithm performs favorable compared with several baselines.

2138Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning

[openreview] [pdf]

Abstract Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently proposed for discrete settings and shown to be a promising metric for robotic exploration problems. In this work, we propose using BE as a principled exploration objective for systematically generating datasets that provide diverse state space coverage in complex, continuous, potentially high-dimensional domains. To achieve this, we extend the notion of BE to continuous settings, derive tractable kk-nearest neighbor estimators, provide theoretical guarantees for these estimators, and develop practical reward functions that can be used with standard RL methods to learn BE-maximizing policies. Using standard MuJoCo environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, R’{e}nyi, and Shannon entropy-maximizing policies. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy on all tasks considered, and on 80% of the tasks compared to datasets collected using R’{e}nyi entropy.

2139GODA: Goal-conditioned Data Augmentation

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) enables policy learning from pre-collected offline datasets, relaxing the need to interact directly with the environment. However, limited by the quality of offline datasets, it generally fails to learn well-qualified policies in suboptimal datasets. To address datasets with insufficient optimal demonstrations, we introduce Goal-cOnditioned Data Augmentation (GODA), a novel goal-conditioned diffusion-based method for augmenting samples with higher quality. Leveraging recent advancements in generative modeling, GODA incorporates a return-oriented goal condition with various selection mechanisms. Specifically, we introduce a controllable scaling technique to provide enhanced return-based guidance during data sampling. GODA learns a comprehensive distribution representation of the original offline datasets while generating new data with selectively higher-return goals, thereby maximizing the utility of limited optimal demonstrations. Furthermore, we propose a novel adaptive gated conditioning method for processing noised inputs and conditions, enhancing the capture of goal-oriented guidance. We conduct experiments on the D4RL benchmark and real-world challenges, specifically traffic signal control (TSC) tasks, to demonstrate GODA’s effectiveness in enhancing data quality and superior performance compared to state-of-the-art data augmentation methods across various offline RL algorithms. Our code will be publicly accessible upon review.

2140Context Matters: Leveraging Contextual Features for Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasts are often influenced by exogenous contextual features in addition to their corresponding history. For example, in financial settings, it is hard to accurately predict a stock price without considering public sentiments and policy decisions in the form of news articles, tweets, etc. Though this is common knowledge, the current state-of-the-art (SOTA) forecasting models fail to incorporate such contextual information, owing to its heterogeneity and multimodal nature. To address this, we introduce ContextFormer, a novel plug-and-play method to surgically integrate multimodal contextual information into existing pre-trained forecasting models. ContextFormer effectively distills forecast-specific information from rich multimodal contexts, including categorical, continuous, time-varying, and even textual information, to significantly enhance the performance of existing base forecasters. ContextFormer outperforms SOTA forecasting models by up to 30% on a range of real-world datasets spanning energy, traffic, environmental, and financial domains.

2141DMQR-RAG: Diverse Multi-Query Rewriting in Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Large language models often encounter challenges with static knowledge and hallucinations, which undermine their reliability. Retrieval-augmented generation (RAG) mitigates these issues by incorporating external information. However, user queries frequently contain noise and intent deviations, necessitating query rewriting to improve the relevance of retrieved documents. In this paper, we introduce DMQR-RAG, a Diverse Multi-Query Rewriting framework designed to improve the performance of both document retrieval and final responses in RAG. Specifically, we investigate how queries with varying information quantities can retrieve a diverse array of documents, presenting four rewriting strategies that operate at different levels of information to enhance the performance of baseline approaches. Additionally, we propose an adaptive strategy selection method that minimizes the number of rewrites while optimizing overall performance. Our methods have been rigorously validated through extensive experiments conducted in both academic and industry settings.

2142Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees

[openreview] [pdf]

Abstract In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant per- formance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorpo- rate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.

2143RLHF with Inconsistent Multi-Agent Feedback Under General Function Approximation: A Theoretical Perspective

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) has been widely studied, as a method for leveraging feedback from human evaluators to guide the learning process. However, existing theoretical analyses typically assume that the human feedback is generated by the ground-truth reward function. This may not be true in practice, because the reward functions in human minds for providing feedback are usually different from the ground-truth reward function, e.g., due to diverse personal experiences and inherent biases. Such inconsistencies could lead to undesirable outcomes when applying existing algorithms, particularly when considering feedback from heterogeneous agents. Therefore, in this paper, we make the first effort to investigate a more practical and general setting of RLHF, where feedback could be generated by multiple agents with reward functions differing from the ground truth. To address this challenge, we develop a new algorithm with novel ideas for handling inconsistent multi-agent feedback, including a Steiner-Point-based confidence set to exploit the benefits ofmulti-agentfeedback and a new weighted importance sampling method to manage complexity issues arising frominconsistency. Our theoretical analysis develops new methods to demonstrate the optimality of our algorithm. This result is the first of its kind to demonstrate the fundamental impact and potential of inconsistent multi-agent feedback in RLHF.

2144TimeRAG: It’s Time for Retrieval-Augmented Generation in Time-Series Forecasting

[openreview] [pdf]

Abstract Time-series data are essential for forecasting tasks across various domains. While Large Language Models (LLMs) have excelled in many areas, they encounter significant challenges in time-series forecasting, particularly in extracting relevant information from extensive temporal datasets. Unlike textual data, time-series data lack explicit retrieval ground truths, complicating the retrieval process. To tackle these issues, we present TimeRAG, a novel retrieval-augmented approach tailored for time-series forecasting. Our method uniquely applies to continuous and complex temporal sequences, and it is trained using LLM feedback, effectively addressing the absence of ground truth and aligning the priorities of the retriever and the LLM. Experimental results demonstrate the effectiveness of TimeRAG, highlighting its ability to significantly enhance forecasting performance and showcasing the potential of LLMs in time-series prediction tasks.

2145SONNET: Solar-disaggregation-based Day-ahead Probabilistic Net Load Forecasting with Transformers

[openreview] [pdf]

Abstract The global transition towards sustainable energy sources has positioned solar power as a cornerstone of modern electricity systems, underscoring the critical need for advanced forecasting techniques in grid management. Accurate net load forecasting is crucial for efficient and reliable power grid operations, especially with the rapid deployment of behind-the-meter (BTM) renewable energy sources such as rooftop solar. Notably, BTM solar generation is neither controlled nor monitored by utilities and hence only net load data are observed. Different from load forecasting, net load forecasting faces new challenges because BTM solar, a major component of net load, behaves very differently from and is much more variable than loads. To exploit the distinct natures of solar generation and load and unlock their predictive potentials, we propose SONNET{\bf SONNET}, which stands for SO{\bf SO}lar-disaggregatioN{\bf N}-based NE{\bf NE}t load forecasting with T{\bf T}ransformers. It is a novel probabilistic net load forecasting method based on disaggregating net loads into solar generation and loads and feeding both into the predictors. The method further features a) an enhanced Transformer architecture that integrates both historical and future input data, employing a combination of self-attention and cross-attention mechanisms, and b) a data augmentation method that enhances the robustness of net load forecasts against weather forecast errors. Extensive experiments are conducted based on the comprehensive real-world data set from a recent net load forecasting competition organized by the U.S. Department of Energy (DOE). It is demonstrated that our proposed method both improves the accuracy and reduces the uncertainty of net load forecasts. Notably, our proposed method significantly outperforms the state-of-the-art. The proposed techniques also have broad applications for energy and/or general forecasting-related problems.

2146A Precompute-Then-Adapt Approach for Efficient Graph Condensation

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown great success in leveraging complex relationships in data but face significant computational challenges when dealing with large-scale graphs. To tackle this issue, graph condensation methods aim to compress large graphs into smaller, synthetic ones that can be efficiently used for GNN training. Recent approaches, particularly those based on trajectory matching, have achieved state-of-the-art (SOTA) performance in graph condensation tasks. Trajectory-based techniques match the training behavior on a condensed graph closely with that on the original graph, typically by guiding the trajectory of model parameters during training. However, these methods require repetitive re-training of GNNs during the condensation process, making them impractical for large graphs due to their high computational cost, \eg, taking up to 22 days to condense million-node graphs. In this paper, we propose a novel Precompute-then-Adapt graph condensation framework that overcomes this limitation by separating the condensation process into a one-time precomputation stage and a one-time adaptation learning stage. Remarkably, even with only the precomputation stage, which typically takes seconds, our method surpasses or matches SOTA results on 3 out of 7 benchmark datasets. Extensive experiments demonstrate that our approach achieves better or comparable accuracy while being 96× to 2,455× faster in condensation time compared to SOTA methods, significantly enhancing the practicality of GNNs for large-scale graph applications. Our code and data are available at \url{https://anonymous.4open.science/r/GCPA-F6F9/}.

2147Enhancing Trust-Region Bayesian Optimization via Derivatives of Gaussian Processes

[openreview] [pdf]

Abstract Bayesian Optimization (BO) has been widely applied to optimize expensive black-box functions while retaining sample efficiency. However, scaling BO to high-dimensional spaces remains challenging. Existing literature proposes performing standard BO in several local trust regions (TuRBO) for heterogeneous modeling of the objective function and avoiding over-exploration. Despite its advantages, using local Gaussian Processes (GPs) reduces sampling efficiency compared to a global GP. To enhance sampling efficiency while preserving heterogeneous modeling, we propose to construct several local quadratic models using gradients and Hessians from a global GP, and select new sample points by solving the bound-constrained quadratic program. We provide a convergence analysis and demonstrate through experimental results that our method enhances the efficacy of TuRBO and outperforms a wide range of high-dimensional BO techniques on synthetic functions and real-world applications.

2148Language Models Can Help to Learn High-Performing Cost Functions for Recourse

[openreview] [pdf]

Abstract Algorithmic recourse is a specialised variant of counterfactual explanation, concerned with offering actionable recommendations to individuals who have received adverse outcomes from automated systems. Most recourse algorithms assume access to a cost function, which quantifies the effort involved in following recommendations. Such functions are useful for filtering down recourse options to those which are most actionable. In this study, we explore the use of large language models (LLMs) to help label data for training recourse cost functions, while preserving important factors such as transparency, fairness, and performance. We find that while LLMs do generally align with human judgements of cost, and can label data for the training of effective cost functions, a high-level schematic of the function parameters should be engineered into the labelling prompt to maximise performance. Previously, recourse cost definitions have mainly relied on heuristics and missed the complexities of feature dependencies and fairness attributes, which has drastically limited their usefulness. Our results show that it is possible to train a high-performing, interpretable cost function by consulting an LLM via careful prompt engineering. Furthermore, these cost functions can be customised to add or remove biases as befitting the domain and problem. Overall, this study suggests a simple, accessible method for accurately quantifying notions of cost, effort, or distance between data points that correlate with human intuition, with possible applications throughout the explainable AI field.

2149Llamas (mostly) think in English: On Causal Interventions in the Latent Language of Transformers

[openreview] [pdf]

Abstract Previous research on the Llama-2 family of Large Language Models (LLMs) suggested a correlation indicating the use of English as a intermediary language within these models for tasks in non-English languages. We improve on this by demonstrating a causal relationship. By intervening on the intermediate layers during a forward pass, we show that projecting out the activations onto a subspace corresponding to the correct prediction in English impairs the model’s ability to make correct predictions on non-English translation tasks. Projecting onto an unrelated English subspace, or a related subspace in a non-English language, has little effect, demonstrating that this family of models store concepts that have a high similarity to the corresponding concept in English in the residual stream.

2150Attention Head Purification: A New Perspective to Harness CLIP for Domain Generalization

[openreview] [pdf]

Abstract Domain Generalization (DG) aims to learn a model from multiple source domains to achieve satisfactory performance on unseen target domains. Recent works introduce CLIP to DG tasks due to its superior image-text alignment and zero-shot performance. Previous methods either utilize full fine-tuning or prompt learning paradigms to harness CLIP for DG tasks. Those works focus on avoiding catastrophic forgetting of the original knowledge encoded in CLIP but ignore that the knowledge encoded in CLIP in nature may contain domain-specific cues that constrain its domain generalization performance. In this paper, we propose a new perspective to harness CLIP for DG, i.e., attention head purification. We observe that different attention heads may encode different properties of an image and selecting heads appropriately may yield remarkable performance improvement across domains. Based on such observations, we purify the attention heads of CLIP from two levels, including task-level purification and domain-level purification. For task-level purification, we design head-aware LoRA to make each head more adapted to the task we considered. For domain-level purification, we perform head selection via a simple gating strategy. We utilize MMD loss to encourage masked head features to be more domain-invariant to emphasize more generalizable properties/heads. During training, we jointly perform task-level purification and domain-level purification. We conduct experiments on various representative DG benchmarks. Though simple, extensive experiments demonstrate that our method performs favorably against previous state-of-the-arts.

2151Oldie but Goodie: Re-illuminating Label Propagation on Graphs with Partially Observed Features

[openreview] [pdf]

Abstract In real-world graphs, we often encounter missing feature situations where a few or the majority of node features, e.g., sensitive information, are missed. Although the recently proposed Feature Propagation algorithm mitigates such situations to some degree, it falls short when only partial features are available, sometimes performing worse than traditional structure-based graph models. To overcome this limitation, we spotlight a classical algorithm, Label Propagation (Oldie), and further illuminate its potential, especially when only a partial feature is available. Now called by Goodie, it takes a hybrid approach to obtain embeddings from the Label Propagation branch and Feature Propagation branch. To do so, we first design a GNN-based decoder that enables the Label Propagation branch to output hidden embeddings that align with those of the FP branch. Then, Goodie automatically captures the significance of structure and feature information thanks to the newly designed Structure-Feature Attention. Followed by a novel Pseudo-Label contrastive learning that differentiates the contribution of each positive pair within pseudo-labels originating from the LP branch, Goodie outputs the final prediction for the unlabeled nodes. Through extensive experiments, we demonstrate that our proposed model, Goodie, outperforms the existing state-of-the art methods not only when only a few features are available but also in abundantly available situations.

2152A Manifold Perspective on the Statistical Generalization of Graph Neural Networks

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) extend convolutional neural networks to operate on graphs. Despite their impressive performances in various graph learning tasks, the theoretical understanding of their generalization capability is still lacking. Previous GNN generalization bounds ignore the underlying graph structures, often leading to bounds that increase with the number of nodes – a behavior contrary to the one experienced in practice. In this paper, we take a manifold perspective to establish the statistical generalization theory of GNNs on graphs sampled from a manifold in the spectral domain. As demonstrated empirically, we prove that the generalization bounds of GNNs decrease linearly with the size of the graphs in the logarithmic scale, and increase linearly with the spectral continuity constants of the filter functions. Notably, our theory explains both node-level and graph-level tasks. Our result has two implications: i) guaranteeing the generalization of GNNs to unseen data over manifolds; ii) providing insights into the practical design of GNNs, i.e., restrictions on the discriminability of GNNs are necessary to obtain a better generalization performance. We demonstrate our generalization bounds of GNNs using synthetic and multiple real-world datasets.

2153FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning of foundation models. However, applying LoRA in federated learning environments, where data is distributed across multiple clients, presents unique challenges. Existing methods rely on traditional federated averaging of LoRA adapters, resulting in inexact updates. To address this, we propose Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the pretrained frozen weight matrix. Our approach achieves exact updates with minimal computational and communication overhead, preserving LoRA’s efficiency. We evaluate the method on various Natural Language Understanding (NLU) and Natural Language Generation (NLG) tasks, showing consistent performance gains over state-of-the-art methods across multiple settings. Through extensive analysis, we quantify that the deviations in updates from the ideal solution are significant, highlighting the need for exact aggregation. Our method’s simplicity, efficiency, and broad applicability position it as a promising solution for accurate and effective federated fine-tuning of foundation models.

2154T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

[openreview] [pdf]

Abstract Sampling from diffusion probabilistic models (DPMs) is often expensive for high-quality image generation and typically requires many steps with a large model. In this paper, we introduce sampling Trajectory Stitching (T-Stitch), a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation. Instead of solely using a large DPM for the entire sampling trajectory, T-Stitch first leverages a smaller DPM in the initial steps as a cheap drop-in replacement of the larger DPM and switches to the larger DPM at a later stage. Our key insight is that different diffusion models learn similar encodings under the same training data distribution and smaller models are capable of generating good global structures in the early steps. Extensive experiments demonstrate that T-Stitch is training-free, generally applicable for different architectures, and complements most existing fast sampling techniques with flexible speed and quality trade-offs. On DiT-XL, for example, 40% of the early timesteps can be safely replaced with a 10x faster DiT-S without performance drop on class-conditional ImageNet generation. We further show that our method can also be used as a drop-in technique to not only accelerate the popular pretrained stable diffusion (SD) models but also improve the prompt alignment of stylized SD models from the public model zoo. Finally, the explicit model allocation strategy of T-Stitch significantly reduces the need of training or searching, delivering high deployment efficiency.

2155Joint Gradient Balancing for Data Ordering in Finite-Sum Multi-Objective Optimization

[openreview] [pdf]

Abstract In finite-sum optimization problems, the sample orders for parameter updates can significantly influence the convergence rate of optimization algorithms. While numerous sample ordering techniques have been proposed in the context of single-objective optimization, the problem of sample ordering in finite-sum multi-objective optimization has not been thoroughly explored. To address this gap, we propose a sample ordering method called JoGBa, which finds the sample orders for multiple objectives by jointly performing online vector balancing on the gradients of all objectives. Our theoretical analysis demonstrates that this approach outperforms the standard baseline of random ordering and accelerates the convergence rate for the MGDA algorithm. Empirical evaluation across various datasets with different multi-objective optimization algorithms further demonstrates that JoGBa can achieve faster convergence and superior final performance than other data ordering strategies.

2156Understanding Gradient Descent through the Training Jacobian

[openreview] [pdf]

Abstract We examine the geometry of neural network training using the Jacobian of trained network parameters with respect to their initial values. Our analysis reveals low-dimensional structure in the training process which is dependent on the input data but largely independent of the labels. We find that the singular value spectrum of the Jacobian matrix consists of three distinctive regions: a “chaotic” region of values orders of magnitude greater than one, a large “bulk” region of values extremely close to one, and a “stable” region of values less than one. Along each bulk direction, the left and right singular vectors are nearly identical, indicating that perturbations to the initialization are carried through training almost unchanged. These perturbations have virtually no effect on the network’s output in-distribution, yet do have an effect far out-of-distribution. While the Jacobian applies only locally around a single initialization, we find substantial overlap in bulk subspaces for different random seeds.

2157Retrieval-Augmented Decision Transformer: External Memory for In-context RL

[openreview] [pdf]

Abstract In-context learning (ICL) is the ability of a model to learn a new task by observing a few exemplars in its context. While prevalent in NLP, this capability has recently also been observed in Reinforcement Learning (RL) settings. Prior in-context RL methods, however, require entire episodes in the agent’s context. Given that complex environments typically lead to long episodes with sparse rewards, these methods are constrained to simple environments with short episodes. To address these challenges, we introduce Retrieval-Augmented Decision Transformer (RA-DT). RA-DT employs an external memory mechanism to store past experiences from which it retrieves only sub-trajectories relevant for the current situation. The retrieval component in RA-DT does not require training and can be entirely domain-agnostic. We evaluate the capabilities of RA-DT on grid-world environments, robotics simulations, and procedurally-generated video games. On grid worlds, RA-DT outperforms baselines, while using only a fraction of their context length. Furthermore, we illuminate the limitations of current in-context RL methods on complex environments and discuss future directions. To facilitate future research, we release datasets for four of the considered environments.

2158Large Language Model Alignment via Inverse Reinforcement Learning from Demonstrations

[openreview] [pdf]

Abstract Aligning Large Language Models (LLMs) is crucial for enhancing their safety and utility. However, existing methods, primarily based on preference datasets, face challenges such as noisy labels, high annotation costs, and privacy concerns. In this work, we introduceAlignment from Demonstrations(AfD), a novel approach leveraging high-quality demonstration data to overcome these challenges. We formalize AfD within a sequential decision-making framework, highlighting its unique challenge of missing reward signals. Drawing insights from forward and inverse reinforcement learning, we introduce divergence minimization objectives for AfD. Analytically, we elucidate the mass-covering and mode-seeking behaviors of various approaches, explaining when and why certain methods are superior. Practically, we propose a computationally efficient algorithm that extrapolates over a tailored reward model for AfD. We validate our key insights through experiments on the Harmless and Helpful tasks, demonstrating their strong empirical performance while maintaining simplicity.

2159Which Experiences Are Influential for RL Agents? Efficiently Estimating The Influence of Experiences

[openreview] [pdf]

Abstract In reinforcement learning (RL) with experience replay, experiences stored in a replay buffer influence the RL agent’s performance. Information about how these experiences influence the agent’s performance is valuable for various purposes, such as identifying experiences that negatively influence underperforming agents. One method for estimating the influence of experiences is the leave-one-out (LOO) method. However, this method is usually computationally prohibitive. In this paper, we present Policy Iteration with Turn-over Dropout (PIToD), which efficiently estimates the influence of experiences. We evaluate how accurately PIToD estimates the influence of experiences and its efficiency compared to LOO. We then apply PIToD to amend underperforming RL agents, i.e., we use PIToD to estimate negatively influential experiences for the RL agents and to delete the influence of these experiences. We show that RL agents’ performance is significantly improved via amendments with PIToD.

2160Brain-inspired Multi-View Incremental Learning for Knowledge Transfer and Retention

[openreview] [pdf]

Abstract The human brain exhibits remarkable proficiency in dynamic learning and adaptation, seamlessly integrating prior knowledge with new information, thereby enabling flexible memory retention and efficient transfer across multiple views. In contrast, traditional multi-view learning methods are predominantly designed for static and fixed-view datasets, leading to the notorious “view forgetting phenomenon”, where the introduction of new views leads to the erosion of prior knowledge. This phenomenon starkly contrasts with the brain’s remarkable ability to continuously integrate and migrate past knowledge, ensuring both the retention of old information and the assimilation of new insights. This oversight presents a critical challenge: how to efficiently learn and integrate new views while simultaneously preserving knowledge from previously acquired views and enabling flexible knowledge transfer across diverse perspectives.Inspired by underlying neural processing mechanisms, we propose a view transfer learning framework named Hebbian View Orthogonal Projection (HVOP), which realizes efficient knowledge migration and sharing between multi-view data. HVOP constructs a knowledge transfer space (KTS), where the KTS reduces the interference between the old and the new views through an orthogonal learning mechanism. By further incorporating recursive lateral connections and Hebbian learning, the proposed model endows the learning process with brain-like dynamic adaptability, enhancing knowledge transfer and integration, and bringing the model closer to human cognition. We extensively validate the proposed model on node classification tasks and demonstrate its superior performance in knowledge retention and transfer compared to traditional methods. Our results underscore the potential of biologically inspired mechanisms in advancing multi-view learning and mitigating the view forgetting phenomenon.

2161Multiplayer Federated Learning: Reaching Equilibrium with Less Communications

[openreview] [pdf]

Abstract Traditional Federated Learning (FL) approaches assume collaborative clients with aligned objectives working towards a shared global model. However, in many real-world scenarios, clients act as rational players with individual objectives and strategic behaviors, a concept that existing FL frameworks are not equipped to adequately address. To bridge this gap, we introduce Multiplayer Federated Learning (MpFL), a novel framework that models the clients in the FL environment as players in a game-theoretic context, aiming to reach an equilibrium. In this scenario, each player tries to optimize their own utility function, which may not align with the collective goal. Within MpFL, we propose Per-Player Local SGD (PEARL-SGD), an algorithm in which each player/client performs local updates independently and periodically communicates with other players. We theoretically analyze PEARL-SGD under different step-size selections and prove that it reaches an equilibrium with less communications compared to its non-local counterpart. Finally, we verify our theoretical findings through numerical experiments.

2162Multimodal Instruction Tuning with Hybrid State Space Models

[openreview] [pdf]

Abstract Handling lengthy context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models (MLLMs) in applications such as processing high-resolution images or high frame rate videos. The rise in image resolution and frame rate substantially increases computational demands due to the increased number of input tokens. This challenge is further exacerbated by the quadratic complexity with respect to sequence length of the self-attention mechanism. Most prior works either pre-train models with long contexts, overlooking the efficiency problem, or attempt to reduce the context length via downsampling (e.g., identify the key image patches or frames) to decrease the context length, which may result in information loss. To circumvent this issue while keeping the remarkable effectiveness of MLLMs, we propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications. Our multimodal model can effectively process long context input exceeding 100k tokens, outperforming existing models across various benchmarks. Remarkably, our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models, with efficiency gains increasing as image resolution or video frames rise. Furthermore, our model is the first to be trained on low-resolution images or low-frame-rate videos while being capable of inference on high-resolution images and high-frame-rate videos, offering flexibility for inference in diverse scenarios.

2163Cognitive Overload Attack: Prompt Injection for Long Context

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in performing tasks across various domains without needing explicit retraining. This capability, known as In-Context Learning (ICL), while impressive, exposes LLMs to a variety of adversarial prompts and jailbreaks that manipulate safety-trained LLMs into generating undesired or harmful output. In this paper, we propose a novel interpretation of ICL in LLMs through the lens of cognitive neuroscience, by drawing parallels between learning in human cognition with ICL. We applied the principles of Cognitive Load Theory in LLMs and empirically validate that similar to human cognition, LLMs also suffer from \emph{cognitive overload}—a state where the demand on cognitive processing exceeds the available capacity of the model, leading to potential errors. Furthermore, we demonstrated how an attacker can exploit ICL to jailbreak LLMs through deliberately designed prompts that induce cognitive overload on LLMs, thereby compromising the safety mechanisms of LLMs. We empirically validate this threat model by crafting various cognitive overload prompts and show that advanced models such as GPT-4, Claude-3.5 Sonnet, Claude-3 OPUS, LLAMA-3-70B-Instruct, Gemini-1.0-Pro, and Gemini-1.5-Pro can be successfully jailbroken, with attack success rates of up to 99.99%. Our findings highlight critical vulnerabilities in LLMs and underscore the urgency of developing robust safeguards. We propose integrating insights from cognitive load theory into the design and evaluation of LLMs to better anticipate and mitigate the risks of adversarial attacks. By expanding our experiments to encompass a broader range of models and by highlighting vulnerabilities in LLMs’ ICL, we aim to ensure the development of safer and more reliable AI systems.

2164DiffDeID: a Multi-conditional Diffusion-based Method for High Fidelity Face De-indentification with Diversity

[openreview] [pdf]

Abstract Face de-identification is a critical task that aims to obscure true identities while preserving other facial attributes. Current methodologies typically involve disentangling identity features within a latent space and leveraging adversarial training to balance privacy with utility, often at the cost of a trade-off between two. To surmount these limitations, we introduce DiffDeID, a novel approach grounded in diffusion models. This method incrementally safeguards identity and sustains utility, all while ensuring enhanced interpretability. Our method employs a Latent Diffusion-based ID Sample to generate authentic identity embeddings that are obfuscated from the original identity, thereby providing users with diverse options. Additionally, a multi-condition diffusion model is utilized for facial images, ensuring the retention of image utility. We further introduce a novel training and inference paradigm, utilizing the unified architecture tailored for video facial de-identification tasks. The robustness of our method is attributed to its powerful 3D prior and meticulous generation design, enabling natural identity protection, generation of high-quality details, and robustness across various attributes. Through extensive experimentation, we demonstrate that DiffDeID surpasses previous methodologies.

2165Variational Inference for Self-Supervised Speech Models Fine-tuning on Downstream Tasks

[openreview] [pdf]

Abstract Despite the growing interest in self-supervised speech models, recent research has primarily focused on modifying upstream model architectures and pretraining techniques, with less attention given to how features from self-supervised models are used. In this paper, we explore the use of variational inference to enhance the performance of self-supervised audio models in downstream tasks. We hypothesize that adaptively reweighting the outputs of the model layers is crucial to improving performance on these tasks. We extensively evaluate our method alongside widely used baselines, demonstrating that understanding sample-specific information is essential for improved performance on several tasks. Our proposed method surpasses existing approaches and generalizes to various speech tasks, including automatic speech recognition, speaker verification, and emotion recognition. Finally, we analyze our method to provide deeper insight into the importance of our modifications.

2166Transformers Can Navigate Mazes With Multi-Step Prediction

[openreview] [pdf]

Abstract Despite their remarkable success in language modeling, transformers trained to predict the next token in a sequence struggle with long-term planning. This limitation is particularly evident in tasks requiring foresight to plan multiple steps ahead such as maze navigation. The standard next single token prediction objective, however, offers no explicit mechanism to predict multiple steps ahead---or revisit the path taken so far. Consequently, in this work we study whether explicitly predicting multiple steps ahead (and backwards) can improve transformers’ maze navigation. We train under identical settings, parameter-matched transformers from scratch to navigate mazes of varying types and sizes with standard next token prediction and MLM-U: an objective explicitly predicting multiple steps ahead and backwards. We find MLM-U considerably improves transformers’ ability to navigate mazes compared to standard next token prediction across maze types and complexities. We also find MLM-U training is 4x more sample efficient and converges 2x faster in terms of GPU training hours relative to next token training. Finally, for more complex mazes we find MLM-U benefits from scaling to larger transformers. Remarkably, we find transformers trained with MLM-U outperform larger transformers trained with next token prediction using additional supervision from A* search traces. We hope these findings underscore the promise of learning objectives to advance transformers’ capacity for long-term planning.

2167Shh, don’t say that! Domain Certification in LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) are often deployed to do constrained tasks, with narrow domains. For example, customer support bots can be built on top of LLMs, relying on their broad language understanding and capabilities to enhance performance. However, these LLMs are adversarially susceptible, potentially generating outputs outside the intended domain. To formalize, assess and mitigate this risk, we introduce \emph{domain certification}; a guarantee that accurately characterizes the out-of-domain behavior of language models. We then propose a simple yet effective approach dubbed VALID that provides adversarial bounds as a certificate. Finally, we evaluate our method across a diverse set of datasets, demonstrating that it yields meaningful certificates.

2168Solving hidden monotone variational inequalities with surrogate losses

[openreview] [pdf]

Abstract Deep learning has proven to be effective in a wide variety of loss minimization problems. However, many applications of interest, like minimizing projected Bellman error and min-max optimization, cannot be modelled as minimizing a scalar loss function but instead correspond to solving a variational inequality (VI) problem. This difference in setting has caused many practical challenges as naive gradient-based approaches from supervised learning tend to diverge and cycle in the VI case. In this work, we propose a principled surrogate-based approach compatible with deep learning to solve VIs. We show that our surrogate-based approach has three main benefits: (1) under assumptions that are realistic in practice (when hidden monotone structure is present, interpolation, and sufficient optimization of the surrogates), it guarantees convergence, (2) it provides a unifying perspective of existing methods, and (3) is amenable to existing deep learning optimizers like ADAM. Experimentally, we demonstrate our surrogate-based approach is effective in min-max optimization and minimizing projected Bellman error. Furthermore, in the deep reinforcement learning case, we propose a novel variant of TD(0) which is more compute and sample efficient.

2169WeatherGFM: Learning a Weather Generalist Foundation Model via In-context Learning

[openreview] [pdf]

Abstract The Earth’s weather system involves intricate weather data modalities and diverse weather understanding tasks, which hold significant value to human life. Existing data-driven models focus on single weather understanding tasks (e.g., weather forecasting). While these models have achieved promising results, they fail to tackle various complex tasks within a single and unified model. Moreover, the paradigm that relies on limited real observations for a single scenario hinders the model’s performance upper bound. Inspired by the in-context learning paradigm from visual foundation models and large language models, in this paper, we introduce the first generalist weather generalist foundation model (WeatherGFM) to address weather understanding tasks in a unified manner. Specifically, we first unify the representation and definition for diverse weather understanding tasks. Subsequently, we design weather prompt formats to handle different weather data modalities, including single, multiple, and temporal modalities. Finally, we adopt a visual prompting question-answering paradigm for the training of unified weather understanding tasks. Extensive experiments indicate that our WeatherGFM can effectively handle up to ten weather understanding tasks, including weather forecasting, super-resolution, weather image translation, and post-processing. Our method also showcases generalization ability on unseen tasks.

2170iREPO:implicit Reward Pairwise Difference based Empirical Preference Optimization

[openreview] [pdf]

Abstract While astonishingly capable, large Language Models (LLM) can sometimes produce outputs that deviate from human expectations. Such deviations necessitate an alignment phase to prevent disseminating untruthful, toxic, or biased information. Traditional alignment methods based on reinforcement learning often struggle with the identified instability, whereas preference optimization methods are limited by their overfitting to pre-collected hard-label datasets. In this paper, we propose a novel LLM alignment framework named iiREPO, which utilizes implicit Reward pairwise difference regression for Empirical Preference Optimization. Particularly, iiREPO employs self-generated datasets labeled by empirical human (or AI annotator) preference to iteratively refine the aligned policy through a novel regression-based loss function. Furthermore, we introduce an innovative algorithm backed by theoretical guarantees for achieving optimal results under ideal assumptions and providing a practical performance-gap result without such assumptions. Experimental results with Phi-2 and Mistral-7B demonstrate that iiREPO effectively achieves self-alignment using soft-label, self-generated responses and the logit of empirical AI annotators. Furthermore, our approach surpasses preference optimization baselines in evaluations using the Language Model Evaluation Harness and Multi-turn benchmarks.

2171Flat-LoRA: Low-Rank Adaption over a Flat Loss Landscape

[openreview] [pdf]

Abstract Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computational and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, provides an efficient way to fine-tune models by optimizing only a low-rank matrix. Despite recent progress made in improving LoRA’s performance, the connection between the LoRA optimization space and the original full parameter space is often overlooked. A solution that appears flat in the LoRA space may exist sharp directions in the full parameter space, potentially harming generalization performance. In this paper, we propose Flat-LoRA, an efficient approach that seeks a low-rank adaptation located in a flat region of the full parameter space. Instead of relying on the well-established sharpness-aware minimization approach, which can incur significant computational and memory burdens, we utilize random weight perturbation with a Bayesian expectation loss objective to maintain training efficiency and design a refined perturbation generation strategy for improved performance. Experiments on natural language processing and image classification tasks with various architectures demonstrate the effectiveness of our approach.

2172Playing the Fool: Jailbreaking Large Language Models with Out-of-Distribution Strategies

[openreview] [pdf]

Abstract Despite the remarkable versatility of Large Language Models (LLMs) and Multimodal-LLMs (MLLMs) to generalize across both language and vision tasks, LLMs and MLLMs have shown vulnerability to jailbreaking, generating textual outputs that undermine safety, ethical, and bias standards when exposed to harmful or sensitive inputs. With the recent advancement of safety-alignment via preference-tuning from human feedback, LLMs and MLLMs have been equipped with safety guardrails to yield safe, ethical, and fair responses with regard to harmful inputs. However, despite the significance of safety-alignment, research on the vulnerabilities remains largely underexplored. In this paper, we investigate the vulnerability of the safety-alignment, examining its ability to consistently provide safety guarantees for out-of-distribution(OOD)-ifying harmful inputs that may fall outside the aligned data distribution. Our key observation is that OOD-ifying the vanilla harmful inputs highly increases the uncertainty of the model to discern the malicious intent within the input, leading to a higher chance of being jailbroken. Exploiting this vulnerability, we propose JOOD, a new Jailbreak strategy via generating OOD-ifying inputs beyond the safety-alignment with diverse visual and textual transformation techniques. Specifically, even simple mixing-based techniques such as image mixup prove highly effective in OOD-ifying the harmful inputs by increasing the uncertainty of the model, thereby facilitating the bypass of the safety-alignment. Experimental results across diverse jailbreak scenarios demonstrate that JOOD effectively jailbreaks recent proprietary LLMs and MLLMs such as GPT-4 and GPT-4V with high attack success rate, which previous attack approaches have consistently struggled to jailbreak.

2173Fundamental Limitations on Subquadratic Alternatives to Transformers

[openreview] [pdf]

Abstract The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative.In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time – whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason – cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.

2174Trajectory-LLM: A Language-based Data Generator for Trajectory Prediction in Autonomous Driving

[openreview] [pdf]

Abstract Vehicle trajectory prediction is a crucial aspect of autonomous driving, which requires extensive trajectory data to train prediction models to understand the complex, varied, and unpredictable patterns of vehicular interactions. However, acquiring real-world data is expensive, so we advocate using Large Language Models (LLMs) to generate abundant and realistic trajectories of interacting vehicles efficiently. These models rely on textual descriptions of vehicle-to-vehicle interactions on a map to produce the trajectories. We introduce Trajectory-LLM (Traj-LLM), a new approach that takes brief descriptions of vehicular interactions as input and generates corresponding trajectories. Unlike language-based approaches that translate text directly to trajectories, Traj-LLM uses reasonable driving behaviors to align the vehicle trajectories with the text. This results in an “interaction-behavior-trajectory” translation process. We have also created a new dataset, Language-to-Trajectory (L2T), which includes 240K textual descriptions of vehicle interactions and behaviors, each paired with corresponding map topologies and vehicle trajectory segments. By leveraging the L2T dataset, Traj-LLM can adapt interactive trajectories to diverse map topologies. Furthermore, Traj-LLM generates additional data that enhances downstream prediction models, leading to consistent performance improvements across public benchmarks.

2175FredNormer: Frequency Domain Normalization for Non-stationary Time Series Forecasting

[openreview] [pdf]

Abstract Recent normalization-based methods have shown great success in tackling the distribution shift issue, facilitating non-stationary time series forecasting. Since these methods operate in the time domain, they may fail to fully capture the dynamic patterns that are more apparent in the frequency domain, leading to suboptimal results. This paper first theoretically analyzes how normalization methods affect frequency components. We prove that the current normalization methods that operate in the time domain uniformly scale non-zero frequencies, and thus, they struggle to determine components that contribute to more robust forecasting. Therefore, we propose FredNormer, which observes datasets from a frequency perspective and adaptively up-weights the key frequency components. To this end, FredNormer consists of two components: a statistical metric that normalizes the input samples based on their frequency stability and a learnable weighting layer that adjusts stability and introduces sample-specific variations. Notably, FredNormer is a plug-and-play module, which does not compromise the efficiency compared to existing normalization methods. Extensive experiments show that FredNormer improves the averaged MSE of backbone forecasting models by 33.3% and 55.3% on the ETTm2 dataset. Compared to the baseline normalization methods, FredNormer achieves 18 top-1 results and 6 top-2 results out of 28 settings. Our code is available at:https://anonymous.4open.science/r/ICLR2025-13956-8F84

2176EMERGENCE OF GROUNDED, OPTIMALLY COMPOSITIONAL SPATIAL LANGUAGE AMONG HOMOGENEOUS AGENTS

[openreview] [pdf]

Abstract A mechanism of effective communication is integral to human existence. An essential aspect of a functional communication scheme among a rational human population involves an efficient, adaptive, and coherent apparatus to convey one’s goal to others. Such an effective macro characteristic can emerge in a finite population through adaptive learning via trial and error at the individual (micro) level, with nearly consistent individual learning faculty and experience across the population. In this paper, we study and hypothesize pertinent aspects of glossogenetics, specifically primal human communication mechanisms, through computational modeling. In particular, we model the process as a language game within the fabric of a decentralized, multi-agent deep reinforcement learning setting, where the agents with local learning and neural cognitive faculties interact through a series of dialogues. Our homogeneous agents seek to achieve the principle of least effort and overcome the poverty of stimulus through efficient concept selection, guided feedback and mirror learning. In our examinations, we observe the emergence of successful and structured communication among static and dynamic agent populations through consistent and continual learning.

2177Optimizing importance weighting in the presence of sub-population shifts

[openreview] [pdf]

Abstract A distribution shift between the training and test data can severely harm performance of machine learning models. Importance weighting addresses this issue by assigning different weights to data points during training. We argue that existing heuristics for determining the weights are suboptimal, as they neglect the increase of the variance of the estimated model due to the limited sample size of the training data. We interpret the optimal weights in terms of a bias-variance trade-off, and propose a bi-level optimization procedure in which the weights and model parameters are optimized simultaneously. We apply this framework to existing importance weighting techniques for last-layer retraining of deep neural networks in the presence of sub-population shifts and show empirically that optimizing weights significantly improves generalization performance.

2178Resolving Complex Social Dilemmas by Aligning Preferences with Counterfactual Regret

[openreview] [pdf]

Abstract Social dilemmas are situations where gains from cooperation are possible but misaligned incentives make it hard to find and stabilize prosocial joint behavior. In such situations selfish behaviors may harm the social good. In spatiotemporally complex social dilemmas, the barriers to cooperation that emerge from misaligned incentives interact with obstacles that stem from spatiotemporal complexity. In this paper, we propose a multi-agent reinforcement learning algorithm which aims to find cooperative resolutions for such complex social dilemmas. Agents maximize their own interests while also helping others, regardless of the actions their co-players take. This approach disentangles the causes of selfish reward from the causes of prosocial reward. Empirically, our method outperforms multiple baseline methods in several complex social dilemma environments.

2179Learning Label Distribution with Subtasks

[openreview] [pdf]

Abstract Label distribution learning (LDL) is a novel learning paradigm that emulates label polysemy by assigning label distributions over the label space. However, recent LDL work seems to exhibit a notable contradiction: 1) some existing LDL methods employ auxiliary tasks to enhance performance, which narrows their focus to specific domains, thereby lacking generalization capability; 2) conversely, LDL methods without auxiliary tasks rely on losses tailored solely to label distributions of the primary task, lacking additional supervised information to guide the learning process. In this paper, we propose S\mathcal{S}-LDL, a novel and minimalist solution that partitions the label distribution of the primary task into subtask label distributions, i.e., a form of pseudo-supervised information, to reconcile the above contradiction. S\mathcal{S}-LDL encompasses two key aspects: 1) an algorithm capable of generating subtasks without any extra knowledge, with subtasks deemed valid and reconstructable via our analysis; and 2) a plug-and-play framework seamlessly compatible with existing LDL methods, and even adaptable to derivative tasks of LDL. Experiments demonstrate that S\mathcal{S}-LDL is effective and efficient. To the best of our knowledge, this represents the first endeavor to address LDL via subtasks. The code will soon be available on GitHub to facilitate reproducible research.

2180Probabilistic Conformal Prediction with Approximate Conditional Validity

[openreview] [pdf]

Abstract We develop a new method for generating prediction sets that combines the flexibility of conformal methods with an estimate of the conditional distribution PYX\textup{P}_{Y \mid X}. Existing methods, such as conformalized quantile regression and probabilistic conformal prediction, usually provide only a marginal coverage guarantee. In contrast, our approach extends these frameworks to achieve approximately conditional coverage, which is crucial for many practical applications. Our prediction sets adapt to the behavior of the predictive distribution, making them effective even under high heteroscedasticity. While exact conditional guarantees are infeasible without assumptions on the underlying data distribution, we derive non-asymptotic bounds that depend on the total variation distance of the conditional distribution and its estimate. Using extensive simulations, we show that our method consistently outperforms existing approaches in terms of conditional coverage, leading to more reliable statistical inference in a variety of applications.

2181Fair Clustering in the Sliding Window Model

[openreview] [pdf]

Abstract We study streaming algorithms for proportionally fair clustering (a notion originally suggested by Chierichetti et al. (2017) in the sliding window model. We show that although there exist efficient streaming algorithms exist in the insertion-only model, surprisingly no algorithm can achieve finite ratio without violating the fairness constraint in sliding window. Hence, the problem of fair clustering is a rare separation between the insertion-only streaming model and the sliding window model. On the other hand, we show that if the fairness constraint by a multiplicative ε\varepsilon factor, there exists a (1+ε)(1 + \varepsilon)-approximate sliding window algorithm that uses poly(kε1logn)\text{poly}(k\varepsilon^{-1}\log n) space. This achieves essentially the best parameters (up to degree in the polynomial) provided the aforementioned lower bound. We also implement a number of empirical evaluations on real datasets to complement our theoretical results.

2182Proxy Denoising for Source-Free Domain Adaptation

[openreview] [pdf]

Abstract Source-Free Domain Adaptation (SFDA) aims to adapt a pre-trained source model to an unlabeled target domain with no access to the source data. Inspired by the success of large Vision-Language (ViL) models in many applications, the latest research has validated ViL’s benefit for SFDA by using their predictions as pseudo supervision. However, we observe that ViL’s supervision could be noisy and inaccurate at an unknown rate, potentially introducing additional negative effects during adaption. To address this thus-far ignored challenge, we introduce a novel Proxy Denoising (ProDe) approach. The key idea is to leverage the ViL model as a proxy to facilitate the adaptation process towards the latent domain-invariant space. Concretely, we design a proxy denoising mechanism to correct ViL’s predictions. This is grounded on a proxy confidence theory that models the dynamic effect of proxy’s divergence against the domain-invariant space during adaptation. To capitalize the corrected proxy, we further derive a mutual knowledge distilling regularization. Extensive experiments show that ProDe significantly outperforms the current state-of-the-art alternatives under both conventional closed-set setting and the more challenging open-set, partial-set and generalized SFDA settings. Our code will be released.

2183Closed-Loop Long-Horizon Robotic Planning via Equilibrium Sequence Modeling

[openreview] [pdf]

Abstract In the endeavor to make autonomous robots take actions, task planning is a major challenge that requires translating high-level task descriptions into long-horizon action sequences. Despite recent advances in language model agents, they remain prone to planning errors and limited in their ability to plan ahead. To address these limitations in robotic planning, we advocate a self-refining scheme that iteratively refines a draft plan until an equilibrium is reached. Remarkably, this process can be optimized end-to-end from an analytical perspective without the need to curate additional verifiers or reward models, allowing us to train self-refining planners in a simple supervised learning fashion. Meanwhile, a nested equilibrium sequence modeling procedure is devised for efficient closed-loop planning that incorporates useful feedback from the environment (or an internal world model). Our method is evaluated on the VirtualHome-Env benchmark, showing advanced performance with better scaling for inference computation. Code is available athttps://github.com/anonymous-iclr-2025/equilibrium-planner.

2184Wait, That’s Not an Option: LLM Robustness with Incorrect Multiple-Choice Options

[openreview] [pdf]

Abstract Decision-making under full alignment requires balancing between reasoning and faithfulness - a challenge for large language models (LLMs). This study explores whether LLMs prioritize following instructions over reasoning and truth when given “misleading” instructions, such as “Respond solely with A or B”, even when neither option is correct. We introduce a new metric called “reflective judgment”, which sheds new light on the relationship between the pre-training and post-training alignment schemes. In tasks ranging from basic arithmetic to domain-specific assessments, models like GPT-4o, o1-mini, or Claude 3 Opus adhered to instructions correctly but failed to reflect on the validity of the provided options. Contrary, models from the Llama 3.1 family (8B, 70B, 405B) or base Qwen2.5 (7B, 14B, 32B) families exhibit improved refusal rates with size, indicating a scaling effect. We also observed that alignment techniques, though intended to enhance reasoning, sometimes weakened the models’ ability to reject incorrect instructions, leading them to follow flawed prompts uncritically. Finally, we have also conducted a parallel human study revealing similar patterns in human behavior and annotations. We highlight how popular RLHF datasets might disrupt either training or evaluation due to annotations exhibiting poor reflective judgement.

2185A Generalist Hanabi Agent

[openreview] [pdf]

Abstract Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, these systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents ---agents that are themselves unable to do so.

2186Cohort Squeeze: Beyond a Single Communication Round per Cohort in Cross-Device Federated Learning

[openreview] [pdf]

Abstract Virtually all federated learning (FL) methods, including FedAvg, operate in the following manner: i) an orchestrating server sends the current model parameters to a cohort of clients selected via certain rule, ii) these clients then independently perform a local training procedure (e.g., via SGD or Adam) using their own training data, and iii) the resulting models are shipped to the server for aggregation. This process is repeated until a model of suitable quality is found. A notable feature of these methods is that each cohort is involved in a single communication round with the server only. In this work we challenge this algorithmic design primitive and investigate whether it is possible to “squeeze more juice” out of each cohort than what is possible in a single communication round. Surprisingly, we find that this is indeed the case, and our approach leads to up to 74% reduction in the total communication cost needed to train a FL model in the cross-device setting. Our method is based on a novel variant of the stochastic proximal point method (SPPM-AS) which supports a large collection of client sampling procedures some of which lead to further gains when compared to classical client selection approaches.

2187Tokens on Demand: Token Condensation as Training-free Test-time Adaptation

[openreview] [pdf]

Abstract In this work, we introduce Token Condensation as Adaptation (TCA), a training-free approach designed to mitigate distribution shifts encountered by vision-language models (VLMs) during test-time inference. TCA bridges distribution gaps at the patch level by condensing image tokens that exhibit low attentiveness to the token. Recognizing the token may correspond to universal concepts, TCA identifies and tracks the most reliable tokens that align specifically with target classes from historical data streams. To achieve this, we propose a context token reservoir (CTR), which retains tokens with the lowest uncertainty as ``anchors" to guide the preservation of class-relevant tokens during inference. These anchors, in turn, act as token-level classifiers to correct VLM predictions and improve visual-text alignment. Utilizing anchors sampled from CTR, TCA condenses tokens through two operations: (1) pruning class-irrelevant tokens that consistently rank low across all attention heads to reach cross-head consensus on their irrelevance, and (2) merging the remaining class-ambiguous tokens into representative centers using coreset selection, maintaining linear computational complexity. As the first method to explore token efficiency in test-time adaptation, TCA consistently demonstrates superior performance across cross-dataset and out-of-distribution adaptation tasks, reducing GFLOPs by 12.2% to 48.9% while achieving accuracy improvements up to 21.4% against the strongest baseline without introducing additional parameters.

2188Meta ControlNet: Enhancing Task Adaptation via Meta Learning

[openreview] [pdf]

Abstract Diffusion-based image synthesis has attracted extensive attention recently. In particular, ControlNet that uses image-based prompts exhibits powerful capability in image tasks such as canny edge detection and generates images well aligned with these prompts. However, vanilla ControlNet generally requires extensive training of around 5000 steps to achieve a desirable control for a single task. Recent context-learning approaches have improved its adaptability, but mainly for edge-based tasks, and rely on paired examples. Thus, two important open issues are yet to be addressed to reach the full potential of ControlNet: (i) zero-shot control for certain tasks and (ii) faster adaptation for non-edge-based tasks. In this paper, we introduce a novel Meta ControlNet method, which adopts the task-agnostic meta learning technique and features a new layer freezing design. Meta ControlNet significantly reduces learning steps to attain control ability from 5000 to 1000. Further, Meta ControlNet exhibits direct zero-shot adaptability in edge-based tasks without any finetuning, and achieves control within only 100 finetuning steps in more complex non-edge tasks such as Human Pose.

2189Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning

[openreview] [pdf]

Abstract Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample efficiency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, \emph{Langevin Soft Actor Critic} (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin QQ updates, parallel tempering for exploring multiple modes of the posterior of the QQ function and diffusion synthesized state-action samples regularized with QQ action gradients. Our extensive experiments demonstrate that LSAC outperforms or matches the performance of mainstream model-free RL algorithms for continuous control tasks. Notably, LSAC marks the first successful application of a Langevin Monte Carlo (LMC) based Thompson sampling in continuous control tasks with continuous action spaces, setting a new benchmark for future research in the field.

2190What Matters in Transformers? Not All Attention is Needed

[openreview] [pdf]

Abstract While scaling Transformer-based large language models (LLMs) has demonstrated promising performance across various tasks, it also introduces redundant archi- tectures, posing efficiency challenges for real-world deployment. Despite some recognition of redundancy in LLMs, the variability of redundancy across different architectures in transformers, such as MLP and Attention layers, is under-explored. In this work, we investigate redundancy across different modules within Trans- formers, including Blocks, MLP, and Attention layers, using a similarity-based metric. Surprisingly, despite the critical role of attention layers in distinguishing transformers from other architectures, we found that a large portion of these layers exhibit excessively high similarity and can be pruned without degrading perfor- mance. For instance, Llama-2-70B achieved a 48.4% speedup with only a 2.4% performance drop by pruning half of the attention layers. Furthermore, by tracing model checkpoints throughout the training process, we observed that attention layer redundancy is inherent and consistent across training stages. Additionally, we further propose a method that jointly drops Attention and MLP layers, allowing us to more aggressively drop additional layers. For instance, when dropping 31 layers (Attention + MLP), Llama-2-13B still retains 90% of the performance on the MMLU task. Our work provides valuable insights for future network architecture design. The code will be released upon acceptance.

2191Revisiting Random Walks for Learning on Graphs

[openreview] [pdf]

Abstract We revisit a recent model class for machine learning on graphs, where a random walk on a graph produces a machine-readable record, and this record is processed by a deep neural network to directly make vertex-level or graph-level predictions. We refer to these stochastic machines as random walk neural networks (RWNNs), and through principled analysis, show that we can design them to be isomorphism invariant while capable of universal approximation of graph functions in probability. A useful finding is that almost any kind of record of random walk guarantees probabilistic invariance as long as the vertices are anonymized. This enables us, for example, to record random walks in plain text and adopt a language model to read these text records to solve graph tasks. We further establish a parallelism to message passing neural networks using tools from Markov chain theory, and show that over-smoothing in message passing is alleviated by construction in RWNNs, while over-squashing manifests as probabilistic under-reaching. We empirically demonstrate RWNNs on a range of problems, verifying our theoretical analysis and demonstrating the use of language models for separating strongly regular graphs where the 3-WL test fails, and transductive classification on arXiv citation network.

2192Which Attention Heads Matter for In-Context Learning?

[openreview] [pdf]

Abstract Large language models (LLMs) exhibit impressive in-context learning (ICL) capability, enabling them to generate relevant responses from a handful of task demonstrations in the prompt. Prior studies have suggested two different explanations for the mechanisms behind ICL: induction heads that find and copy relevant tokens, and function vector (FV) heads whose activations compute a latent encoding of the ICL task. To better understand which of the two distinct mechanisms drives ICL, we study induction heads and FV heads in 12 language models.Our study reveals that in all 12 models, few-shot ICL is driven primarily by FV heads: ablating FV heads decreases few-shot ICL accuracy significantly more than ablating induction heads, especially in larger models. We also find that FV and induction heads are connected: many FV heads start as induction heads during training before transitioning to the FV mechanism. This leads us to speculate that induction heads facilitate the learning of the more complex FV mechanism for ICL. Finally, the prevalence of FV and induction heads varies with architecture, which questions strong versions of the “universality” hypothesis: findings from interpretability research are not always generalizable across models.

2193Can Data be Myopic? Outlier Detection in High-Dimensional Tabular Data via Subspaces

[openreview] [pdf]

Abstract Outlier detection in high-dimensional tabular data is an important task in data mining, essential for many downstream tasks and applications. Existing unsupervised outlier detection algorithms face one or more problems, including inlier assumption (IA), curse of dimensionality (CD), and multiple views (MV). To address these issues, we introduce Generative Subspace Adversarial Active Learning (GSAAL), a novel approach that uses a Generative Adversarial Network with multiple adversaries. These adversaries learn the marginal class probability functions over different data subspaces, while a single generator in the full space models the entire distribution of the inlier class. GSAAL is specifically designed to address the MV limitation while also handling the IA and CD, being the only method to do so. We provide a mathematical formulation of MV, convergence guarantees for the discriminators, and scalability results for GSAAL. Our extensive experiments demonstrate the effectiveness and scalability of GSAAL, highlighting its superior performance compared to other popular OD methods, especially in MV scenarios.

2194Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing

[openreview] [pdf]

Abstract Recent advancements in text-guided diffusion models have unlocked powerful image manipulation capabilities, yet balancing reconstruction fidelity and editability for real images remains a significant challenge. In this work, we introduce TaskOriented Diffusion Inversion (TODInv), a novel framework that inverts and edits real images tailored to specific editing tasks by optimizing prompt embeddings within the extended P ∗ space. By leveraging distinct embeddings across different U-Net layers and time steps, TODInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability. This hierarchical editing mechanism categorizes tasks into structure, appearance, and global edits, optimizing only those embeddings unaffected by the current editing task. Extensive experiments on benchmark dataset reveal TODInv’s superior performance over existing methods, delivering both quantitative and qualitative enhancements while showcasing its versatility with few-step diffusion model.

2195ADAPT: Attentive Self-Distillation and Dual-Decoder Prediction Fusion for Continual Panoptic Segmentation

[openreview] [pdf]

Abstract Panoptic segmentation, which unifies semantic and instance segmentation into a single task, has witnessed considerable success on predefined tasks. However, traditional methods tend to struggle with catastrophic forgetting and poor generalization when learning from a continuous stream of new tasks. Continual learning, emerged to tackle these challenges, has garnered increasing attention in recent years. Nonetheless, our study reveals that existing continual panoptic segmentation (CPS) methods often suffer from efficiency or scalability issues. To address these limitations, we propose a novel dual-decoder framework that incorporates attentive self-distillation and prediction fusion to efficiently preserve prior knowledge while facilitating model generalization. Specifically, we freeze the majority of model weights up to the pixel decoder, which is shared between the teacher and student models, thus enabling efficient knowledge distillation with only a single forward pass. Attentive self-distillation then adaptively distills useful knowledge from the old classes without distracting from non-object regions, which mitigates the inherent bias toward newly learned tasks. Additionally, query-level fusion (QLF) is devised to seamlessly integrate the output of the dual decoders without incurring scale inconsistency. Crucially, the computational overhead of our approach remains nearly constant, regardless of the number of continual learning steps or the number of classes introduced at each step. Our method achieves state-of-the-art performance on the ADE20K benchmark.

2196Declarative characterizations of direct preference alignment algorithms

[openreview] [pdf]

Abstract Recent direct preference alignment algorithms (DPA), such as DPO, have shown great promise in aligning large language models to human preferences. While this has motivated the development on many new variants of the original DPO loss, understanding the the differences between these recent proposals, as well as developing new DPA loss functions, remains a formidable challenge given the lack of a technical and conceptual framework for reasoning about the underlying semantics of these algorithms. In this paper, we attempt to remedy this by formalizing DPA losses in terms of discrete reasoning problems. Specifically, we ask: Given an existing DPA loss, can we systematically derive a symbolic expression that characterizes its semantics? We show how this formal view of preference learning sheds new light on both the size and structure of DPA loss landscape, making it possible to not only rigorously characterize the relationships between recent proposals but to derive new loss functions from first principles. We also couple our formal findings with empirical results on a hitherto untested class of single model preference loss functions. Our experiments reveal interesting connections between symbolic constraint complexity and the empirical success and training dynamics of the corresponding losses, insights we believe can give useful guidance to AI practitioners working on AI alignment.

2197Causal Information Prioritization for Efficient Reinforcement Learning

[openreview] [pdf]

Abstract Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency, resulting from blind exploration strategies that neglect causal relationships among states, actions, and rewards. Although recent causal approaches aim to address this problem, they lack grounded modeling of reward-guided causal understanding of states and actions for goal-orientation, thus impairing learning efficiency. To tackle this issue, we propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs to infer causal relationships between different dimensions of states and actions with respect to rewards, enabling the prioritization of causal information. Specifically, CIP identifies and leverages causal relationships between states and rewards to execute counterfactual data augmentation to prioritize high-impact state features under the causal understanding of the environments. Moreover, CIP integrates a causality-aware empowerment learning objective, which significantly enhances the agent’s execution of reward-guided actions for more efficient exploration in complex environments. To fully assess the effectiveness of CIP, we conduct extensive experiments across 39 tasks in 5 diverse continuous control environments, encompassing both locomotion and manipulation skills learning with pixel-based and sparse reward settings. Experimental results demonstrate that CIP consistently outperforms existing RL methods across a wide range of scenarios.

2198Efficient Adaptive Federated Optimization

[openreview] [pdf]

Abstract Adaptive optimization plays a pivotal role in federated learning, where simultaneous server and client-side adaptivity have been shown to be essential for maximizing its performance. However, the scalability of jointly adaptive systems is often constrained by limited resources in communication and memory. In this paper, we introduce a class of efficient adaptive algorithms, named FedAda2FedAda^2, designed specifically for large-scale, cross-device federated environments. FedAda2FedAda^2 optimizes communication efficiency by avoiding the transfer of preconditioners between the server and clients, while simultaneously utilizing memory-efficient adaptive optimizers on the client-side to reduce extra on-device memory cost. Theoretically, we demonstrate that FedAda2FedAda^2 achieves the same convergence rates for general, non-convex objectives as its more resource-intensive counterparts that directly integrate joint adaptivity. Empirically, we showcase the benefits of joint adaptivity and the effectiveness of FedAda2FedAda^2 on both image and text datasets.

2199Adaptive Strategy Evolution for Generating Tailored Jailbreak Prompts against Black-Box Safety-Aligned LLMs

[openreview] [pdf]

Abstract While safety-aligned Large Language Models (LLMs) have been secured by extensive alignment with human feedback, they remain vulnerable to jailbreak attacks that exploit prompt manipulation to generate harmful outputs. Investigating these jailbreak methods, particularly in black-box scenarios, allows us to explore the inherent limitations of such LLMs and provides insights into possible improvements. However, existing black-box jailbreak methods either overly rely on red-teaming LLMs to execute sophisticated reasoning tasks, such as diagnosing failure cases, determining improvement directions, and rewriting prompts, which pushes them beyond their inherent capabilities and introduces uncertainty and inefficiency into the refinement process, or they are confined to rigid, manually predefined strategy spaces, limiting their performance ceiling. To enable a sustained and deterministic exploration with clear directional guidance, we propose the novel Adaptive Strategy Evolution (ASE) framework. Specifically, ASE innovatively decomposes jailbreak strategies into modular key components, dramatically enhancing both the flexibility and expansiveness of the strategy space. This also allows us to shift focus from directly optimizing prompts to optimizing the jailbreak strategies. Then, by leveraging a genetic algorithm (GA) for strategy components’ selection and mutation, ASE could replace the uncertainties of LLM-based self-adjustment with a more systematic and deterministic optimization process. Additionally, we have also designed a new fitness evaluation, that emphasizes the independence of scoring criteria, provides highly accurate and reliable feedback, enabling precise and targeted refinement of jailbreak strategies. Experimental results further demonstrate that ASE achieves superior jailbreak success rates (JSR) compared to existing state-of-the-art methods, especially against the most advanced safety-aligned LLMs like GPT-4o, Claude-3.5, and even o1.

2200InvestAlign: Align LLMs with Investor Decision-Making under Herd Behavior

[openreview] [pdf]

Abstract Studying investor decision-making processes under herd behavior is of great significance in microeconomics and behavioral finance. Large Language Models (LLMs) can be leveraged to assist in solving complex investment problems. However, the investment decisions generated by existing LLMs often deviate from real-user data. One method to align LLMs with investor decision-making processes is Supervised Fine-Tuning (SFT), which requires a substantial amount of real-user data that is costly to collect and raises concerns about privacy and security. In this work, we propose InvestAlign, a low-cost and high-quality method that constructs large-scale SFT training datasets based on theoretical solutions to a similar and simpler optimal investment problem, rather than the original complex one. We theoretically demonstrate that fine-tuning LLMs with these datasets leads to faster parameter convergence compared to using real-user data. By fine-tuning LLMs, we obtain InvestAgents, which align more closely with real-user data than pre-SFT LLMs in both the simple and original complex problems. This highlights InvestAlign as a promising approach with the potential to address complex optimal investment problems and align LLMs with investor decision-making processes in economics and finance.

2201INSTRUCTION-FOLLOWING LLMS FOR TIME SERIES PREDICTION: A TWO-STAGE MULTIMODAL APPROACH

[openreview] [pdf]

Abstract We introduce Text-Informed Time Series Prediction (TITSP), an innovative multimodal framework that integrates textual knowledge with temporal dynamics using Large Language Models (LLMs). TITSP employs a two-stage process that bridges numerical data with rich contextual information for enhanced forecasting accuracy and interpretability.In the first stage, we present AutoPrompter, which captures temporal dependencies from time series data and aligns them with semantically meaningful text embeddings.In the second stage, these aligned embeddings are refined by incorporating task-specific textual instructions through LLM. We evaluate TITSP on several multimodal time series prediction tasks, demonstrating substantial improvements over state-of-the-art baselines. Quantitative results reveal significant gains in predictive performance, while qualitative analyses show that textual context enhances interpretability and actionable insights. Our findings indicate that integrating multimodal inputs not only improves prediction accuracy but also fosters more intuitive, user-centered forecasting

2202Large Language Models Assume People are More Rational than We Really are

[openreview] [pdf]

Abstract In order for AI systems to communicate effectively with people, they must understand how we make decisions. However, people’s decisions are not always rational, so the implicit internal models of human decision-making in Large Language Models (LLMs) must account for this. Previous empirical evidence seems to suggest that these implicit models are accurate --- LLMs offer believable proxies of human behavior, acting how we expect humans would in everyday interactions. However, by comparing LLM behavior and predictions to a large dataset of human decisions, we find that this is actually not the case: when both simulating and predicting people’s choices, a suite of cutting-edge LLMs (GPT-4o & 4-Turbo, Llama-3-8B & 70B, Claude 3 Opus) assume that people are more rational than we really are. Specifically, these models deviate from human behavior and align more closely with a classic model of rational choice --- expected value theory. Interestingly, people also tend to assume that other people are rational when interpreting their behavior. As a consequence, when we compare the inferences that LLMs and people draw from the decisions of others using another psychological dataset, we find that these inferences are highly correlated. Thus, the implicit decision-making models of LLMs appear to be aligned with the human expectation that other people will act rationally, rather than with how people actually act.

2203Mitigating Graph Covariate Shift via Score-based Out-of-distribution Augmentation

[openreview] [pdf]

Abstract Distribution shifts between training and testing datasets significantly impair the model performance on graph learning. A commonly-taken causal view in graph invariant learning suggests that stable features of graphs are causally associated with labels, whereas unstable environmental features lead to distribution shifts. In particular, covariate shifts caused by unseen environmental features in test graphs underscore the critical need for out-of-distribution (OOD) generalization. Existing graph augmentation methods designed to address the covariate shift often disentangle the stable and environmental features in the input space, and selectively perturb or mixup the environmental features. However, such perturbation-based methods heavily rely on an accurate separation of stable and environmental features, and their exploration ability is confined to existing environmental features in the training distribution. To overcome these limitations, we introduce a novel approach using score-based graph generation strategies that synthesize unseen environmental features while preserving the validity and stable features of overall graph patterns. Our comprehensive empirical evaluations demonstrate the enhanced effectiveness of our method in improving graph OOD generalization.

2204ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination

[openreview] [pdf]

Abstract Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLMs is limited within texts and it is difficult to represent the spatial occupancy and geometry layout only by texts. Both are important for making rational navigation decisions. In this work, we seek to unleash the spatial perception and planning ability of Vision-Language Models (VLMs), and explore whether the VLM, with only on-board camera captured RGB/RGB-D stream inputs, can efficiently finish the visual navigation tasks in a mapless manner. We achieve this by developing the imagination-powered navigation framework ImagineNav, which imagines the future observation images at valuable robot views and translates the complex navigation planning process into a rather simple best-view image selection problem for VLM. To generate appropriate candidate robot views for imagination, we introduce the Where2Imagine module, which is distilled to align with human navigation habits. Finally, to reach the VLM preferred views, an off-the-shelf point-goal navigation policy is utilized. Empirical experiments on the challenging open-vocabulary object navigation benchmarks demonstrates the superiority of our proposed system.

2205Instance-Aware Graph Prompt Learning

[openreview] [pdf]

Abstract Graph neural networks stand as the predominant technique for graph representation learning owing to their strong expressive power, yet the performance highly depends on the availability of high-quality labels in an end-to-end manner. Thus the pretraining and fine-tuning paradigm has been proposed to mitigate the label cost issue. Subsequently, the gap between the pretext tasks and downstream tasks has spurred the development of graph prompt learning which inserts a set of graph prompts into the original graph data with minimal parameters while preserving competitive performance. However, the current exploratory works are still limited since they all concentrate on learning fixed task-specific prompts which may not generalize well across the diverse instances that the task comprises. To tackle this challenge, we introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper, aiming to generate distinct prompts tailored to different input instances. The process involves generating intermediate prompts for each instance using a lightweight architecture, quantizing these prompts through trainable codebook vectors, and employing the exponential moving average technique to ensure stable training. Extensive experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.

2206Can Knowledge Editing Really Correct Hallucinations?

[openreview] [pdf]

Abstract Large Language Models (LLMs) suffer from hallucinations, referring to the non-factual information in generated content, despite their superior capacities across different tasks. Meanwhile, knowledge editing has become a burgeoning paradigm that is designed to correct the erroneous factual knowledge encoded in LLMs for its advantage of avoiding retraining from scratch. However, one common issue of existing evaluation datasets for knowledge editing is that they do not ensure LLMs actually generate hallucinated answers to the evaluation questions before editing. When LLMs are evaluated on such datasets after being edited by different techniques, it is hard to directly adopt the performance to assess the effectiveness of different knowledge editing methods in correcting hallucinations. Thus, the fundamental question remains insufficiently validated: Can knowledge editing really correct hallucinations in LLMs? Then, we proposed HalluEditBench to holistically benchmark knowledge editing methods in correcting real-world hallucinations. First, we rigorously construct a massive hallucination dataset with 9 domains, 26 topics and more than 6,000 hallucinations. Then, we assess the performance of knowledge editing methods in a holistic way on five dimensions including Efficacy, Generalization, Portability, Locality, and Robustness. Through HalluEditBench, we have provided new insights into the potentials and limitations of different knowledge editing methods in correcting hallucinations, which could inspire more future improvements and facilitate the progress in the field of knowledge editing. Data and code are available at here.

2207Rethinking Shapley Value for Negative Interactions in Non-convex Games

[openreview] [pdf]

Abstract We study causal interaction for payoff allocation in cooperative game theory, including quantifying feature attribution for deep learning models. Most feature attribution methods mainly stem from the criteria from the Shapley value, which provides a unique payoff vector for players by marginalizing contributions in a cooperative game. However, interactions between players in the game do not exactly appear in the original formulation of the Shapley value. In this work, we clarify the role of interactions in computing the Shapley value by reformulation and discuss implicit assumptions from a game-theoretical perspective. Our theoretical analysis demonstrates that classical payoff allocation in a cooperative game assumes the convexity of the game, which is equivalent to non-negative interactions between players. When negative interactions exist, common in deep learning models, attributions or payoffs can be underrated by the efficiency axiom in this classical setup. We suggest a new allocation rule that decomposes contributions into interactions and aggregates positive parts for non-convex games. Furthermore, we propose an approximation algorithm to reduce the cost of interaction computation which can be applied for differentiable functions such as deep learning models. Our approach mitigates counter-intuitive phenomena where even features highly relevant to the decision are assigned low attribution in the previous approaches.

[openreview] [pdf]

Abstract Understanding the intrinsic causal structure of time-series data is crucial for effective real-world interventions and decision-making. While several studies address the Time-Series Causal Discovery (TSCD) problem, the lack of high-quality datasets may limit the progress and evaluation of new methodologies. Many available datasets are derived from simplistic simulations, while real-world datasets are often limited in quantity, variety, and lack of ground-truth knowledge describing temporal causal relations. In this paper, we propose CausalDiffusion, the first diffusion model capable of generating multiple causally related time-series alongside a ground-truth causal graph, which abstracts their mutual temporal dependencies. CausalDiffusiom employs a causal reconstruction of the output time-series, allowing it to be trained exclusively on time-series data. Our experiments demonstrate that CausalDiffusion outperforms state-of-the-art methods in generating realistic time-series, with causal graphs that closely resemble those of real-world phenomena. Finally, we provide a benchmark of widely used TSCD algorithms, highlighting the benefits of our synthetic data with respect to existing solutions.

2209On Logical Extrapolation for Mazes with Recurrent and Implicit Networks

[openreview] [pdf]

Abstract Recent work has suggested that certain neural network architectures---particularly recurrent neural networks (RNNs) and implicit neural networks (INNs)--- are capable oflogical extrapolation. That is, one may train such a network on easy instances of a specific task and then apply it successfully to more difficult instances of the same task. In this paper, we revisit this idea and show that (i) The capacity for extrapolation is less robust than previously suggested. Specifically, in the context of a maze-solving task, we show that while INNs (and some RNNs) are capable of generalizing to larger maze instances, they fail to generalize along axes of difficulty other than maze size. (ii) Models that are explicitly trained to converge to a fixed point (e.g. the INN we test) are likely to do so when extrapolating, while models that are not (e.g. the RNN we test) may exhibit more exotic limiting behaviour such as limit cycles,even whenthey correctly solve the problem. Our results suggest that (i) further study intowhysuch networks extrapolate easily along certain axes of difficulty yet struggle with others is necessary, and (ii) analyzing thedynamicsof extrapolation may yield insights into designing more efficient and interpretable logical extrapolators.

2210Enhancing Graph Of Thought: Enhancing Prompts with LLM Rationales and Dynamic Temperature Control

[openreview] [pdf]

Abstract We introduce Enhancing Graph Of Thoughts (EGOT), a method designed to enhance the performance of large language models (LLMs) on complex reasoning tasks. EGOT automates the process of generating accurate responses using given data and a base prompt. The process consists of several steps: It obtain an initial response from the answering node using the base prompt. Evaluation node evaluates the response and generates reasoning for it, utilizing the score’s probabilities to enhance evaluation accuracy. The reasoning from both the answering node and the evaluation node is aggregated to identify the problem in the response. This aggregated reasoning is incorporated into the base prompt to obtain an enhanced response. These steps are organized in a graph architecture, where the final leaf nodes are merged to produce a final response. As the graph descends, the temperature is lowered using Cosine Annealing and scoring, to explore diverse responses with earlier nodes and to focus on precise responses with later nodes. The minimum temperature in Cosine Annealing is adjusted based on scoring, ensuring that nodes with low scores continue to explore diverse responses, while those with high scores confirm accurate responses. In sorting 256 elements using GPT-4o mini, EGOT performs 88.31% accuracy, respectively, GoT (Graph Of Thought) performance has 84.37%. In the frozen lake problem using GPT-4o, EGOT averages 0.55 jumps or falls into the hole, while TOT (Tree Of Thoughts) averages 0.89.

2211Differentially private learners for heterogeneous treatment effects

[openreview] [pdf]

Abstract Patient data is widely used to estimate heterogeneous treatment effects and understand the effectiveness and safety of drugs. Yet, patient data includes highly sensitive information that must be kept private. In this work, we aim to estimate the conditional average treatment effect (CATE) from observational data under differential privacy. Specifically, we present DP-CATE, a novel framework for CATE estimation that isdoubly robustand ensuresdifferential privacyof the estimates. For this, we build upon non-trivial tools from semi-parametric and robust statistics to exploit the connection between privacy and model robustness. Our framework is highly general and applies to any two-stage CATE meta-learner with a Neyman-orthogonal loss function. It can be used with all machine learning models employed for nuisance estimation. We further provide an extension of DP-CATE where we employ RKHS regression to release the complete doubly robust CATE function while ensuring differential privacy. We demonstrate the effectiveness of DP-CATE across various experiments using synthetic and real-world datasets. To the best of our knowledge, we are the first to provide a framework for CATE estimation that is doubly robust and differentially private.

2212Stochastic Online Conformal Prediction with Semi-Bandit Feedback

[openreview] [pdf]

Abstract Conformal prediction has emerged as an effective strategy for uncertainty quantification by modifying a model to output sets of labels instead of a single label. These prediction sets come with the guarantee that they contain the true label with high probability. However, conformal prediction typically requires a large calibration dataset of i.i.d. examples. We consider the online learning setting, where examples arrive over time, and the goal is to construct prediction sets dynamically. Departing from existing work, we assume semi-bandit feedback, where we only observe the true label if it is contained in the prediction set. For instance, consider calibrating a document retrieval model to a new domain; in this setting, a user would only be able to provide the true label if the target document is in the prediction set of retrieved documents. We propose a novel conformal prediction algorithm targeted at this setting, and prove that it obtains sublinear regret compared to the optimal conformal predictor. We evaluate our algorithm on a retrieval task, an image classification task, and an auction price-setting task, and demonstrate that it empirically achieves good performance compared to several baselines.

2213RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources. To adapt LLMs for RAG pipelines, current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. This supervised fine-tuning (SFT) approach focuses on equipping LLMs to handle diverse RAG tasks using different instructions. However, it trains RAG modules to overfit training signals and overlooks the varying data preferences among agents within the RAG system. In this paper, we propose a Differentiable Data Rewards (DDR) method, which end-to-end trains RAG systems by aligning data preferences between different RAG modules. DDR works by collecting the rewards to optimize each agent with a rollout method. This method prompts agents to sample some potential responses as perturbations, evaluates the impact of these perturbations on the whole RAG system, and subsequently optimizes the agent to produce outputs that improve the performance of the RAG system. Our experiments on various knowledge-intensive tasks demonstrate that DDR significantly outperforms the SFT method, particularly for LLMs with smaller-scale parameters that depend more on the retrieved knowledge. Additionally, DDR exhibits a stronger capability to align the data preference between RAG modules. The DDR method makes generation module more effective in extracting key information from documents and mitigating conflicts between parametric memory and external knowledge. All codes will be released via GitHub.

2214Detecting Out-of-Distribution through the Lens of Neural Collapse

[openreview] [pdf]

Abstract Out-of-Distribution (OOD) detection is essential for safe deployment; however, existing detectors exhibit generalization discrepancies and cost concerns. To address this, we propose a highly versatile and efficient OOD detector inspired by the trend of Neural Collapse on practical models, without requiring complete collapse. By analyzing this trend, we discover that features of in-distribution (ID) samples cluster closer to the weight vectors compared to features of OOD samples. Additionally, we reveal that ID features tend to expand in space to structure a simplex Equiangular Tight Framework, which explains the prevalent observation that ID features reside further from the origin than OOD features. Taking both insights from Neural Collapse into consideration, our OOD detector utilizes feature proximity to weight vectors and further complements this perspective by using feature norms to filter OOD samples. Extensive experiments on \emph{off-the-shelf} models demonstrate the efficiency and effectiveness of our OOD detector across diverse classification tasks and model architectures, mitigating generalization discrepancies and improving \emph{overall} performance.

2215Sequence Denoising with Self-Augmentation for Knowledge Tracing

[openreview] [pdf]

Abstract Knowledge tracing (KT) aims to predict students’ future knowledge levels based on their historical interaction sequences. Most KT methods rely on interaction data between students and questions to assess knowledge states and these approaches typically assume that the interaction data is reliable. In fact, on the one hand, factors such as guessing or slipping could inevitably bring in noise in sequences. On the other hand, students’ interaction sequences are often sparse, which could amplify the impact of noise, further affecting the accurate assessment of knowledge states. Although data augmentation which is always adopted in KT could alleviate data sparsity, it also brings noise again during the process. Therefore, denoising strategy is urgent and it should be employed not only on the original sequences but also on the augmented sequences. To achieve this goal, we adopt a plug and play denoising framework in our method. The denoising technique is adopted not only on the original and the enhanced sequences separately during the data augmentation process, but also we explore the hard noise through the comparison between the two streams. During the denoising process, we employ a novel strategy for selecting data samples to balance the hard and soft noise leveraging Singular Value Decomposition (SVD). This approach optimizes the ratio of explicit to implicit denoising and combines them to improve feature representation. Extensive experiments on four real-world datasets demonstrate that our method not only enhances accuracy but also maintains model interpretability.

2216TimeCAT: Hierarchical Context-Aware Transformer with Dynamic Grouping for Time Series Forecasting

[openreview] [pdf]

Abstract Transformer-based models have achieved significant success in time series forecasting by modeling global dependencies through self-attention mechanisms. However, these models often rely on fixed patch settings with locality constraints, tokenizing time series into spatially connected sub-series. This approach can hinder the capture of semantic relationships and lead to computational inefficiencies, especially when dealing with long sequences with complex temporal dependencies. In this work, we introduce \textbf{TimeCAT}—a \underline{Time} series \underline{C}ontext-\underline{A}ware \underline{T}ransformer that dynamically groups input sequences into semantically coherent groups, enabling efficient modeling of both local and global dependencies. By appending group and global tokens, TimeCAT facilitates fine-grained information exchange through a novel \emph{Context-Aware Mixing Block}, which utilizes self-attention and MLP mixing operations. This hierarchical approach efficiently models long sequences by processing inputs in structured contexts, reducing computational overhead without sacrificing accuracy. Experiments on several challenging real-world datasets demonstrate that TimeCAT achieves consistent state-of-the-art performance, significantly improving forecasting accuracy and computational efficiency over existing methods. This advancement enhances the Transformer family with improved performance, generalization ability, and better utilization of sequence information.

2217Attack on LLMs: LoRA Once, Backdoor Everywhere in the Share-and-Play Ecosystem

[openreview] [pdf]

Abstract Finetuning large language models (LLMs) with LoRA has gained significant popularity due to its simplicity and effectiveness. Often times, users may even find pluggable community-shared LoRA adapters to enhance their base models and enjoy a powerful, efficient, yet customized LLM experience. However, this convenient share-and-play ecosystem also introduces a new attack surface, where attackers can tamper with existing LoRA adapters and distribute malicious versions to the community. Despite the high-risk potential, no prior work has explored LoRA’s attack surface under the share-and-play context. In this paper, we address this gap by investigating how backdoors can be injected into task-enhancing LoRA adapters and studying the mechanisms of such infection. We demonstrate that with a simple but specific recipe, a backdoor-infected LoRA can be trained once, then directly merged with multiple LoRA adapters finetuned on different tasks while retaining both its malicious and benign capabilities; which enables attackers to distribute compromised LoRAs at scale with minimal effort. Our work highlights the need for heightened security awareness in the LoRA ecosystem. Warning: the paper contains potentially offensive content generated by models.

2218The Pitfalls of Memorization: When Memorization Hurts Generalization

[openreview] [pdf]

Abstract Neural networks often learn simple explanations that fit the majority of the data while memorizing exceptions that deviate from these explanations. This leads to poor generalization if the learned explanations are spurious. In this work, we formalize the interplay between memorization and generalization\textit{the interplay between memorization and generalization}, showing that spurious correlations would particularly lead to poor generalization when are combined with memorization. Memorization can reduce the training loss to zero, leaving no incentive for learning robust, generalizable patterns. To address this issue, we introduce memorization-aware training\textit{memorization-aware training} (MAT). MAT leverages the flip side of memorization by using held-out predictions to shift a model’s logits, guiding it towards learning robust patterns that remain invariant from training to test, thereby enhancing generalization under distribution shifts.

2219UrbanPlanBench: A Comprehensive Assessment of Urban Planning Abilities in Large Language Models

[openreview] [pdf]

Abstract Urban planning is a professional discipline that shapes our daily surroundings, which demands multifaceted domain knowledge and relies heavily on human expertise. The advent of Large Language Models (LLMs) holds promise for revolutionizing such a field by the pre-trained world knowledge. However, the extent to which these models can assist human practitioners remains largely unexplored. In this paper, we introduce a comprehensive benchmark, PlanBench, tailored to evaluate the efficacy of LLMs in urban planning, which encompasses fundamental principles, professional knowledge, and management and regulations, aligning closely with the qualifications expected of human planners. Through extensive evaluation, we reveal a significant imbalance in the acquisition of planning knowledge among LLMs, with even the most proficient models falling short of meeting professional standards. For instance, we observe that 70% of LLMs achieve subpar performance in understanding planning regulations compared to other aspects. Besides the benchmark, we present the largest-ever supervised fine-tuning (SFT) dataset, PlanText, for LLMs in urban planning, comprising over 30,000 instruction pairs sourced from urban planning exams and textbooks. Our findings demonstrate that fine-tuned models exhibit enhanced performance in memorization tests and comprehension of urban planning knowledge, while there exists significant room for improvement, particularly in tasks requiring domain-specific terminology and reasoning. Our benchmark, dataset, and associated evaluation and fine-tuning toolsets aim to catalyze the integration of LLMs into practical urban computing, fostering a symbiotic relationship between human expertise and machine intelligence.

2220Trajectory-Class-Aware Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract In the context of multi-agent reinforcement learning,generalizationis a challenge to solve various tasks that may require different joint policies or coordination without relying on policies specialized for each task. We refer to this type of problem as amulti-goal task, and we train agents to be versatile in this multi-goal task through a single training process. To address this challenge, we introduce TRajectory-class-Aware Multi-Agent reinforcement learning (TRAMA). In TRAMA, agents recognize a task type by identifying the class of trajectories they are experiencing through partial observations, and the agents use this trajectory awareness or prediction as additional information for action policy. To this end, we introduce three primary objectives in TRAMA: (a) constructing a quantized latent space to generate trajectory embeddings that reflect key similarities among them; (b) conducting trajectory clustering using these trajectory embeddings; and (c) building a trajectory-class-aware policy. Specifically for (c), we introduce a trajectory-class predictor that performs agent-wise predictions on the trajectory class; and we design a trajectory-class representation model for each trajectory class. Each agent takes actions based on this trajectory-class representation along with its partial observation for task-aware execution. The proposed method is evaluated on various tasks including multi-goal tasks built upon StarCraft II. Empirical results show further performance improvements over state-of-the-art baselines.

2221A Truncated Newton Method for Optimal Transport

[openreview] [pdf]

Abstract Developing a contemporary optimal transport (OT) solver requires navigating trade-offs among several critical requirements: GPU parallelization, scalability to high-dimensional problems, theoretical convergence guarantees, empirical performance in terms of precision versus runtime, and numerical stability in practice. With these challenges in mind, we introduce a specialized truncated Newton algorithm for entropic regularized OT. In addition to proving that locally quadratic convergence is possible without assuming a Lipschitz Hessian, we provide strategies to maximally exploit the high rate of local convergence in practice. Our GPU-parallel algorithm exhibits exceptionally favorable runtime performance, achieving high precision orders of magnitude faster than many existing alternatives. This is evidenced by wall-clock time experiments on 4096-dimensional MNIST and color transfer problems. The scalability of the algorithm is showcased on an extremely large OT problem with n106n \approx 10^6, solved approximately under weak entopric regularization.

2222Towards a General Time Series Anomaly Detector with Adaptive Bottlenecks and Dual Adversarial Decoders

[openreview] [pdf]

Abstract Time series anomaly detection plays a vital role in a wide range of applications. Existing methods require training one specific model for each dataset, which exhibits limited generalization capability across different target datasets, hindering anomaly detection performance in various scenarios with scarce training data. Aiming at this problem, we propose constructing a general time series anomaly detection model, which is pre-trained on extensive multi-domain datasets and can subsequently apply to a multitude of downstream scenarios. The significant divergence of time series data across different domains presents two primary challenges in building such a general model: (1) meeting the diverse requirements of appropriate information bottlenecks tailored to different datasets in one unified model, and (2) enabling distinguishment between multiple normal and abnormal patterns, both are crucial for effective anomaly detection in various target scenarios. To tackle these two challenges, we propose a General time series anomaly Detector with Adaptive Bottlenecks and Dual Adversarial Decoders (DADA), which enables flexible selection of bottlenecks based on different data and explicitly enhances clear differentiation between normal and abnormal series. We conduct extensive experiments on nine target datasets from different domains. After pre-training on multi-domain data, DADA, serving as a zero-shot anomaly detector for these datasets, still achieves competitive or even superior results compared to those models tailored to each specific dataset.

2223Adam-mini: Use Fewer Learning Rates To Gain More

[openreview] [pdf]

Abstract We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v1/\sqrt{v}). By delving into the Hessian structure of neural nets, we find Adam’s vv might not function at its full potential as effectively as we expected. We find that 90\geq 90% of these learning rates in vv could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We then provide one simple way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 39M to 13B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama 2-7B on 2×2\times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

2224Truncated Consistency Models

[openreview] [pdf]

Abstract Consistency models have recently been introduced to accelerate the generation speed of diffusion models by directly predicting the solution (data) of the probability flow ODE (PF ODE) from initial noise. However, the training of consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints. This task is much more challenging than the ultimate objective of one-step generation, which only concerns the PF ODE’s noise-to-data mapping. We empirically find that this training paradigm limits the one-step generation performance of consistency models. To address this issue, we generalize consistency training to the truncated time range, which allows the model to ignore denoising tasks at earlier time steps and focus its capacity on generation. We propose a new parameterization of the consistency function and a two-stage training procedure that prevent the truncated-time training from collapsing to a trivial solution. Experiments on CIFAR-10 and ImageNet 64×6464\times64 datasets show that our method achieves better one-step and two-step FIDs than the state-of-the-art consistency models such as iCT-deep, using more than 2×\times smaller networks.

2225Rapid Selection and Ordering of In-Context Demonstrations via Prompt Embedding Clustering

[openreview] [pdf]

Abstract While Large Language Models (LLMs) excel at in-context learning (ICL) using just a few demonstrations, their performances are sensitive to demonstration orders. The reasons behind this sensitivity remain poorly understood. In this paper, we investigate the prompt embedding space to bridge the gap between the order sensitivity of ICL with inner workings of decoder-only LLMs, uncovering the clustering property: prompts sharing the first and last demonstrations have closer embeddings. We explain this property through extensive theoretical analyses and empirical evidences. Our finding suggests that the positional encoding and the causal attention mask are key contributors to the clustering phenomenon. Leveraging this clustering insight, we introduce Cluster-based Search, a novel method that accelerates the selection and ordering of demonstrations in self-adaptive ICL settings. Our approach substantially decreases the time complexity from factorial to quadratic, saving 92% to nearly 100% execution time while maintaining comparable performance to exhaustive search.

2226Wasserstein-Regularized Conformal Prediction under General Distribution Shift

[openreview] [pdf]

Abstract Conformal prediction yields a prediction set with guaranteed 1α1-\alpha coverage of the true target under the i.i.d. assumption, which can fail and lead to a gap between 1α1-\alpha and the actual coverage. Prior studies bound the gap using total variation distance, which cannot identify the gap changes under distribution shift at different α\alpha, thus serving as a weak indicator of prediction set validity. Besides, existing methods are mostly limited to covariate shifts, while general joint distribution shifts are more common in practice but less researched. In response, we first propose a Wasserstein distance-based upper bound of the coverage gap and analyze the bound using probability measure pushforwards between the shifted joint data and conformal score distributions, enabling a separation of the effect of covariate and concept shifts over the coverage gap. We exploit the separation to design algorithms based on importance weighting and regularized representation learning (WR-CP) to reduce the Wasserstein bound with a finite-sample error bound. WR-CP achieves a controllable balance between conformal prediction accuracy and efficiency. Experiments on six datasets prove that WR-CP can reduce coverage gaps to 3.1% across different confidence levels and outputs prediction sets 38% smaller than the worst-case approach on average.

2227Horizon-Length Prediction: Advancing Fill-in-the-Middle Capabilities for Code Generation with Lookahead Planning

[openreview] [pdf]

Abstract Fill-in-the-Middle (FIM) has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm, which reorders original training sequences and then performs regular next-token prediction (NTP), often leads to models struggling to generate content that aligns smoothly with the surrounding context. Crucially, while existing works rely on rule-based post-processing to circumvent this weakness, such methods are not practically usable in open-domain code completion tasks as they depend on restrictive, dataset-specific assumptions (e.g., generating the same number of lines as in the ground truth). Moreover, model performance on FIM tasks deteriorates significantly without these unrealistic assumptions.We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens (i.e., horizon length) at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different models and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level, and without resorting to unrealistic post-processing methods. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP only incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.

2228Large Language Models Are Natural Video Popularity Predictors

[openreview] [pdf]

Abstract Predicting video popularity is typically formalized as a supervised learning problem, where models classify videos as popular or unpopular. Traditional approaches rely heavily on meta-information and aggregated user engagement data, but video popularity is highly context-dependent, influenced by cultural, social, and temporal factors that these approaches fail to capture. We argue that Large Language Models (LLMs), with their deep contextual awareness, are well-suited to address these challenges. A key difficulty, however, lies in bridging the modality gap between pixel-based video data and token-based LLMs. To overcome this, we transform frame-level visual data into sequential text representations using Vision-Language Models (VLMs), enabling LLMs to process multimodal video content—titles, frame-based descriptions, and captions—and capture rich contextual information for more accurate predictions. Evaluating on a newly introduced dataset of 17,000 videos, we show that while a supervised neural network using content embeddings achieved 80% accuracy, our LLM-based method reached 82% without fine-tuning. A combined approach, integrating the neural network’s predictions into the LLM, further improved accuracy to 85.5%. Additionally, the LLM generates interpretable hypotheses explaining its predictions based on theoretically sound attributes. Survey-based manual validations confirm the quality of these hypotheses and address concerns about hallucinations in the video-to-text conversion process. Our findings highlight that LLMs, equipped with textually transformed multimodal representations, offer a powerful, interpretable, and data-efficient solution to the context-dependent challenge of video popularity prediction.

2229CleanerCLIP: Fine-grained Counterfactual Semantic Augmentation for Backdoor Defense in Contrastive Learning

[openreview] [pdf]

Abstract Multimodal contrastive models like CLIP are increasingly vulnerable to data-poisoning backdoor attacks. Existing defense methods primarily target the pretraining phase. However, with the rise of open-source communities, pretrained models are now freely available for download and fine-tuning. These models may carry unknown security risks, posing significant threats to downstream users. This highlights the need for lightweight defense strategies tailored specifically for the fine-tuning stage. Current defenses during fine-tuning include: finetuning with clean data; and using unimodal self-supervised techniques like CleanCLIP, which has represented the state-of-the-art (SOTA). However, these methods rely on strengthening clean feature representations to mitigate attacks, making them ineffective against more stealthy backdoor techniques, such as BadCLIP, which leverage covert toxic features. To overcome this limitation, we propose a finetuning defense mechanism based on fine-grained counterfactual text semantic augmentation. By modifying small portions of text during fine-tuning, our approach disrupts the association between backdoor triggers and target features. We evaluate our method against six attack algorithms and conduct comprehensive zero-shot classification on ImageNet1K. Experimental results demonstrate that our method achieves SOTA performance in fine-tuning defense. Specifically, when facing the novel BadCLIP attack, our method surpasses CleanCLIP, reducing the Attack Success Rate (ASR) by 52.02% in the Top-1 and 63.88% in the Top-10 classifications.

2230MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL

[openreview] [pdf]

Abstract Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sample efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for TD Learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD’s ability to combat value overestimation, and its practical stability gains for continued learning.

2231DeCoRe: Decoding by Contrasting Retrieval Heads to Mitigate Hallucinations

[openreview] [pdf]

Abstract Large Language Models (LLMs) often hallucinate, producing unfaithful or factually incorrect outputs by misrepresenting the provided context or incorrectly recalling internal knowledge. Recent studies have identified specific attention heads within the Transformer architecture, known as retrieval heads, responsible for extracting relevant contextual information. We hypothesise that masking these retrieval heads can induce hallucinations and that contrasting the outputs of the base LLM and the masked LLM can reduce hallucinations. To this end, we propose Decoding by Contrasting Retrieval Heads (DeCoRe), a novel training-free decoding strategy that amplifies information found in the context and model parameters. DeCoRe mitigates potentially hallucinated responses by dynamically contrasting the outputs of the base LLM and the masked LLM, using conditional entropy as a guide. Our extensive experiments confirm that DeCoRe significantly improves performance on tasks requiring high contextual faithfulness, such as summarisation (XSum by 18.6%), instruction following (MemoTrap by 10.9%), and open-book question answering (NQ by 2.4% and NQ-Swap by 5.5%).

2232Identification of Mean-Field Dynamics using Transformers

[openreview] [pdf]

Abstract This paper investigates the use of transformer architectures to approximate the mean-field dynamics of interacting particle systems exhibiting collective behavior. Such systems are fundamental in modeling phenomena across physics, biology, and engineering, including gas dynamics, opinion formation, biological networks, and swarm robotics. The key characteristic of these systems is that the particles are indistinguishable, leading to permutation-equivariant dynamics. We demonstrate that transformers, which inherently possess permutation equivariance, are well-suited for approximating these dynamics. Specifically, we prove that if a finite-dimensional transformer can effectively approximate the finite-dimensional vector field governing the particle system, then the expected output of this transformer provides a good approximation for the infinite-dimensional mean-field vector field. Leveraging this result, we establish theoretical bounds on the distance between the true mean-field dynamics and those obtained using the transformer. We validate our theoretical findings through numerical simulations on the Cucker-Smale model for flocking, and the mean-field system for training two-layer neural networks.

2233Fairness-Aware Graph Learning: A Benchmark

[openreview] [pdf]

Abstract Fairness-aware graph learning has gained increasing attention in recent years. Nevertheless, there lacks a comprehensive benchmark to evaluate and compare different fairness-aware graph learning methods, which blocks practitioners from choosing appropriate ones for broader real-world applications. In this paper, we present an extensive benchmark on ten representative fairness-aware graph learning methods. Specifically, we design a systematic evaluation protocol and conduct experiments on seven real-world datasets to evaluate these methods from multiple perspectives, including group fairness, individual fairness, the balance between different fairness criteria, and computational efficiency. Our in-depth analysis reveals key insights into the strengths and limitations of existing methods. Additionally, we provide practical guidance for applying fairness-aware graph learning methods in applications. To the best of our knowledge, this work serves as an initial step towards comprehensively understanding representative fairness-aware graph learning methods to facilitate future advancements in this area.

2234Communication-Efficient Federated Low-Rank Update Algorithm and its Connection to Implicit Regularization

[openreview] [pdf]

Abstract Federated Learning (FL) faces significant challenges related to communication efficiency and heterogeneity. To address these issues, we explore the potential of using low-rank updates. Our theoretical analysis reveals that client’s loss exhibits a higher rank structure (gradients span higher rank subspaces of Hessian) compared to the server’s loss. Based on this insight, we hypothesize that constraining client-side optimization to a low-rank subspace could provide an implicit regularization effect. Consequently, we propose FedLoRU, a general low-rank update framework for FL. Our framework enforces low-rank client-side updates and accumulates these updates to form a higher-rank model. Additionally, variants of FedLoRU can adapt to environments with statistical and model heterogeneity by employing multiple or hierarchical low-rank updates. Experimental results demonstrate that FedLoRU performs comparably to full-rank algorithms and exhibits robustness to heterogeneous and large numbers of clients.

2235On the Almost Sure Convergence of the Stochastic Three Points Algorithm

[openreview] [pdf]

Abstract The stochastic three points (STP) algorithm is a derivative-free optimization technique designed for unconstrained optimization problems in Rd\mathbb{R}^d. In this paper, we analyze this algorithm for three classes of functions : smooth functions that may lack convexity, smooth convex functions, and smooth functions that are strongly convex. Our work provides the first almost sure convergence results of the STP algorithm, alongside some convergence results in expectation. For the class of smooth functions, we establish that the best gradient iterate of the STP algorithm converges almost surely to zero at a rate arbitrarily close to o(1T)o(\frac{1}{\sqrt{T}}), where TT is the number of iterations. Furthermore, within the same class of functions, we establish both almost sure convergence and convergence in expectation of the final gradient iterate towards zero. For the class of smooth convex functions, we establish that f(θT)f(\theta^T) converges to infθRdf(θ)\inf_{\theta \in \mathbb{R}^d} f(\theta) almost surely at a rate arbitrarily close to o(1T)o(\frac{1}{T}), and in expectation at a rate of O(dT)O(\frac{d}{T}) where dd is the dimension of the space. Finally, for the class of smooth functions that are strongly convex, we establish that when step sizes are obtained by approximating the directional derivatives of the function, f(θT)f(\theta^T) converges to infθRdf(θ)\inf_{\theta \in \mathbb{R}^d} f(\theta) in expectation at a rate of O((1μdL)T)O((1-\frac{\mu}{dL})^T), and almost surely at a rate arbitrarily close to o((1μdL)T)o((1-\frac{\mu}{dL})^T), where μ\mu and LL are the strong convexity and smoothness parameters of the function.

2236Active Learning for Continual Learning: Keeping the Past Alive in the Present

[openreview] [pdf]

Abstract Continual learning (CL)enables deep neural networks to adapt to ever-changing data distributions. In practice, there may be scenarios where annotation is costly, leading toactive continual learning (ACL), which performsactive learning (AL)for the CL scenarios when reducing the labeling cost by selecting the most informative subset is preferable. However, conventional AL strategies are not suitable for ACL, as they focus solely on learning the new knowledge, leading tocatastrophic forgettingof previously learned tasks. Therefore, ACL requires a new AL strategy that can balance the prevention of catastrophic forgetting and the ability to quickly learn new tasks. In this paper, we proposeAccuACL,Accumulated informativeness-basedActiveContinualLearning, by achieving an optimal balance between the two required capabilities of ACL, as well as alleviating the scalability issue of Fisher information-based AL. Extensive experiments demonstrate that AccuACL significantly outperforms AL baselines across various CL algorithms, increasing the average accuracy and forgetting by 23.8% and 17.0%, respectively, in average.

2237μLO: Compute-Efficient Meta-Generalization of Learned Optimizers

[openreview] [pdf]

Abstract Learned optimizers (LOs) can significantly reduce the wall-clock training time of neural networks, substantially reducing training costs. However, they can struggle to optimize unseen tasks (meta-generalize), especially when training networks much larger than those seen during meta-training. To address this, we derive the Maximal Update Parametrization (μ\muP) for two popular learned optimizer architectures and propose a simple meta-training recipe for μ\mu-parameterized LOs (μ\muLOs). Our empirical evaluation demonstrates that LOs meta-trained with our recipe substantially improve meta-generalization to wider unseen tasks when compared to LOs trained under standard parametrization (e.g., as they are trained in existing work). When applying our μ\muLOs, each trained for less than 250 GPU-hours, to large-width models we are often able to match or exceed the performance of pre-trained VeLO, the most performant publicly available learned optimizer, meta-trained with 4000 TPU-months of compute. We also observe that learned optimizers trained with our μ\muLO recipe also exhibit substantially improved meta-generalization to deeper networks (5×5\times meta-training) and remarkable generalization to much longer training horizons (25×25\times meta-training).

2238Understand Clean Generalization and Robust Overfitting in Adversarial Training from Two Theoretical Views: Representation Complexity and Training Dynamics

[openreview] [pdf]

Abstract Similar to surprising performance in the standard deep learning, deep nets trained by adversarial training also generalize well for unseen clean data (natural data). However, despite adversarial training can achieve low robust training error, there exists a significant robust generalization gap. We call this phenomenon the Clean Generalization and Robust Overfitting (CGRO). In this work, we study the CGRO phenomenon in adversarial training from two views: representation complexity and training dynamics. Specifically, we consider a binary classification setting with NN separated training data points. First, we prove that, based on the assumption that we assume there is poly(D)\operatorname{poly}(D)-size clean classifier (where DD is the data dimension), ReLU net with only O(ND)O(N D) extra parameters is able to leverages robust memorization to achieve the CGRO, while robust classifier still requires exponential representation complexity in worst case. Next, we focus on a structured-data case to analyze training dynamics, where we train a two-layer convolutional network with O(ND)O(N D) width against adversarial perturbation. We then show that a three-stage phase transition occurs during learning process and the network provably converges to robust memorization regime, which thereby results in the CGRO. Besides, we also empirically verify our theoretical analysis by experiments in real-image recognition datasets.

2239When LLMs Play the Telephone Game: Cumulative Changes and Attractors in Iterated Cultural Transmissions

[openreview] [pdf]

Abstract As large language models (LLMs) start interacting with each other and generating an increasing amount of text online, it becomes crucial to better understand how information is transformed as it passes from one LLM to the next. While significant research has examined individual LLM behaviors, existing studies have largely overlooked the collective behaviors and information distortions arising from iterated LLM interactions. Small biases, negligible at the single output level, risk being amplified in iterated interactions, potentially leading the content to evolve towards attractor states. In a series oftelephone game experiments, we apply a transmission chain design borrowed from the human cultural evolution literature: LLM agents iteratively receive, produce, and transmit texts from the previous to the next agent in the chain. By tracking the evolution of texttoxicity,positivity,difficulty, andlengthacross transmission chains, we uncover the existence of biases and attractors, and study their dependence on the initial text, the instructions, language model, and model size. For instance, we find that more open-ended instructions lead to stronger attraction effects compared to more constrained tasks. We also find that different text properties display different sensitivity to attraction effects, withtoxicityleading to stronger attractors thanlength. These findings highlight the importance of accounting for multi-step transmission dynamics and represent a first step towards a more comprehensive understanding of LLM cultural dynamics.

2240LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

[openreview] [pdf]

Abstract We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal Language Models (ss-MLLM) distilling knowledge from large-scale MLLM (ll-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of ss-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the language model, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy for comprehensive knowledge transfer. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable ss-MLLM to emulate ss-MLLM’s understanding. Following this, we introduce preference distillation via Preference Optimization (PO), where the key lies in treating ll-MLLM as the reference model. During this phase, the ss-MLLM’s ability to discriminate between superior and inferior examples is significantly enhanced beyond ll-MLLM, leading to a better ss-MLLM that surpasses ll-MLLM, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD surpasses existing works across various benchmarks while maintaining a minimal activated parameters and low computational costs. Remarkably, LLaVA-MoD-2B surpasses Qwen-VL-Chat-7B with an average gain of 8.8%, using merely 0.30.3% of the training data and 23% trainable parameters. The results underscore LLaVA-MoD’s ability to effectively distill comprehensive knowledge from its teacher model, paving the way for developing efficient MLLMs.

2241It Takes Two: On the Seamlessness between Reward and Policy Model in RLHF

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) involves training policy models (PMs) and reward models (RMs) to align language models with human preferences. Instead of focusing solely on PMs and RMs independently, we propose to examine their interactions during fine-tuning, introducing the concept of \textbf{seamlessness}. Our study starts with observing the saturation phenomenon, where continual improvements in RM and PM do not translate into RLHF progress. Our analysis shows that RMs fail to assign proper scores to PM responses, resulting in a 35% mismatch rate with human preferences, highlighting a significant discrepancy between PM and RM. To measure seamlessness between PM and RM without human effort, we propose an automatic metric, SEAM. SEAM quantifies the discrepancies between PM and RM judgments induced by data samples. We validate the effectiveness of SEAM in data selection and model augmentation. Our experiments demonstrate that (1) using SEAM-filtered data for RL training improves RLHF performance by 4.5%, and (2) SEAM-guided model augmentation results in a 4% performance improvement over standard augmentation methods.

2242Credit-based self organizing maps: training deep topographic networks with minimal performance degradation

[openreview] [pdf]

Abstract In the primate neocortex, neurons with similar function are often found to be spatially close. Kohonen’s self-organizing map (SOM) has been one of the most influential approaches for simulating brain-like topographical organization in artificial neural network models. However, integrating these maps into deep neural networks with multitude of layers has been challenging, with self-organized deep neural networks suffering from substantially diminished capacity to perform visual recognition. We identified a key factor leading to the performance degradation in self-organized topographical neural network models: the discord between predominantly bottom-up learning updates in the self-organizing maps, and those derived from top-down, credit-based learning approaches. To address this, we propose an alternative self organization algorithm, tailored to align with the top-down learning processes in deep neural networks. This model not only emulates critical aspects of cortical topography but also significantly narrows the performance gap between non-topographical and topographical models. This advancement underscores the substantial importance of top-down assigned credits in shaping topographical organization. Our findings are a step in reconciling topographical modeling with the functional efficacy of neural network models, paving the way for more intricate and accurate simulations of brain-like neural architectures.

2243Don’t Take Things Out of Context: Attention Intervention for Enhancing Chain-of-Thought Reasoning in Large Language Models

[openreview] [pdf]

Abstract Few-shot Chain-of-Thought (CoT) significantly enhances the reasoning capabilities of large language models (LLMs), functioning as a whole to guide these models in generating reasoning steps toward final answers. However, we observe that isolated segments, words, or tokens within CoT demonstrations can unexpectedly disrupt the generation process of LLMs. The model may overly concentrate on certain local information present in the demonstration, introducing irrelevant noise into the reasoning process and potentially leading to incorrect answers. In this paper, we investigate the underlying mechanism of CoT through dynamically tracing and manipulating the inner workings of LLMs at each output step, which demonstrates that tokens exhibiting specific attention characteristics are more likely to induce the model to take things out of context; these tokens directly attend to the hidden states tied with prediction, without substantial integration of non-local information. Building upon these insights, we propose a Few-shot Attention Intervention method (FAI) that dynamically analyzes the attention patterns of demonstrations to accurately identify these tokens and subsequently make targeted adjustments to the attention weights to effectively suppress their distracting effect on LLMs. Comprehensive experiments across multiple benchmarks demonstrate consistent improvements over baseline methods, with a remarkable 5.91% improvement on the AQuA dataset, further highlighting the effectiveness of FAI.

2244On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

[openreview] [pdf]

Abstract Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance generalization performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

2245AGLP: A Graph Learning Perspective for Semi-supervised Domain Adaptation

[openreview] [pdf]

Abstract In semi-supervised domain adaptation (SSDA), the model aims to leverage partially labeled target domain data along with a large amount of labeled source domain data to enhance its generalization capability for the target domain. A key advantage of SSDA is its ability to significantly reduce reliance on labeled data, thereby lowering the costs and time associated with data preparation. Most existing SSDA methods utilize information from domain labels and class labels but overlook the structural information of the data. To address this issue, this paper proposes a graph learning perspective (AGLP) for semi-supervised domain adaptation. We apply the graph convolutional network to the instance graph which allow structural information to propagate along the weighted graph edges. The proposed AGLP model has several advantages. First, to the best of our knowledge, this is the first work to model structural information in SSDA. Second, the proposed model can effectively learn domain-invariant and semantic representations, reducing domain discrepancies in SSDA. Extensive experimental results on multiple standard benchmarks demonstrate that the proposed AGLP algorithm outperforms state-of-the-art semi-supervised domain adaptation methods.

2246Prediction Via Shapley Value Regression

[openreview] [pdf]

Abstract Shapley values have several desirable properties for explaining black-box model predictions, which come with strong theoretical support. Traditionally, Shapley values are computed post-hoc, leading to additional computational cost at inference time. To overcome this, we introduce ViaSHAP, a novel approach that learns a function to compute Shapley values, from which the predictions can be derived directly by summation. We explore two learning approaches based on the universal approximation theorem and the Kolmogorov-Arnold representation theorem. Results from a large-scale empirical investigation are presented, in which the predictive performance of ViaSHAP is compared to state-of-the-art algorithms for tabular data, where the implementation using Kolmogorov-Arnold Networks showed a superior performance. It is also demonstrated that the explanations of ViaSHAP are accurate, and that the accuracy is controllable through the hyperparameters.

2247DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

[openreview] [pdf]

Abstract Deploying long-context large language models (LLMs) is essential but poses significant computational and memory challenges. Caching all Key and Value (KV) states across all attention heads consumes substantial memory. Existing KV cache pruning methods either damage the long-context capabilities of LLMs or offer only limited efficiency improvements. In this paper, we identify that only a fraction of attention heads, a.k.a, Retrieval Heads, are critical for processing long contexts and require full attention across all tokens. In contrast, all other heads, which primarily focus on recent tokens and attention sinks—referred to as Streaming Heads—do not require full attention. Based on this insight, we introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads, which reduces both LLM’s decoding and pre-filling memory and latency without compromising its long-context abilities. DuoAttention uses a lightweight, optimization-based algorithm with synthetic data to identify retrieval heads accurately. Our method significantly reduces long-context inference memory by up to 2.55×\times for MHA and 1.67×\times for GQA models while speeding up decoding by up to 2.18×\times and 1.50×\times and accelerating pre-filling by up to 1.73×\times and 1.63×\times for MHA and GQA models, respectively, with minimal accuracy loss compared to full attention. Notably, combined with quantization, DuoAttention enables Llama-3-8B decoding with 3.33 million context length measured on a single A100 GPU. Code and dataset will be released upon publication.

2248Large Language Models are Demonstration Pre-Selectors for Themselves

[openreview] [pdf]

Abstract In-context learning with large language models (LLMs) delivers strong few-shot performance by choosing few-shot demonstrations from the entire training dataset. However, previous few-shot in-context learning methods, which calculate similarity scores for choosing demonstrations, incur high computational costs by repeatedly retrieving large-scale datasets for each query. This is due to their failure to recognize that not all demonstrations are equally informative, and many less informative demonstrations can be inferred from a core set of highly informative ones. To this end, we propose FEEDER (FEw yet Essential Demonstration prE-selectoR), a novel \emph{pre-selection} framework that identifies a core subset of demonstrations containing the most informative examples. This subset, referred to as the FEEDER set, consists of demonstrations that capture both the ‘‘sufficiency’’ and ‘‘necessity’’ information to infer the entire dataset. Notice that FEEDER is selected before the few-shot in-context learning, enabling more efficient few-shot demonstrations choosing in a smaller set. To identify FEEDER, we propose a novel effective tree based algorithm. Once selected, it can replace the original dataset, leading to improved efficiency and prediction accuracy in few-shot in-context learning. Additionally, FEEDER also benefit fine-tuning LLMs, we propose a bi-level optimization method enabling more efficient training without sacrificing performance when datasets become smaller. Our experiments are on 6 text classification datasets, 1 reasoning dataset, and 1 semantic-parsing dataset, across 6 LLMs (ranging from 335M to 7B parameters), demonstrate that: (i) In few-shot inference, FEEDER achieves superior (or comparable) performance while utilizing only half the input training data. (ii) In fine-tuning, FEEDER significantly boosts the performance of LLMs.

2249Anomaly Detection by Context Contrasting

[openreview] [pdf]

Abstract Anomaly detection focuses on identifying samples that deviate from the norm. When working with high-dimensional data such as images, a crucial requirement for detecting anomalous patterns is learning lower-dimensional representations that capture concepts of normality. Recent advances in self-supervised learning have shown great promise in this regard. However, many successful self-supervised anomaly detection methods assume prior knowledge about anomalies to create synthetic outliers during training. Yet, in real-world applications, we often do not know what to expect from unseen data, and we can solely leverage knowledge about normal data. In this work, we propose Con2_2, which learns representations through context augmentations that allow us to observe samples from two distinct perspectives while keeping the invariances of normal data. Con2_2 learns rich representations of context-augmented samples by clustering them according to their context while simultaneously aligning their positions across clusters. At test time, representations of anomalies that do not adhere to the invariances of normal data then deviate from their respective context cluster. Learning representations in such a way thus allows us to detect anomalies without making assumptions about anomalous data.

2250CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL

[openreview] [pdf]

Abstract In addressing the challenges of improving large language model (LLM) performance for Text-to-SQL tasks, we propose a new framework, CHASE-SQL, that is comprised of innovative strategies that leverage judiciously-designed test-time compute in multi-agent modeling to enhance candidate generation and selection. Our approach leverages LLMs’ intrinsic knowledge to generate diverse and high-quality SQL candidates using different LLM generators with: (1) a divide-and-conquer method that decomposes complex queries into manageable sub-queries in a single LLM call; (2) chain-of-thought reasoning based on query execution plans, reflecting the steps a database engine takes during execution; and (3) a unique instance-aware synthetic example generation technique, which offers specific few-shot demonstrations tailored to test questions. To identify the best candidate, a selection agent is employed to rank the candidates through pairwise comparisons with a fine-tuned binary-candidates selection LLM. This selection approach has been demonstrated more robust over alternatives. The proposed generators-selector framework not only enhances the quality and diversity of SQL queries but also outperforms previous methods. Overall, our proposed CHASE-SQL achieves the state-of-the-art execution accuracy of 73.0 % and 73.01% on the test set and development set of the notable BIRD Text-to-SQL dataset bench-mark, rendering CHASE-SQL the top submission of the leaderboard (at the time of paper submission)

2251Reducing class-wise confusion for incremental learning with disentangled manifolds

[openreview] [pdf]

Abstract Class incremental learning (CIL) aims to enable models to continuously learn new classes without catastrophically forgetting old ones. A promising direction is to learn and use prototypes of classes during incremental updates. Despite simplicity and intuition, we find that such methods suffer from inadequate representation capability and unsatisfied confusion caused by distribution drift. In this paper, we develop a Confusion-REduced AuTo-Encoder classifier (CREATE) for CIL. Specifically, our method employs a lightweight auto-encoder module to learn each compact class manifold in latent subspace, constraining samples well reconstructed only on the semantically correct auto-encoder. Thus, the representation stability and capability of class distributions are enhanced, alleviating the potential class-wise confusion problem. To further distinguish the drifted features, we propose a confusion-aware latent space separation loss that ensures exemplars are closely distributed in their corresponding low-dimensional manifold while keeping away from the distributions of drifted features from other classes. Our method demonstrates stronger representational capacity by learning disentangled manifolds and reduces class confusion caused by drift. Extensive experiments on multiple datasets and settings show that CREATE outperforms other state-of-the-art methods up to 5.41%.

2252Limits to scalable evaluation at the frontier: LLM as judge won’t beat twice the data

[openreview] [pdf]

Abstract High quality annotations are increasingly a bottleneck in the explosively growing machine learning ecosystem. Scalable evaluation methods that avoid costly annotation have therefore become an important research ambition. Many hope to use strong existing models in lieu of costly labels to provide cheap model evaluations. Unfortunately, this method of using models as judges introduces biases, such as self-preferencing, that can distort model comparisons. An emerging family of debiasing tools promises to fix these issues by using a few high quality labels to debias a large number of model judgments. In this paper, we study how far such debiasing methods, in principle, can go. Our main result shows that when the judge is no more accurate than the evaluated model, no debiasing method can decrease the required amount of ground truth labels by more than half. Our result speaks to the severe limitations of the LLM-as-a-judge paradigm at the evaluation frontier where the goal is to assess newly released models that are possibly better than the judge. Through an empirical evaluation, we demonstrate that the sample size savings achievable in practice are even more modest than what our theoretical limit suggests. Along the way, our work provides new observations about debiasing methods for model evaluation, and points out promising avenues for future work.

2253Decentralized primal-dual actor-critic with entropy regularization for safe multi-agent reinforcement learning

[openreview] [pdf]

Abstract We investigate the decentralized safe multi-agent reinforcement learning (MARL) problem based on homogeneous multi-agent systems, where agents aim to maximize the team-average return and the joint policy’s entropy, while satisfying safety constraints associated to the cumulative team-average cost. A mathematical model referred to as a homogeneous constrained Markov game is formally characterized, based on which policy sharing provably preserves the optimality of our safe MARL problem. An on-policy decentralized primal-dual actor-critic algorithm is then proposed, where agents utilize both local gradient updates and consensus updates to learn local policies, without the requirement for a centralized trainer. Asymptotic convergence is proven using multi-timescale stochastic approximation theory under standard assumptions. Thereafter, a practical off-policy version of the proposed algorithm is developed based on the deep reinforcement learning training architecture. The effectiveness of our practical algorithm is demonstrated through comparisons with solid baselines on three safety-aware multi-robot coordination tasks in continuous action spaces.

2254Scalable and Certifiable Graph Unlearning: Overcoming the Approximation Error Barrier

[openreview] [pdf]

Abstract Graph unlearning has emerged as a pivotal research area for ensuring privacy protection, given the widespread adoption of Graph Neural Networks (GNNs) in applications involving sensitive user data. Among existing studies, certified graph unlearning is distinguished by providing robust privacy guarantees. However, current certified graph unlearning methods are impractical for large-scale graphs because they necessitate the costly re-computation of graph propagation for each unlearning request. Although numerous scalable techniques have been developed to accelerate graph propagation for GNNs, their integration into certified graph unlearning remains uncertain as these scalable approaches introduce approximation errors into node embeddings. In contrast, certified graph unlearning demands bounded model error on exact node embeddings to maintain its certified guarantee.To address this challenge, we present ScaleGUN, the first approach to scale certified graph unlearning to billion-edge graphs. ScaleGUN integrates the approximate graph propagation technique into certified graph unlearning, offering certified guarantees for three unlearning scenarios: node feature, edge and node unlearning. Extensive experiments on real-world datasets demonstrate the efficiency and unlearning efficacy of ScaleGUN. Remarkably, ScaleGUN accomplishes (ϵ,δ)=(1,104)(\epsilon,\delta)=(1,10^{-4}) certified unlearning on the billion-edge graph ogbn-papers100M in 20 seconds for a 5,000 random edge removal request -- of which only 5 seconds are required for updating the node embeddings -- compared to 1.91 hours for retraining and 1.89 hours for re-propagation. Our code is available athttps://anonymous.4open.science/r/ScaleGUN-5921.

2255Adversarially Robust Out-of-Distribution Detection Using Lyapunov-Stabilized Embeddings

[openreview] [pdf]

Abstract Despite significant advancements in out-of-distribution (OOD) detection, existing methods still struggle to maintain robustness against adversarial attacks, compromising their reliability in critical real-world applications. Previous studies have attempted to address this challenge by exposing detectors to auxiliary OOD datasets alongside adversarial training. However, the increased data complexity inherent in adversarial training, and the myriad of ways that OOD samples can arise during testing, often prevent these approaches from establishing robust decision boundaries. To address these limitations, we propose AROS, a novel approach leveraging neural ordinary differential equations (NODEs) with Lyapunov stability theorem in order to obtain robust embeddings for OOD detection. By incorporating a tailored loss function, we apply Lyapunov stability theory to ensure that both in-distribution (ID) and OOD data converge to stable equilibrium points within the dynamical system. This approach encourages any perturbed input to return to its stable equilibrium, thereby enhancing the model’s robustness against adversarial perturbations. To not use additional data, we generate fake OOD embeddings by sampling from low-likelihood regions of the ID data feature space, approximating the boundaries where OOD data are likely to reside. To then further enhance robustness, we propose the use of an orthogonal binary layer following the stable feature space, which maximizes the separation between the equilibrium points of ID and OOD samples. We validate our method through extensive experiments across several benchmarks, demonstrating superior performance, particularly under adversarial attacks. Notably, our approach improves robust detection performance from 37.8% to 80.1% on CIFAR-10 vs. CIFAR-100 and from 29.0% to 67.0% on CIFAR-100 vs. CIFAR-10.

2256Variational Diffusion Posterior Sampling with Midpoint Guidance

[openreview] [pdf]

Abstract Diffusion models have recently shown considerable potential in solving Bayesian inverse problems when used as priors. However, sampling from the resulting denoising posterior distributions remains a challenge as it involves intractable terms. To tackle this issue, state-of-the-art approaches formulate the problem as that of sampling from a surrogate diffusion model targeting the posterior and decompose its scores into two terms: the prior score and an intractable guidance term. While the former is replaced by the pre-trained score of the considered diffusion model, the guidance term has to be estimated. In this paper, we propose a novel approach that utilises a decomposition of the transitions which, in contrast to previous methods, allows a trade-off between the complexity of the intractable guidance term and that of the prior transitions. We validate the proposed approach through extensive experiments on linear and nonlinear inverse problems, including challenging cases with latent diffusion models as priors, and demonstrate its effectiveness in reconstructing electrocardiogram (ECG) from partial measurements for accurate cardiac diagnosis.

2257Semantic or Covariate? A Study on the Intractable Case of Out-of-Distribution Detection

[openreview] [pdf]

Abstract The primary goal of out-of-distribution (OOD) detection tasks is to identify inputs with semantic shifts, i.e., if samples from novel classes are absent in the in-distribution (ID) dataset used for training, we should reject these OOD samples rather than misclassifying them into existing ID classes. However, we find the current definition of “semantic shift” is ambiguous, which renders certain OOD testing protocols intractable for the post-hoc OOD detection methods based on a classifier trained on the ID dataset. In this paper, we offer a more precise definition of the Semantic Space and the Covariate Space for the ID distribution, allowing us to theoretically analyze which types of OOD distributions make the detection task intractable. To avoid the flaw in the existing OOD settings, we further define the “Tractable OOD” setting which ensures the distinguishability of OOD and ID distributions for the post-hoc OOD detection methods. Finally, we conduct several experiments to demonstrate the necessity of our definitions and validate the correctness of our theorems.

2258Counterfactual Generative Modeling with Variational Causal Inference

[openreview] [pdf]

Abstract Estimating an individual’s potential outcomes under counterfactual treatments is a challenging task for traditional causal inference and supervised learning approaches when the outcome is high-dimensional (e.g. gene expressions, facial images) and covariates are relatively limited. In this case, to predict one’s outcomes under counterfactual treatments, it is crucial to leverage individual information contained in its high-dimensional observed outcome in addition to the covariates. Prior works using variational inference in counterfactual generative modeling have been focusing on neural adaptations and model variants within the conditional variational autoencoder formulation, which we argue is fundamentally ill-suited to the notion of counterfactual in causal inference. In this work, we present a novel variational Bayesian causal inference framework and its theoretical backings to fittingly handle counterfactual generative modeling tasks, through which we are able to conduct counterfactual supervision end-to-end during training without any counterfactual samples, and encourage latent disentanglement that aids the correct identification of causal effect in counterfactual generations. In experiments, we demonstrate the advantage of our framework compared to state-of-the-art models in counterfactual generative modeling on multiple benchmarks.

2259Principal-Agent Reinforcement Learning: Orchestrating AI Agents with Contracts

[openreview] [pdf]

Abstract The increasing deployment of AI is shaping the future landscape of the internet, which is set to become an integrated ecosystem of AI agents. Orchestrating the interaction among AI agents necessitates decentralized, self-sustaining mechanisms that harmonize the tension between individual interests and social welfare. In this paper we tackle this challenge by synergizing reinforcement learning with principal-agent theory from economics. Taken separately, the former allows unrealistic freedom of intervention, while the latter struggles to scale in sequential settings. Combining them achieves the best of both worlds. We propose a framework where a principal guides an agent in a Markov Decision Process (MDP) using a series of contracts, which specify payments by the principal based on observable outcomes of the agent’s actions. We present and analyze a meta-algorithm that iteratively optimizes the policies of the principal and agent, showing its equivalence to a contraction operator on the principal’s Q-function, and its convergence to subgame-perfect equilibrium. We then scale our algorithm with deep Q-learning and analyze its convergence in the presence of approximation error, both theoretically and through experiments with randomly generated binary game-trees. Extending our framework to multiple agents, we apply our methodology to the combinatorial Coin Game. Addressing this multi-agent sequential social dilemma is a promising first step toward scaling our approach to more complex, real-world instances.

2260Temporal Graph Rewiring with Expander Graphs

[openreview] [pdf]

Abstract Evolving relations in real-world networks are often modelled by temporal graphs. Temporal Graph Neural Networks (TGNNs) emerged to model evolutionary behaviour of such graphs by leveraging the message passing primitive at the core of Graph Neural Networks (GNNs). It is well-known that GNNs are vulnerable to several issues directly related to the input graph topology, such as under-reaching and over-squashing---we argue that these issues can often get exacerbated in temporal graphs, particularly as the result of stale nodes and edges. While graph rewiring techniques have seen frequent usage in GNNs to make the graph topology more favourable for message passing, they have not seen any mainstream usage on TGNNs. In this work, we propose Temporal Graph Rewiring (TGR), the first approach for graph rewiring on temporal graphs, to the best of our knowledge. TGR constructs message passing highways between temporally distant nodes in a continuous-time dynamic graph by utilizing expander graph propagation, a prominent framework used for graph rewiring on static graphs which makes minimal assumptions on the underlying graph structure. On the challenging TGB benchmark, TGR achieves state-of-the-art results on tgbl-review, tgbl-coin, tgbl-comment and tgbl-flight datasets at the time of writing. For tgbl-review, TGR has 50.5% improvement in MRR over the base TGN model and 22.2% improvement over the base TNCN model. The significant improvement over base models demonstrates clear benefits of temporal graph rewiring.

2261KEYPOINT-GUIDED 4D GAUSSIAN SPLATTING WITH DECOUPLED SPATIO-TEMPORAL FLOW REFINEMENT

[openreview] [pdf]

Abstract We propose KG4D, a novel method for generating time-aware 4D representations from a single static image or video. Previous methods largely rely on weak su- pervision signals, failing to introduce fine-grained supervision necessary for cap- turing detailed spatio-temporal dynamics. In contrast, our approach employs Har- monic Spatio-temporal Encoding (HSE) to achieve efficient spatio-temporal sep- aration during training, allowing the model to represent dynamic scene changes more accurately. Furthermore, Keypoint Feature Calibration (KFC) ensures pre- cise pose consistency, and Wasserstein Gradient Flow (WGF) enhances motion coherence, effectively reducing artifacts. Comprehensive evaluation and ablations demonstrate that our proposed KG4D outperforms existing state-of-the-art meth- ods on various benchmarks in dynamic 4D generation and novel viewpoint syn- thesis, validating its effectiveness and superior generation capability.

2262Embodied Instruction Following in Unknown Environments

[openreview] [pdf]

Abstract Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.

2263Does Deep Active Learning Work in the Wild?

[openreview] [pdf]

Abstract Deep active learning (DAL) methods have shown significant improvements in sample efficiency compared to simple random sampling. While these studies are valuable, they nearly always assume that optimal DAL hyperparameter (HP) settings are known in advance, or optimize the HPs through repeating DAL several times with different HP settings. Here, we argue that in real-world settings, orin the wild, there is significant uncertainty regarding good HPs, and their optimization contradicts the premise of using DAL (i.e., we require labeling efficiency). In this study, we evaluate the performance of eleven modern DAL methods on eight benchmark problems as we vary a key HP shared by all methods: the pool ratio. Despite adjusting only one HP, our results indicate that eight of the eleven DAL methods sometimes underperform relative to simple random sampling and some frequently perform worse. Only three methods always outperform random sampling (albeit narrowly), and we find that these methods all utilize diversity to select samples - a relatively simple criterion. Our findings reveal the limitations of existing DAL methods when deployedin the wild, and present this as an important new open problem in the field.

2264Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach

[openreview] [pdf]

Abstract In the field of long-tailed recognition, the Decoupled Training paradigm has shown exceptional promise by dividing training into two stages: representation learning and classifier re-training. While previous work has tried to improve both stages simultaneously, this complicates isolating the effect of classifier re-training. Recent studies reveal that simple regularization can produce strong feature representations, highlighting the need to reassess classifier re-training methods. In this study, we revisit classifier re-training methods based on a unified feature representation and re-evaluate their performances. We propose two new metrics, Logits Magnitude and Regularized Standard Deviation, to compare the differences and similarities between various methods. Using these two newly proposed metrics, we demonstrate that when the Logits Magnitude across classes is nearly balanced, further reducing its overall value can effectively decrease errors and disturbances during training, leading to better model performance. Based on our analysis using these metrics, we observe that adjusting the logits could improve model performance, leading us to develop a simple label over-smoothing approach to adjust the logits without requiring prior knowledge of class distribution. This method softens the original one-hot labels by assigning a probability slightly higher than 1K\frac{1}{K} to the true class and slightly lower than 1K\frac{1}{K} to the other classes, where KK is the number of classes. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist2018.

2265Solving New Tasks by Adapting Internet Video Knowledge

[openreview] [pdf]

Abstract Video generative models, beyond enabling the production of astounding visual creations, offer a promising pathway for unlocking novel, text-conditioned robotic behaviors, whether utilized as a video planner or as a policy supervisor. When pretrained on internet-scale datasets, such video models intimately understand alignment with natural language, and can thus facilitate novel text-conditioned behavior generalization. At the same time, however, they may not be sensitive to the specificities of the particular environment in which a policy of interest is to be learned. On the other hand, video modeling over in-domain examples of robotic behavior naturally encode environment-specific intricacies, but the scale of available demonstrations may not be sufficient to support generalization to unseen tasks via natural language specification. In this work, we investigate different adaptation techniques that integrate in-domain information into large-scale pretrained video models, and explore the extent to which they enable novel text-conditioned generalization for robotic tasks. Furthermore, we highlight the individual data and training requirements of each approach, which range from utilizing only a few still frames illustrating the subject of interest, to direct finetuning over videos labelled with text descriptions. We successfully demonstrate across robotic environments that adapting powerful video models with small scales of example data can successfully facilitate generalization to novel behaviors, both when utilized as policy supervisors, and as visual planners.

2266Counterfactual Outcome Estimation in Time Series via Sub-treatment Group Alignment and Random Temporal Masking

[openreview] [pdf]

Abstract Estimating counterfactual outcomes in time series from observational data is important for effective decision-making in many fields, such as determining the optimal timing for a medical intervention. However, this task is challenging, primarily because of the unobservability of counterfactual outcomes and the complexity of confounding in time series. To this end, we introduce a representation learning-based framework for counterfactual estimation in time series with two novel techniques:Sub-treatment Group Alignment (SGA)andRandom Temporal Masking (RTM). The first technique focuses on reducing confounding at each time point. While the common approach is to align the distributions of different treatment groups in the latent space, our proposed approach, SGA, first identifiessub-treatment groupsthrough Gaussian Mixture Models (GMMs) and subsequently aligns the corresponding sub-groups. We demonstrate that, both theoretically and empirically, SGA achieves improved alignment, thus leading to more effective deconfounding. The second technique, RTM, masks covariates at random time steps with Gaussian noises. This approach promotes the time series models to select information not only important for the outcome estimation at current time point but also crucial for the time points in the future where the covariates are masked out, thus preserving thecausal informationand reducing the risk of overfitting to factual outcomes. We observe in experiments on synthetic and semi-synthetic datasets that applying SGA and RTM individually improves counterfactual outcome estimation, and when combined, they achieve state-of-the-art performance.

2267Towards Universal Certified Robustness with Multi-Norm Training

[openreview] [pdf]

Abstract Existing certified training methods can only train models to be robust against a certain perturbation type (e.g. ll_\infty or l2l_2). However, an ll_\infty certifiably robust model may not be certifiably robust against l2l_2 perturbation (and vice versa) and also has low robustness against other perturbations (e.g. geometric transformation). To this end, we propose the first multi-norm certified training framework \textbf{CURE}, consisting of a new l2l_2 deterministic certified training defense and several multi-norm certified training methods, to attain better \emph{union robustness} when training from scratch or fine-tuning a pre-trained certified model. Further, we devise bound alignment and connect natural training with certified training for better union robustness. Compared with SOTA certified training, \textbf{CURE} improves union robustness up to 22.822.8% on MNIST, 23.923.9% on CIFAR-10, and 8.08.0% on TinyImagenet. Further, it leads to better generalization on a diverse set of challenging unseen geometric perturbations, up to 6.86.8% on CIFAR-10. Overall, our contributions pave a path towards \textit{universal certified robustness}.

2268Out-of-Distribution Detection in Class Incremental Learning

[openreview] [pdf]

Abstract Class incremental learning (CIL) aims to learn a model that can not only incrementally accommodate new classes, but also maintain the learned knowledge of old classes. Out-of-distribution (OOD) detection in CIL is to retain this incremental learning ability, while being able to reject unknown samples that are drawn from different distributions of the learned classes. This capability is crucial to the safety of deploying CIL models in open worlds.However, despite remarkable advancements in the respective CIL and OOD detection, there lacks a systematic and large-scale benchmark to assess the capability of advanced CIL models in detecting OOD samples. To fill this gap, in this study we design a comprehensive empirical study to establish such a benchmark, namedOpenCIL, offering a unified protocol for enabling CIL models with different OOD detectors using two principled OOD detection frameworks. One key observation we find through our comprehensive evaluation is that the CIL models can be severely biased towards the OOD samples and newly added classes when they are exposed to open environments. Motivated by this, we further propose a novel approach for OOD detection in CIL, namely Bi-directional Energy Regularization (BER), which is specially designed to mitigate these two biases in different CIL models by having energy regularization on both old and new classes. Extensive experiments show that BER can substantially improve the OOD detection capability across a range of CIL models, achieving state-of-the-art performance on the OpenCIL benchmark.

2269Open-Vocabulary Customization from CLIP via Data-Free Knowledge Distillation

[openreview] [pdf]

Abstract Vision-language models such as CLIP have demonstrated strong zero-shot performance, but their considerable size and inefficient inference limit customizable deployment for users. While knowledge distillation is a solution, it still requires the original data, which is not always available due to copyrights and privacy concerns. For many users seeking open-vocabulary customization, Data-Free Knowledge Distillation (DFKD) emerges as a promising direction. Upon rethinking DFKD, we find that existing methods fail on CLIP due to their heavy reliance on BatchNorm layers, which are unexpectedly unusable in CLIP. Based on our findings, we adopt image-text matching to achieve DFKD for CLIP, enabling customization based on arbitrary class texts. This involves (i) inversing a surrogate dataset from CLIP based on text prompts; and (ii) distilling a student model from CLIP using the surrogate dataset. Specifically, we introduce style dictionary diversification to enhance the diversity of synthetic images. To prevent uncontrollable semantics introduced by diversification, we propose a class consistency maintaining strategy to ensure the consistency of synthetic images. Based on synthetic images with various styles, we further propose meta knowledge distillation to train the student model with good generalization ability. Moreover, we introduce a simple yet effective method to enable customization based on few example images. Comprehensive experiments showcase the superiority of our approach across twelve customized tasks, achieving a 9.33% improvement compared to existing DFKD methods.

2270TimeRAF: Retrieval-Augmented Foundation model for Zero-shot Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting plays a crucial role in data mining, driving rapid advancements across numerous industries. With the emergence of large models, time series foundation models (TSFMs) have exhibited remarkable generalization capabilities, such as zero-shot learning, through large-scale pre-training. Meanwhile, Retrieval-Augmented Generation (RAG) methods are widely employed to enhance the performance of foundation models on unseen data, allowing models to access to external knowledge. In this paper, we introduceTimeRAF, aRetrieval-AugmentedForecasting model that enhance zero-shot time series forecasting through retrieval-augmented techniques. We develop customized time series knowledge bases that are tailored to the specific forecasting tasks. TimeRAF employs an end-to-end learnable retriever to extract valuable information from the knowledge base. Additionally, we propose Channel Prompting for knowledge integration, which effectively extracts relevant information from the retrieved knowledge along the channel dimension. Extensive experiments demonstrate the effectiveness of our model, showing significant improvement across various domains and datasets.

2271The Last Iterate Advantage: Empirical Auditing and Principled Heuristic Analysis of Differentially Private SGD

[openreview] [pdf]

Abstract We propose a simple heuristic privacy analysis of noisy clipped stochastic gradient descent (DP-SGD) in the setting where only the last iterate is released and the intermediate iterates remain hidden. Namely, our heuristic assumes a linear structure for the model.We show experimentally that our heuristic is predictive of the outcome of privacy auditing applied to various training procedures. Thus it can be used prior to training as a rough estimate of the final privacy leakage. We also probe the limitations of our heuristic by providing some artificial counterexamples where it underestimates the privacy leakage.The standard composition-based privacy analysis of DP-SGD effectively assumes that the adversary has access to all intermediate iterates, which is often unrealistic. However, this analysis remains the state of the art in practice. While our heuristic does not replace a rigorous privacy analysis, it illustrates the large gap between the best theoretical upper bounds and the privacy auditing lower bounds and sets a target for further work to improve the theoretical privacy analyses.

2272Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance

[openreview] [pdf]

Abstract The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs for post-generation correction. In response to these limitations, we proposeMitigating hallucinAtion via image-gRounded guIdaNcE(MARINE), a framework that is bothtraining-freeandAPI-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework’s flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs’ generations.

2273DEAL: High-Efficacy Privacy Attack on Retrieval-Augmented Generation Systems via LLM Optimizer

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) technology provides a powerful means of combining private databases with large language models (LLMs). In a typical RAG system, a set of documents is retrieved from a private database and inserted into the final prompt, which is then fed into the LLM. Existing research has shown that an attacker can use a simple manually designed attack suffix to induce LLM to output private documents in prompt with high probability. However, in this paper, we demonstrate that the privacy leakage risk exhibited by using this simple manual attack suffix is significantly underestimated. We propose a novel attack method called Documents Extraction Attack via LLM-Optimizer (DEAL). DEAL leverages an LLM as optimizer to iteratively refine attack strings, inducing the RAG model to reveal private data in its responses. Notably, our attack method does not require any knowledge about the target LLM, including its gradient information or model type. Instead, the attack can be executed solely through query access to the RAG model. We evaluate the effectiveness of our attack on multiple LLM architectures, including Qwen2, Llama3.1, and GPT-4o, across different attack tasks such as Entire Documents Extraction and Private Identity Information (PII) Extraction. Under the same permission setting as the existing method, the Mean Rouge-L Recall (MRR) of our method can reach more than 0.95 on average in the Entire Documents Extraction task, and we can steal PII from the retrieved documents with close to 99% accuracy in the PII Extraction task, highlighting the risk of privacy leakage in RAG systems.

2274Supervised Dimension Contrastive Learning

[openreview] [pdf]

Abstract Self-supervised learning has emerged as an effective pre-training strategy for representation learning using large-scale unlabeled data. However, models pre-trained with self-supervised learning still require supervised fine-tuning to achieve optimal task-specific performance. Due to the lack of label utilization, it is difficult to accurately distinguish between positive and hard negative samples. Supervised contrastive learning methods address the limitation by leveraging labels, but they focus on global representations, leading to limited feature diversity and high cross-correlation between representation dimensions. To address these challenges, we propose Supervised Dimension Contrastive Learning, a novel approach that combines supervision with dimension-wise contrastive learning. Inspired by redundancy reduction techniques like Barlow Twins, this approach reduces cross-correlation between embedding dimensions while enhancing class discriminability. The aggregate function combines the embedding dimensions to generate predicted class variables, which are optimized to correlate with their corresponding class labels. Orthogonal regularization is applied to ensure the full utilization of all dimensions by enforcing full-rankness in the aggregate function. We evaluate our method on both in-domain supervised classification tasks and out-of-domain transfer learning tasks, demonstrating its superior performance compared to traditional supervised learning, supervised contrastive learning, and self-supervised learning methods. Our results show that the proposed method effectively reduces inter-dimensional correlation and enhances class discriminability, proving its generalizability across various downstream tasks.

2275Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models

[openreview] [pdf]

Abstract Conversational Large Language Models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different jailbreak types circumvent safeguards, this paper analyses model activations on different jailbreak inputs. We find that it is possible to extract a jailbreak vector from a single class of jailbreaks that works to mitigate jailbreak effectiveness from other classes. This may indicate that different kinds of effective jailbreaks operate via similar internal mechanisms. We investigate a potential common mechanism of harmfulness feature suppression, and provide evidence for its existence by looking at the harmfulness vector component. These findings offer actionable insights for developing more robust jailbreak countermeasures and lay the groundwork for a deeper, mechanistic understanding of jailbreak dynamics in language models.

2276BALCONI: BALancing CONtext and Internal Knowledge For Training Flexible LLMs

[openreview] [pdf]

Abstract The faithfulness to the context is significant for large language models (LLMs) in tasks such as Retrieval-Augmented Generation (RAG) or Information Extraction. However, LLMs can exhibit a “stubborn” reliance on their internal knowledge, which leads to failure in maintaining faithfulness to the context. Ideally, a model should leverage the given context if the user instruction requires to, yet remain correctness based on internal knowledge when the instruction does not provide the context. Considering such scenarios, we propose a balanced benchmark, FaithfulBench, to evaluate the faithfulness of LLMs, together with internal knowledge correctness in LLMs and evaluate whether the improvement in faithfulness would affect internal knowledge. Extensive experiments show that LLMs can be unfaithful to the context to some extent and in the Multi-choice QA, we observe an obvious negative correlation between faithfulness and internal knowledge correctness across different LLMs. Then based on the analysis of faithfulness enhancement methods, we find that instruction tuning using counterfactual data can significantly improve the model’s context faithfulness, but compromise the model’s internal knowledge. To address such a issue, we propose a straightforward yet effective approach BALCONI training by training with mixup data of factual requests, context requests, and NoAns (I cannot tell the answer from the context) requests. Experiments on our benchmark and a context-based machine translation task demonstrate that BALCONI can achieve a well-balanced effect in improving the balanced faithfulness and internal knowledge.

2277A Unified Theoretical Framework for Understanding Difficult-to-learn Examples in Contrastive Learning

[openreview] [pdf]

Abstract Unsupervised contrastive learning has shown significant performance improvements in recent years, often approaching or even rivaling supervised learning in various tasks. However, its learning mechanism is fundamentally different from that of supervised learning. Previous works have shown that difficult-to-learn examples (well-recognized in supervised learning as examples around the decision boundary), which are essential in supervised learning, contribute minimally in unsupervised settings. In this paper, perhaps surprisingly, we find that the direct removal of difficult-to-learn examples, although reduces the sample size, can boost the downstream classification performance of contrastive learning. To uncover the reasons behind this, we develop a theoretical framework modeling the similarity between different pairs of samples. Guided by this theoretical framework, we conduct a thorough theoretical analysis revealing that the presence of difficult-to-learn examples negatively affects the generalization of contrastive learning. Furthermore, we demonstrate that the removal of these examples, and techniques such as margin tuning and temperature scaling can enhance its generalization bounds, thereby improving performance. Empirically, we propose a simple and efficient mechanism for selecting difficult-to-learn examples and validate the effectiveness of the aforementioned methods, which substantiates the reliability of our proposed theoretical framework.

2278Unsupervised Reinforcement Learning by Maximizing Skill Density Deviation

[openreview] [pdf]

Abstract Unsupervised Reinforcement Learning (RL) aims to discover diverse behaviors that can accelerate the learning of downstream tasks. Previous methods typically focus on entropy-based exploration or empowerment-driven skill learning. However, entropy-based exploration struggles in large-scale state spaces (e.g., images), and empowerment-based methods with Mutual Information (MI) estimations have limitations in state exploration. To address these challenges, we propose a novel skill discovery objective that maximizes the deviation of the state density of one skill from the explored regions of other skills, encouraging inter-skill state diversity similar to the initial MI objective. For state-density estimation, we construct a novel conditional autoencoder with soft modularization for different skill policies in high-dimensional space. To incentivize intra-skill exploration, we formulate an intrinsic reward based on the learned autoencoder that resembles count-based exploration in a compact latent space. Through extensive experiments in challenging state and image-based tasks, we find our method learns meaningful skills and achieves superior performance in various downstream tasks.

22793D Affordance Reconstruction from Egocentric Demonstration Video

[openreview] [pdf]

Abstract Developing robots capable of generalized skills remains an exceedingly challenging task. Drawing from psychology, the concept of affordance has emerged as a promising intermediate representation to guide robot manipulation. However, prior work has primarily focused on 2D affordances from video, neglecting critical spatial information such as camera positioning, absolute position, depth and geometry. In this paper, we present a novel training-free method that constructs 3D affordances from egocentric demonstration videos. To address the challenge of insufficient static, high-quality frames for 3D reconstruction in egocentric videos, we employ the 3D foundational model DUST3R, which reconstructs scenes from sparse images without requiring COLMAP. We analyze videos using hand detection to identify contact times and 2D contact points, reconstruct these interactions using DUST3R, and project the 2D contact points into 3D space using gaussian heatmaps. Finally, we derive hand trajectories through 3D hand pose estimation and process them using linear regression to integrate the spatiotemporal dynamics of human-object interactions. We demonstrate the effectiveness of our method on the ego4d-exo dataset for seven real-world hand-object manipulation tasks in cooking scenes.

2280Branches: A Fast Dynamic Programming and Branch & Bound Algorithm for Optimal Decision Trees

[openreview] [pdf]

Abstract Decision Tree (DT) Learning is a fundamental problem in Interpretable Machine Learning, yet it poses a formidable optimisation challenge. Despite numerous efforts dating back to the early 1990’s, practical algorithms have only recently emerged, primarily leveraging Dynamic Programming (DP) and Branch & Bound (B&B) techniques. These methods fall into two categories: algorithms like DL8.5, MurTree and STreeD utilise an efficient DP strategy but lack effective bounds for pruning the search space; while algorithms like OSDT and GOSDT employ more efficient pruning bounds but at the expense of a less refined DP strategy. We introduce Branches, a new algorithm that combines the strengths of both approaches. Using DP and B&B with a novel analytical bound for efficient pruning, Branches offers both speed and sparsity optimisation. Unlike other methods, it also handles non-binary features. Theoretical analysis shows its lower complexity compared to existing methods, and empirical results confirm that Branches outperforms the state-of-the-art in speed, iterations, and optimality.

2281Provably Efficient Linear Bandits with Instantaneous Constraints in Non-Convex Feature Spaces

[openreview] [pdf]

Abstract In linear stochastic bandits, tasks with instantaneous hard constraints present significant challenges, particularly when the feature space is non-convex or discrete. This is especially relevant in applications such as financial management, recommendation systems, and medical treatment selection, where safety constraints appear in non-convex forms or where decisions must often be made within non-convex and discrete sets. In these systems, bandit methods rely on the ability of feature functions to extract critical features. However, in contrast to the star-convexity assumption commonly discussed in the literature, these feature functions often lead to non-convex and more complex feature spaces. In this paper, we investigate linear bandits and introduce a method that operates effectively in a non-convex feature space while satisfying instantaneous hard constraints at each time step. We demonstrate that our method, with high probability, achieves a regret of O~(d(1+τϵι)T)\tilde{\mathcal{O}}\big( d (1+\frac{\tau}{\epsilon \iota}) \sqrt{T}\big) and meets the instantaneous hard constraints, where dd represents the feature space dimension, TT the total number of rounds, and τ\tau a safety related parameter. The constant parameters ϵ\epsilon and ι\iota are related to our localized assumptions around the origin and the optimal point. In contrast, standard safe linear bandit algorithms that rely on the star-convexity assumption often result in linear regret. Furthermore, our approach handles discrete action spaces while maintaining a comparable regret bound. Moreover, we establish an information-theoretic lower bound on the regret of Ω(max(d1)T,1ϵι2)\Omega \left( \max{ \sqrt{(d-1)T}, \frac{1}{\epsilon \iota^2} } \right) for Tmax(d1,32eϵι2)T \geq \max (d-1, \frac{32 e}{\epsilon \iota^2}), emphasizing the critical role of ϵ\epsilon and ι\iota in the regret upper bound. Lastly, we provide numerical results to validate our theoretical findings.

2282Structure-Preserving Text-Based Editing for Few-Step Diffusion Models

[openreview] [pdf]

Abstract Text-based image editing aims to generate an image that corresponds to the given text prompt, but with the structure of the original source image. Existing methods often rely on attention maps in diffusion models (DMs) for structure preservation, as these features are considered to play a primary role in determining the spatial layout. However, we find that these methods struggle to preserve the spatial layout when applied to few-step DMs (e.g., SDXL-Turbo), limiting their use cases to the slower multi-step DMs (e.g., Stable Diffusion). In this work, we investigate the limitations of these approaches in terms of intermediate feature representations. Our findings indicate that for few-step DMs, the attention layers have less influence in determining the structure. To tackle this, we localize layers within the network that better control spatial layout and inject these features during the editing process. Additionally, we disentangle structural information from other features to avoid conflicts between the injected features and the text prompt. This ensures that the edited image faithfully follows the prompt while preserving the source structure. Our method outperforms existing text-based editing baselines.

2283Temporal-Difference Variational Continual Learning

[openreview] [pdf]

Abstract A crucial capability of Machine Learning models in real-world applications is the ability to continuously learn new tasks. This adaptability allows them to respond to potentially inevitable shifts in the data-generating distribution over time. However, in Continual Learning (CL) settings, models often struggle to balance learning new tasks (plasticity) with retaining previous knowledge (memory stability). Consequently, they are susceptible to Catastrophic Forgetting, which degrades performance and undermines the reliability of deployed systems. Variational Continual Learning methods tackle this challenge by employing a learning objective that recursively updates the posterior distribution and enforces it to stay close to the latest posterior estimate. Nonetheless, we argue that these methods may be ineffective due to compounding approximation errors over successive recursions. To mitigate this, we propose new learning objectives that integrate the regularization effects of multiple previous posterior estimations, preventing individual errors from dominating future posterior updates and compounding over time. We reveal insightful connections between these objectives and Temporal-Difference methods, a popular learning mechanism in Reinforcement Learning and Neuroscience. We evaluate the proposed objectives on challenging versions of popular CL benchmarks, demonstrating that they outperform standard Variational CL methods and non-variational baselines, effectively alleviating Catastrophic Forgetting.

2284Comet: A Communication-efficient and Performant Approximation for Private Transformer Inference

[openreview] [pdf]

Abstract The prevalent use of Transformer-like models, exemplified by ChatGPT in modern language processing applications, underscores the critical need for enabling private inference essential for many cloud-based services reliant on such models. However, current privacy-preserving frameworks impose significant communication burden, especially for non-linear computation in Transformer model. In this paper, we introduce a novel plug-in method Comet to effectively reduce the communication cost without compromising the inference performance. We second introduce an efficient approximation method to eliminate the heavy communication in finding good initial approximation. We evaluate our Comet on Bert and RoBERTa models with GLUE benchmark datasets, showing up to 3.9 less communication and 3.5 speedups while keep competitive model performance compared to the prior art.

2285Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback

[openreview] [pdf]

Abstract We study the problem of no-regret learning algorithms for general monotone and smooth games and their last-iterate convergence properties. Specifically, we investigate the problem under bandit feedback and strongly uncoupled dynamics, which allows modular development of the multi-player system that applies to a wide range of real applications. We propose a mirror-descent-based algorithm, which converges in O(T1/4)O(T^{-1/4}) and is also no-regret. The result is achieved by a dedicated use of two regularizations and the analysis of the fixed point thereof. The convergence rate is further improved to O(T1/2)O(T^{-1/2}) in the case of strongly monotone games. Motivated by practical tasks where the game evolves over time, the algorithm is extended to time-varying monotone games. We provide the first non-asymptotic result in converging monotone games and give improved results for equilibrium tracking games.

2286Zero-Shot Task-Level Adaptation via Coarse-to-Fine Policy Refinement and Holistic-Local Contrastive Representations

[openreview] [pdf]

Abstract Meta-reinforcement learning offers a mechanism for zero-shot adaptation, enabling agents to handle new tasks with parametric variation in real-world environments. However, existing methods still struggle with task-level adaptation, which demands generalization beyond simple variations within tasks, thereby limiting their practical effectiveness. This limitation stems from several challenges, including the poor task representations and inefficient policy learning, resulting from the underutilization of hierarchical structure inherent in task-level adaptation. To address these challenges, we propose a Coarse-to-Fine Policy Refinement combined with a Holistic-Local Contrastive Representation method to enable effective zero-shot policy adaptation. Specifically, in terms of policy learning, we use task language instructions as prior knowledge to select skill-specific expert modules as a coarse policy. This coarse policy is then refined by a fine policy generated through a hypernetwork, producing a task-aware policy based on task representations. Additionally, for task representation, we employ contrastive learning from both holistic and local perspectives to enhance task representations for more effective policy adaptation. Experimental results demonstrate that our method significantly improves learning efficiency and zero-shot adaptation on new tasks, outperforming previous methods by approximately 42.3% and 45.4% in success rate on the Meta-World ML-10 and ML-45 benchmarks, respectively.

2287Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation

[openreview] [pdf]

Abstract Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on “action,” which involves behavior cloning from extensive collections of robotic data, while the other emphasizes “vision,” enhancing model generalization by pre-training representations or generative models, also referred to as world models, using large-scale visual datasets. This paper presents an end-to-end paradigm that predicts actions using inverse dynamics models conditioned on the robot’s forecasted visual states, named Predictive Inverse Dynamics Models (PIDM). By closing the loop between vision and action, the end-to-end PIDM can be a better scalable action learner. In practice, we use Transformers to process both visual states and actions, naming the model Seer. It is initially pre-trained on large-scale robotic datasets, such as DROID, and can be adapted to real-world scenarios with a little fine-tuning data. Thanks to large-scale, end-to-end training and the continuous synergy between vision and action at each execution step, Seer significantly outperforms state-of-the-art methods across both simulation and real-world experiments. It achieves improvements of 13% on the LIBERO-LONG benchmark, 22% on CALVIN ABC-D, and 43% in real-world tasks. Notably, it demonstrates superior generalization for novel objects, lighting conditions, and environments under high-intensity disturbances. Code and models will be publicly available.

2288Foundation of Scalable Constraint Learning from Human Feedback

[openreview] [pdf]

Abstract Constraint learning from human feedback (CLHF) has garnered significant interest in the domain of safe reinforcement learning (RL) due to the challenges associated with designing constraints that elicit desired behaviors. However, a comprehensive theoretical analysis of CLHF is still missing. This paper addresses this gap by establishing a theoretical foundation. Concretely, trajectory-wise feedback, which is the most natural form of feedback, is shown to be helpful only for learning chance constraints. Building on this insight, we propose and theoretically analyze algorithms for CLHF and for solving chance constrained RL problems. Our algorithm is empirically shown to outperform an existing algorithm.

2289Estimating Statistical Similarity Between Product Distributions

[openreview] [pdf]

Abstract We investigate the problem of computing the \emph{statistical or total variation (TV) similarity} between distributions PP and QQ, which is defined as sTV(P,Q):=1dTV(P,Q)s_{\mathrm{TV}}(P,Q) := 1 - d_{\mathrm{TV}}(P, Q), where dTVd_{\mathrm{TV}} is the total variation distance between PP and QQ. Statistical similarity is a basic measure of similarity between distributions with several natural interpretations. We focus on the case when PP and QQ are products of Bernoulli trials. Recent work has established, somewhat surprisingly, that even for this simple class of distributions exactly computing the TV distance (and hence statistical similarity) is #P\mathsf{P}-hard. This motivates the question of designing multiplicative approximation algorithms for these computational tasks. It is known that the TV distance computation admits a fully polynomial-time deterministic approximation scheme (FPTAS). It remained an open question whether efficient approximation schemes exist for estimating the statistical similarity between two product distributions. In this work, we affirmatively answer this question by designing an FPTAS for estimating the statistical similarity between two product distributions. To obtain our result, we introduce a new variant of the knapsack problem, which we call multidimensional Masked Knapsack problem, and design an FPTAS to estimate the number of solutions to this problem. This result might be of independent interest.

2290Learning to Ground VLMs without Forgetting

[openreview] [pdf]

Abstract Spatial awareness is key to enable embodied multimodal AI systems. Yet, without vast amounts of spatial supervision, current Visual Language Models (VLMs) struggle at this task. In this paper, we introduce LynX, a framework that equips pretrained VLMs with visual grounding ability without forgetting their existing image and language understanding skills. To this end, we propose a Dual Mixture of Experts module that modifies only the decoder layer of the language model, using one frozen Mixture of Experts (MoE) pre-trained on image and language understanding and another learnable MoE for new grounding capabilities. This allows the VLM to retain previously learned knowledge and skills, while acquiring what is missing. To train the model effectively, we generate a high-quality synthetic dataset we call SCouT, which mimics human reasoning in visual grounding. This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process, thereby simplifying the task of visual grounding. We evaluate LynX on several object detection and visual grounding datasets, demonstrating strong performance in object detection, zero-shot localization and grounded reasoning while maintaining its original image and language understanding capabilities on seven standard benchmark datasets.

2291Efficient Exploration and Discriminative World Model Learning with an Object-Centric Abstraction

[openreview] [pdf]

Abstract In the face of difficult exploration problems in reinforcement learning, we study whether giving an agent an object-centric mapping (describing a set of items and their attributes) allow for more efficient learning. We found this problem is best solved hierarchically by modelling items at a higher level of state abstraction to pixels, and attribute change at a higher level of temporal abstraction to primitive actions. This abstraction simplifies the transition dynamic by making specific future states easier to predict. We make use of this to propose a fully model-based algorithm that learns a discriminative world model, plans to explore efficiently with only a count-based intrinsic reward, and can subsequently plan to reach any discovered (abstract) states.We demonstrate the model’s ability to (i) efficiently solve single tasks, (ii) transfer zero-shot and few-shot across item types and environments, and (iii) plan across long horizons. Across a suite of 2D crafting and MiniHack environments, we empirically show our model significantly out-performs state-of-the-art low-level methods (without abstraction), as well as performant model-free and model-based methods using the same abstraction. Finally, we show how to reinforce learn low level object-perturbing policies, and supervise learn the object mapping itself.

2292Deliberate Reasoning for LLMs as Structure-aware Planning with Accurate World Model

[openreview] [pdf]

Abstract Enhancing the reasoning capabilities of large language models (LLMs) remains a key challenge, especially for tasks that require complex, multi-step decision-making. Humans excel at these tasks by leveraging deliberate planning with an internal world model to simulate the potential outcomes of various actions. Inspired by this, we propose a novel multi-step reasoning framework for LLMs, referred to as Structure-aware Planning with Accurate World Model (SWAP). Unlike previous approaches that rely solely on Chain-of-Thought (CoT) reasoning in natural language, SWAP incorporates structural information to guide the reasoning process via a world model and provides a soft verification mechanism over the steps. Moreover, SWAP overcomes the challenge of accurate world state predictions in complex reasoning tasks by introducing a Generator-Discriminator architecture, which enables more reliable world modeling. Specifically, the generator predicts the next state, and the discriminator ensures alignment with the logical consistency required by the problem context. SWAP also encourages the policy model to explore a broad range of potential actions to prevent premature convergence. By resolving the bottlenecks of generation diversity for both actions and states using diversity-based modeling (DBM) and improving discrimination accuracy through contrastive ranking (CR), SWAP significantly enhances the reasoning performance of LLMs. We evaluate SWAP across diverse reasoning-intensive benchmarks including math reasoning, logical reasoning, and coding tasks. Extensive experiments demonstrate that SWAP achieves substantial improvements over the baselines and consistently outperforms existing LLMs of similar sizes.

2293Distribution-Specific Agnostic Conditional Classification With Halfspaces

[openreview] [pdf]

Abstract We study “selective” or “conditional” classification problems under an agnostic setting. Classification tasks commonly focus on modeling the relationship between features and categories that captures the vast majority of data. In contrast to common machine learning frameworks, conditional classification intends to model such relationships only on a subset of the data defined by some selection rule. Most work on conditional classification either solves the problem in a realizable setting or does not guarantee the error is bounded compared to an optimal solution. In this work, we consider selective/conditional classification by sparse linear classifiers for subsets defined by halfspaces, and give both positive as well as negative results for Gaussian feature distributions. On the positive side, we present the first PAC-learning algorithm for homogeneous halfspace selectors with error guaranteeO(opt), whereoptis the smallest conditional classification error over the given class of classifiers and homogeneous halfspaces. On the negative side, we find that, under cryptographic assumptions, approximating the conditional classification loss within a small additive error is computationally hard even under Gaussian distribution. We prove that approximating conditional classification is at least as hard as approximating agnostic classification in both additive and multiplicative form.

2294Spatial-temporal Graph Attention Network for Forex Forecasting with Hierarchical Transformer

[openreview] [pdf]

Abstract The foreign exchange market, with its daily trading volume reaching nearly trillions of dollars, presents significant opportunities for the application of advanced predictive analytics. Traditional exchange rate forecasting methods often overlook the interdependencies between currencies and struggle with long-range data dependencies, leading to challenges in capturing the true market dynamics. To overcome these limitations, this paper introduces a novel Spatial-Temporal Graph Attention Network with Hierarchical Transformer (STGAT). Our model innovatively combines spatial graph convolutions with a dual-view temporal transformer-based mechanism, utilizing a Temporal Linearity Graph Attention Network (TLGAT) to account for currency relations in a time-sensitive manner. By integrating a linear attention mechanism for enhanced efficiency and capturing both local and global sequential data embeddings, STGAT provides a framework based on a hierarchical transformer for predicting exchange rates. We validate our approach on exchange rates of seventeen currencies over 2,092 trading days, demonstrating superior performance compared to state-of-the-art models.

2295Locate-then-edit for Multi-hop Factual Recall under Knowledge Editing

[openreview] [pdf]

Abstract The locate-then-edit paradigm has shown significant promise for knowledge editing (KE) in Large Language Models (LLMs). While previous methods perform well on single-hop fact recall tasks, they consistently struggle with multi-hop factual recall tasks involving newly edited knowledge. In this paper, leveraging tools in mechanistic interpretability, we first identify that in multi-hop tasks, LLMs tend to retrieve implicit subject knowledge from deeper MLP layers, unlike single-hop tasks, which rely on earlier layers. This distinction explains the poor performance of current methods in multi-hop queries, as they primarily focus on editing shallow layers, leaving deeper layers unchanged. To address this, we propose IFMET, a novel locate-then-edit KE approach designed to edit both shallow and deep MLP layers. IFMET employs multi-hop editing prompts and supplementary sets to locate and modify knowledge across different reasoning stages. Experimental results demonstrate that IFMET significantly improves performance on multi-hop factual recall tasks, effectively overcoming the limitations of previous locate-then-edit methods.

2296H2IL-MBOM: A Hierarchical World Model Integrating Intent and Latent Strategy as Opponent Modeling in Multi-UAV Game

[openreview] [pdf]

Abstract In the mixed cooperative-competitive scenario, the uncertain decisions of agents on both sides not only render learning non-stationary but also pose a threat to each other’s security. Existing methods either predict policy beliefs based on opponents’ interactive actions, goals, and rewards or predict trajectories and intents solely from local historical observations. However, the above private information is unavailable and these methods neglect the underlying dynamics of the environment and relationship between intentions, latent strategies, actions, and trajectories for both sides. To address these challenges, we propose a Hierarchical Interactive Intent-Latent-Strategy-Aware World Model based Opponent Model (H2IL-MBOM) and the Mutual Self-Observed Adversary Reasoning PPO (MSOAR-PPO) to enables both parties to dynamically and interactively predict multiple intentions and latent strategies, along with their trajectories based on self observation. Concretely, the high-level world model fuses related observations regarding opponents and multi-learnable intention queries to anticipate future intentions and trajectories of opponents and incorporate anticipated intentions into the low-level world model to infer how opponents’ latent strategies react and their influence on the trajectories of cooperative agents. We validate the effectiveness of the method and demonstrate its superior performance through comparisons with state-of-the-art model-free reinforcement learning and opponent modeling methods in more challenging settings involving multi-agent close-range air-combat environments with missiles.

2297Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models

[openreview] [pdf]

Abstract Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Main results on Starcraft Multi-Agent Challenge (SMAC) and additional results on MAMujoco show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.

2298Time-Dependent Mirror Flows and Where to Find Them

[openreview] [pdf]

Abstract Explicit regularization and implicit bias are often studied separately, though in practice, they act in tandem. However, their interplay remains poorly understood. In this work, we show that explicit regularization modifies the behavior of implicit bias and provides a mechanism to control its strength. By incorporating explicit regularization into the mirror flow framework, we present a general approach to better understand implicit biases and their potential in guiding the design of optimization problems. Our primary theoretical contribution is the characterization of regularizations and parameterizations that induce a time-dependent Bregman potential, with a discussion of the implications of its temporal variation. Importantly, our framework encompasses single-layer attention, and application to sparse coding. Extending beyond our core assumptions, we apply this framework to LoRA finetuning, revealing an implicit bias towards sparsity.

2299Convergent Privacy Loss of Noisy-SGD without Convexity and Smoothness

[openreview] [pdf]

Abstract We study the Differential Privacy (DP) guarantee of hidden-state Noisy-SGD algorithms over a bounded domain. Standard privacy analysis for Noisy-SGD assumes all internal states are revealed, which leads to a divergent R’enyi DP bound with respect to the number of iterations. Ye & Shokri (2022) and Altschuler & Talwar (2022) proved convergent bounds for smooth (strongly) convex losses, and raise open questions about whether these assumptions can be relaxed. We provide positive answers by proving convergent R’enyi DP bound for non-convex non-smooth losses, where we show that requiring losses to have H"older continuous gradient is sufficient. We also provide a strictly better privacy bound compared to state-of-the-art results for smooth strongly convex losses. Our analysis relies on the improvement of shifted divergence analysis in multiple aspects, including forward Wasserstein distance tracking, identifying the optimal shifts allocation, and the H"older reduction lemma. Our results further elucidate the benefit of hidden-state analysis for DP and its applicability.

2300Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) addresses the limitation of large language models (LLMs) in achieving up-to-date information by integrating external knowledge sources, but it is hindered by noisy or irrelevant retrieved data, leading to reduced accuracy. Additionally, most RAG methods rely on task-specific supervision, reducing their adaptability across domains. To overcome these challenges, we propose WinnowRAG, a novel multi-agent debate-based RAG framework. WinnowRAG operates in two stages: in Stage I, query-aware clustering groups similar documents, with each cluster assigned to an LLM agent for generating personalized responses. A critic LLM then consolidates these answers, forming super-agents. In Stage II, the super-agents engage in a structured discussion to filter out incorrect or irrelevant information, ensuring only relevant knowledge is used for final response generation. Crucially, WinnowRAG is unsupervised and leverages pretrained LLMs without requiring fine-tuning, making it easily adaptable to various tasks. The experiments on various realistic datasets demonstrate the effectiveness of WinnowRAG over state-of-the-art baselines.

2301Composing Novel Classes: A Concept-Driven Approach to Generalized Category Discovery

[openreview] [pdf]

Abstract We tackle the generalized category discovery (GCD) problem, which aims to discover novel classes in unlabeled datasets by leveraging the knowledge of known classes. Previous works utilize the known class knowledge through shared representation spaces. Despite their progress, our analysis experiments show that novel classes can achieve impressive clustering results on the feature space of a known class pre-trained model, suggesting that existing methods may not fully utilize known class knowledge. To address it, we introduce a novel concept learning framework for GCD, named ConceptGCD, that categorizes concepts into two types: derivable and underivable from known class concepts, and adopts a stage-wise learning strategy to learn them separately. Specifically, our framework first extracts known class concepts by a known class pre-trained model and then produces derivable concepts from them by a generator layer with a covariance-augmented loss. Subsequently, we expand the generator layer to learn underivable concepts in a balanced manner ensured by a concept score normalization strategy and integrate a contrastive loss to preserve previously learned concepts. Extensive experiments on various benchmark datasets demonstrate the superiority of our approach over the previous state-of-the-art methods.

2302Towards Unifying Interpretability and Control: Evaluation via Intervention

[openreview] [pdf]

Abstract With the growing complexity and capability of large language models (LLMs), a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both, with the connection between interpretation and control more broadly remaining tenuous. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess these methods’ practical utility and efficacy. To address the aforementioned issues, we propose intervention as a fundamental goal of interpretability and introduce success criteria to evaluate how well methods are able to control model behavior through interventions. We unify and extend four popular interpretability methods—sparse autoencoders, logit lens, tuned lens, and probing—into an abstract encoder-decoder framework. This framework maps intermediate latent representations to human-interpretable feature spaces, enabling interventions on these interpretable features, which can then be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and the coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) although current methods allow for intervention, they are inconsistent across various models and features, (2) lens-based methods outperform others in achieving simple, concrete interventions, and (3) interventions often compromise model performance and coherence, underperforming simpler alternatives, such as prompting, for steering model behavior and highlighting a critical shortcoming of current interpretability approaches in real-world applications requiring control. Code is made available for replicability.

2303Convergence Of Consistency Model With Multistep Sampling Under General Data Assumptions

[openreview] [pdf]

Abstract Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.

2304Online-to-Offline RL for Agent Alignment

[openreview] [pdf]

Abstract Reinforcement learning (RL) has shown remarkable success in training agents to achieve high-performing policies, particularly in domains like Game AI where simulation environments enable efficient interactions. However, despite their success in maximizing these returns, such online-trained policies often fail to align with human preferences concerning actions, styles, and values. The challenge lies in efficiently adapting these online-trained policies to align with human preferences, given the scarcity and high cost of collecting human behavior data. In this work, we formalize the problem asonline-to-offlineRL and propose ALIGNment of Game AI to Preferences (ALIGN-GAP), an innovative approach for alignment of well-trained game agents to human preferences. Our method features a carefully designed reward model that encodes human preferences from limited offline data and incorporates curriculum-based preference learning to align RL agents with targeted human values. Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences.

2305Weak-to-Strong Trustworthiness: Eliciting Trustworthiness with Weak Supervision

[openreview] [pdf]

Abstract The rapid proliferation of generative AI, especially large language models (LLMs), has led to their integration into a variety of applications. An emergent behavior known as weak-to-strong generalization—where a strong model trained on a weak model’s outputs surpasses the weak model in task performance—has gained significant attention. However, whether trustworthiness properties such as robustness, fairness and privacy can be transferred in a similar manner remains an open question. In this work, we investigate this critical question by exploring the transfer of trustworthiness properties from weak to strong models via weak-to-strong generalization. Specifically, we examine whether a strong model can inherit or even enhance trustworthiness attributes when fine-tuned on a weak model’s outputs. We refer to this as weak-to-strong trustworthiness. To this end, we introduce two novel approaches aimed at improving trustworthiness transfer between weak and strong models: 1) Weak Trustworthiness Finetuning (Weak TFT), which applies trustworthiness regularization during the fine-tuning of the weak model, and 2) Weak and Weak-to-Strong Trustworthiness Finetuning (Weak+WTS TFT), which extends trustworthiness regularization to both the weak model and the strong model during fine-tuning. Our experimental evaluation on real-world datasets (Adult, OOD Style Transfer, AdvGLUE++, and Enron Emails) reveals that while some trustworthiness properties, such as fairness, adversarial, and OOD robustness, show significant improvement in transfer when both models were regularized, others like privacy do not exhibit signs of weak-to-strong trustworthiness. As the first study to explore the transfer of trustworthiness properties via weak-to-strong generalization, our work provides valuable insights into the potential and limitations of this method. Our findings highlight the importance of systematically studying trustworthiness transfer to develop AI systems that are not only accurate but also ethically aligned and reliable in critical applications.

2306Adversarial Attacks on Cooperative Multi-agent Bandits

[openreview] [pdf]

Abstract Cooperative multi-agent multi-armed bandits (CMA2B) consider the collaborative efforts of multiple agents in a shared multi-armed bandit game. We study latent vulnerabilities exposed by this collaboration and consider adversarial attacks on a few agents with the goal of influencing the decisions of the rest. More specifically, we study adversarial attacks on CMA2B in both homogeneous settings, where agents operate with the same arm set, and heterogeneous settings, where agents may have distinct arm sets. In the homogeneous setting, we propose attack strategies that, by targeting just one agent, convince all agents to select a particular target arm To(T)T-o(T) times while incurring o(T)o(T) attack costs in TT rounds. In the heterogeneous setting, we prove that a target arm attack requires linear attack costs and propose attack strategies that can force a maximum number of agents to suffer linear regrets while incurring sublinear costs and only manipulating the observations of a few target agents. Numerical experiments validate the effectiveness of our proposed attack strategies.

2307Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy

[openreview] [pdf]

Abstract Large Language Models (LLMs) are susceptible to security and safety threats, such as prompt injection, prompt extraction, and harmful requests. One major cause of these vulnerabilities is the lack of an instruction hierarchy. Modern LLM architectures treat all inputs equally, failing to distinguish between and prioritize various types of instructions, such as system messages, user prompts, and data. As a result, lower-priority user prompts may override more critical system instructions, including safety protocols. Existing approaches to achieving instruction hierarchy, such as delimiters and instruction-based training, do not address this issue at the architectural level. We introduce the I\textbf{I}nstructional S\textbf{S}egment E\textbf{E}mbedding (ISE) technique, inspired by BERT, to modern large language models, which embeds instruction priority information directly into the model. This approach enables models to explicitly differentiate and prioritize various instruction types, significantly improving safety against malicious prompts that attempt to override priority rules. Our experiments on the Structured Query and Instruction Hierarchy benchmarks demonstrate an average robust accuracy increase of up to 15.75% and 18.68%, respectively. Furthermore, we observe an improvement in instruction-following capability of up to 4.1% evaluated on AlpacaEval. Overall, our approach offers a promising direction for enhancing the safety and effectiveness of LLM architectures.

2308TRENDy: Temporal Regression of Effective Nonlinear Dynamics

[openreview] [pdf]

Abstract Spatiotemporal dynamics pervade the natural sciences, from the morphogen dynamics underlying patterning in animal pigmentation to the protein waves controlling cell division. A central challenge lies in understanding how controllable parameters induce qualitative changes in system behavior called bifurcations. This endeavor is made particularly difficult in realistic settings where governing partial differential equations (PDEs) are unknown and data is limited and noisy. To address this challenge, we propose TRENDy (Temporal Regression of Effective Nonlinear Dynamics), an equation-free approach to learning low-dimensional, predictive models of spatiotemporal dynamics. Following classical work in spatial coarse-graining, TRENDy first maps input data to a low-dimensional space of effective dynamics through a cascade of multiscale filtering operations. Our key insight is the recognition that these effective dynamics can be fit by a neural ordinary differential equation (NODE) having the same parameter space as the input PDE. The preceding filtering operations strongly regularize the phase space of the NODE, making TRENDy significantly more robust to noise compared to existing methods. We train TRENDy to predict the effective dynamics of synthetic and real data representing dynamics from across the physical and life sciences. We then demonstrate how our framework can automatically locate both Turing and Hopf bifurcations in unseen regions of parameter space. We finally apply our method to the analysis of spatial patterning of the ocellated lizard through development. Our results show how TRENDy’s synthesis of classical multiscale methods with techniques from data-driven dynamical systems forms a powerful tool for the study and control of spatiotemporal dynamics.

2309Data Attribution for Multitask Learning

[openreview] [pdf]

Abstract Data attribution quantifies the influence of individual training data points on machine learning models, aiding in their interpretation and improvement. While prior work has primarily focused on single-task learning (STL), this work extends data attribution to multitask learning (MTL). Data attribution in MTL presents new opportunities for interpreting and improving MTL models while also introducing unique technical challenges. On the opportunity side, data attribution in MTL offers a natural way to efficiently measure task relatedness, a key factor that impacts the effectiveness of MTL. However, the shared and task-specific parameters in MTL models present challenges that require specialized data attribution methods. In this paper, we propose the MultiTask Influence Function (MTIF), a novel data attribution method tailored for MTL. MTIF leverages the structure of MTL models to efficiently estimate the impact of removing data points or excluding tasks on the predictions of specific target tasks, providing both data-level and task-level influence analysis. Extensive experiments on both linear and neural network models show that MTIF effectively approximates leave-one-out and leave-one-task-out effects. Moreover, MTIF facilitates fine-grained data selection, consistently improving model performance in MTL, and provides interpretable insights into task relatedness. Our work establishes a novel connection between data attribution and MTL, offering an efficient and scalable solution for measuring task relatedness and enhancing MTL models.

2310Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

[openreview] [pdf]

Abstract No absctract

2311Nexus: Specialization meets Adaptability for Efficiently Training Mixture of Experts

[openreview] [pdf]

Abstract Efficiency, specialization, and adaptability to new data distributions are qualities that are hard to combine in current Large Language Models. The Mixture of Experts (MoE) architecture has been the focus of significant research because its inherent conditional computation enables such desirable properties. In this work, we focus on “upcycling” dense expert models into an MoE, aiming to improve specialization while also adding the ability to adapt to new tasks easily. We introduce Nexus, an enhanced MoE architecture with adaptive routing where the model learns to project expert embeddings from domain representations. This approach allows Nexus to flexibly add new experts after the initial upcycling through separately trained dense models, without requiring large-scale MoE training for unseen data domains. Our experiments show that Nexus achieves a relative gain of up to 2.1% over the baseline for initial upcycling, and a 18.8% relative gain for extending the MoE with a new expert by using limited finetuning data. This flexibility of Nexus is crucial to enable an open-source ecosystem where every user continuously assembles their own MoE-mix according to their needs.

2312Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning LLMs

[openreview] [pdf]

Abstract Through supervised fine-tuning or reinforcement learning with human feedback, large language models can achieve a certain level of safety alignment during instruction fine-tuning. However, thesesafety guardrailsare often fragile, as models can easily generate harmful content after downstream fine-tuning. Although various methods have been proposed to mitigate this, our paper shifts focus to the durability of safety guardrails, beginning with their formation in the upstream alignment stages. The central question we explore is:Can we construct more durable safety guardrails for specific downstream tasks to ensure models remain safe after fine-tuning?Our experiments demonstrate that the durability of these safety guardrails is closely tied to the similarity between upstream and downstream datasets: higher similarity results in more fragile guardrails after fine-tuning, whereas lower similarity results in more durable guardrails. This finding highlights the importance of dataset diversity and privacy in upstream alignment data. Ensuring the diversity of the alignment dataset, which allows downstream datasets to be less similar to it, enhances the guardrail durability for fine-tuning. Maintaining its privacy prevents the exposure of alignment data that adversaries could exploit. Thus, we advocate for a dual strategy: prioritizing both the privacy and diversity of upstream alignment datasets to fortify safety guardrails against potential threats, ensuring long-term model robustness in real-world applications.

2313Unlearning Mapping Attack: Exposing Hidden Vulnerabilities in Machine Unlearning

[openreview] [pdf]

Abstract As machine learning becomes increasingly data-dependent, concerns over privacy and content regulation among data owners have intensified. Machine Unlearning has emerged as a promising solution, allowing for the removal of specific data from pre-trained systems to protect user privacy and regulate information. Existing research on Machine Unlearning has shown considerable success in eliminating the influence of certain data while preserving model performance. However, the resilience of Machine Unlearning to malicious attacks has not been thoroughly examined. In this paper, we investigate the hidden vulnerabilities within current Machine Unlearning techniques. We propose a novel adversarial attack, the Unlearning Mapping Attack (UMA), capable of undermining the unlearning process without altering its procedures. Through experiments on both generative and discriminative tasks, we demonstrate the susceptibility of existing unlearning techniques to UMA. These findings highlight the need to reassess unlearning objectives across various tasks, prompting the introduction of a Robust Unlearning standard that prioritizes protection against adversarial threats. Our extensive studies show the successful adaptation of current unlearning methods to this robust framework. The Python implementation will be made publicly available upon acceptance of the paper.

2314EVA-Gaussian: 3D Gaussian-based Real-time Human Novel View Synthesis under Diverse Camera Settings

[openreview] [pdf]

Abstract The feed-forward based 3D Gaussian Splatting method has demonstrated exceptional capability in real-time human novel view synthesis. However, existing approaches are restricted to dense viewpoint settings, which limits their flexibility in free-viewpoint rendering across a wide range of camera view angle discrepancies. To address this limitation, we propose a real-time pipeline named EVA-Gaussian for 3D human novel view synthesis across diverse camera settings. Specifically, we first introduce an Efficient cross-View Attention (EVA) module to accurately estimate the position of each 3D Gaussian from the source images. Then, we integrate the source images with the estimated Gaussian position map to predict the attributes and feature embeddings of the 3D Gaussians. Moreover, we employ a recurrent feature refiner to correct artifacts caused by geometric errors in position estimation and enhance visual fidelity. To further improve synthesis quality, we incorporate a powerful anchor loss function for both 3D Gaussian attributes and human face landmarks. Experimental results on the THuman2.0 and THumansit datasets showcase the superiority of our EVA-Gaussian approach in rendering quality across diverse camera settings. Project page:https://anonymousiclr2025.github.io/iclr2025/EVA-Gaussian.

2315Subgraph Federated Learning for Local Generalization

[openreview] [pdf]

Abstract Federated Learning (FL) on graphs enables collaborative model training to enhance performance without compromising the privacy of each client. However, existing methods often overlook the mutable nature of graph data, which frequently introduces new nodes and leads to shifts in label distribution. Since they focus solely on performing well on each client’s local data, they are prone to overfitting to their local distributions (i.e., local overfitting), which hinders their ability to generalize to unseen data with diverse label distributions. In contrast, our proposed method, FedLoG, effectively tackles this issue by mitigating local overfitting. Our model generates global synthetic data by condensing the reliable information from each class representation and its structural information across clients. Using these synthetic data as a training set, we alleviate the local overfitting problem by adaptively generalizing the absent knowledge within each local dataset. This enhances the generalization capabilities of local models, enabling them to handle unseen data effectively. Our model outperforms baselines in our proposed experimental settings, which are designed to measure generalization power to unseen data in practical scenarios. Our code is available athttps://anonymous.4open.science/r/FedLoG-89EE

2316Computational Limits of Low-Rank Adaptation for Transformer Models

[openreview] [pdf]

Abstract We study the computational limits of Low-Rank Adaptation (LoRA) for finetuning transformer-based models using fine-grained complexity theory. Our key observation is that the existence of low-rank decompositions within the gradient computation of LoRA adaptation leads to possible algorithmic speedup. This allows us to (i) identify a phase transition behavior of efficiency and (ii) prove the existence of nearly linear algorithms by controlling the LoRA update computation term by term, assuming the Strong Exponential Time Hypothesis (SETH). For the former, we identify a sharp transition in the efficiency of all possible rank-rr LoRA update algorithms for transformers, based on specific norms resulting from the multiplications of the input sequence XX, pretrained weights W{W^\star}, and adapter matrices αBA/r\alpha BA/r. Specifically, we derive a shared upper bound threshold for such norms, and show that efficient (sub-quadratic) approximation algorithms of LoRA exist only below this threshold. For the latter, we prove the existence of nearly linear approximation algorithms for LoRA adaptation by utilizing the hierarchical low-rank structures of LoRA gradients and approximating the gradients with a series of chained low-rank approximations. To showcase our theory, we consider two practical scenarios: partial (e.g., only WVW_V and WQW_Q) and full adaptations (e.g., WQW_Q, WVW_V, and WKW_K) of weights in attention heads.

2317Action Mapping for Reinforcement Learning in Continuous Environments with Constraints

[openreview] [pdf]

Abstract Deep reinforcement learning (DRL) has had success across various domains, but applying it to environments with constraints remains challenging due to poor sample efficiency and slow convergence. Recent literature explored incorporating model knowledge to mitigate these problems, particularly through the use of models that assess the feasibility of proposed actions. However, integrating feasibility models efficiently into DRL pipelines in environments with continuous action spaces is non-trivial. We propose a novel strategy, termed action mapping, that leverages feasibility models to streamline the learning process. By decoupling the learning of feasible actions from policy optimization, action mapping allows DRL agents to focus on selecting the optimal action from a reduced feasible action set. We demonstrate through experiments that action mapping significantly improves training performance in constrained environments with continuous action spaces, especially with imperfect feasibility models.

2318Communication-efficient Algorithms Under Generalized Smoothness Assumptions

[openreview] [pdf]

Abstract We provide the first proof of convergence for normalized error feedback algorithms across a wide range of machine learning problems. Despite their popularity and efficiency in training deep neural networks, traditional analyses of error feedback algorithms rely on the smoothness assumption that does not capture the properties of objective functions in these problems. Rather, these problems have recently been shown to satisfy generalized smoothness assumptions, and the theoretical understanding of error feedback algorithms under these assumptions remains largely unexplored. Moreover, to the best of our knowledge, all existing analyses under generalized smoothness either i) focus on single-node settings or ii) make unrealistically strong assumptions for distributed settings, such as requiring data heterogeneity, and almost surely bounded stochastic gradient noise variance. In this paper, we propose distributed error feedback algorithms that utilize normalization to achieve the O(1/K)\mathcal{O}(1/\sqrt{K}) convergence rate for nonconvex problems under generalized smoothness. Our analyses apply for distributed settings without data heterogeneity conditions, and enable stepsize tuning that is independent of problem parameters. Additionally, we provide strong convergence guarantees of normalized error feedback algorithms for stochastic settings. Finally, we show that normalized EF21, due to its larger allowable stepsizes, outperforms EF21 on various tasks, including the minimization of polynomial functions, logistic regression, and ResNet-20 training.

2319Conservative Contextual Bandits: Beyond Linear Representations

[openreview] [pdf]

Abstract Conservative Contextual Bandits (CCBs) address safety in sequential decision making by requiring that an agent’s policy, along with minimizing regret, also satisfies a safety constraint: the performance is not worse than a baseline policy (e.g., the policy that the company has in production) by more than (1+α)(1+\alpha) factor. Prior work developed UCB-style algorithms for this problem in the multi-armed (Wu et al., 2016) and contextual linear (Kazerouni et al., 2017) settings. However, in practice the cost of the arms is often a non-linear function, and therefore existing UCB algorithms are ineffective in such settings. In this paper, we consider CCBs beyond the linear case and develop two algorithms C-SquareCB\mathtt{C\text{-}SquareCB} and C-FastCB\mathtt{C\text{-}FastCB}, using Inverse Gap Weighting (IGW) based exploration and an online regression oracle. We show that the safety constraint is satisfied in high probability and that the regret for C-SquareCB\mathtt{C\text{-}SquareCB} is sub-linear in horizon TT, while the the regret for C-FastCB\mathtt{C\text{-}FastCB} is first-order and is sub-linear in LL^*, the cumulative loss of the optimal policy. Subsequently, we use a neural network for function approximation and online gradient descent as the regression oracle to provide O~(KT+K/α)\tilde{\mathcal{O}}\big(\sqrt{KT} + K/\alpha\big) and O~(KL+K(1+1/α))\tilde{\mathcal{O}}\big(\sqrt{KL^*} + K (1 + 1/\alpha)\big) regret bounds respectively. Finally, we demonstrate the efficacy of our algorithms on real world data, and show that they significantly outperform the existing baseline while maintaining the performance guarantee.

2320Standard Gaussian Process Can Be Excellent for High-Dimensional Bayesian Optimization

[openreview] [pdf]

Abstract A long-standing belief holds that Bayesian Optimization (BO) with standard Gaussian processes (GP) --- referred to as standard BO --- underperforms in high-dimensional optimization problems. While this belief seems plausible, it lacks both robust empirical evidence and theoretical justification. To address this gap, we present a systematic investigation. First, through a comprehensive evaluation across eleven widely used benchmarks, we found that while the popular Square Exponential (SE) kernel often leads to poor performance, using Mat’ern kernels enables standard BO to consistently achieve top-tier results, frequently surpassing methods specifically designed for high-dimensional optimization. Second, our theoretical analysis reveals that the SE kernel’s failure primarily stems from improper initialization of the length-scale parameters, which are commonly used in practice but can cause gradient vanishing in training. We provide a probabilistic bound to characterize this issue, showing that Mat’ern kernels are less susceptible and can robustly handle much higher dimensions. Third, we propose a simple robust initialization strategy that dramatically improves the performance of the SE kernel, bringing it close to state-of-the-art methods, without requiring any additional priors or regularization. We prove another probabilistic bound that demonstrates how the gradient vanishing issue can be effectively mitigated with our method. Our findings advocate for a re-evaluation of standard BO’s potential in high-dimensional settings.

2321Learning Graph Invariance by Harnessing Spuriosity

[openreview] [pdf]

Abstract Recently, graph invariant learning has become the \textit{de facto} approach to tackle the Out-of-Distribution (OOD) generalization failure in graph representation learning. They generically follow the framework of invariant risk minimization to capture the invariance of graph data from different environments. Despite some success, it remains unclear to what extent existing approaches have captured invariant features for OOD generalization on graphs. In this work, we find that representative OOD methods such as IRM and VRex, and their variants on graph invariant learning may have captured a limited set of invariant features. To tackle this challenge, we propose LIRS, a novel learning framework designed to Learn graph Invariance by Removing Spurious features. Different from most existing approaches that \textit{directly} learn the invariant features, LIRS takes an \textit{indirect} approach by first learning the spurious features and then removing them from the ERM-learned features, which contains both spurious and invariant features. We demonstrate that learning the invariant graph features in an \textit{indirect} way can learn a more comprehensive set of invariant features. Moreover, our proposed method outperforms the second-best method by as much as 25.50% across all competitive baseline methods, highlighting its effectiveness in learning graph invariant features.

2322Inference Scaling for Long-Context Retrieval Augmented Generation

[openreview] [pdf]

Abstract The scaling of inference computation has unlocked the potential of long-contextlarge language models (LLMs) across diverse settings. For knowledge-intensivetasks, the increased compute is often allocated to incorporate more external knowl-edge. However, without effectively utilizing such knowledge, solely expandingcontext does not always enhance performance. In this work, we investigate infer-ence scaling for retrieval augmented generation (RAG), exploring strategies beyondsimply increasing the quantity of knowledge. We focus on two inference scalingstrategies: in-context learning and iterative prompting. These strategies provideadditional flexibility to scale test-time computation (e.g., by increasing retrieveddocuments or generation steps), thereby enhancing LLMs’ ability to effectivelyacquire and utilize contextual information. We address two key questions: (1) Howdoes RAG performance benefit from thescaling of inference computationwhenoptimally configured? (2) Can we predict the optimaltest-time compute allocationfor a given budget by modeling the relationship between RAG performance andinference parameters? Our observations reveal that increasing inference computa-tion leads to nearly linear gains in RAG performance when optimally allocated, arelationship we describe as theinference scaling laws for RAG. Building on this,we further develop thecomputation allocation modelto estimate RAG performanceacross different inference configurations. The model predicts optimal inferenceparameters under various computation constraints, which align closely with theexperimental results. By applying these optimal configurations, we demonstratethat scaling inference compute on long-context LLMs achieves up to 58.9% gainson benchmark datasets compared to standard RAG.

2323OracleMamba: A Dynamic Market-Guided and Time State Selection Framework for Robust Stock Prediction

[openreview] [pdf]

Abstract Stock price prediction is a complex challenge due to the inherent volatility of financial markets and the influence of diverse factors such as macroeconomic conditions, capital flows, and market sentiment. Recent joint stock forecasting models focus on extracting temporal patterns from individual stock price series and combining them to model stock correlations. However, these models face two critical limitations: first, in long-term predictions, they retain both informative and excessive states, amplifying noise and increasing complexity; second, in short-term predictions, they prioritize market indices and technical indicators, neglecting the real-time influence of market sentiment, which can drive price movements independent of traditional indicators. While state space models (SSMs) like Mamba improve efficiency and capture long-distance relationships, they still underperform compared to Transformer-based models. To address these challenges, we propose OracleMamba, a novel framework that integrates a dynamic market-guided module for short-term forecasting and a SelectiveMamba module for long-term forecasting. The dynamic market-guided module fuses objective market data and subjective sentiment analysis to enhance short-term prediction accuracy. The SelectiveMamba module efficiently captures both spectral and temporal features using a 3D scan mechanism, which extracts and filters key signals from the time-series data. By integrating spectral features to identify market rhythms and temporal features to track price movements over time, the SelectiveMamba module reduces noise and preserves critical information for long-term forecasts. This framework significantly improves both model efficiency and accuracy, outperforming existing approaches across real-world stock prediction tasks.

2324Offline Inverse Constrained Reinforcement Learning for Safe-Critical Decision Making in Healthcare

[openreview] [pdf]

Abstract Reinforcement Learning (RL) applied in healthcare can lead to unsafe medical decisions and treatment, such as excessive dosages or abrupt changes, often due to agents overlooking common-sense constraints. Consequently, Constrained Reinforcement Learning (CRL) is a natural choice for safe decisions. However, specifying the exact cost function is inherently difficult in healthcare. Recent Inverse Constrained Reinforcement Learning (ICRL) is a promising approach that infers constraints from expert demonstrations. ICRL algorithms model Markovian decisions in an interactive environment. These settings do not align with the practical requirement of a decision-making system in healthcare, where decisions rely on historical treatment recorded in an offline dataset. To tackle these issues, we propose the Constraint Transformer (CT). Specifically, 1) we utilize a causal attention mechanism to incorporate historical decisions and observations into the constraint modeling, while employing a Non-Markovian layer for weighted constraints to capture critical states. 2) A generative world model is used to perform exploratory data augmentation, enabling offline RL methods to simulate unsafe decision sequences. In multiple medical scenarios, empirical results demonstrate that CT can capture unsafe states and achieve strategies that approximate lower mortality rates, reducing the occurrence probability of unsafe behaviors.

2325Cauchy-Schwarz Fairness Regularizer

[openreview] [pdf]

Abstract In this paper, we propose a novel approach to fair machine learning, the Cauchy-Schwarz fairness regularizer, which minimizes the Cauchy-Schwarz divergence between the prediction distribution and sensitive attributes. While existing methods effectively reduce bias as indicated by low values on specific fairness metrics, they frequently struggle to achieve a balanced performance across various fairness definitions. For example, many approaches may successfully attain low demographic parity yet still demonstrate significant disparities in equal opportunity. Theoretical studies have shown that the Cauchy-Schwarz divergence provides a tighter bound compared to the Kullback-Leibler divergence and gap parity, suggesting its potential to improve fairness in machine learning models. Our empirical evaluation, conducted on four tabular datasets and one image dataset, demonstrates that the Cauchy-Schwarz fairness regularizer achieves a more balanced performance across fairness metrics while maintaining satisfactory utility. It outperforms existing fairness approaches, providing a superior trade-off between fairness and utility. In addition, the Cauchy-Schwarz fairness regularizer is a versatile, plug-and-play fairness regularizer that can be easily integrated into various machine learning models to promote fairness.

2326Dealing with Frequency Collapse in Time Series Embeddings by Post-Embedding reMapping

[openreview] [pdf]

Abstract Transformer-based methods have made significant strides in time series forecasting tasks in recent years. However, we observe underfitting in numerous samples, e.g., pattern shifts or excessive deviation in extreme value regions when testing the transform-based model that converges on the training set. Through the proposed spectral analysis of adjacent embedding sequences, we identify a frequency collapse issue in the embedding features generated by the top layer of the transformer backbone. To address this, we propose the Post-Embedding ReMapping (PErM) strategy that improves the frequency-domain representation of embeddings using fixed non-linear functions. Both two kinds of PErM functions that we insert into the model can effectively resolve the frequency collapse issue and lead to significant improvements in prediction performance. Experimental results show that our method outperforms state-of-the-art algorithms across multiple datasets. We will release our code after the review phase.

2327Pareto Low-Rank Adapters: Efficient Multi-Task Learning with Preferences

[openreview] [pdf]

Abstract Multi-task trade-offs in machine learning can be addressed via Pareto Front Learning (PFL) methods that parameterize the Pareto Front (PF) with a single model. PFL permits to select the desired operational point during inference, contrary to traditional Multi-Task Learning (MTL) that optimizes for a single trade-off decided prior to training. However, recent PFL methodologies suffer from limited scalability, slow convergence, and excessive memory requirements, while exhibiting inconsistent mappings from preference to objective space. We introduce PaLoRA, a novel parameter-efficient method that addresses these limitations in two ways. First, we augment any neural network architecture with task-specific low-rank adapters and continuously parameterize the Pareto Front in their convex hull. Our approach steers the original model and the adapters towards learning general and task-specific features, respectively. Second, we propose a deterministic sampling schedule of preference vectors that reinforces this division of labor, enabling faster convergence and strengthening the validity of the mapping from preference to objective space throughout training. Our experiments show that PaLoRA outperforms state-of-the-art MTL and PFL baselines across various datasets, scales to large networks, reducing the memory overhead 23.831.723.8-31.7 times compared with competing PFL baselines in scene understanding benchmarks.

2328Towards Synergistic Path-based Explanations for Knowledge Graph Completion: Exploration and Evaluation

[openreview] [pdf]

Abstract Knowledge graph completion (KGC) aims to alleviate the inherent incompleteness of knowledge graphs (KGs), a crucial task for numerous applications such as recommendation systems and drug repurposing. The success of knowledge graph embedding (KGE) models provokes the question about the explainability: ``\textit{Which the patterns of the input KG are most determinant to the prediction}?‘’ Particularly, path-based explainers prevail in existing methods because of their strong capability for human understanding. In this paper, based on the observation that a fact is usually determined by the synergy of multiple reasoning chains, we propose a novel explainable framework, dubbed KGExplainer, to explore synergistic pathways. KGExplainer is a model-agnostic approach that employs a perturbation-based greedy search algorithm to identify the most crucial synergistic paths as explanations within the local structure of target predictions. To evaluate the quality of these explanations, KGExplainer distills an evaluator from the target KGE model, allowing for the examination of their fidelity. We experimentally demonstrate that the distilled evaluator has comparable predictive performance to the target KGE. Experimental results on benchmark datasets demonstrate the effectiveness of KGExplainer, achieving a human evaluation accuracy of 83.3% and showing promising improvements in explainability. Code is available at \url{https://anonymous.4open.science/r/KGExplainer-33A0}

2329Personalized Federated Learning via Variational Massage Passing

[openreview] [pdf]

Abstract Conventional federated learning (FL) aims to train a unified machine learning model that fits data distributed across various agents. However, statistical heterogeneity arising from diverse data resources renders the single global model trained by FL ineffective for all clients. Personalized federated learning (pFL) has been proposed to primarily address this challenge by tailoring individualized models to each client’s specific dataset while integrating global information during feature aggregation. Achieving efficient pFL necessitates the accurate estimation of global feature information across all the training data. Nonetheless, balancing the personalization of individual models with the global consensus of feature information remains a significant challenge in existing approaches. In this paper, we propose pFedVMP, a novel pFL approach that employs variational message passing (VMP) to design feature aggregation protocols. By leveraging the first-order and second-order statistical information, pFedVMP yields more precise estimates of the distributions of model parameters and global feature centroids. Additionally, pFedVMP is effective in boosting training accuracy and preventing overfitting by regularizing local training with global feature centroids. Extensive experiments on heterogeneous data conditions demonstrate that pFedVMP surpasses state-of-the-art methods in both effectiveness and fairness.

2330Montessori-Instruct: Generate Influential Training Data Tailored for Student Learning

[openreview] [pdf]

Abstract Synthetic data has been widely used to train large language models, but their generative nature inevitably introduces noisy, non-informative, and misleading learning signals. In this paper, we propose Montessori-Instruct, a novel data synthesis framework that tailors the data synthesis ability of the teacher language model toward the student language model’s learning process. Specifically, we utilize local data influence of synthetic training data points on students to characterize students’ learning preferences. Then, we train the teacher model with Direct Preference Optimization (DPO) to generate synthetic data tailored toward student learning preferences. Experiments with Llama3-8B-Instruct (teacher) and Llama3-8B (student) on Alpaca Eval and MT-Bench demonstrate that Montessori-Instruct significantly outperforms standard synthesis methods by 18.35% and 46.24% relatively. Our method also beats data synthesized by a stronger teacher model, GPT-4o. Further analysis confirms the benefits of teacher’s learning to generate more influential training data in the student’s improved learning, the advantages of local data influence in accurately measuring student preferences, and the robustness of Montessori-Instruct across different student models. Our code, data, and models will be open-sourced.

2331FreeTraj: Tuning-Free Trajectory Control via Noise Guided Video Diffusion

[openreview] [pdf]

Abstract Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest in introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion model itself allows decent control over the generated content without requiring any training. In this study, we introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation. Specifically, 1) we first show several instructive phenomena and analyze how initial noises influence the motion trajectory of generated content. 2) Subsequently, we propose FreeTraj, a tuning-free approach that enables trajectory control by modifying noise sampling and attention mechanisms. 3) Furthermore, we extend FreeTraj to facilitate longer and larger video generation with controllable trajectories. Equipped with these designs, users have the flexibility to provide trajectories manually or opt for trajectories automatically generated by the LLM trajectory planner. Extensive experiments validate the efficacy of our approach in enhancing the trajectory controllability of video diffusion models. Generated video samples are available at the anonymous website:https://FreeTraj.github.io.

2332MARS: A Malignity-Aware Backdoor Defense in Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) is a distributed paradigm aimed at protecting participant data privacy by exchanging model parameters to achieve high-quality model training. However, this distributed nature also makes FL highly vulnerable to backdoor attacks. Notably, the recently proposed state-of-the-art (SOTA) attack, 3DFed (SP2023), uses an indicator mechanism to determine whether the backdoor models have been accepted by the defender and adaptively optimizes backdoor models, rendering existing defenses ineffective. In this paper, we first reveal that the failure of existing defenses lies in the employment of empirical statistical measures that are loosely coupled with backdoor attacks. Motivated by this, we propose a Malignity-Aware backdooR defenSe (MARS) that leverages backdoor energy (BE) to indicate the malicious extent of each neuron. To amplify malignity, we further extract the most prominent BE values from each model to form a concentrated backdoor energy (CBE). Finally, a novel Wasserstein distance-based clustering method is introduced to effectively identify backdoor models. Extensive experiments demonstrate that MARS can defend against SOTA backdoor attacks and significantly outperforms existing defenses.

2333Adversarial Robustness of Graph Transformers

[openreview] [pdf]

Abstract Existing studies have shown that Message-Passing Graph Neural Networks (MPNNs) are highly susceptible to adversarial attacks. In contrast, despite the increasing importance of Graph Transformers (GTs), their robustness properties are unexplored. Thus, for the purpose of robustness evaluation, we design the first adaptive attacks for GTs. We provide general design principles for strong gradient-based attacks on GTs w.r.t. structure perturbations and instantiate our attack framework for five representative and popular GT architectures. Specifically, we study GTs with specialized attention mechanisms and Positional Encodings (PEs) based on random walks, pair-wise shortest paths, and the Laplacian spectrum. We evaluate our attacks on multiple tasks and threat models, including structure perturbations on node and graph classification and node injection for graph classification. Our results reveal that GTs can be catastrophically fragile in many cases. Consequently, we show how to leverage our adaptive attacks for adversarial training, substantially improving robustness.

2334AlignIQL: Policy Alignment in Implicit Q-Learning through Constrained Optimization

[openreview] [pdf]

Abstract Implicit Q-learning (IQL) serves as a strong baseline for offline RL, which never needs to evaluate actions outside of the dataset through quantile regression. However, it is unclear how to recover the implicit policy from the learned implicit Q-function and whether IQL can utilize weighted regression for policy extraction. IDQL reinterprets IQL as an actor-critic method and gets weights of implicit policy, however, this weight only holds for the optimal value function under certain critic loss functions. In this work, we introduce a different way to solve the implicit policy-finding problem\textit{implicit policy-finding problem} (IPF) by formulating this problem as an optimization problem. Based on this optimization problem, we further propose two practical algorithms AlignIQL and AlignIQL-hard, which inherit the advantages of decoupling actor from critic in IQL and provide insights into why IQL can use weighted regression for policy extraction. Compared with IQL and IDQL, we find that our method keeps the simplicity of IQL and solves the implicit policy-finding problem. Experimental results on D4RL datasets show that our method achieves competitive or superior results compared with other SOTA offline RL methods. Especially in complex sparse reward tasks like AntMaze and Adroit, our method outperforms IQL and IDQL by a significant margin.

2335MotifDisco: Motif Causal Discovery For Time Series Motifs

[openreview] [pdf]

Abstract Many time series, particularly health data streams, can be best understood as a sequence of phenomenon or events, which we call motifs. A time series motif is a short trace segment which may implicitly capture an underlying phenomenon within the time series. Specifically, we focus on glucose traces collected from continuous glucose monitors (CGMs), which inherently contain motifs representing underlying human behaviors such as eating and exercise. The ability to identify and quantify causal relationships amongst motifs can provide a mechanism to better understand and represent these patterns, useful for improving deep learning and generative models and for advanced technology development (e.g., personalized coaching and artificial insulin delivery systems). However, no previous work has developed causal discovery methods for time series motifs. Therefore, in this paper we develop MotifDisco (motif disco-very of causality), a novel causal discovery framework to learn causal relations amongst motifs from time series traces. We formalize a notion of Motif Causality (MC), inspired from Granger Causality and Transfer Entropy, and develop a Graph Neural Network-based framework that learns causality between motifs by solving an unsupervised link prediction problem. We also integrate MC with three model use cases of forecasting, anomaly detection and clustering, to showcase the use of MC as a building block for other downstream tasks. Finally, we evaluate our framework and find that Motif Causality provides a significant performance improvement in all use cases.

2336Generalization from Starvation: Hints of Universality in LLM Knowledge Graph Learning

[openreview] [pdf]

Abstract Motivated by interpretability and reliability, we investigate how neural networks represent knowledge during graph learning. We find hints of universality, where equivalent representations are learned across a range of model sizes (from 102 to 109 parameters) and contexts (MLP toy models, LLM in-context learning and LLM training). We show that these attractor representations optimize generalization to unseen examples by exploiting properties of knowledge graph relations (e.g. symmetry and meta-transitivity). We find experimental support for such universality by showing that LLMs and simpler neural networks can be successfully stitched, i.e., by stitching the first part of one model to the last part of another, mediated only by an affine or almost affine transformation. We hypothesize that this dynamic toward simplicity and generalization is driven by ``intelligence from starvation”: where overfitting is minimized by pressure to minimize the use of resources that are either scarce or competed for against other tasks.

2337Glider: Global and Local Instruction-Driven Expert Router

[openreview] [pdf]

Abstract The availability of performant pre-trained models has led to a proliferation of fine-tuned expert models that are specialized to a particular domain or task. This has enabled the creation of powerful and adaptive routing-based “Model MoErging" methods with the goal of using expert modules to create an aggregate system with improved performance or generalization. However, existing MoErging methods often prioritize generalization to unseen tasks at the expense of performance on held-in tasks. This limitation adversely impacts practical applicability, as real-world deployments require robust performance across both known and novel tasks. We observe that current token-level routing mechanisms neglect the global semantic context of the input task. This token-wise independence hinders effective expert selection, particularly for held-in tasks, as routing decisions fail to incorporate the holistic semantic properties of the task. To address this, we propose a novel method, Global and Local Instruction Driven Expert Router (GLIDER) that integrates a multi-scale routing mechanism, encompassing a semantic global router and a learned local router. As recent LLMs demonstrate advanced reasoning capabilities for semantic-related contexts, the global router leverages this ability to enhance expert selection. By utilizing the input query and an LLM, the router generates semantic task instructions that guide the retrieval of the most relevant experts across all layers. This global guidance is complemented by a local router that facilitates token-level routing decisions within each module, enabling finer control and enhanced performance on unseen and challenging tasks. Our experiments using T5-based expert models for T0 and FLAN tasks demonstrate that GLIDER achieves substantially improved held-in performance while maintaining strong generalization on held-out tasks. Additionally, we perform ablations experiments to dive deeper into the components of GLIDER and plot routing distributions to show that GLIDER can effectively retrieve correct expert for held-in tasks while also demonstrating compositional capabilities for held-out tasks. Our experiments highlight the importance of our multi-scale routing that leverages LLM-driven semantic reasoning for MoErging methods. Our code is attached as supplementary material.

2338From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency

[openreview] [pdf]

Abstract Chain-of-thought (CoT) significantly enhances the reasoning performance of large language models (LLM). While current theoretical studies often attribute this improvement to increased expressiveness and computational capacity, we argue that expressiveness is not the primary limitation in the LLM regime, as current large models will fail on simple tasks. Using a parity-learning setup, we demonstrate that CoT can substantially improve sample efficiency even when the representation power is sufficient. Specifically, with CoT, a transformer can learn the function within polynomial samples, whereas without CoT, the required sample size is exponential. Additionally, we show that CoT simplifies the learning process by introducing sparse sequential dependencies among input tokens, and leads to a sparse and interpretable attention. We validate our theoretical analysis with both synthetic and real-world experiments, confirming that sparsity in attention layers is a key factor of the improvement induced by CoT.

2339Source-Free Target Domain Confidence Calibration

[openreview] [pdf]

Abstract In this study, we consider the setup of source-free domain adaptation and address the challenge of calibrating the confidence of a model adapted to the target domain using only unlabeled data. The primary challenge in addressing uncertainty calibration is the absence of labeled data which prevents computing the accuracy of the adapted network on the target domain. We address this by leveraging pseudo-labels generated from the source model’s predictions to estimate the true, unobserved accuracy. We demonstrate that, although the pseudo-labels are noisy, the network accuracy calculated using these pseudo-labels is similar to the accuracy obtained with the correct labels. We validate the effectiveness of our calibration approach by applying it to standard domain adaptation datasets and show that it achieves results comparable to, or even better than, previous calibration methods that relied on the availability of labeled source data.

2340Inverse decision-making using neural amortized Bayesian actors

[openreview] [pdf]

Abstract Bayesian observer and actor models have provided normative explanations for many behavioral phenomena in perception, sensorimotor control, and other areas of cognitive science and neuroscience. They attribute behavioral variability and biases to interpretable entities such as perceptual and motor uncertainty, prior beliefs, and behavioral costs. However, when extending these models to more naturalistic tasks with continuous actions, solving the Bayesian decision-making problem is often analytically intractable. Inverse decision-making, i.e. performing inference over the parameters of such models given behavioral data, is computationally even more difficult. Therefore, researchers typically constrain their models to easily tractable components, such as Gaussian distributions or quadratic cost functions, or resort to numerical approximations. To overcome these limitations, we amortize the Bayesian actor using a neural network trained on a wide range of parameter settings in an unsupervised fashion. Using the pre-trained neural network enables performing efficient gradient-based Bayesian inference of the Bayesian actor model’s parameters. We show on synthetic data that the inferred posterior distributions are in close alignment with those obtained using analytical solutions where they exist. Where no analytical solution is available, we recover posterior distributions close to the ground truth. We then show how our method allows for principled model comparison and how it can be used to disentangle factors that may lead to unidentifiabilities between priors and costs. Finally, we apply our method to empirical data from three sensorimotor tasks and compare model fits with different cost functions to show that it can explain individuals’ behavioral patterns.

2341Differentially Private Federated Learning with Time-Adaptive Privacy Spending

[openreview] [pdf]

Abstract Federated learning (FL) with differential privacy (DP) provides a framework for collaborative machine learning, enabling clients to train a shared model while adhering to strict privacy constraints. The framework allows each client to have an individual privacy guarantee, e.g., by adding different amounts of noise to each client’s model updates. One underlying assumption is that all clients spend their privacy budgets uniformly over time (learning rounds). However, it has been shown in the literature that learning in early rounds typically focuses on more coarse-grained features that can be learned at lower signal-to-noise ratios while later rounds learn fine-grained features that benefit from higher signal-to-noise ratios. Building on this intuition, we propose a time-adaptive DP-FL framework that expends the privacy budget non-uniformly across both time and clients. Our framework enables each client to save privacy budget in early rounds so as to be able to spend more in later rounds when additional accuracy is beneficial in learning more fine-grained features. We theoretically prove utility improvements in the case that clients with stricter privacy budgets spend budgets unevenly across rounds, compared to clients with more relaxed budgets, who have sufficient budgets to distribute their spend more evenly. Our practical experiments on standard benchmark datasets support our theoretical results and show that, in practice, our algorithms improve the privacy-utility trade-offs compared to baseline schemes.

2342Data Pruning by Information Maximization

[openreview] [pdf]

Abstract In this paper, we present InfoMax, a novel data pruning method, also known as coreset selection, designed to maximize the information content of selected samples while minimizing redundancy. By doing so, InfoMax enhances the overall informativeness of the coreset. The information of individual samples is measured by importance scores, which capture their influence or difficulty in model learning. To quantify redundancy, we use pairwise sample similarities, based on the premise that similar samples contribute similarly to the learning process. We formalize the coreset selection problem as a discrete quadratic programming (DQP) task, with the objective of maximizing the total information content, represented as the sum of individual sample contributions minus the redundancies introduced by similar samples within the coreset. To ensure practical scalability, we introduce an efficient gradient-based solver, complemented by sparsification techniques applied to the similarity matrix and dataset partitioning strategies. This enables InfoMax to seamlessly scale to datasets with millions of samples. Extensive experiments demonstrate the superior performance of InfoMax in various data pruning tasks, including image classification, vision-language pre-training, and instruction tuning for large language models.

2343Cuff-KT: Tackling Learners’ Real-time Learning Pattern Adjustment via Tuning-Free Knowledge State-Guided Model Updating

[openreview] [pdf]

Abstract Knowledge Tracing (KT) is a core component of Intelligent Tutoring Systems, modeling learners’ knowledge state to predict future performance and provide personalized learning support. Current KT models simply assume that training data and test data follow the same distribution. However, this is challenged by the continuous changes in learners’ patterns. In reality, learners’ patterns change irregularly at different stages (e.g.e.g., different semesters) due to factors like cognitive fatigue and external stress. Additionally, there are significant differences in the patterns of learners from various groups (e.g.e.g., different classes), influenced by social cognition, resource optimization, etc. We refer to these distribution changes at different stages and from different groups as intra-learner shift and inter-learner shift, respectively---a task introduced, which we refer to as Real-time Learning Pattern Adjustment (RLPA). Existing KT models, when faced with RLPA, lack sufficient adaptability, because they fail to timely account for the dynamic nature of different learners’ evolving learning patterns. Current strategies for enhancing adaptability rely on retraining, which leads to significant overfitting and high time cost problem. To address this, we propose Cuff-KT, comprising a controller and a generator. The controller assigns value scores to learners, while the generator generates personalized parameters for selected learners. Cuff-KT adapts to distribution changes fast and flexibly without fine-tuning. Experiments on one classic and two latest datasets demonstrate that Cuff-KT significantly improves current KT models’ performance under intra- and inter-learner shifts, with an average relative increase of 7% on AUC, effectively tackling RLPA. Our code and datasets are available athttps://anonymous.4open.science/r/Cuff-KT.

2344Offline vs. Online Learning in Model-based RL: Lessons for Data Collection Strategies

[openreview] [pdf]

Abstract Data collection is crucial for learning robust world models in model-based reinforcement learning. The most prevalent strategies are to actively collect trajectories by interacting with the environment during online training or training on offline datasets. At first glance, the nature of learning task-agnostic environment dynamics makes world models a good candidate for effective offline training. However, the effects of online vs. offline data on world models and thus on the resulting task performance have not been thoroughly studied in the literature. In this work, we investigate both paradigms in model-based settings, conducting experiments on 31 different environments. First, we showcase that online agents outperform their offline counterparts. We identify a key challenge behind performance degradation of offline agents: encountering Out-of-Distribution states at test time. This issue arises because the data collected online primarily benefits the online agent by learning from its own mistakes, but it leaves many states unvisited. As a result, the offline agent suffers from insufficient coverage of the state space. We demonstrate that this issue can be mitigated by allowing for additional online interactions in a fixed or adaptive schedule, restoring the performance of online training with limited interaction data. We also showcase that incorporating exploration data helps mitigate the performance degradation of offline agents. Based on our insights, we recommend adding exploration data when collecting large datasets, as current efforts predominantly focus on expert data alone.

2345Entering Real Social World! Benchmarking the Theory of Mind and Socialization Capabilities of LLMs from a First-person Perspective

[openreview] [pdf]

Abstract In the social world, humans possess the capability to infer and reason about others’ mental states (such as emotions, beliefs, and intentions), known as Theory of Mind (ToM). Simultaneously, humans’ own mental states evolve in response to social situations, a capability we refer to as \textit{socialization}. Together, these capabilities form the foundation of human social interaction. In the era of artificial intelligence (AI), especially with the development of large language models (LLMs), we raise intriguing questions: How do LLMs perform in terms of ToM and \textit{socialization} capabilities? And more broadly, can these AI models truly enter and navigate the real social world? Existing research evaluating LLMs’ ToM and \textit{socialization} capabilities by positioning LLMs as passive observers from a third-person perspective, rather than as active participants. However, compared to the third-person perspective, observing and understanding the world from an ego-centric first-person perspective is a natural approach for both humans and AI agents. The ToM and \textit{socialization} capabilities of LLMs from a first-person perspective, a crucial attribute for advancing embodied AI agents, remain unexplored. To answer the aforementioned questions and bridge the research gap, we introduce \textit{EgoSocialArena}, a novel framework designed to evaluate and investigate the ToM and \textit{socialization} capabilities of LLMs from a first-person perspective. It encompasses two evaluation environments: static environment and interactive environment, with seven scenarios: Daily Life, Counterfactual, New World, Blackjack, Number Guessing, and Limit Texas Hold’em, totaling 2,195 data entries. With \textit{EgoSocialArena}, we have conducted a comprehensive evaluation of nine advanced LLMs and observed some key insights regarding the future development of LLMs as well as the capabilities levels of the most advanced LLMs currently available.

2346Successor Representations Enable Emergent Compositional Instruction Following

[openreview] [pdf]

Abstract Behavioral cloning (BC) has seen widespread adoption in scalable robot learning pipelines. These methods struggle to perform compositional generalization, where a new out-of-distribution evaluation task can be viewed as a sequence of simpler in-distribution steps. We augment goal-conditioned BC methods with a temporal alignment loss that learns to associate present and future states. This approach is able to generalize to novel composite tasks specified as goal images or language instructions, without assuming any additional reward supervision or explicit subtask planning. We evaluate our approach across diverse tabletop robotic manipulation tasks, showing substantial improvements for tasks specified with either language or goal images.

2347fine-tuning with very large dropout

[openreview] [pdf]

Abstract It is impossible today to pretend that the practice of machine learning is compatible with the idea that training and testing data follow the same distribution. Several authors have recently used ensemble techniques to show how scenarios involving multiple data distributions are best served by representations that are both richer than those obtained by regularizing for the best in-distribution performance, and richer than those obtained under the influence of the implicit sparsity bias of common stochastic gradient procedures.This contribution investigates the use of very high dropout rates instead of ensembles to obtain such rich representations. Although training a deep network from scratch using such dropout rates is virtually impossible, fine-tuning a large pre-trained model under such conditions is not only possible but also achieves out-of-distribution performances that exceed those of both ensembles and weight averaging methods such as model soups.This result has practical significance because the importance of the fine-tuning scenario has considerably grown in recent years. This result also provides interesting insights on the nature of rich representations and on the intrinsically linear nature of fine-tuning a large network using a comparatively small dataset.

2348A Meta-Learning Approach to Bayesian Causal Discovery

[openreview] [pdf]

Abstract Discovering a unique causal structure is difficult due to both inherent identifiability issues, and the consequences of finite data. As such, uncertainty over causal structures, such as those obtained from a Bayesian posterior, are often necessary for downstream tasks. Finding an accurate approximation to this posterior is challenging, due to the large number of possible causal graphs, as well as the difficulty in the subproblem of finding posteriors over the functional relationships of the causal edges. Recent works have used Bayesian meta learning to view the problem of posterior estimation as a supervised learning task. Yet, these methods are limited as they cannot reliably sample from the posterior over causal structures and fail to encode key properties of the posterior, such as correlation between edges and permutation equivariance with respect to nodes. To address these limitations, we propose a Bayesian meta learning model that allows for sampling causal structures from the posterior and encodes these key properties. We compare our meta-Bayesian causal discovery against existing Bayesian causal discovery methods, demonstrating the advantages of directly learning a posterior over causal structure.

2349PersonalLLM: Tailoring LLMs to Individual Preferences

[openreview] [pdf]

Abstract As LLMs become capable of complex tasks, there is growing potential for personalized interactions tailored to the subtle and idiosyncratic preferences of the user. We present a public benchmark, PersonalLLM, focusing on adapting LLMs to provide maximal benefits for a particular user. Departing from existing alignment benchmarks that implicitly assume uniform preferences, we curate open-ended prompts paired with many high-quality answers over which users would be expected to display heterogeneous latent preferences. Instead of persona prompting LLMs based on high-level attributes (e.g., user race or response length), which yields homogeneous preferences relative to humans, we develop a method that can simulate a large user base with diverse preferences from a set of pre-trained reward models. Our dataset and generated personalities offer an innovative testbed for developing personalization algorithms that grapple with continual data sparsity---few relevant feedback from the particular user---by leveraging historical data from other (similar) users. We explore basic in-context learning and meta-learning baselines to illustrate the utility of PersonalLLM and highlight the need for future methodological development.

2350Differentiable Implicit Solver on Graph Neural Networks for Forward and Inverse Problems

[openreview] [pdf]

Abstract Partial differential equations (PDEs) on unstructured grids can be solved using message passing on a graph neural network (GNN). Implicit time-stepping schemes are often favored, especially for parabolic PDEs, due to their stability properties. In this work, we develop a fully differentiable implicit solver for unstructured grids. We evaluate its performance across four key tasks: a) forward modeling of stiff evolutionary and static problems; b) the inverse problem of estimating equation coefficients; c) the inverse problem of estimating the right-hand side; and d) graph coarsening to accelerate forward modeling. The increased stability and differentiability of our solver enable excellent results in reducing the complexity of forward modeling and efficiently solving related inverse problems. This makes it a promising tool for geoscience and other physics-based applications.

2351Manifold K-means withℓ2,p-Norm Maximization

[openreview] [pdf]

Abstract Although a variety of different methods have emerged in the field of clustering, K-means still occupies an important position, and many advanced clustering methods even rely on the K-means to achieve effective cluster detection. However, the sensitivity of K-means to the selection of the initial cluster center and its limited ability to handle nonlinear separable data somewhat restrict its clustering performance. In order to overcome the limitations of K-means, we draw inspiration from manifold learning and redefine K-means as a manifold K-means clustering framework. This framework supports various types of distance matrices, thus facilitating the efficient processing of nonlinear separable data. A unique advantage of this approach is that it does not require the calculation of the cluster center, while it maintains the consistency between manifold structure and cluster labels. Additionally, we highlight the significant role of the 2,p\ell_{2,p}-norm; by maximizing the 2,p\ell_{2,p}-norm, we can ensure the balance of classes in the clustering process, which is also supported by theoretical analysis. The results from extensive experiments across multiple databases substantiate the superiority of our proposed model.

2352Looped Transformers for Length Generalization

[openreview] [pdf]

Abstract Recent work has shown that Transformers trained from scratch can successfully solve various arithmetic and algorithmic tasks, such as adding numbers and computing parity. While these Transformers generalize well on unseen inputs of the same length, they struggle with length generalization, i.e., handling inputs of unseen lengths. In this work, we demonstrate that looped Transformers with an adaptive number of steps significantly improve length generalization. We focus on tasks with a known iterative solution, involving multiple iterations of a RASP-L operation—a length-generalizable operation that can be expressed by a finite-sized Transformer. We train looped Transformers using our proposed learning algorithm and observe that they learn highly length-generalizable solutions for various tasks.

2353Consistency Checks for Language Model Forecasters

[openreview] [pdf]

Abstract Forecasting is a task that is difficult to evaluate: the ground truth can only be known in the future. Recent work showing LLM forecasters rapidly approaching human-level performance begs the question: how can we benchmark and evaluate these forecastersinstantaneously? Following the consistency check framework, we measure the performance of forecasters in terms of the consistency of their predictions on different logically-related questions. We propose a new, general consistency metric based onarbitrage: for example, if a forecasting AI illogically predicts that both the Democratic and Republican parties have 60% probability of winning the 2024 US presidential election, an arbitrageur could trade against the forecaster’s predictions and make a profit. We build an automated evaluation system that generates a set of base questions, instantiates consistency checks from these questions, elicits the predictions of the forecaster, and measures the consistency of the predictions. We then build a standard, proper-scoring-rule forecasting benchmark, and show that our (instantaneous) consistency metrics correlate strongly with LLM forecasters’ ground truth Brier scores (which are only known in the future). We also release a consistency benchmark that resolves in 2028, providing a long-term evaluation tool for forecasting.

2354Multivariate Time-series Forecasting with SPACE: Series Prediction Augmented by Causality Estimation

[openreview] [pdf]

Abstract The analysis of multivariate time series (MTS) presents a complex yet crucial task with substantial applications in areas such as weather forecasting, policy formulation, and stock market prediction. It is important to highlight three key characteristics of MTS that contribute to the challenging and multifaceted nature of their analysis: (i) their interrelationships are represented through causal relationships rather than mere similarities; (ii) they convey information across multiple independent factors; and (iii) their dynamics often arise from inherent temporal dependencies. While conventional time series analysis frameworks often fail to capture one or more of these aspects, resulting in incomplete or even misleading conclusions, we propose an end-to-end trainable S\textbf{S}eries P\textbf{P}rediction model A\textbf{A}ugmented by C\textbf{C}ausality E\textbf{E}stimation (SPACE) to address these limitations. This model effectively incorporates temporal dependencies and causal relationships, featuring a temporal embedding and a transfer entropy-based Cross-TE module designed to enhance predictions through causality-augmented mechanisms. Experiments demonstrate that SPACE achieves state-of-the-art results on challenging real-world time series prediction tasks, showing its effectiveness and versatility.

2355Rethinking Invariance Regularization in Adversarial Training to Improve Robustness-Accuracy Trade-off

[openreview] [pdf]

Abstract Adversarial training often suffers from a robustness-accuracy trade-off, where achieving high robustness comes at the cost of accuracy. One approach to mitigate this trade-off is leveraging invariance regularization, which encourages model invariance under adversarial perturbations; however, it still leads to accuracy loss. In this work, we closely analyze the challenges of using invariance regularization in adversarial training and understand how to address them. Our analysis identifies two key issues: (1) a “gradient conflict” between invariance and classification objectives, leading to suboptimal convergence, and (2) the mixture distribution problem arising from diverged distributions between clean and adversarial inputs. To address these issues, we propose Asymmetric Representation-regularized Adversarial Training (ARAT), which incorporates asymmetric invariance loss with stop-gradient operation and a predictor to avoid gradient conflict, and a split-BatchNorm (BN) structure to resolve the mixture distribution problem. Our detailed analysis demonstrates that each component effectively addresses the identified issues, offering novel insights into adversarial defense. ARAT shows superiority over existing methods across various settings. Finally, we discuss the implications of our findings to knowledge distillation-based defenses, providing a new perspective on their relative successes.

2356Task Descriptors Help Transformers Learn Linear Models In-Context

[openreview] [pdf]

Abstract Large language models (LLMs) exhibit strong in-context learning (ICL) ability, which allows the model to make predictions on new examples based on the given prompt. Recently, a line of research (Von Oswald et al., 2023; Aky¨urek et al., 2023; Ahn et al., 2023; Mahankali et al., 2023; Zhang et al., 2024) considered ICL for a simple linear regression setting and showed that the forward pass of Transformers is simulating some variants of gradient descent (GD) algorithms on the in-context examples. In practice, the input prompt usually contains a task descriptor in addition to in-context examples. We investigate how the task description helps ICL in the linear regression setting. Our results show that gradient flow converges to a global minimum for a simple linear model with task descriptors. At the global minimum, the Transformer learns to use the task descriptor effectively to improve its performance. Empirically, we verify our results by showing that the weights converge to the predicted global minimum and Transformers indeed perform better with task descriptors.

2357A Theoretical Framework for Partially-Observed Reward States in RLHF

[openreview] [pdf]

Abstract The growing deployment of reinforcement learning from human feedback (RLHF) calls for a deeper theoretical investigation of its underlying models. The prevalent models of RLHF do not account for neuroscience-backed, partially-observed "internal states’’ that can affect human feedback, nor do they accommodate intermediate feedback during an interaction. Both of these can be instrumental in speeding up learning and improving alignment. To address these limitations, we model RLHF as reinforcement learning with partially observed reward-states (PORRL). We accommodate two kinds of feedback — cardinal and dueling feedback. We first demonstrate that PORRL subsumes a wide class of RL problems, including traditional RL, RLHF, and reward machines. For cardinal feedback, we present two model-based methods (POR-UCRL, POR-UCBVI). We give both cardinal regret and sample complexity guarantees for the methods, showing that they improve over naive history-summarization. We then discuss the benefits of a model-free method like GOLF with naive history-summarization in settings with recursive internal states and dense intermediate feedback. For this purpose, we define a new history aware version of the Bellman-eluder dimension and give a new guarantee for GOLF in our setting, which can be exponentially sharper in illustrative examples. For dueling feedback, we show that a naive reduction to cardinal feedback fails to achieve sublinear dueling regret. We then present the first explicit reduction that converts guarantees for cardinal regret to dueling regret. In both feedback settings, we show that our models and guarantees generalize and extend existing ones.

2358Better autoregressive regression with LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) have proven successful on many machine learning tasks, including those that do not involve language generation. In specific, LLMs have been shown to be effective in solving regression, where the targets are real-numbers. One common approach is to fine tune the LLM based on the log-perplexity loss and use autoregressive sampling at the inference time. Another approach relies on adding a predictive head and finetuning it with a suitable loss. Despite the success, there has not been a study on the principled ways of using decoder LLMs for regression. In this work we compare different prior works under a unified view, and introduce RAFT, regression-aware fine-tuning, a novel approach based on the Bayes-optimal decision rule. We demonstrate how RAFT improves over established baselines on several benchmarks and model families.

2359Clustering on Skewed Cost Distributions

[openreview] [pdf]

Abstract In this paper, we tackle the problem of (k,z)(k,z)-clustering, a generalization of the well-known kk-means, kk-medians and kk-medoids problems that is known to be APX hard, i.e., impossible to approximate within a multiplicative factor of 1.06 in polynomial time for nn and kk unless P=NP. Due to the APX-hardness, the fastest (1+ε)(1+\varepsilon)-approximation scheme proposed by Feldman et al. (2007), exhibits a run time with a polynomial dependency on nn, but an exponential dependency 2O~(k/ε)2^{\tilde{\mathcal{O}}(k/\varepsilon)} on kk. We observe that a (1+ε)(1+\varepsilon)-approximation in truly polynomial time is feasible if the data sets exhibit sufficiently skewed distributions. Indeed in practical scenarios, data sets often exhibit a heavy skewness, leading to the overall clustering cost disproportionately dominated by a few clusters. We propose a novel algorithm that adapts the traditional local search technique to effectively manage (s,1εz+1)(s, 1- \varepsilon^{z+1})-skewed datasets with a run time of (nk/ε)O(s+1/ε)(nk/\varepsilon)^{\mathcal{O}(s+1/\varepsilon)} for discrete case and O~(nk)+(klogn)O~(s+1/ε)\tilde{\mathcal{O}}(nk) + (k \log n)^{\tilde{\mathcal{O}}(s+1/\varepsilon)} for continuous case. Our method is particularly effective with Zipfian distributions with exponent p>1p>1, where s=O(1ε(z+1)/(p1))s = \mathcal{O}\left(\frac{1}{\varepsilon^{(z+1)/(p-1)}}\right).

2360MambaTS: Improved Selective State Space Models for Long-term Time Series Forecasting

[openreview] [pdf]

Abstract In recent years, Transformers have become the de-facto architecture for long-term sequence forecasting (LTSF), yet they face challenges associated with the self-attention mechanism, including quadratic complexity and permutation invariant bias. This raises an important question: \emph{do we truly need the self-attention mechanism to establish long-range dependencies in LTSF?} Recognizing the significance of causal relationships in multivariate LTSF, we propose MambaTS, which leverages causal relationships to model global dependencies across time and variables through a single linear scan. However, causal graphs are often unknown. To address this, we introduce variable-aware scan along time (VAST), which dynamically discovers variable relationships during training and decodes the optimal variable scan order by solving the shortest path visiting all nodes problem during inference. MambaTS employs the latest Mamba model as its backbone. We suggest that the causal convolution in Mamba is unnecessary due to the presence of independent variables, leading to the development of the Temporal Mamba Block (TMB). To mitigate model overfitting, we further incorporate a dropout mechanism for selective parameters in TMB. Extensive experiments conducted on eight public datasets demonstrate that MambaTS achieves new state-of-the-art performance.

2361Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision

[openreview] [pdf]

Abstract The performance and reasoning capabilities of Large Multi-modal Models (LMMs) is dependent on the size and quality of their training datasets. However, collecting datasets that support chain-of-thought instruction tuning is highly challenging. Existing video instruction tuning datasets are often derived by prompting large language models with video captions to generate question-answer pairs, which makes them predominantly descriptive rather than reasoning-focused. Meanwhile, many labeled video datasets with diverse labels and supervision exist -- however, we find that their integration into LMMs is non-trivial. Herein, we present Video\underline{\text{Video}} Self\underline{\text{S}}\text{elf}-Training\underline{\text{T}}\text{raining} with\text{with} augmented\underline{\text{a}}\text{ugmented} Reasoning\underline{\text{R}}\text{easoning} (Video-STaR), the first self-training approach for video instruction tuning. Video-STaR allows the utilization ofanylabeled video dataset for video instruction tuning. In Video-STaR, an LMM cycles between instruction generation and finetuning, which we show (I) improves general video understanding and (II) adapts LMMs to novel downstream tasks with existing supervision. During instruction generation, an LMM is prompted to propose an answer. The answers are then filtered only to those that contain the original video labels, and the LMM is then re-trained on the generated dataset. By training exclusively on generated answers containing the correct video labels, Video-STaR leverages these existing labels as weak supervision for video instruction tuning. Our results demonstrate that Video-STaR-augmented LMMs achieve notable improvements in (I) general Video QA, where TempCompass performance improved by 6.1%,and(II) downstream tasks, with a 9.9% increase in Kinetics700-QA accuracy and a 4.0% improvement in action quality assessment on FineDiving, while also exhibiting better interpretability.

2362Generalized Group Data Attribution

[openreview] [pdf]

Abstract Data Attribution (DA) methods quantify the influence of individual training data points on model outputs and have broad applications such as explainability, data selection, and noisy label identification. However, existing DA methods are often computationally intensive, limiting their applicability to large-scale machine learning models. To address this challenge, we introduce the Generalized Group Data Attribution (GGDA) framework, which computationally simplifies DA by attributing to groups of training points instead of individual ones. GGDA is a general framework that subsumes existing attribution methods and can be applied to new DA techniques as they emerge. It allows users to optimize the trade-off between efficiency and fidelity based on their needs. Our empirical results demonstrate that GGDA applied to popular DA methods such as Influence Functions, TracIn, and TRAK results in upto 10x-50x speedups over standard DA methods while gracefully trading off attribution fidelity. For downstream applications such as dataset pruning and noisy label identification, we demonstrate that GGDA significantly improves computational efficiency and maintains effectiveness, enabling practical applications in large-scale machine learning scenarios that were previously infeasible.

2363Evolve: Evaluating and Optimizing LLMs For Exploration

[openreview] [pdf]

Abstract Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs’ (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments that include both context-free and contextual bandits of varying task difficulties to benchmark LLMs’ performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithmic guided support during inference; and through knowledge distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on the different factors, such as task difficulty and data representations, that influence the efficiency of LLM exploration. Additionally, we provide empirical measurements on the convergence rate of different exploration strategies introduced.

2364Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) is a powerful approach for data-driven decision-making and control. Compared to model-free methods, offline model-based reinforcement learning (MBRL) explicitly learns world models from a static dataset and uses them as surrogate simulators, improving the data efficiency and enabling the learned policy to potentially generalize beyond the dataset support. However, there could be various MDPs that behave identically on the offline dataset and so dealing with the uncertainty about the true MDP can be challenging. In this paper, we propose modeling offline MBRL as a Bayes Adaptive Markov Decision Process (BAMDP), which is a principled framework for addressing model uncertainty. We further introduce a novel Bayes Adaptive Monte-Carlo planning algorithm capable of solving BAMDPs in continuous state and action spaces with stochastic transitions. This planning process is based on Monte Carlo Tree Search and can be integrated into offline MBRL as a policy improvement operator in policy iteration. Our “RL + Search” framework follows in the footsteps of superhuman AIs like AlphaZero, improving on current offline MBRL methods by incorporating more computation input. The proposed algorithm significantly outperforms state-of-the-art model-based and model-free offline RL methods on twelve D4RL MuJoCo benchmark tasks and three target tracking tasks in a challenging, stochastic tokamak control simulator.

2365Constrained Skill Discovery: Quadruped Locomotion with Unsupervised Reinforcement Learning

[openreview] [pdf]

Abstract Representation learning and unsupervised skill discovery can allow robots to acquire diverse and reusable behaviors without the need for task-specific rewards. In this work, we learn a latent representation by maximizing the mutual information between skills and states subject to a distance constraint, using unsupervised reinforcement learning. Our method improves upon prior constrained skill discovery methods by replacing the latent transition maximization with a norm-matching objective. This not only results in a much a richer state space coverage, but allows the robot to learn more stable and easily controllable locomotive behaviors. In robotics this is particularly important, because state transition-maximizing behaviors can result in highly dangerous motions. We successfully deployed the learned policy on a real ANYmal quadruped robot and demonstrated that the robot can accurately reach arbitrary points of the Cartesian state space in a zero-shot manner, using only an intrinsic skill discovery and standard regularization rewards.

2366Precedence-Constrained Winter Value for Effective Graph Data Valuation

[openreview] [pdf]

Abstract Data valuation is essential for quantifying data’s worth, aiding in assessing data quality and determining fair compensation. While existing data valuation methods have proven effective in evaluating the value of Euclidean data, they face limitations when applied to the increasingly popular graph-structured data. Particularly, graph data valuation introduces unique challenges, primarily stemming from the intricate dependencies among nodes and the exponential growth in value estimation costs. To address the challenging problem of graph data valuation, we put forth an innovative solution, Precedence-Constrained Winter (PC-Winter) Value, to account for the complex graph structure. Furthermore, we develop a variety of strategies to address the computational challenges and enable efficient approximation of PC-Winter. Extensive experiments demonstrate the effectiveness of PC-Winter across diverse datasets and tasks.

2367Rate of Approximation by Flows: A Case Study on the Eikonal Equation

[openreview] [pdf]

Abstract Previous works have demonstrated the universal approximation capability of residual networks through their continuous idealization as flow maps of dynamical systems. However, informative results on their approximation rates in terms of depth (corresponding to time) are generally lacking. From the viewpoint of approximation theory, a major difficulty in addressing this gap lies in identifying an appropriate target space for the approximation problem. In this paper, we introduce a restrictive but useful target function space comprised of solutions to the eikonal equations, a type of first-order nonlinear partial differential equation, to investigate the approximation rates of flow map families. We provide an estimate of the approximation error within this space, which is notably different from classical rate estimates based directly on the smoothness of target functions. This theoretical result further inspires a new learning-based algorithm for solving the eikonal equation. Experimental results validate the effectiveness of our proposed algorithm, including its robustness to spatial resolution and solution regularity, as well as transferability among similar problems.

2368Improving Fairness and Mitigating MADness in Generative Models

[openreview] [pdf]

Abstract Generative models unfairly penalize data belonging to minority classes, suffer from model autophagy disorder (MADness), and learn biased estimates of the underlying distribution parameters. Our theoretical and empirical results show that training generative models with intentionally designed hypernetworks leads to models that 1) are more fair when generating datapoints belonging to minority classes 2) are more stable in a self-consumed (i.e., MAD) setting, and 3) learn parameters that are less statistically biased. To further mitigate unfairness, MADness, and bias, we introduce a regularization term that penalizes discrepancies between a generative model’s estimated weights when trained on real data versus its own synthetic data. To facilitate training existing deep generative models within our framework, we offer a scalable implementation of hypernetworks that automatically generates a hypernetwork architecture for any given generative model.

2369From Loops to Oops: Fallback Behaviors of Language Models Under Uncertainty

[openreview] [pdf]

Abstract Large language models (LLMs) often exhibit undesirable behaviors, such as hallucinations and sequence repetitions. We propose to view these behaviors as fallbacks that models exhibit under epistemic uncertainty, and investigate the connection between them. We categorize fallback behaviors — sequence repetitions, degenerate text, and hallucinations — and extensively analyze them in models from the same family that differ by the amount of pretraining tokens, parameter count, or the inclusion of instruction-following training. Our experiments reveal a clear and consistent ordering of fallback behaviors, across all these axes: the more advanced an LLM is (i.e., trained on more tokens, has more parameters, or instruction-tuned), its fallback behavior shifts from sequence repetitions, to degenerate text, and then to hallucinations. Moreover, the same ordering is observed during the generation of a single sequence, even for the best-performing models; as uncertainty increases, models shift from generating hallucinations to producing degenerate text and finally sequence repetitions. Lastly, we demonstrate that while common decoding techniques, such as random sampling, alleviate unwanted behaviors like sequence repetitions, they increase harder-to-detect hallucinations.

2370Improving model robustness against noise with safe haven activations

[openreview] [pdf]

Abstract Quantized neural networks (QNNs) are often used in edge AI because they reduce memory and computational demands. In practical applications such as control systems, medical imaging, and robotics, controlling input noise is crucial for enhancing system robustness. Thus, improving the noise resilience of QNNs is an important challenge in achieving effective edge AI applications. In this paper, we investigate the impact of input noise on QNN performance and propose the safe haven activation quantization (SHAQ) method. This approach leverages the characteristics of the quantization function to constrain outputs before quantization within a more noise-resilient ‘safe’ range, effectively reducing the impact of noise across quantized layers. Our methods achieve state-of-the-art, 73.11% accuracy with 2-bit activations under the fast gradient sign method (FGSM) adversarial attacks with an epsilon of 8/255 on the CIFAR-10 dataset. Furthermore, we extend our methods into a plug-and-play solution we call quantized helmet (QH), comprising a series of quantized layers that can be integrated into any unquantized neural network to enhance its noise robustness. Our experimental code and analysis are open-source and publicly accessible.

2371CPT: Consistent Proxy Tuning for Black-box Optimization

[openreview] [pdf]

Abstract Black-box tuning has attracted recent attention due to that the structure or inner parameters of advanced proprietary models are not accessible. Recently, Proxy-tuning provides a test-time output adjustment for tuning black-box language models.It applies the difference of the output logits before and after tuning a smaller white-box “proxy” model to improve the black-box model. However, this technique serves only as a decoding-time algorithm, leading to an inconsistency between training and testing which potentially limits overall performance. To address this problem, we introduce Consistent Proxy Tuning (CPT), a simple yet effective black-box tuning method. Different from Proxy-tuning, CPT additionally exploits the frozen large black-box model and another frozen small white-box model, ensuring consistency between training-stage optimization objective and test-time proxies. This consistency benefits Proxy-tuning and enhances model performance. Note that our method focuses solely on logit-level computation, which makes it model-agnostic and applicable to any task involving logit classification. Extensive experimental results demonstrate the superiority of our CPT in both black-box tuning of Large-Language Models (LLMs) and Vision-Language Models (VLMs) across various datasets.

2372Automated Rewards via LLM-Generated Progress Functions

[openreview] [pdf]

Abstract Large Language Models (LLMs) have the potential to automate reward engineering by leveraging their broad domain knowledge across various tasks. However, they often need many iterations of trial-and-error to generate effective reward functions. This process is costly because evaluating every sampled reward function requires completing the full policy optimization process for each function. In this paper, we introduce an LLM-driven reward generation framework that is able to produce state-of-the-art policies on the challenging Bi-DexHands benchmark with 20×\times fewer reward function samples than the prior state-of-the-art work. Our key insight is that we reduce the problem of generating task-specific rewards to the problem of coarsely estimating task progress. Our two-step solution leverages the task domain knowledge and the code synthesis abilities of LLMs to author progress functions that estimate task progress from a given state. Then, we use this notion of progress to discretize states, and generate count-based intrinsic rewards using the low-dimensional state space. We show that the combination of LLM-generated progress functions and count-based intrinsic rewards is essential for our performance gains, while alternatives such as generic hash-based counts or using progress directly as a reward function fall short.

2373Tri-Tense Former: Capturing Dynamic Traffic Flow Using Tri-Tense Attention for Traffic Forecasting

[openreview] [pdf]

Abstract Accurate traffic forecasting is essential to enable advanced utilization of intelligent transportation systems. However, forecasting models often struggle to capture the complex spatio-temporal dependencies of traffic data, as they typically handle spatial and temporal dependencies separately. To overcome this limitation, we introduce the Tri-Tense Former (TTformer), a novel approach that captures spatio-temporal relationships through three tense-specific attention modules. We categorize traffic flow into three tense dimensions: past-to-present (present-perfect), present, and future. Each tense-specific attention module captures the dependencies within its respective traffic flow. Furthermore, to address incomplete traffic data, we improve the robustness of the model by employing contrastive learning with negative filtering technique that operates regardless of predefined adjacency matrices. TTformer significantly outperforms existing models by more effectively capturing spatio-temporal dependencies and improving traffic forecasting accuracy.

2374Radial Basis Operator Networks

[openreview] [pdf]

Abstract Operator networks are designed to approximate nonlinear operators, which provide mappings between infinite-dimensional spaces such as function spaces. These networks are playing an increasingly important role in machine learning, with their most notable contributions in the field of scientific computing. Their significance stems from their ability to handle the type of data often encountered in scientific applications. For instance, in climate modeling or fluid dynamics, input data typically consists of discretized continuous fields (like temperature distributions or velocity fields). We introduce the radial basis operator network (RBON), which represents a significant advancement as the first operator network capable of learning an operator in both the time domain and frequency domain when adjusted to accept complex-valued inputs. Despite the small, single hidden-layer structure, the RBON boasts small L2L^2 relative test error for both in- and out-of-distribution data (OOD) of less than 1×1071\times 10^{-7} in some benchmark cases. Moreover, the RBON maintains small error on OOD data from entirely different function classes from the training data.

2375OSM+: Cloud-native Open Street Map Data System for City-wide Experiments

[openreview] [pdf]

Abstract Road network data can provide rich information about cities and thus become the base for various urban research. However, processing large-volume world-wide road network data requires intensive computing resources and the processed results might be different to be unified for benchmark downstream tasks. Therefore, in this paper, we process the OpenStreetMap data and release a structured world-wide 1-billion-node road network graph database with high accessibility and usability. We have presented three illustrative use cases, traffic prediction task, city boundary detection task and traffic policy control task. Moreover, for the well-investigated traffic prediction task, we release a new benchmark with 31 datasets, which is much more comprehensive than the previously frequently-used datasets. While for the relatively novel traffic policy control task, we release a new 6 city datasets with much larger scale than the previous datasets. Along with the OSM+ dataset, the release of data converters facilitates the integration of multimodal spatial-temporal data based on map information for large model training, thereby expediting the process of uncovering compelling scientific insights.

2376GoodDrag: Towards Good Practices for Drag Editing with Diffusion Models

[openreview] [pdf]

Abstract In this paper, we introduce GoodDrag, a novel approach to improve the stability and image quality of drag editing. Unlike existing methods that struggle with accumulated perturbations and often result in distortions, GoodDrag introduces an AlDD framework that alternates between drag and denoising operations within the diffusion process, effectively improving the fidelity of the result. We also propose an information-preserving motion supervision operation that maintains the original features of the starting point for precise manipulation and artifact reduction. In addition, we contribute to the benchmarking of drag editing by introducing a new dataset, Drag100, and developing dedicated quality assessment metrics, Dragging Accuracy Index and Gemini Score, utilizing Large Multimodal Models. Extensive experiments demonstrate that the proposed GoodDrag compares favorably against the state-of-the-art approaches both qualitatively and quantitatively. The source code and data will be released.

2377Improved Training Technique for Latent Consistency Models

[openreview] [pdf]

Abstract Consistency models are a new family of generative models capable of producing high-quality samples in either a single step or multiple steps. Recently, consistency models have demonstrated impressive performance, achieving results on par with diffusion models in the pixel space. However, the success of scaling consistency training to large-scale datasets, particularly for text-to-image and video generation tasks, is determined by performance in the latent space. In this work, we analyze the statistical differences between pixel and latent spaces, discovering that latent data often contains highly impulsive outliers, which significantly degrade the performance of iCT \citep{song2023improved} in the latent space. To address this, we replace Pseudo-Huber losses with Cauchy losses, effectively mitigating the impact of outliers. Additionally, we introduce a diffusion loss at early timesteps and employ optimal transport (OT) coupling to further enhance performance. Lastly, we introduce the adaptive scaling-cc scheduler to manage the robust training process and adopt Non-scaling LayerNorm in the architecture to better capture the statistics of the features and reduce outlier impact. With these strategies, we successfully train latent consistency models capable of high-quality sampling with one or two steps, significantly narrowing the performance gap between latent consistency and diffusion models.

2378Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models

[openreview] [pdf]

Abstract Requiring a large language model to generate intermediary reasoning steps has been shown to be an effective way of boosting performance. In fact, instruction tuning on these intermediary reasoning steps improves model performance. In this work, we present a novel method of further improving performance by requiring models to compare multiple reasoning chains before generating a solution in a single inference step. We call this method Divergent CoT (DCoT). We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, LLMs. Through a rigorous set of experiments spanning a wide range of tasks that require various reasoning types, we show that fine-tuning on DCoT consistently improves performance over the CoT baseline across model families and scales (1.3B to 70B). Through a combination of empirical and manual evaluation, we additionally show that these performance gains stem from models generating multiple divergent reasoning chains in a single inference step, indicative of the enabling of self-correction in language models. Our code and data are publicly available.

2379What do we learn from inverting CLIP models?

[openreview] [pdf]

Abstract We employ an inversion-based approach to examine CLIP models. Our examination reveals that inverting CLIP models results in the generation of images that exhibit semantic alignment with the specified target prompts. We leverage these inverted images to gain insights into various aspects of CLIP models, such as their ability to blend concepts and inclusion of gender biases. We notably observe instances of NSFW (Not Safe For Work) images during model inversion. This phenomenon occurs even for semantically innocuous prompts, like `a beautiful landscape,’ as well as for prompts involving the names of celebrities.

2380Unlocking Point Processes through Point Set Diffusion

[openreview] [pdf]

Abstract Point processes model the distribution of random point sets in mathematical spaces, such as spatial and temporal domains, with applications in fields like seismology, neuroscience, and economics. Existing statistical and machine learning models for point processes are predominantly constrained by their reliance on the characteristic intensity function, introducing an inherent trade-off between efficiency and flexibility. In this paper, we introduce Point Set Diffusion, a diffusion-based latent variable model that can represent arbitrary point processes on general metric spaces without relying on the intensity function. By directly learning to stochastically interpolate between noise and data point sets, our approach enables efficient, parallel sampling and flexible generation for complex conditional tasks defined on the metric space. Experiments on synthetic and real-world datasets demonstrate that Point Set Diffusion achieves state-of-the-art performance in unconditional and conditional generation of spatial and spatiotemporal point processes while providing up to orders of magnitude faster sampling than autoregressive baselines.

2381Nested Gloss Makes Large Language Models Lost

[openreview] [pdf]

Abstract Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs’ personification capabilities to construct a virtual, nested scene\textit{a virtual, nested scene}, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, e.g.\textit{e.g.}, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o.

2382BeST - A Novel Source Selection Metric for Transfer Learning

[openreview] [pdf]

Abstract One of the most fundamental, and yet relatively less explored, goals in transfer learning is the efficient means of selecting top candidates from a large number of previously trained models (optimized for various “source” tasks) that would perform the best for a new “target” task with a limited amount of data. In this paper, we undertake this goal by developing a novel task-similarity metric (BeST) and an associated method that consistently performs well in identifying the most transferrable source(s) for a given task. In particular, our design employs an innovative quantization-level optimization procedure in the context of classification tasks that yields a measure of similarity between a source model and the given target data. The procedure uses a concept similar to early stopping (usually implemented to train deep neural networks (DNNs) to ensure generalization) to derive a function that approximates the transfer learning mapping without training. The advantage of our metric is that it can be quickly computed to identify the top candidate(s) for a given target task before a computationally intensive transfer operation (typically using DNNs) can be implemented between the selected source and the target task. As such, our metric can provide significant computational savings for transfer learning from a selection of a large number of possible source models. Through extensive experimental evaluations, we establish that our metric performs well over different datasets and varying numbers of data samples.

2383Customizing Reinforcement Learning Agent with Multi-Objective Preference Control

[openreview] [pdf]

Abstract Practical reinforcement learning (RL) usually requires agents to be optimized for multiple potentially conflicting criteria, e.g. speed vs. safety. Although Multi-Objective RL (MORL) algorithms have been studied in previous works, their trained agents often lack precise controllability of the delicate trade-off among multiple objectives. Hence, the resulting agent is not versatile in aligning with customized requests from different users. To bridge the gap, we develop ``Preference control (PC) RL’', which aims to train a meta-policy that takes user preference as input controlling the generation of a trajectory on the Pareto frontier adhering to the preference. To this end, we train a preference-conditioned meta-policy by our proposed preference-regularized MORL algorithm. The achieved meta-policy performs as a multi-objective optimizer that can produce user-desired solutions on the Pareto frontier. The proposed algorithm is analyzed and its convergence and controllability are theoretically justified. Experiments from discrete toy examples to higher-dimension robotic control tasks and experiments with more than two objectives are conducted to show its performance. In these experiments, PCRL-trained policies show significantly better controllability than existing approaches and can generate Pareto optimal solutions with better diversity and utilities.

2384Learning K-U-Net in Constant Complexity with Application to Time Series Forecasting

[openreview] [pdf]

Abstract Training deep models for time series forecasting is a critical task with an inherent challenge of time complexity. While current methods generally ensure linear time complexity, our observations on temporal redundancy show that high-level features are learned 99.5% slower than low-level features. To address this issue, we introduce a new exponentially weighted stochastic gradient descent algorithm designed to achieve constant time complexity in deep learning models. We prove that the theoretical complexity of this learning method is constant. Evaluation of this method on Kernel U-Net (K-U-Net) on synthetic datasets shows a significant reduction in complexity while improving the accuracy of the test set.

2385Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation

[openreview] [pdf]

Abstract Recent advances in pre-trained Vision Language Models (VLMs) have shown promising potential through \textit{prompt learning} in effectively adapting to downstream tasks without requiring additional annotated paired datasets. To supplement text information in VLMs dependently trained on correlation with vision data, new approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data. Existing methods query LLM for text-based responses (i.e., \textit{descriptions}) to utilize in prompts, but this approach has limitations: high variability and low reliability. In this work, we propose \textbf{De}scription-free \textbf{Mul}ti-prompt Learning(\textbf{DeMul}) for image recognition task, a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts. By adopting a description-free approach, prompts can encapsulate richer semantics and still be defined as continuous vectors to optimize, thereby eliminating the need for discrete pre-defined templates. Additionally, in a multi-prompt setting, we have empirically shown the potential of using prompt weighting to reflect the importance of different prompts during training. Experimental results demonstrate that our approach achieves superior performance across 11 recognition datasets.

2386SCOPE: Scalable and Adaptive Evaluation of Misguided Safety Refusal in LLMs

[openreview] [pdf]

Abstract The rapid progress of foundation models has amplified AI safety risks, prompting the development and deployment of alignment techniques and safety measures such as reinforcement learning with human feedback and supervised safety fine-tuning. However, these safety mechanisms can inadvertently cause models to reject benign requests that contain keywords or syntax linked to unsafe content in training data, leading to misguided safety refusals (or over-cautiousness). Existing benchmarks for assessing these refusals are limited by their static nature and reliance on manual efforts. To address this, we introduce SCOPE, an automated pipeline that dynamically generates false refusal benchmarks from any given red-teaming dataset. This facilitates continuous adaptation to the evolving landscape of refusal behaviors introduced by growing red-teaming efforts. Our evaluation across 29 models demonstrates the widespread issue of misguided refusals in existing LLMs and identifies spurious features that trigger these behaviors. Furthermore, we demonstrate that the generated benchmarks facilitate the development of more effective countermeasures to mitigate these misguided refusals.

2387HaDeMiF: Hallucination Detection and Mitigation in Large Language Models

[openreview] [pdf]

Abstract The phenomenon of knowledge hallucinations has raised substantial concerns about the security and reliability of deployed large language models (LLMs). Current methods for detecting hallucinations primarily depend on manually designed individual metrics, such as prediction uncertainty and consistency, and fall short in effectively calibrating model predictions, thus constraining their detection accuracy and applicability in practical applications. In response, we propose an advanced framework, termed HaDeMiF, for detecting and mitigating hallucinations in LLMs. Specifically, hallucinations within the output and semantic spaces of LLMs are comprehensively captured through two compact networks—a novel, interpretable tree model known as the Deep Dynamic Decision Tree (D3T) and a Multilayer Perceptron (MLP)—which take as input a set of prediction characteristics and the hidden states of tokens, respectively. The predictions of LLMs are subsequently calibrated using the outputs from the D3T and MLP networks, aiming to mitigate hallucinations and enhance the reliability of the model generations. HaDeMiF can be applied during both the inference and fine-tuning phases of LLMs, introducing less than 2% of the parameters relative to the LLMs through the training of two small-scale networks. Extensive experiments conducted on multiple prevalent LLMs conclusively demonstrate the effectiveness of our framework in hallucination detection and model calibration across text generation tasks with responses of varying lengths.

2388A Theory for Token-Level Harmonization in Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs). Studies show that while RAG provides valuable external information (benefit), it may also mislead LLMs (detriment) with noisy or incorrect retrieved texts. Although many existing methods attempt to preserve benefit and avoid detriment, they lack a theoretical explanation for RAG. The benefit and detriment in the next token prediction of RAG remain a ‘black box’ that cannot be quantified or compared in an explainable manner, so existing methods are data-driven, need additional utility evaluators or post-hoc. This paper takes the first step towards providing a theory to explain and trade off the benefit and detriment in RAG. We model RAG as the fusion between distributions of LLMs’ knowledge and distributions of retrieved texts. Then, we formalize the trade-off between the value of external knowledge (benefit) and its potential risk of misleading LLMs (detriment) in next token prediction of RAG by distribution difference in this fusion. Finally, we prove that the actual effect of RAG on the token, which is the comparison between benefit and detriment, can be predicted without any training or accessing the utility of retrieval. Based on our theory, we propose a practical novel method, Tok-RAG, which achieves collaborative generation between the pure LLM and RAG at token level to preserve benefit and avoid detriment. Experiments in real-world tasks using LLMs such as OPT, LLaMA-2, and Mistral show the effectiveness of our method and support our theoretical findings. Code is in supplemental material and will be released on GitHub after acceptance.

2389BLEND: Behavior-guided Neural Population Dynamics Modeling via Privileged Knowledge Distillation

[openreview] [pdf]

Abstract Modeling the nonlinear dynamics of neuronal populations represents a key pursuit in computational neuroscience. Recent research has increasingly focused on jointly modeling neural activity and behavior to unravel their interconnections. Despite significant efforts, these approaches often necessitate either intricate model designs or oversimplified assumptions. Given the frequent absence of perfectly paired neural-behavioral datasets in real-world scenarios when deploying these models, a critical yet understudied research question emerges: how to develop a model that performs well using only neural activity as input at inference, while benefiting from the insights gained from behavioral signals during training?To this end, we proposeBLEND, theBehavior-guided neuraLpopulation dynamics modElling framework via privileged kNowledgeDistillation. By considering behavior as privileged information, we train a teacher model that takes both behavior observations (privileged features) and neural activities (regular features) as inputs. A student model is then distilled using only neural activity. Unlike existing methods, our framework is model-agnostic and avoids making strong assumptions about the relationship between behavior and neural activity. This allows BLEND to enhance existing neural dynamics modeling architectures without developing specialized models from scratch. Extensive experiments across neural population activity modeling and transcriptomic neuron identity prediction tasks demonstrate strong capabilities of BLEND, reporting over 50% improvement in behavioral decoding and over 15% improvement in transcriptomic neuron identity prediction after behavior-guided distillation. Furthermore, we empirically explore various behavior-guided distillation strategies within the BLEND framework and present a comprehensive analysis of effectiveness and implications for model performance.

2390Intelligence at the Edge of Chaos

[openreview] [pdf]

Abstract We explore the emergence of intelligent behavior in artificial systems by investigating how the complexity of rule-based systems influences the capabilities of models trained to predict these rules. Our study focuses on elementary cellular automata (ECA), simple yet powerful one-dimensional systems that generate behaviors ranging from trivial to highly complex. By training distinct Large Language Models (LLMs) on different ECAs, we evaluated the relationship between the complexity of the rules’ behavior and the intelligence exhibited by the LLMs, as reflected in their performance on downstream tasks. Our findings reveal that rules with higher complexity lead to models exhibiting greater intelligence, as demonstrated by their performance on reasoning and chess move prediction tasks. Both uniform and periodic systems, and often also highly chaotic systems, resulted in poorer downstream performance, highlighting a sweet spot of complexity conducive to intelligence. We conjecture that intelligence arises from the ability to predict complexity and that creating intelligence may require only exposure to complexity.

2391Not Every Image is Worth a Thousand Words: Quantifying Originality in Stable Diffusion

[openreview] [pdf]

Abstract This work addresses the challenge of quantifying originality in text-to-image (T2I) generative diffusion models, with a focus on copyright originality. We begin by evaluating T2I models’ ability to innovate and generalize through controlled experiments, revealing that stable diffusion models can effectively recreate unseen elements with sufficiently diverse training data. Then, our key insight is that concepts and combinations of image elements the model is familiar with, and saw more during training, are more concisly represented in the model’s latent space. We hence propose a method that leverages textual inversion to measure the originality of an image based on the number of tokens required for its reconstruction by the model. Our approach is inspired by legal definitions of originality and aims to assess whether a model can produce original content without relying on specific prompts or having the training data of the model. We demonstrate our method using both a pre-trained stable diffusion model and a synthetic dataset, showing a correlation between the number of tokens and image originality. This work contributes to the understanding of originality in generative models and has implications for copyright infringement cases.

2392M3C: a Multi-Domain Multi-Objective, Mixed-Modality Framework for Cost-Effective, Industry Scale Recommendation

[openreview] [pdf]

Abstract The ever-expanding landscape of products, surfaces, policies, and regulations poses significant challenges for recommendation systems, leading to data fragmentation and prohibitive hikes in infrastructure costs. To address these challenges, we propose M3C, a holistic co-design of model, data and efficiency strategies. M3C (1) partitions the recommendation space to allow better representation learning and encourage knowledge sharing within a subspace; (2) covers each partition using a hierarchy of foundational and vertical networks tailored to handle multi- domain, multi-objective tasks with mixed-modal inputs; (3) forms a unified data representation that utilizes heterogeneous signals across domains, objectives and optimization goals to alleviate data fragmentation, label sparsity, and to enhance knowledge sharing; (4) improves execution efficiency and lowers costs with a suite of stability and throughput optimizations. We show that across a diverse set of tasks on public and industry datasets, M3C delivers up to 1% lower LogLoss compared to 10 state-of-the-art baselines, while improving system efficiency by up to 20%. Furthermore, in a large-scale industry setting our deployment of M3C has resulted in 7% top-line metrics improvement in online tests with 10% capacity savings.

2393Mixture-of-Diffusers: Dual-Stage Diffusion Model for Improved Time Series Generation

[openreview] [pdf]

Abstract Synthetic Time Series Generation (TSG) is a crucial task for data augmentation and various downstream applications. While TSG has advanced, its effectiveness often relies on the availability of extensive training datasets, posing challenges in data-scarce scenarios. Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) have shown promise, but they frequently struggle to capture the complex temporal dynamics and interdependencies inherent in time series data. To address these limitations, we propose a novel generative framework, Mixture-of-Diffusers (MoD). This approach decomposes the diffusion process into a collection of specialized diffusers, each designed to model specific patterns at distinct noise levels. Early-stage diffusers focus on capturing overarching global and coarse patterns, while late-stage diffusers specialize in capturing fine-grained details as the noise level diminishes. This hierarchical decomposition empowers MoD to learn robust representations and generate realistic time series samples. The model is trained using a combination of multi-objective loss functions, ensuring both temporal consistency and alignment with the true data distribution. Extensive experiments on a diverse range of real-world and simulated time series datasets demonstrate the superior performance of MoD compared to state-of-the-art TSG generative models. Furthermore, rigorous evaluations incorporating both qualitative and quantitative metrics, coupled with assessments of downstream task performance on long-term generation and scarce time series data (see Figure 1), collectively validate the efficacy of our proposed approach.

2394Alice in Wonderland: Simple Tasks Reveal Severe Generalization and Basic Reasoning Deficits in State-Of-the-Art Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) are often described as being instances of foundation models - that is, models that possess strong generalization and therefore transfer robustly across various tasks and conditions in few-show or zero-shot manner, while exhibiting scaling laws that predict generalization improvement when increasing the pre-training scale. These claims of strong generalization and advanced reasoning function enabling it rely on measurements by various standardized benchmarks where state-of-the-art (SOTA) models score high. We demonstrate here a dramatic breakdown of generalization and basic reasoning of all SOTA models which claim strong function, including advanced models like GPT-4 or Claude 3 Opus trained at the largest scales, using a simple, short, conventional common sense problem formulated in concise natural language, easily solvable by humans (AIW problem). The breakdown is dramatic as it manifests in strong performance fluctuations on the simple problem across its mild variations that should not affect problem solving at all, while also often expressing strong overconfidence in the wrong solutions, backed up by plausible sounding explanation-like confabulations. Various standard interventions in an attempt to get the right solution, like chain-of-thought prompting, or urging the models to reconsider the wrong solutions again by multi step re-evaluation, fail. We take these observations to the scientific and technological community to stimulate re-assessment of the claimed capabilities of current generation of LLMs. Such re-assessment also requires common action to create standardized benchmarks that would allow proper detection of such deficits in generalization and reasoning that obviously remain undiscovered by current state-of-the-art evaluation procedures, where SOTA LLMs obtain high scores. Code for reproducing experiments in the paper and raw experiments data can be found athttps://anonymous.4open.science/r/AITW_anonymous-69A6/

2395One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs

[openreview] [pdf]

Abstract Safety alignment in large language models (LLMs) is increasingly compromised by jailbreak attacks, which can manipulate these models to generate harmful or unintended content. Investigating these attacks is crucial for uncovering model vulnerabilities. However, many existing jailbreak strategies fail to keep pace with the rapid development of defense mechanisms, such as defensive suffixes, rendering them ineffective against defended models. To tackle this issue, we introduce a novel attack method called ArrAttack, specifically designed to target defended LLMs. ArrAttack automatically generates robust jailbreak prompts capable of bypassing various defense measures. This capability is supported by a universal robustness judgment model that, once trained, can perform robustness evaluation for any target model with a wide variety of defenses. By leveraging this model, we can rapidly develop a robust jailbreak prompt generator that efficiently converts malicious input prompts into effective attacks. Extensive evaluations reveal that ArrAttack significantly outperforms existing attack strategies, demonstrating strong transferability across both white-box and black-box models, including GPT-4 and Claude-3. Our work bridges the gap between jailbreak attacks and defenses, providing a fresh perspective on generating robust jailbreak prompts.

2396Learning to Solve Differential Equation Constrained Optimization Problems

[openreview] [pdf]

Abstract Differential equations (DE) constrained optimization plays a critical role in numerous scientific and engineering fields, including energy systems, aerospace engineering, ecology, and finance, where optimal configurations or control strategies must be determined for systems governed by ordinary or stochastic differential equations. Despite its significance, the computational challenges associated with these problems have limited their practical use. To address these limitations, this paper introduces a learning-based approach to DE-constrained optimization that combines techniques from proxy optimization and neural differential equations. The proposed approach uses a dual-network architecture, with one approximating the control strategies, focusing on steady-state constraints, and another solving the associated DEs. This combination enables the approximation of optimal strategies while accounting for dynamic constraints in near real-time. Experiments across problems in energy optimization and finance modeling show that this method provides full compliance with dynamic constraints and it produces results up to 25 times more precise than other methods which do not explicitly model the system’s dynamic equations.

2397Transformers Handle Endogeneity in In-Context Linear Regression

[openreview] [pdf]

Abstract We explore the capability of transformers to address endogeneity in in-context linear regression. Our main finding is that transformers inherently possess a mechanism to handle endogeneity effectively using instrumental variables (IV). First, we demonstrate that the transformer architecture can emulate a gradient-based bi-level optimization procedure that converges to the widely used two-stage least squares (2SLS) solution at an exponential rate. Next, we propose an in-context pretraining scheme and provide theoretical guarantees showing that the global minimizer of the pre-training loss achieves a small excess loss. Our extensive experiments validate these theoretical findings, showing that the trained transformer provides more robust and reliable in-context predictions and coefficient estimates than the 2SLS method, in the presence of endogeneity.

2398Fine-tuned In-Context Learning Transformers are Excellent Tabular Data Classifiers.

[openreview] [pdf]

Abstract The recently introduced TabPFN pretrains an In-Context Learning (ICL) transformer on synthetic data to perform tabular data classification. In this work, we extend TabPFN to the fine-tuning setting, resulting in a significant performance boost. We also discover that fine-tuning enables ICL-transformers to create complex decision boundaries, a property regular neural networks do not have. Based on this observation, we propose to pretrain ICL-transformers on a new forest dataset generator which creates datasets that are unrealistic, but have complex decision boundaries. TabForest, the ICL-transformer pretrained on this dataset generator, shows better fine-tuning performance when pretrained on more complex datasets. Additionally, TabForest outperforms TabPFN on some real-world datasets when fine-tuning, despite having lower zero-shot performance due to the unrealistic nature of the pretraining datasets. By combining both dataset generators, we create TabForestPFN, an ICL-transformer that achieves excellent fine-tuning performance and good zero-shot performance.

2399Causal Bayesian Optimization with Unknown Causal Graphs

[openreview] [pdf]

Abstract Causal Bayesian Optimization (CBO) is a methodology designed to optimize an outcome variable by leveraging known causal relationships through targeted interventions. Traditional CBO methods require a fully and accurately specified causal graph, which is a limitation in many real-world scenarios where such graphs are unknown. To address this, we propose a new method for the CBO framework that operates without prior knowledge of the causal graph. We demonstrate through theoretical analysis and empirical validation that focusing on the direct causal parents of the target variable is sufficient for optimization. Our method learns a Bayesian posterior over the direct parents of the target variable. This allows us to optimize the outcome variable while simultaneously learning the causal structure. Our contributions include a derivation of a closed-form posterior distribution for the linear case. In the nonlinear case, we present a Gaussian Process (GP) approximation that still enables CBO in cases where the posterior is not tractable. The proposed method performs competitively with existing benchmarks and scales well to larger graphs, making it a practical tool for real-world applications where causal information is incomplete.

2400DP-GPL: Differentially Private Graph Prompt Learning

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown remarkable performance in various applications. Recently, graph prompt learning has emerged as a powerful GNN training paradigm, inspired by advances in language and vision models. Here, a GNN is pre-trained on public data and then adapted to sensitive tasks using lightweight graph prompts. However, using prompts from sensitive data poses privacy risks. In this work, we are the first to investigate these risks in graph prompts by instantiating a membership inference attack that reveals significant privacy leakage. We also find that the standard privacy method, DP-SGD, fails to provide practical privacy-utility trade-offs in graph prompt learning, likely due to the small number of sensitive data points used to learn the prompts. As a solution, we propose two algorithms, DP-GPL and DP-GPL+W, for differentially private graph prompt learning based on the PATE framework, that generate a graph prompt with differential privacy guarantees. Our evaluation across various graph prompt learning methods, GNN architectures, and pre-training strategies demonstrates that our algorithms achieve high utility at strong privacy, effectively mitigating privacy concerns while preserving the powerful capabilities of prompted GNNs.

2401Amulet: ReAlignment During Test Time for Personalized Preference Adaptation of LLMs

[openreview] [pdf]

Abstract How to align large language models (LLMs) with user preferences from a static general dataset has been frequently studied. However, user preferences are usually personalized, changing, and diverse. This leads to the problem that the actual user preferences often do not coincide with those trained by the model developers in the practical use of LLMs. Since we cannot collect enough data and retrain for every demand, researching efficient real-time preference adaptation methods based on the backbone LLMs during test time is important. To this end, we introduceAmulet, a novel, training-free framework that formulates the decoding process of every token as a separate online learning problem with the guidance of simple user-provided prompts, thus enabling real-time optimization to satisfy users’ personalized preferences. To reduce the computational cost brought by this optimization process for each token, we additionally provide a closed-form solution for each iteration step of the optimization process, thereby reducing the computational time cost to a negligible level. The detailed experimental results demonstrate that Amulet can achieve significant performance improvements in rich settings with combinations of different LLMs, datasets, and user preferences, while maintaining acceptable computational efficiency.

2402DLGrapher: Dual Latent Diffusion for Attributed Graph Generation

[openreview] [pdf]

Abstract Graphs for applications like social data and financial transactions are particularly complex, with large node counts and high-dimensional features. State-of-the-art diffusion graph synthesizers model the node structure via discrete diffusion and are, unfortunately, limited to small-scale graphs with few to no features. In contrast, continuous diffusion models capture rich node features well, but have issues faithfully modelling connectivity. In this paper, we design DLGrapher, a dual latent diffusion framework for jointly synthesizing large graph structures and high-dimension node features. DLGrapher models node features and structure as a joint latent representation. Structure-wise, we design a reversible coarsening scheme to merge pairs of similar neighboring nodes and their respective edges after encoding node features through a structure-aware variational autoencoder. To capture the dependencies between node features and the graph structure, DLGrapher trains a single diffusion over a dual denoising objective, one for the continuous node representations and another for the discrete edge connectivity. We extensively evaluate DLGrapher’s performance on three complex social graph datasets against baselines combining tabular and graph synthesizers. Our solution fares 12.9x better at statistically capturing feature-structure interaction and 25.2% better at downstream tasks thanks to the dual diffusion on average and the latent compressed representation increases throughput by 2.5X. Furthermore, we maintain competitive synthesis quality for simple-featured molecular graphs and structure-only synthetic graphs while drastically reducing computation in the latter case.

2403Regularized Conditional Optimal Transport for Feature Learning and Generalization Bounds

[openreview] [pdf]

Abstract This paper develops the regularized conditional optimal transport for feature learning in an embedding space. Instead of using joint distributions of data, we introduce conditional distributions to some reference conditional distributions in terms of the Kullback-Leibler (KL) divergence. Using conditional distributions provides the flexibility in controlling the transferring range of given data points. When the alternating optimization technique is employed to solve our model, it is interesting to find that conditional and marginal distributions have closed-form solutions. Moreover, the use of conditional distributions facilitates the derivation of the generalization bound of our model via the Rademacher complexity, which characterizes its convergence speed in terms of the number of samples. By optimizing the anchors (centroids) defined in the model, we also employ optimal transport and autoencoders to explore an embedding space of samples in the clustering problem. In the experimental part, we demonstrate that the proposed model achieves promising performance on some learning tasks. Moreover, we construct a conditional Wasserstein classifier to classify set-valued objects.

2404Advancing Table Understanding of Large Language Models via Feature Re-ordering

[openreview] [pdf]

Abstract Large Language Models (LLMs) exhibit exceptional proficiency in comprehending human language. Despite their significant success across a wide array of tasks, including text generation, translation, question answering, and even code generation, understanding tabular data remains a challenging task. Especially, tabular data lacks an intrinsic order of the different features (table fields), whereas LLMs take only sequential inputs. Consequently, an artificial order is imposed, the impact of which on the performance of LLMs has not yet been thoroughly investigated. Surprisingly, as discovered in this work, this artificially induced order bias dramatically influences the performance of LLMs on tasks related to tabular data. Mitigating the order bias presents a significant challenge. To address this, we propose a simple and cost-effective method, Re-Ordering Tabular feATures fOR LLM (ROTATOR-LLM), to conduct test-time compute without fine-tuning the base LLM. Aiming at optimizing the feature order of tabular data and boosting LLMs’ capability to better understand the data semantics, ROTATOR-LLM re-frames the ordering problem as a feature trajectory generation task. A dynamic programming based meta-controller is trained to auto-regressively generate an individualized feature trajectory for each data instance via accumulative value estimation of the serialized feature input through the LLM’s final performance metrics. Model performance is maximized by iteratively selecting features across different steps. Experimental results on multiple datasets and LLMs show close to or over 20% performance boosts via features reordered by ROTATOR-LLM against the un-ordered counterpart. Also, it outperforms State-Of-The-Art tabular LLM methods with significant margin. Moreover, meta-controller demonstrates strong transferability: the tested LLMs gain performance enhancements when utilizing a meta-controller trained on one of them.

2405Single-agent Poisoning Attacks Suffice to Ruin Multi-Agent Learning

[openreview] [pdf]

Abstract We investigate the robustness of multi-agent learning in strongly monotone games with bandit feedback. While previous research has developed learning algorithms that achieve last-iterate convergence to the unique Nash equilibrium (NE) at a polynomial rate, we demonstrate that all such algorithms are vulnerable to adversaries capable of poisoning even a single agent’s utility observations. Specifically, we propose an attacking strategy such that for any given time horizon TT, the adversary can mislead any multi-agent learning algorithm to converge to a point other than the unique NE with a corruption budget that grows sublinearly in TT. To further understand the inherent robustness of these algorithms, we characterize the fundamental trade-off between convergence speed and the maximum tolerable total utility corruptions for two example algorithms, including the state-of-the-art one. Our theoretical and empirical results reveal an intrinsic efficiency-robustness trade-off: the faster an algorithm converges, the more vulnerable it becomes to utility poisoning attacks. To the best of our knowledge, this is the first work to identify and characterize such a trade-off in the context of multi-agent learning.

2406Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data

[openreview] [pdf]

Abstract Long-tailed distribution datasets are prevalent in many machine learning tasks, yet existing neural network models still face significant challenges when handling such data. This paper proposes a novel adaptive pruning strategy, LTAP (Long-Tailed Adaptive Pruner), aimed at balancing model efficiency and performance to better address the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring criteria and designs a dynamic weight adjustment mechanism to adaptively determine the pruning priority of parameters for different classes. By focusing on protecting parameters critical for tail classes, LTAP significantly enhances computational efficiency while maintaining model performance. This method combines the strengths of long-tailed learning and neural network pruning, overcoming the limitations of existing approaches in handling imbalanced data. Extensive experiments demonstrate that LTAP outperforms existing methods on various long-tailed datasets, achieving a good balance between model compression rate, computational efficiency, and classification accuracy. This research provides new insights into solving model optimization problems in long-tailed learning and is significant for improving the performance of neural networks on imbalanced datasets. The code is available at \url{https://anonymous.4open.science/r/AEFCDAISJ/README.md}.

2407MoS: Unleashing Parameter Efficiency of Low-Rank Adaptation with Mixture of Shards

[openreview] [pdf]

Abstract The rapid scaling of large language models necessitates more lightweight finetuning methods to reduce the explosive GPU memory overhead when numerous customized models are served simultaneously. Targeting more parameter-efficient low-rank adaptation (LoRA), parameter sharing presents a promising solution. Empirically, our research into high-level sharing principles highlights the indispensable role of differentiation in reversing the detrimental effects of pure sharing. Guided by this finding, we propose Mixture of Shards (MoS), incorporating both inter-layer and intra-layer sharing schemes, and integrating four nearly cost-free differentiation strategies, namely subset selection, pair dissociation, vector sharding, and shard privatization. Briefly, it selects a designated number of shards from global pools with a Mixture-of-Experts (MoE)-like routing mechanism before sequentially concatenating them to low-rank matrices. Hence, it retains all the advantages of LoRA while offering enhanced parameter efficiency, and effectively circumvents the drawbacks of peer parameter-sharing methods. Our empirical experiments demonstrate approximately 8×8\times parameter savings in a standard LoRA setting. The ablation study confirms the significance of each component. Our insights into parameter sharing and MoS method may illuminate future developments of more parameter-efficient finetuning methods.

2408Recipes for Unbiased Reward Modeling Learning: An Empirically Study

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) enhances the alignment between humans and large language models (LLMs), with Reward Models (RMs) playing a pivotal role. RLHF and sampling techniques, such as Best-of-N, require RMs to provide reliable rewards to guide policy training or sample selection. However, despite the advancement of LLMs, critical issues in RMs persist, such as overestimation on out-of-distribution (OOD) data (also known as reward hacking) and a preference for verbose outputs (length bias). These issues undermine the reliability of RM-generated rewards. Training an unbiased RM requires addressing these challenges, yet there is a lack of in-depth analysis on RMs. In this paper, we first decompose the RM training pipeline and identify three key aspects critical for developing an unbiased RM: 1) model architectures, 2) training paradigms, and 3) the influence of preference data. For each aspect, we conduct thorough empirical studies, revealing several insightful design considerations. Building on our findings, we develop an RM capable of mitigating the identified issues. This study represents the first comprehensive examination of various challenges from a holistic perspective in RM training, offering in-depth analyses of essential concerns and providing guidance for training unbiased RMs that can accurately guide downstream policies. The relevant code and models will be made publicly available.

2409Feature-Based Online Bilateral Trade

[openreview] [pdf]

Abstract Bilateral trade models the problem of facilitating trades between a seller and a buyer having private valuations for the item being sold. In the online version of the problem, the learner faces a new seller and buyer at each time step, and has to post a price for each of the two parties without any knowledge of their valuations. We consider a scenario where, at each time step, before posting prices the learner observes a context vector containing information about the features of the item for sale. The valuations of both the seller and the buyer follow an unknown linear function of the context. In this setting, the learner could leverage previous transactions in an attempt to estimate private valuations. We characterize the regret regimes of different settings, taking as a baseline the best context-dependent prices in hindsight. First, in the setting in which the learner has two-bit feedback and strong budget balance constraints, we propose an algorithm with O(logT)O(\log T) regret. Then, we study the same set-up with noisy valuations, providing a tight O~(T2/3)\widetilde O(T^{2/3}) regret upper bound. Finally, we show that loosening budget balance constraints allows the learner to operate under more restrictive feedback. Specifically, we show how to address the one-bit, global budget balance setting through a reduction from the two-bit, strong budget balance setup. This established a fundamental trade-off between the quality of the feedback and the strictness of the budget constraints.

2410On the Surprising Efficacy of Online Self-Improvement for Embodied Multimodal Foundation Models

[openreview] [pdf]

Abstract Foundation models trained on web-scale data have revolutionized robotics, but their application to low-level control remains largely limited to behavioral cloning. Drawing inspiration from the sample efficiency and success of reinforcement learning (RL) fine-tuning in large language models (LLMs), we propose a two-stage approach suited to robotics. The first stage, Supervised Fine-Tuning (SFT), fine-tunes pre-trained foundation models using goal-conditioned behavioral cloning and “steps-to-go” prediction objectives. In the second stage, this foundation enables the extraction of a well-shaped reward function and a success detector, eliminating the need for manual reward engineering and real-world instrumentation, and allowing robots to practice autonomously with minimal human supervision. Our experiments on both real-world and simulated robots demonstrate that the combination of SFT and online Self-Improvement is significantly more sample-efficient than supervised learning alone. Furthermore, the combination of our proposed approach with web-scale pre-trained foundation models enables rapid acquisition of new skills, allowing robots to generalize far beyond the behaviors observed in the imitation learning datasets used during training. These findings highlight the transformative potential of combining pre-trained foundation models with online fine-tuning to unlock new levels of autonomy and skill acquisition in robotics.

2411Chain of Ideas: Revolutionizing Research in Idea Development with LLM Agents

[openreview] [pdf]

Abstract Effective research ideation is a critical step for scientific research. However, the exponential increase in scientific literature makes it challenging for researchers to stay current with recent advances and identify meaningful research directions. Recent developments in large language models~(LLMs) suggest a promising avenue for automating the generation of novel research ideas. However, existing methods for idea generation either trivially prompt LLMs or directly expose LLMs to extensive literature without indicating useful information. Inspired by the research process of human researchers, we propose a Chain-of-Ideas (CoI) agent, an LLM-based agent that organizes relevant literature in a chain structure to effectively mirror the progressive development in a research domain. This organization facilitates LLMs to capture the current advancements in research, thereby enhancing their ideation capabilities. Furthermore, we propose Idea Arena, an evaluation protocol that can comprehensively evaluate idea generation methods from different perspectives, aligning closely with the preferences of human researchers. Experimental results indicate that the CoI agent consistently outperforms other methods and shows comparable quality as humans in research idea generation. Moreover, our CoI agent is budget-friendly, with a minimum cost of $0.50 to generate a candidate idea and its corresponding experimental design.

2412T2V-Turbo-v2: Enhancing Video Model Post-Training through Data, Reward, and Conditional Guidance Design

[openreview] [pdf]

Abstract In this paper, we focus on enhancing a diffusion-based text-to-video (T2V) model during the post-training phase by distilling a highly capable consistency model from a pretrained T2V model. Our proposed method, T2V-Turbo-v2, introduces a significant advancement by integrating various supervision signals, including high-quality training data, reward model feedback, and conditional guidance, into the consistency distillation process. Through comprehensive ablation studies, we highlight the crucial importance of tailoring datasets to specific learning objectives and the effectiveness of learning from diverse reward models for enhancing both the visual quality and text-video alignment. Additionally, we highlight the vast design space of conditional guidance strategies, which centers on designing an effective energy function to augment the teacher ODE solver. We demonstrate the potential of this approach by extracting motion guidance from the training datasets and incorporating it into the ODE solver, showcasing its effectiveness in improving the motion quality of the generated videos with the improved motion-related metrics from VBench and T2V-CompBench. Empirically, our T2V-Turbo-v2 establishes a new state-of-the-art result on VBench,with a Total score of 85.13, surpassing proprietary systems such as Gen-3 and Kling.

2413Taming Overconfidence in LLMs: Reward Calibration in RLHF

[openreview] [pdf]

Abstract Language model calibration refers to the alignment between the confidence of the model and the actual performance of its responses. While previous studies point out the overconfidence phenomenon in Large Language Models (LLMs) and show that LLMs trained with Reinforcement Learning from Human Feedback (RLHF) are overconfident with a more sharpened output probability, in this study, we reveal that RLHF tends to lead models to express verbalized overconfidence in their own responses. We investigate the underlying cause of this overconfidence and demonstrate that reward models used for Proximal Policy Optimization (PPO) exhibit inherent biases towards high-confidence scores regardless of the actual quality of responses. Building upon this insight, we propose two PPO variants: PPO-M: PPO\underline{PPO} with Calibrated Reward M\underline{M}odeling and PPO-C: PPO\underline{PPO} with Calibrated Reward C\underline{C}alculation. PPO-M integrates explicit confidence scores in reward model training, which calibrates reward models to better capture the alignment between response quality and verbalized confidence. PPO-C adjusts the reward score during PPO based on the difference between the current reward and the moving average of past rewards. Both PPO-M and PPO-C can be seamlessly integrated into the current PPO pipeline and do not require additional golden labels. We evaluate our methods on both Llama3-8B\texttt{Llama3-8B} and Mistral-7B\texttt{Mistral-7B} across six diverse datasets including multiple-choice and open-ended generation. Experiment results demonstrate that both of our methods can reduce calibration error and maintain performance comparable to standard PPO. We further show that they do not compromise model capabilities in open-ended conversation settings.

2414Discovering Temporally Compositional Neural Manifolds with Switching Infinite GPFA

[openreview] [pdf]

Abstract Gaussian Process Factor Analysis (GPFA) is a powerful latent variable model for extracting low-dimensional manifolds underlying population neural activities. However, one limitation of standard GPFA models is that the number of latent factors needs to be pre-specified or selected through heuristic-based processes, and that all factors contribute at all times. We propose the infinite GPFA model, a fully Bayesian non-parametric extension of the classical GPFA by incorporating an Indian Buffet Process (IBP) prior over the factor loading process, such that it is possible to infer a potentially infinite set of latent factors, and the identity of those factors that contribute to neural firings in a compositional manner at each time point. Learning and inference in the infinite GPFA model is performed through variational expectation-maximisation, and we additionally propose scalable extensions based on sparse variational Gaussian Process methods. We empirically demonstrate that the infinite GPFA model correctly infers dynamically changing activations of latent factors on a synthetic dataset. By fitting the infinite GPFA model to population activities of hippocampal place cells during spatial navigation, we identify non-trivial and behaviourally meaningful dynamics in the neural encoding process.

2415Anomaly Detection Exposed: Imagining Anomalies Were Normal

[openreview] [pdf]

Abstract Deep learning-based methods have achieved a breakthrough in image anomaly detection, but their complexity introduces a considerable challenge to understanding why an instance is predicted to be anomalous. We introduce a novel explanation method that generates multiple alternative modifications for each anomaly, capturing diverse concepts of anomalousness. Each modification is trained to be perceived as normal by the anomaly detector. The method provides a semantic explanation of the mechanism that triggered the anomaly detector, allowing users to explore ``what-if scenarios.‘’ Qualitative and quantitative analyses across various image datasets demonstrate that applying this method to state-of-the-art anomaly detectors provides high-quality semantic explanations.

2416SWIFT: Mapping Sub-series with Wavelet Decomposition Improves Time Series Forecasting

[openreview] [pdf]

Abstract In this paper, we propose SWIFT\textit{SWIFT}, a lightweight model that is not only powerful, but also efficient in deployment and inference for Long-term Time Series Forecasting (LTSF). Our model is based on two key points: 1. decomposition of sequences using wavelet transform. 2. using only one shared single layer for sub-series’ mapping. We conduct comprehensive experiments, and the results show that SWIFT\textit{SWIFT} achieves state-of-the-art (SOTA) performance on multiple datasets, offering a promising method for edge computing and deployment in this task. Moreover, it is noteworthy that the number of parameters in SWIFT\textit{SWIFT} is only 25% of what it would be with a single-layer linear model for time-domain prediction.

2417Revised NTK Analysis of Optimization and Generalization with Its Extensions to Arbitrary Initialization

[openreview] [pdf]

Abstract Recent theoretical works based on the neural tangent kernel (NTK) have shed light on the optimization and generalization of over-parameterized neural networks, and partially bridge the gap between their practical success and classical learning theory. However, the existing NTK-based analysis has a limitation that the scaling of the initial parameter should decrease with respect to the sample size which is contradictory to the practical initialization scheme. To address this issue, in this paper, we present the revised NTK analysis of optimization and generalization of overparametrized neural networks, which successfully remove the dependency on the sample size of the initialization. Based on our revised analysis, we further extend our theory that allow for arbitrary initialization, not limited to Gaussian initialization. Under our initialization-independent analysis, we propose NTK-based regularizer that can improve the model generalization, thereby illustrating the potential to bridge the theory and practice while also supporting our theory. Our numerical simulations demonstrate that the revised theory indeed can achieve the significantly lower generalization error bound compared to existing error bound. Also importantly, the proposed regularizer also corroborate our theory on the arbitrary initialization with fine-tuning scenario, which takes the first step for NTK theory to be promisingly applied to real-world applications.

2418Step-wise Triple-Consistent Diffusion Sampling for Inverse Problems

[openreview] [pdf]

Abstract Diffusion models (DMs) are a class of generative models that allow sampling from a distribution learned over a training set. When applied to solving inverse imaging problems (IPs), the reverse sampling steps of DMs are typically modified to approximately sample from a measurement-conditioned distribution in the image space. However, these modifications may be unsuitable for certain settings (such as in the presence of measurement noise) and non-linear tasks, as they often struggle to correct errors from earlier sampling steps and generally require a large number of optimization and/or sampling steps. To address these challenges, we state three conditions for achieving measurement-consistent diffusion trajectories. Building on these conditions, we propose a new optimization-based sampling method that not only enforces the standard data manifold measurement consistency and forward diffusion consistency, as seen in previous studies, but also incorporates backward diffusion consistency that maintains a diffusion trajectory by optimizing over the input of the pre-trained model at every sampling step. By enforcing these conditions, either implicitly or explicitly, our sampler requires significantly fewer reverse steps. Therefore, we refer to our accelerated method asStep-wiseTriple-Consistent Sampling (SITCOM). Compared to existing state-of-the-art baseline methods, under different levels of measurement noise, our extensive experiments across five linear and three non-linear image restoration tasks demonstrate that SITCOM achieves competitive or superior results in terms of standard image similarity metrics while requiring a significantly reduced run-time across all considered tasks.

2419CirT: Global Subseasonal-to-Seasonal Forecasting with Geometry-inspired Transformer

[openreview] [pdf]

Abstract Accurate Subseasonal-to-Seasonal (S2S) climate forecasting is pivotal for decision-making including agriculture planning and disaster preparedness but is known to be challenging due to its chaotic nature. Although recent data-driven models have shown promising results, their performance is limited by inadequate consideration of geometric inductive biases. Usually, they treat the spherical weather data as planar images, resulting in an inaccurate representation of locations and spatial relations. In this work, we propose the geometric-inspired Circular Transformer (CirT) to model the cyclic characteristic of the graticule, consisting of two key designs: (1) Decomposing the weather data by latitude into circular patches that serve as input tokens to the Transformer; (2) Leveraging Fourier transform in self-attention to capture the global information and model the spatial periodicity. Extensive experiments on the Earth Reanalysis 5 (ERA5) reanalysis dataset demonstrate our model yields a significant improvement over the advanced data-driven models, including PanguWeather and GraphCast, as well as skillful ECMWF systems. Additionally, we empirically show the effectiveness of our model designs and high-quality prediction over spatial and temporal dimensions.

2420DROSIA: Decoupled Representation on Sequential Information Aggregation for Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting is crucial in various fields, including finance, energy consumption, weather, transportation, and network traffic. It necessitates effective and efficient sequence modeling to encapsulate intricate temporal relationships. However, conventional methods often aggregate sequential information into representations of each time point by considering other points in the sequence, thereby ignoring the intra-individual information and suffering from inefficiency. To address these challenges, we introduce a novel approach, DROSIA: Decoupled Representation On Sequential Information Aggregation, which only integrates temporal relationships once as an additional representation for each point, achieving sequential information aggregation in a decoupled fashion. Thus balancing between individual and sequential information, along with a reduction in computational complexity. We select several widely used time series forecasting datasets, and previously top-performing models and baselines, for a comprehensive comparison. The experimental results validate the effectiveness and efficiency of DROSIA, which achieves state-of-the-art performance with only linear complexity. When provided with sufficiently long input data, the channel-independent DROSIA even outperforms the current best channel-dependent model, highlighting its proficiency in sequence modeling and capturing long-distance dependencies. Our code will be made open-source in the subsequent version of this paper.

2421A Causal Study on The Learnability of Formal Languages

[openreview] [pdf]

Abstract Understanding the limitations of neural language models is crucial for knowing what such models are capable of and how they can be used safely. A popular approach to analyzing formal limitations takes the form of training models on formal languages, and studying what aspects of the languages affect model performance. Formal languages can, for instance, be designed using manually constructed grammars or randomly sampled by sampling some type of automata. This provides the researcher with unique control over the features of the language of interest. In this paper, we provide an even more fine-grained approach to targeted model evaluation. We develop a method for controlling specific \emph{string} features, on the corpus level, in the language of a given automaton. This gives us control over properties such as symbol frequencies while keeping everything else intact, enabling a causal study of their importance. To describe our framework formally, we turn to \emph{semirings} and introduce finite state automata over a novel---counting---semiring. We devise algorithms that enable string sampling under varying degrees of interventions and demonstrate the utility of our method through several examples showing how targeted interventions over transition, symbol, and state frequencies can be performed. We then train Transformer and LSTM language models on languages under varying degrees of interventions. Our fine-grained analysis allows us to show that different mechanisms influence the learning behavior of these two architectures.

2422Improving Neural Optimal Transport via Displacement Interpolation

[openreview] [pdf]

Abstract Optimal Transport (OT) theory investigates the cost-minimizing transport map that moves a source distribution to a target distribution. Recently, several approaches have emerged for learning the optimal transport map for a given cost function using neural networks. We refer to these approaches as the OT Map. OT Map provides a powerful tool for diverse machine learning tasks, such as generative modeling and unpaired image-to-image translation. However, existing methods that utilize max-min optimization often experience training instability and sensitivity to hyperparameters. In this paper, we propose a novel method to improve stability and achieve a better approximation of the OT Map by exploiting displacement interpolation, dubbed Displacement Interpolation Optimal Transport Model (DIOTM). We derive the dual formulation of displacement interpolation at specific time tt and prove how these dual problems are related across time. This result allows us to utilize the entire trajectory of displacement interpolation in learning the OT Map. Our method improves the training stability and achieves superior results in estimating optimal transport maps. We demonstrate that DIOTM outperforms existing OT-based models on image-to-image translation tasks.

2423Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

[openreview] [pdf]

Abstract Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings.

2424MSLC: Monte Carlo Tree Search Sampling Guided Local Construction for Solving Large-Scale Traveling Salesman Problem

[openreview] [pdf]

Abstract Neural solvers have achieved promising results in solving small-scale Travelling Salesman Problems (TSP), but inefficiencies arise when tackling larger instances. This paper proposes the MSLC (\textbf{M}onte Carlo Tree Search \textbf{S}ampling Guided \textbf{L}ocal \textbf{C}onstruction) framework, which innovatively integrates a predictive sampling module into the global coarse-grained selection module, MCTS, to achieve mutual integration with the fine-grained local construction module. This integration effectively balances coarse-grained exploration with fine-grained adjustment, thereby improving overall efficiency. This framework offers a novel way to combine autoregressive and non-autoregressive models. Experimental results demonstrate that MSLC effectively balances time and solution quality, outperforming state-of-the-art neural solvers. The performance gap of MSLC is reduced by at least 29.4% (resp. 34.7% or 28.5%) on TSP-500 (resp. TSP-1000 or TSP-10000), compared to the SOTA neural methods.

2425Auto-GDA: Automatic Domain Adaptation for Efficient Grounding Verification in Retrieval Augmented Generation

[openreview] [pdf]

Abstract While retrieval augmented generation (RAG) has been shown to enhance factuality of large language model (LLM) outputs, LLMs still suffer from hallucination, generating incorrect or irrelevant information. One common detection strategy involves prompting the LLM again to assess whether its response is grounded in the retrieved evidence, but this approach is costly. Alternatively, lightweight natural language inference (NLI) models for efficient grounding verification can be used at inference time. While existing pre-trained NLI models offer potential solutions, their performance remains subpar compared to larger models on realistic RAG inputs. RAG inputs are more complex than most datasets used for training NLI models and have characteristics specific to the underlying knowledge base, requiring adaptation of the NLI models to a specific target domain. Additionally, the lack of labeled instances in the target domain makes supervised domain adaptation, e.g., through fine-tuning, infeasible. To address these challenges, we introduce Automatic Generative Domain Adaptation (Auto-GDA). Our framework enables unsupervised domain adaptation through synthetic data generation. Unlike previous methods that rely on handcrafted filtering and augmentation strategies, Auto-GDA employs an iterative process to continuously improve the quality of generated samples using weak labels from less efficient teacher models and discrete optimization to select the most promising augmented samples. Experimental results demonstrate the effectiveness of our approach, with models fine-tuned on synthetic data using Auto-GDA often surpassing the performance of the teacher model and reaching the performance level of LLMs at 10 % of their computational cost.

2426Statistical Test for Anomaly Detections using Variational Auto-Encoders by Selective Inference

[openreview] [pdf]

Abstract Over the past decade, Variational Autoencoders (VAE) have become a widely used tool for anomaly detection (AD), with research advancing from algorithm development to real-world applications. However, a critical challenge remains --- the lack of a reliable method to rigorously assess the reliability of detected anomalies, which restricts its use in high-stakes decision-making tasks such as medical diagnostics. To overcome this limitation, we introduce the VAE-AD Test, a novel approach for quantifying the statistical reliability of VAE-based AD. The key advantage of the VAE-AD Test lies in its ability to properly control the probability of misidentifying anomalies under a pre-specified level of guarantee α\alpha (e.g., 0.05). Specifically, by carefully analyzing the AD process of VAE, which operates through piecewise-linear functions, and leveraging the Selective Inference (SI) framework to assign valid p-values to the detected anomalies, we prove that theoretical control of the false detection rate is achievable. Experiments conducted on both synthetic and real-world datasets robustly support our theoretical results, showcasing the VAE-AD Test’s superior performance. To our knowledge, this is the first work capable of conducting valid statistical inference to assess the reliability of VAE-based AD.

2427Concept Bottleneck Large Language Models

[openreview] [pdf]

Abstract We introduce the Concept Bottleneck Large Language Model (CB-LLM), a pioneering approach to creating inherently interpretable Large Language Models (LLMs). Unlike traditional black-box LLMs that rely on post-hoc interpretation methods with limited neuron function insights, CB-LLM sets a new standard with its built-in interpretability, scalability, and ability to provide clear, accurate explanations. We investigate two essential tasks in the NLP domain: text classification and text generation. In text classification, CB-LLM narrows the performance gap with traditional black-box models and provides clear interpretability. In text generation, we show how interpretable neurons in CB-LLM can be used for concept detection and steering text generation. Our CB-LLMs enable greater interaction between humans and LLMs across a variety of tasks --- a feature notably absent in existing LLMs.

2428Practicalϵ-Exploring Thompson Sampling for Reinforcement Learning with Continuous Controls

[openreview] [pdf]

Abstract Balancing exploration and exploitation is crucial in reinforcement learning (RL). While Thompson Sampling (TS) is a sound and effective exploration strategy, its application to RL with high-dimensional continuous controls remains challenging. We propose Practical ϵ\epsilon-Exploring Thompson Sampling (PETS), a practical approach that addresses these challenges. Since the posterior over the parameters of the action-value function is intractable, we leverage Langevin Monte Carlo (LMC) for sampling. We propose an approach which maintains nn parallel Markov chains to mitigate the issues of nai"{ve} application of LMC. The next step following the posterior sampling in TS involves finding the optimal action under the sampled model of the action-value function. We explore both gradient-based and gradient-free approaches to approximate the optimal action, with extensive experiments. Furthermore, to justify the use of gradient-based optimization to approximate the optimal action, we analyze the regret for TS in the RL setting with continuous controls and show that it achieves the best-known bound previously established for the discrete setting. Our empirical results demonstrate that PETS, as an exploration strategy, can be integrated with leading RL algorithms, enhancing their performance and stability on benchmark continuous control tasks.

2429Audio Prototypical Network for Controllable Music Recommendation

[openreview] [pdf]

Abstract Traditional recommendation systems represent user preferences in dense representations obtained through black-box encoder models. While these models often provide strong recommendation performance, they lack interpretability for users, leaving users unable to understand or control the system’s modeling of their preferences. This limitation is especially challenging in music recommendation, where user preferences are highly personal and often evolve based on nuanced qualities like mood, genre, tempo, or instrumentation. In this paper, we propose an audio prototypical network for controllable music recommendation. This network expresses user preferences in terms of prototypes representative of semantically meaningful features pertaining to musical qualities. We show that the model obtains competitive recommendation performance compared to popular baseline models while also providing interpretable and controllable user profiles.

2430Learning from interval targets

[openreview] [pdf]

Abstract We consider regression problems where the exact real-valued targets are not directly available; instead, supervision is provided in the form of intervals around the targets—that is, only lower and upper bounds are known. Such a “learning from interval targets” setup arises in domains where labeling costs are high or there is inherent uncertainty in the target values. In these settings, traditional regression loss functions, which require exact target values, cannot be directly applied. To address this challenge, we propose two approaches: (i) modifying the regression loss function to be compatible with interval ground truths, and (ii) formulating a min-max problem where we minimize the typical regression loss with respect to the “worst-case” label within the interval. We provide theoretical guarantees for our methods, analyze their computational efficiency, and evaluate their practical performance on real-world datasets.

2431Optimization Insights into Deep Diagonal Linear Networks

[openreview] [pdf]

Abstract Overparameterized models trained with (stochastic) gradient descent are ubiquitous in modern machine learning. These large models achieve unprecedented performance on test data, but their theoretical understanding is still limited. In this paper, we take a step towards filling this gap by adopting an optimization perspective. More precisely, we study the implicit regularization properties of the gradient flow “algorithm” for estimating the parameters of a deep diagonal neural network. Our main contribution is showing that this gradient flow induces a mirror flow dynamic on the model, meaning that it is biased towards a specific solution of the problem depending on the initialization of the network. Along the way, we prove several properties of the trajectory.

2432Mixed-curvature decision trees and random forests

[openreview] [pdf]

Abstract Decision trees (DTs) and their random forest (RF) extensions are workhorses of classification and regression in Euclidean spaces. However, algorithms for learning in non-Euclidean spaces are still limited. We extend DT and RF algorithms to product manifolds: Cartesian products of several hyperbolic, hyperspherical, or Euclidean components. Such manifolds handle heterogeneous curvature while still factorizing neatly into simpler components, making them compelling embedding spaces for complex datasets. Our novel angular reformulation of DTs respects the geometry of the product manifold, yielding splits that are geodesically convex, maximum-margin, and composable. In the special cases of single-component manifolds, our method simplifies to its Euclidean or hyperbolic counterparts, or introduces hyperspherical DT algorithms, depending on the curvature. We benchmark our method on various classification, regression, and link prediction tasks on synthetic data, graph embeddings, mixed-curvature variational autoencoder latent spaces, and empirical data. Compared to six other classifiers, product DTs and RFs ranked first on 21 of 22 single-manifold benchmarks and 18 of 35 product manifold benchmarks, and placed in the top 2 on 53 of 57 benchmarks overall. This highlights the value of product DTs and RFs as straightforward yet powerful new tools for data analysis in product manifolds.

2433Improving Graph Neural Networks by Learning Continuous Edge Directions

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) traditionally employ a message-passing mechanism that resembles diffusion over undirected graphs, which often leads to homogenization of node features and reduced discriminative power in tasks such as node classification. Our key insight for addressing this limitation is to assign fuzzy edge directions---that can vary continuously from node ii pointing to node jj to vice versa---to the edges of a graph so that features can preferentially flow in one direction between nodes to enable long-range information transmission across the graph. We also introduce a novel complex-valued Laplacian for directed graphs with fuzzy edges where the real and imaginary parts represent information flow in opposite directions. Using this Laplacian, we propose a general framework, called Continuous Edge Direction (CoED) GNN, for learning on graphs with fuzzy edges and prove its expressivity limits using a generalization of the Weisfeiler-Leman (WL) graph isomorphism test for directed graphs with fuzzy edges. Our architecture aggregates neighbor features scaled by the learned edge directions and processes the aggregated messages from in-neighbors and out-neighbors separately alongside the self-features of the nodes. Because continuous edge directions are differentiable, we can learn both the edge directions and the GNN weights end-to-end via gradient-based optimization. CoED GNN is particularly well-suited for graph ensemble data where the graph structure remains fixed but multiple realizations of node features are available, such as in gene regulatory networks, web connectivity graphs, and power grids. We demonstrate through extensive experiments on both synthetic and real datasets that learning continuous edge directions significantly improves performance both for undirected and directed graphs compared with existing methods.

2434Teaching with Uncertainty: Unleashing the Potential of Knowledge Distillation in Object Detection

[openreview] [pdf]

Abstract Knowledge distillation (KD) has become a fundamental technique for model compression in object detection tasks. The data noise and training randomness may cause the knowledge of the teacher model to be unreliable, referred to as knowledge uncertainty. Existing methods only transfer this knowledge and could limit the student’s ability to capture and understand the potential ``dark knowledge’'. In this work, we introduce a new strategy that explicitly incorporates knowledge uncertainty, named Uncertainty-Driven Knowledge Extraction and Transfer (UET). Given that the knowledge distribution is unknown and high-dimensional in practice, we introduce a simple yet effective sampling method with Monte Carlo dropout (MC dropout) to estimate the teacher’s knowledge uncertainty. Leveraging information theory, we integrate knowledge uncertainty into the conventional KD process, allowing the student model to benefit from knowledge diversity. UET is a plug-and-play method that integrates seamlessly with existing distillation techniques. We validate our approach through comprehensive experiments across various distillation strategies, detectors, and backbones. Specifically, UET achieves state-of-the-art results, with a ResNet50-based GFL detector obtaining 44.1% mAP on the COCO dataset—surpassing baseline performance by 3.9%.

[openreview] [pdf]

Abstract Inductive link prediction is a significant challenge in knowledge graphs, focusing on predicting potential relations between unseen entities during training. A promising approach is to utilize Graph Neural Networks (GNNs) to extract entity-independent features from surrounding subgraphs. However, existing mainstream subgraph extraction methods may lead to the loss of key entities and relations, resulting in many disconnected reasoning paths that seriously hinder effective message passing. To address this challenge, we propose a novel framework called Common Neighbor Induced Message Passing (CNMP), designed to enhance message passing even when reasoning paths are disconnected. We observe that the common neighbors of two entities must share a reasoning path. Based on this insight, CNMP enhances message passing by updating the distance labels of isolated common neighbors, even if they are unreachable. This allows CNMP to incorporate new connected equivalent relations, facilitating effective message passing. Furthermore, we introduce a CNMP+ strategy that further improves the preservation of entities and relations during the message-passing process. CNMP+ involves maintaining a list of common neighbors at various distances and using a probing strategy to reconstruct complete reasoning paths. Experiments across multiple datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.

2436Time-aware World Model: Adaptive Learning of Task Dynamics

[openreview] [pdf]

Abstract In this work, we introduce Time-Aware World Model, a model-based approach designed to explicitly incorporate the temporal dynamics of environments. By conditioning on the time step size, Δt\Delta t, and training over a diverse range of Δt\Delta t values - rather than relying on a fixed time step size - our model enables learning of both high- and low-frequency task dynamics in real-world control problems. Inspired by the information-theoretic principle that the optimal sampling rate varies depending on the underlying dynamics of different physical systems, our time-aware model enhances both performance and learning efficiency. Empirical evaluations demonstrate that our model consistently outperforms baseline approaches across different observation rates in various control tasks, using the same number of training samples and iterations. We will release our source code on GitHub once the final review decisions are made.

2437Multi-Draft Speculative Sampling: Canonical Architectures and Theoretical Limits

[openreview] [pdf]

Abstract We consider multi-draft speculative sampling, where the proposal sequences are sampled independently from different draft models. At each step, a token-level draft selection scheme takes a list of valid tokens as input and produces an output token whose distribution matches that of the target model. Previous works have demonstrated that the optimal scheme (which maximizes the probability of accepting one of the input tokens) can be cast as a solution to a linear program. In this work we show that the optimal scheme can be decomposed into a two-step solution: in the first step an importance sampling (IS) type scheme is used to select one intermediate token; in the second step (single-draft) speculative sampling is applied to generate the output token. For the case of two identical draft models we further 1) establish a necessary and sufficient condition on the distributions of the target and draft models for the acceptance probability to equal one and 2) provide an explicit expression for the optimal acceptance probability. Our theoretical analysis also motives a new class of token-level selection scheme based on weighted importance sampling. Our experimental results demonstrate consistent improvements in the achievable block efficiency and token rates over baseline schemes in a number of scenarios.

2438LW2G: Learning Whether to Grow for Prompt-based Continual Learning

[openreview] [pdf]

Abstract Continual Learning (CL) aims to learn in non-stationary scenarios, progressively acquiring and maintaining knowledge from sequential tasks. Recent Prompt-based Continual Learning (PCL) has achieved remarkable performance with Pre-Trained Models (PTMs). These approaches grow a prompt sets pool by adding a new set of prompts when learning each new task (prompt learning) and adopt a matching mechanism to select the correct set for each testing sample (prompt retrieval). Previous studies focus on the latter stage by improving the matching mechanism to enhance Prompt Retrieval Accuracy (PRA). To promote cross-task knowledge facilitation and form an effective and efficient prompt sets pool, we propose a plug-in module in the former stage to Learn Whether to Grow (LW2G) based on the disparities between tasks. Specifically, a shared set of prompts is utilized when several tasks share certain commonalities, and a new set is added when there are significant differences between the new task and previous tasks. Inspired by Gradient Projection Continual Learning, our LW2G develops a metric called Hinder Forward Capability (HFC) to measure the hindrance imposed on learning new tasks by surgically modifying the original gradient onto the orthogonal complement of the old feature space. With HFC, an automated scheme Dynamic Growing Approach adaptively learns whether to grow with a dynamic threshold. Furthermore, we design a gradient-based constraint to ensure the consistency between the updating prompts and pre-trained knowledge, and a prompts weights reusing strategy to enhance forward transfer. Extensive experiments show the effectiveness of our method.

2439Revisiting Prefix-tuning: Statistical Benefits of Reparameterization among Prompts

[openreview] [pdf]

Abstract Prompt-based techniques, such as prompt-tuning and prefix-tuning, have gained prominence for their efficiency in fine-tuning large pre-trained models. Despite their widespread adoption, the theoretical foundations of these methods remain limited. For instance, in prefix-tuning, we observe that a key factor in achieving performance parity with full fine-tuning lies in the reparameterization strategy. However, the theoretical principles underpinning the effectiveness of this approach have yet to be thoroughly examined. Our study demonstrates that reparameterization is not merely an engineering trick but is grounded in deep theoretical foundations. Specifically, we show that the reparameterization strategy implicitly encodes a shared structure between prefix key and value vectors. Building on recent insights into the connection between prefix-tuning and mixture of experts models, we further illustrate that this shared structure significantly improves sample efficiency in parameter estimation compared to non-shared alternatives. The effectiveness of prefix-tuning across diverse tasks is empirically confirmed to be enhanced by the shared structure, through extensive experiments in both visual and language domains. Additionally, we uncover similar structural benefits in prompt-tuning, offering new perspectives on its success. Our findings provide theoretical and empirical contributions, advancing the understanding of prompt-based methods and their underlying mechanisms.

2440Shifting the Paradigm: A Diffeomorphism Between Time Series Data Manifolds for Achieving Shift-Invariancy in Deep Learning

[openreview] [pdf]

Abstract Deep learning models often lack shift invariance, making them sensitive to input shifts that cause changes in output. While recent techniques seek to address this for images, our findings show that these approaches fail to provide shift-invariance in time series, where the data generation mechanism is more challenging due to the interaction of low and high frequencies. Worse, they also decrease performance across several tasks. In this paper, we propose a differentiable bijective function that maps samples from their high-dimensional data manifold to another manifold of the same dimension, without any dimensional reduction. Our approach guarantees that samples---when subjected to random shifts---are mapped to a unique point in the data manifold while preserving all task-relevant information without loss. We theoretically and empirically demonstrate that the proposed transformation guarantees shift-invariance in deep learning models without imposing any limits to the shift. Our experiments on five-time series tasks with state-of-the-art methods show that our proposed approach consistently improves the performance while enabling models to achieve complete shift-invariance without modifying or imposing restrictions on the model’s topology. Source code: Double-blind.

2441In-context Fine-tuning for Time-series Foundation Models

[openreview] [pdf]

Abstract Motivated by the recent success of time-series foundation models for zero-shot forecasting, we present a methodology forin-context fine-tuningof a time-series foundation model. In particular, we design a pretrained foundation model that can be prompted (at inference time) with multiple time-series examples, in order to forecast a target time-series into the future. Our foundation model is specifically trained to utilize examples from multiple related time-series in its context window (in addition to the history of the target time-series) to help it adapt to the specific distribution of the target domain at inference time. We show that such a foundation model that uses in-context examples at inference time can obtain much better performance on popular forecasting benchmarks compared to supervised deep learning methods, statistical models as well as other time-series foundation models. Interestingly, our in-context fine-tuning approach even rivals the performance of a foundation model that is explicitly fine-tuned on the target domain.

2442GENERALIZATION, ROBUSTNESS AND ADAPTABILITY OF PROGRESSIVE NEURAL COLLAPSE

[openreview] [pdf]

Abstract Neural networks exhibit the neural collapse phenomenon in multi-class classification tasks, where last-layer features and linear classifier weights converge into a symmetric geometric structure. However, most prior studies have primarily focused on last-layer feature representations or have examined intermediate features using limited, simple architectures and datasets. The mechanisms by which deep neural networks separate data according to class membership across all layers in more complex and realistic scenarios, and how this separation evolves under distribution shifts, remain unclear. In this work, we extend the study of neural collapse to a broader range of architectures and datasets, investigating its progression throughout the network and its implications for generalization, robustness, and domain adaptability. Our findings reveal that well-trained neural networks progressively enhance neural collapse across layers, though a distinct transition phase occurs where this improvement plateaus after the initial layers and is followed by a renewed continuous improvement in the very last layers, with additional layers contributing minimal generalization benefits. Moreover, we observe that this progressive neural collapse pattern remains robust against noisy data, whether the noise occurs in inputs or labels, and that the degree of intermediate separation serves as an effective indicator of noise levels. Additionally, for the learned networks, comparing neural collapse evaluated on noisy data and clean data reveals insights into feature learning and memorization, with the latter primarily occurring in the very last layers. This finding aligns with the neural collapse pattern observed with clean training data. Finally, we show that when a shift occurs between source and target domains, intermediate neural collapse is closely related to downstream target performance.

2443General Scene Adaptation for Vision-and-Language Navigation

[openreview] [pdf]

Abstract Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in persistent environments with relatively consistent physical layouts, visual observations, and language styles from instructors. Such a gap in the task setting presents an opportunity to improve VLN agents by incorporating continuous adaptation to specific environments. To better reflect these real-world conditions, we introduce GSA-VLN (General Scene Adaptation for VLN), a novel task requiring agents to execute navigation instructions within a specific scene and simultaneously adapt to it for improved performance over time. To evaluate the proposed task, one has to address two challenges in existing VLN datasets: the lack of out-of-distribution (OOD) data, and the limited number and style diversity of instructions for each scene. Therefore, we propose a new dataset, GSA-R2R, which significantly expands the diversity and quantity of environments and instructions for the Room-to-Room (R2R) dataset to evaluate agent adaptability in both ID and OOD contexts. Furthermore, we design a three-stage instruction orchestration pipeline that leverages large language models (LLMs) to refine speaker-generated instructions and apply role-playing techniques to rephrase instructions into different speaking styles. This is motivated by the observation that each individual user often has consistent signatures or preferences in their instructions, taking the use case of home robotic assistants as an example. We conducted extensive experiments on GSA-R2R to thoroughly evaluate our dataset and benchmark various methods, revealing key factors enabling agents to adapt to specific environments. Based on our findings, we propose a novel method, Graph-Retained DUET (GR-DUET), which incorporates memory-based navigation graphs with an environment-specific training strategy, achieving state-of-the-art results on all GSA-R2R splits.

2444Towards good practice in boosting the targeted adversarial attack

[openreview] [pdf]

Abstract By accessing only the surrogate model, attackers can craft adversarial perturbations to fool black-box victim models into misclassifying a given image into the target class. However, the misalignment between surrogate models and victim models raises concerns about defining what constitutes a successful targeted attack in a black-box setting. In our work, we empirically identify that the vision-language foundation model CLIP is a natural good indicator to evaluate a good transferable targeted attacks. We find that a successful transferable targeted attack not only confuse the model on the vision modality towards the target class, but also fool the model on the text modality between the original class and target class. Motivated by this finding, we propose a simple yet effective regularization term to boost the existing transferable targeted attacks. We also revisit the feature-based attacks, and propose to boost the performance by enhancing the fine-grained features. Extensive experiments on the ImageNet-1k dataset demonstrate the effectiveness of our proposed methods. We hope our finding can motivate future research on the understanding of targeted attacks and develop more powerful techniques.

2445No Equations Needed: Learning System Dynamics Without Relying on Closed-Form ODEs

[openreview] [pdf]

Abstract Data-driven modeling of dynamical systems is a crucial area of machine learning. In many scenarios, a thorough understanding of the model’s behavior becomes essential for practical applications. For instance, understanding the behavior of a pharmacokinetic model, constructed as part of drug development, may allow us to both verify its biological plausibility (e.g., the drug concentration curve is non-negative and decays to zero in the long term) and to design dosing guidelines (e.g., by looking at the peak concentration and its timing). Discovery of closed-form ordinary differential equations (ODEs) can be employed to obtain such insights by finding a compact mathematical equation and then analyzing it (a two-step approach). However, its widespread use (in pharmacology and other domains) is currently hindered because the analysis process may be time-consuming, requiring substantial mathematical expertise, or even impossible if the equation is too complex. Moreover, if the found equation’s behavior does not satisfy the requirements, editing it or influencing the discovery algorithms to rectify it is challenging as the link between the symbolic form of an ODE and its behavior can be elusive. This paper proposes a conceptual shift to modeling low-dimensional dynamical systems by departing from the traditional two-step modeling process. Instead of first discovering a closed-form equation and then analyzing it, our approach, direct semantic modeling, predicts the semantic representation of the dynamical system (i.e., description of its behavior) directly from data, bypassing the need for complex post-hoc analysis. This direct approach also allows the incorporation of intuitive inductive biases into the optimization algorithm and editing the model’s behavior directly, ensuring that the model meets the desired specifications. Our approach not only simplifies the modeling pipeline but also enhances the transparency and flexibility of the resulting models compared to traditional closed-form ODEs. We validate the effectiveness of this method through extensive experiments, demonstrating its advantages in terms of both performance and practical usability.

2446Adversaries Can Misuse Combinations of Safe Models

[openreview] [pdf]

Abstract Developers try to evaluate whether an AI system can accomplish malicious tasks before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing models for such misuse is inadequate; adversaries can misuse combinations of models even when each individual model is safe. The adversary accomplishes this by first decomposing tasks into subtasks, then solving each subtask with the best-suited model. For example, an adversary might solve challenging-but-benign subtasks with an aligned frontier model, and easy-but-malicious subtasks with a weaker misaligned model. We study two decomposition methods: manual decomposition where a human identifies a natural decomposition of a task, and automated decomposition where a weak model generates benign tasks for a frontier model to solve, then uses the solutions in-context to solve the original task. Using these decompositions, we empirically show that adversaries can create vulnerable code, explicit images, python scripts for hacking, and manipulative tweets at much higher rates with combinations of models than either individual model. Our work suggests that even perfectly-aligned frontier systems can enable misuse without ever producing malicious outputs, and that red-teaming efforts should extend beyond single models in isolation.

2447Online Laplacian-Based Representation Learning in Reinforcement Learning

[openreview] [pdf]

Abstract Representation learning plays a crucial role in reinforcement learning, especially in complex environments with high-dimensional and unstructured states. Effective representations can enhance the efficiency of learning algorithms by improving sample efficiency and generalization across tasks. This paper considers the Laplacian-based framework for representation learning, where the eigenvectors of the Laplacian matrix of the underlying transition graph are leveraged to encode meaningful features from raw sensory observations of the states. Despite the promising algorithmic advances in this framework, it remains an open question whether the Laplacian-based representations can be learned online and with theoretical guarantees along with policy learning. To answer this question, we study online Laplacian-based representation learning, where the graph-based representation is updated simultaneously while the policy is updated by the reinforcement learning algorithm. We design an online optimization formulation by introducing the Asymmetric Graph Drawing Objective (AGDO) and provide a theoretical analysis of the convergence of running online projected gradient descent on AGDO under mild assumptions. Specifically, we show that if the policy learning algorithm induces a bounded drift on the policy, running online projected gradient descent on AGDO exhibits ergodic convergence. Our extensive simulation studies empirically validate the guarantees of convergence to the true Laplacian representation. Furthermore, we provide insights into the compatibility of different reinforcement learning algorithms with online representation learning.

2448Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training

[openreview] [pdf]

Abstract This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by identifying and tackling a refusal position bias within safety tuning data, which compromises the models’ ability to appropriately refuse generating unsafe content. We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position, significantly enhancing their safety capabilities. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation (MLE) with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful response sequence. Our empirical evaluation, conducted using LLaMA3 and Mistral model families across six attack scenarios, demonstrates that our method not only improves model safety without compromising performance but also surpasses well-known models such as GPT-4 in defending against attacks. Importantly, our approach successfully defends recent advanced attack methods that have jailbroken GPT-4 and LLaMA3-70B-Instruct.

2449The Benefit of Being Bayesian in Online Conformal Prediction

[openreview] [pdf]

Abstract Based on the framework of Conformal Prediction (CP), we study the online construction of valid confidence sets given a black-box machine learning model. By converting the target confidence levels into quantile levels, the problem can be reduced to predicting the quantiles (in hindsight) of a sequentially revealed data sequence. Two very different approaches have been studied previously:Direct approach.Assuming the data sequence is iid or exchangeable, one could maintain the empirical distribution of the observed data as an algorithmic belief, and directly predict its quantiles.Indirect approach.As statistical assumptions often do not hold in practice, a recent trend is to consider the adversarial setting and apply first-order online optimization to moving quantile losses (Gibbs and Candes, 2021). It requires knowing the target quantile level beforehand, and suffers from certain validity issues on the obtained confidence sets, due to the associated loss linearization.This paper presents a novel Bayesian CP framework that combines their strengths. Without any statistical assumption, it is able to bothanswer multiple arbitrary confidence level queries online, with provably low regret; andovercome the validity issues suffered by first-order optimization baselines, due to being “data-centric” rather than “iterate-centric”.From a technical perspective, our key idea is to regularize the algorithmic belief of the above direct approach by a Bayesian prior, which “robustifies” it by simulating a non-linearizedFollow the Regularized Leader(FTRL) algorithm on the output. For statisticians, this can be regarded as an online adversarial view of Bayesian inference. Importantly, the proposed belief update backbone is shared by prediction heads targeting different confidence levels, bringing practical benefits analogous to the recently proposed concept ofU-calibration(Kleinberg et al., 2023).

2450Oblivious Unlearning by Learning: Machine Unlearning Without Exposing Erased Data

[openreview] [pdf]

Abstract Machine unlearning enables users to remove the influence of their data from trained models, thus protecting their privacy. However, it is paradoxical that most unlearning methods require users first to upload their to-be-removed data to machine learning servers and notify the servers of their unlearning intentions to prepare appropriate unlearning methods. Both unlearned data and unlearning intentions are sensitive user information. Exposing this information to the server for unlearning operations conflicts with the privacy protection goal. In this paper, we investigate the challenge of implementing unlearning without exposing erased data and unlearning intentions to the server. We propose an Oblivious Unlearning by Learning (OUbL) approach to address this privacy-preserving machine unlearning problem. In OUbL, the users construct a new dataset with synthesized unlearning noise, ensuring that once the server continually updates the model using the original learning algorithm based on this dataset, it can implement unlearning. The server does not need to perform any tailored unlearning operation and remains unaware that the constructed samples are for unlearning. As a result, the process is oblivious to the server regarding unlearning intentions. Additionally, by transforming the original erased data into unlearning noise and distributing this noise across numerous auxiliary samples, our approach protects the privacy of the unlearned data while effectively implementing unlearning. The effectiveness of the proposed OUbL method is evaluated through extensive experiments on three representative datasets across various model architectures and four mainstream unlearning benchmarks. The results demonstrate the significant superiority of OUbL over the state-of-the-art privacy-preserving unlearning benchmarks in terms of both privacy protection and unlearning effectiveness.

2451Ensuring Fair Comparisons in Time Series Forecasting: Addressing Quality Issues in Three Benchmark Datasets

[openreview] [pdf]

Abstract Time series forecasting (TSF) is critical in numerous applications; however, unlike other AI domains where benchmark datasets are meticulously standardized, TSF datasets often suffer from data inconsistencies, missing values, and improper temporal splits. These issues have an impact on model performance and evaluation. This paper addresses these challenges by proposing inconsistency-free versions of three well-known TSF datasets. Our methodology involves identifying and correcting data inconsistencies using a combination of linear interpolation and context-aware imputation strategies. Additionally, we introduce a novel cycle-inclusive data splitting method, which respects the longest cycle in each dataset, ensuring that models are evaluated over meaningful temporal patterns. Through extensive testing of multiple transformer-based models, we demonstrate that our revised datasets and cycle-inclusive splitting lead to more accurate and interpretable forecasting results, as well as fairer comparison of TSF models. Finally, our findings highlight the need for proper dataset refinement and tailored data splitting strategies in TSF tasks, and pave the way for future work in the development of more robust forecasting benchmarks.

2452Neural Dynamic Pricing: Provable and Practical Efficiency

[openreview] [pdf]

Abstract Despite theoretical guarantees of existing dynamic pricing (DP) methods, their strong model assumptions may not reflect real-world conditions and are often unverifiable. This poses major challenges in practice since the performance of an algorithm may significantly degrade if the assumptions are not satisfied. Moreover, many DP algorithms show unfavorable empirical performance due to the lack of data efficiency. To address these challenges, we design a practical contextual DP algorithm that utilizes regression oracles. Our proposed algorithm assumes only Lipschitz continuity on the true conditional probability of purchase. We prove O~(T23regretR(T)13)\tilde{\mathcal{O}}(T^{\frac{2}{3}}\text{regret}_R(T)^{\frac{1}{3}}) regret upper bound where TT is the horizon and regretR(T)\text{regret}_R(T) is the regret of the oracle. The bound is nearly minimax optimal in the canonical case of finite function class, and our analysis generically applies to other function approximators including neural networks. To the best of our knowledge, our work is the first algorithm to utilize the powerful generalization capability of neural networks with provable guarantees in dynamic pricing literature. Extensive numerical experiments show that our algorithm outperforms existing state-of-the-art dynamic pricing algorithms in various settings, which demonstrates both provable efficiency and practicality.

2453GIFT-Eval: A Benchmark for General Time Series Forecasting Model Evaluation

[openreview] [pdf]

Abstract Time series foundation models excel in zero-shot forecasting, handling diverse tasks without explicit training. However, the advancement of these models has been hindered by the lack of comprehensive benchmarks. To address this gap, we introduce theGeneral TIme SeriesForecasTing ModelEvaluation,GIFT-EVAL, a pioneering benchmark aimed at promoting evaluation across diverse datasets. GIFT-EVAL encompasses 28 datasets over 144,000 time series and 177 million data points, spanning seven domains, 10 frequencies, multivariate inputs, and prediction lengths ranging from short to long-term forecasts. To facilitate the effective pretraining and evaluation of foundation models, we also provide a non-leaking pretraining dataset containing approximately 230 billion data points. Additionally, we provide a comprehensive analysis of 17 baselines, which includes statistical models, deep learning models, and foundation models. We discuss each model in the context of various benchmark characteristics and offer a qualitative analysis that spans both deep learning and foundation models. We believe the insights from this analysis, along with access to this new standard zero-shot time series forecasting benchmark, will guide future developments in time series foundation models.

2454Towards Constraint-aware Learning for Resource Allocation in NFV-enabled Networks

[openreview] [pdf]

Abstract Virtual Network Embedding (VNE) is a challenging combinatorial optimization problem that refers to resource allocation associated with hard and multifaceted constraints in network function virtualization (NFV). Existing works for VNE struggle to handle such complex constraints, leading to compromised system performance and stability. In this paper, we propose a \textbf{CON}straint-\textbf{A}ware \textbf{L}earning framework for VNE, named \textbf{CONAL}, to achieve efficient constraint management. Concretely, we formulate the VNE problem as a constrained Markov decision process with violation tolerance. This modeling approach aims to improve both resource utilization and solution feasibility by precisely evaluating solution quality and the degree of constraint violation. We also propose a reachability-guided optimization with an adaptive reachability budget method that dynamically assigns budget values. This method achieves persistent zero violation to guarantee the feasibility of VNE solutions and more stable policy optimization by handling instances without any feasible solution. Furthermore, we propose a constraint-aware graph representation method to efficiently learn cross-graph relations and constrained path connectivity in VNE. Finally, extensive experimental results demonstrate the superiority of our proposed method over state-of-the-art baselines. Our code is available at \href{https://anonymous.4open.science/r/iclr25-conal}{https://anonymous.4open.science/r/iclr25-conal}.

2455TODO: Enhancing LLM Alignment with Ternary Preferences

[openreview] [pdf]

Abstract Aligning large language models (LLMs) with human intent is critical for enhancing their performance across a variety of tasks. Standard alignment techniques, such as Direct Preference Optimization (DPO), often rely on the binary Bradley-Terry (BT) model, which can struggle to capture the complexities of human preferences—particularly in the presence of noisy or inconsistent labels and frequent ties. To address these limitations, we introduce the Tie-rank Oriented Bradley-Terry model (TOBT), an extension of the BT model that explicitly incorporates ties, enabling more nuanced preference representation. Building on this, we propose Tie-rank Oriented Direct Preference Optimization (TODO), a novel alignment algorithm that leverages TOBT’s ternary ranking system to improve preference alignment. In evaluations on Mistral-7B and Llama 3-8B models, TODO consistently outperforms DPO in modeling preferences across both in-distribution and out-of-distribution datasets. Additional assessments using MT Bench and benchmarks such as Piqa, ARC-c, and MMLU further demonstrate TODO’s superior alignment performance. Notably, TODO also shows strong results in binary preference alignment, highlighting its versatility and potential for broader integration into LLM alignment. The code for TODO is made publicly available.

2456Constrained Multi-Objective Optimization

[openreview] [pdf]

Abstract There is more and more attention on constrained multi-objective optimization (CMOO) problems, however, most of them are based on gradient-free methods. This paper proposes a constraint gradient-based algorithm for multi-objective optimization (MOO) problems based on multi-gradient descent algorithms. We first establish a framework for the CMOO problem. Then, we provide a Moreau envelope-based Lagrange Multiplier (MLM-CMOO) algorithm to solve the formulated CMOO problem, and the convergence analysis shows that the proposed algorithm convergence to Pareto stationary solutions with a rate of O(1T)\mathcal{O}(\frac{1}{\sqrt{T}}). Finally, the MLM-CMOO algorithm is tested on several CMOO problems and has shown superior results compared to some chosen state-of-the-art designs.

2457Enhancing Robust Fairness via Confusional Spectral Regularization

[openreview] [pdf]

Abstract Recent research has highlighted a critical issue known as ``robust fairness", where robust accuracy varies significantly across different classes, undermining the reliability of deep neural networks (DNNs). A common approach to address this has been to dynamically reweight classes during training, giving more weight to those with lower empirical robust performance. However, our findings reveal a limitation: the class with the worst robust accuracy in training set does not consistently align with that in testing set, indicating the need for a more principled solution. In this work, we derive a robust generalization bound for the worst-class robust error within the PAC-Bayesian framework, accounting for unknown data distributions. Our analysis shows that the worst-class robust error is influenced by two main factors: the spectral norm of the empirical robust confusion matrix and the information embedded in the model and training set. While the latter has been extensively studied, we propose a novel regularization technique targeting the spectral norm of the robust confusion matrix to improve worst-class robust accuracy and enhance robust fairness. We validate our approach through comprehensive experiments on various datasets and models, demonstrating its effectiveness in enhancing robust fairness.

2458MIRAI: Evaluating LLM Agents for International Event Forecasting

[openreview] [pdf]

Abstract We present MIRAI, a benchmark designed to systematically evaluate LLM agents as temporal forecasters to predict international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents’ abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. Notably, MIRAI features a dynamic data construction pipeline that supports periodically downloading recent news and events, and automatically generates the most recent test split. This allows us to evaluate any newly released model in a contamination-free manner as we can always construct a test split later than its knowledge cutoff date. MIRAI comprehensively evaluates the agents’ capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes with both domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and timespan to accurately predict future events. Through comprehensive evaluation, we establish a reliable benchmark for assessing the capabilities of LLM agents in forecasting international events and contribute to the development of more accurate and trustworthy models for international relation analysis.

2459Retraining with Predicted Hard Labels Provably Increases Model Accuracy

[openreview] [pdf]

Abstract The performance of a model trained withnoisy labelsis often improved by simplyretrainingthe model with its own predictedhardlabels (i.e., 1/0 labels). Yet, a detailed theoretical characterization of this phenomenon is lacking. In this paper, we theoretically analyze retraining in a linearly separable setting with randomly corrupted labels given to us and prove that retraining can improve the population accuracy obtained by initially training with the given (noisy) labels. To the best of our knowledge, this is the first such theoretical result. Retraining finds application in improving training with local label differential privacy (DP) which involves training with noisy labels. We empirically show that retraining selectively on the samples for which the predicted label matches the given label significantly improves label DP training atno extra privacy cost; we call thisconsensus-based retraining. As an example, when training ResNet-18 on CIFAR-100 with ϵ=3\epsilon=3 label DP, we obtain 6.4% improvement in accuracy with consensus-based retraining.

2460An Empirical Study on Enhancing LLMs’ Alignment Capabilities through Restyled In-Context Learning Demonstration Examples

[openreview] [pdf]

Abstract Alignment tuning is crucial for ensuring large language models (LLMs) behave safely, ethically, and align with human values. It bridges the gap between raw model capabilities and nuanced task requirements, such as helpfulness and user safety. Current alignment approaches, like instruction-following through supervised fine-tuning (SFT) and preference optimization (PO), require high-quality data and significant resources. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment.Leveraging the autoregressive nature of LLMs, we observed that aligned models adjust the probability distribution of early polarity tokens during decoding, influencing their response trajectory. Among polarity tokens, malicious tokens induce LLMs to positively respond to toxic queries, whereas benign tokens encourage constructive output. Based on this, we designed heuristic rules to select ICL demonstration examples that effectively influence polarity token distributions.We packaged these examples as prompts to trigger few-shot learning, improving LLM alignment. Furthermore, the style and content of ICL demonstrations critically impact few-shot learning. Rewriting examples in a unified, structured style improved LLM accuracy and helpfulness, while specific content encouraged refusal of malicious prompts, enhancing safety.Our experiments show that rewritten examples boost alignment, safety, and reasoning across various tasks. Compared to the best baseline approach, with an average score of 5.00 as the maximum, our method achieves a maximum 0.15 increase on the Alpaca-eval task (from 4.44 → 4.59), a 0.10 enhancement on the just-eval-instruct benchmark (from 4.50 → 4.60), and a maximum improvement of 0.08 (from 3.53 → 3.61) on the MT-Bench dataset. These findings underscore the need for deeper analysis and theoretical understanding of alignment for advancing future LLM research.

2461Ensembles of Low-Rank Expert Adapters

[openreview] [pdf]

Abstract The training and fine-tuning of large language models (LLMs) often involve diverse textual data from multiple sources, which poses challenges due to conflicting gradient directions, hindering optimization and specialization. These challenges can undermine model generalization across tasks, resulting in reduced downstream performance. Recent research suggests that fine-tuning LLMs on carefully selected, task-specific subsets of data can match or even surpass the performance of using the entire dataset. Building on these insights, we propose the Ensembles of Low-Rank Expert Adapters (ELREA) framework to improve the model’s capability to handle diverse tasks. ELREA clusters the training instructions based on their gradient directions, representing different areas of expertise and thereby reducing conflicts during optimization. Expert adapters are then trained on these clusters, utilizing the low-rank adaptation (LoRA) technique to ensure training efficiency and model scalability. During inference, ELREA combines predictions from the most relevant expert adapters based on the input data’s gradient similarity to the training clusters, ensuring optimal adapter selection for each task. Experiments show that our method outperforms baseline LoRA adapters trained on the full dataset and other ensemble approaches with similar training and inference complexity across a range of domain-specific tasks.

2462Enhancing Near OOD Detection in Prompt Learning: Maximum Gains, Minimal Costs

[openreview] [pdf]

Abstract Prompt learning has shown to be an efficient and effective fine-tuning method for vision-language models like CLIP. While numerous studies have focused on the generalisation of these models in few-shot classification, their capability in near out-of-distribution (OOD) detection has been overlooked. A few recent works have highlighted the promising performance of prompt learning in far OOD detection. However, the more challenging task of few-shot near OOD detection has not yet been addressed. In this study, we investigate the near OOD detection capabilities of prompt learning models and observe that commonly used OOD scores have limited performance in near OOD detection. To enhance the performance, we propose a fast and simple post-hoc method that complements existing logit-based scores and can be easily applied to any prompt learning model without change in architecture or model re-training while keeping the same classification accuracy. Our method boosts existing prompt learning methods’ near OOD detection performance in AUROC by up to 11.67% with minimal computational cost. Comprehensive empirical evaluations across 13 datasets and 8 models demonstrate the effectiveness and adaptability of our method.

2463Zero-shot Concept Bottleneck Models via Sparse Regression of Retrieved Concepts

[openreview] [pdf]

Abstract Concept bottleneck models (CBMs) are inherently interpretable neural network models, which explain their final label prediction by high-level semantic \textit{concepts} predicted in the intermediate layers. Previous works of CBMs have succeeded in achieving high-accuracy concept/label predictions without manually collected concept labels by incorporating large language models (LLMs) and vision-language models (VLMs). However, they still require training on the target dataset to learn input-to-concept and concept-to-label correspondences, incurring target dataset collections and training resource requirements. In this paper, we present \textit{zero-shot concept bottleneck models} (Z-CBMs), which are interpretable models predicting labels and concepts in a fully zero-shot manner without training neural networks. Z-CBMs utilize a large-scale concept bank, which is composed of millions of noun phrases extracted from caption datasets, to describe arbitrary input in various domains. To infer the input-to-concept correspondence, we introduce \textit{concept retrieval}, which dynamically searches input-related concepts from the concept bank on the multi-modal feature space of pre-trained VLMs. This enables Z-CBMs to handle the millions of concepts and extract appropriate concepts for each input image. In the concept-to-label inference stage, we apply \textit{concept regression} to select important concepts from the retrieved concept candidates containing noisy concepts related to each other. To this end, concept regression estimates the importance weight of concepts with sparse linear regression approximating the input image feature vectors by the weighted sum of concept feature vectors. Through extensive experiments, we confirm that our Z-CBMs achieve both high target task performance and interpretability without any additional training.

2464Transformer Training Instability of Softmax and Lipschitz-Kernel Attentions

[openreview] [pdf]

Abstract Transformers have been making significant progress across various domains, and recently, with scaling up of models like LLMs, they have achieved even greater success. Recent findings have shown that the softmax function in the self-attention used to re-weight the attention logits into probability vectors causes \emph{attention entropy collapse}, where the attention is concentrated on a single token, and it leads to unstable training. In this work, we first demonstrate that the (non-Lipschitz) softmax-based attention leads to the attention entropy collapse but the \emph{Lipschitz-kernel}-based attention does not. We show that the Lipschitzness of the attention plays an important role in keeping the attention entropy stable regardless of the variance of the attention logits. Moreover, we argue that the underlying reason why the attention entropy collapse leads to the training instability is that as the attention probabilities become more concentrated, it causes the attention matrix to gradually increase, leading to gradient exploding.

2465Multi-domain Distribution Learning for De Novo Drug Design

[openreview] [pdf]

Abstract We introduce DrugFlow, a generative model for structure-based drug design that integrates continuous flow matching with discrete Markov bridges, demonstrating state-of-the-art performance in learning chemical, geometric, and physical aspects of three-dimensional protein-ligand data. We endow DrugFlow with an uncertainty estimate that is able to detect out-of-distribution samples. To further enhance the sampling process towards distribution regions with desirable metric values, we propose a joint preference alignment scheme applicable to both flow matching and Markov bridge frameworks. Furthermore, we extend our model to also explore the conformational landscape of the protein by jointly sampling side chain angles and molecules.

2466Searching for Optimal Solutions with LLMs via Bayesian Optimization

[openreview] [pdf]

Abstract Efficient scaling of test-time compute to search for optimal solutions is an important step towards building generally-capable language models that can reason. Recent work, however, shows that tasks of varying complexity require distinct search strategies to solve optimally, thus making it challenging to design a one-size-fits-all approach. Prior solutions either attempt to predict task difficulty to select the optimal search strategy, often infeasible in practice, or use a static, pre-defined strategy, e.g., repeated parallel sampling or greedy sequential search, which is sub-optimal. In this work, we argue for an alternative view that dynamically adapts the search strategy to changing estimates of the uncertainty in the search space with each round of generation via the probabilistic framework of Bayesian optimization (BO). To this end, we introduce Bayesian-OPRO (BOPRO)---a generalization of a recent method for in-context optimization, which iteratively samples from new proposal distributions by prompting the LLM with a subset of its previous generations selected to explore different parts of the search space. We evaluate our method on a word-search task called Semantle and the joint task of hypothesis search cum program synthesis using a one-dimensional version of the challenging Abstraction and Reasoning Corpus (1D-ARC) to find that BOPRO trails a strong greedy baseline in aggregate. Our analysis of the behaviors exhibited by each method reveals, nonetheless, that BOPRO demonstrates desirable properties essential for building a general solution for search, and we conclude by identifying key areas for future research to address its current limitations.

2467Aligning Teacher with Student Preferences for Tailored Instruction Tuning Dataset Generation

[openreview] [pdf]

Abstract Enhancing the reasoning abilities of lightweight language models (LMs) for tasks like decision-making often relies on instruction-tuning, a method that trains LMs to mimic the reasoning process using labeled question-rationale pairs, known as instruction-tuning datasets, which are typically generated by more powerful teacher LMs. However, current methods for generating these instruction-tuning datasets tend to focus solely on the quality of the questions and rationales from the teacher model’s perspective, often neglecting the learning preferences of the student language model. To fill this gap, we proposeARTE(Aligning TeacheRwith StudenTPreferencEs), a novel framework that adapts the teacher LM’s outputs to the student’s preferences, inspired by “responsive teaching” in pedagogy. Our method involves three key steps: (1) generating draft question-rationale pairs from the teacher model, (2) collecting the student’s preferences on these draft pairs via one-shot in-context learning, and (3) aligning the teacher model using Direct Preference Optimization (DPO), then finally curating tailored question-rationale pairs from the aligned teacher for student training. Through extensive experiments on academic reasoning benchmarks, we demonstrate that student models fine-tuned with tailored datasets by ARTE achieve significant improvements across various reasoning tasks, outperforming existing instruction-tuning datasets. Moreover, we thoroughly investigate the generalization of ARTE, including the generalization of fine-tuned student models in reasoning ability and the generalization of aligned teacher models to generate tailored training data across tasks and students.

2468WISDOM: Progressive Curriculum Synthesis Makes LLMs Better Mathematical Reasoner

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of problem-solving tasks. Despite their success, LLMs still face significant challenges in complex reasoning, particularly with advanced mathematical problems. These problems require not only a deep understanding of task descriptions but also sophisticated logical and mathematical reasoning to determine the correct solution path, which is often lacking in the existing synthetic data. To address this gap, we introduce WISDOM, which draws inspiration from the human learning process and employs curriculum learning to gradually synthesize high-quality CoT data from easy to hard. Our goal is to guide LLM training and improve reasoning capabilities by progressively exposing models to increasingly challenging problems. Based on the synthesized data, we further fine-tune and develop the WISDOM series models, achieving significant improvements across multiple mathematical reasoning benchmarks. Notably, WISDOM-7B (DSMath) achieves a score of 62.4% on MATH, matching GPT-4’s performance with 2/30 correct answers on AIME2024. Furthermore, WISDOM-70B (Llama3) outperforms GPT-4 on AIME2024 with 3/30 correct answers, demonstrating its potential as a better mathematical reasoner. More data and models will be available athttps://anonymous.4open.science/r/Wisdom-math-377B

2469Reweighting Local Mimina with Tilted SAM

[openreview] [pdf]

Abstract Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that are flatter and that incur larger losses. TSAM is parameterized by a tilt hyperparameter tt and reduces to SAM as tt approaches infinity. We prove that (1) the TSAM objective is smoother than SAM and thus easier to optimize; and (2) TSAM explicitly favors flatter minima as tt increases. This is desirable as flatter minima could have better generalization properties for certain tasks. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.

2470Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts

[openreview] [pdf]

Abstract The training data for many Large Language Models (LLMs) is contaminated with test data. This means that public benchmarks used to assess LLMs are compromised, suggesting a performance gap between benchmark scores and actual capabilities. Ideally, a private holdout set could be used to accurately verify scores. Unfortunately, such datasets do not exist for most benchmarks, and post-hoc construction of sufficiently similar datasets is non-trivial. To address these issues, we introduce a systematic methodology for (i) retrospectively constructing a holdout dataset for a target dataset, (ii) demonstrating the statistical indistinguishability of this retro-holdout dataset, and (iii) comparing LLMs on the two datasets to quantify the performance gap due to the dataset’s public availability. Applying these methods to TruthfulQA, we construct and release Retro-Misconceptions, on which we evaluate twenty LLMs and find that some have inflated scores by as much as 16 percentage points. Our results demonstrate that public benchmark scores do not always accurately assess model properties, and underscore the importance of improved data practices in the field.

2471CBM-zero: Concept Bottleneck Model With Zero Performance Loss

[openreview] [pdf]

Abstract Interpreting machine learning models with high-level, human-understandable \emph{concepts} has gained increasing interest. The concept bottleneck model (CBM) is a popular approach to providing interpretable models, relying on first predicting the presence of concepts in a given input, and then using these concept scores to predict a label of interest. Yet, CBMs suffer from lower accuracy compared with standard black-box models, as they use a surrogate (and thus, interpretable) predictor in lieu of the original model. In this work, we propose an approach that allows us to find a CBM in any standard black-box model via an invertible mapping from its latent space to an interpretable concept space. This method preserves the original black-box model’s prediction and thus has zero performance drop while providing human-understandable explanations. We evaluate the accuracy and interpretability of our method across various benchmarks, demonstrating state-of-the-art explainability metrics while enjoying superior accuracy.

2472Linear Bandits with Partially Observable Features

[openreview] [pdf]

Abstract We introduce a novel linear bandit problem where a subset of features is latent, resulting in partial access to reward information and spurious estimates. Without properly addressing the latent features, the regret grows linearly over the decision epoch TT while improving the regret bound is challenging because their dimension and relationship with rewards are not available. We propose a novel analysis to handle the latent features and an algorithm that achieves a regret bound sublinear in TT. The core of the algorithm lies in (i) augmenting basis vectors orthogonal to the observable feature space, and (ii) developing an efficient doubly robust estimator that further improves the regret bound. With these two ingredients, our algorithm achieves a regret bound of O~((d+dh)T)\tilde{O}(\sqrt{(d + d_h)T}), where dd is the dimension of observable features, and dhd_h is the \emph{unknown} dimension of the unobserved features that affects the reward. Crucially, our algorithm does not rely on prior knowledge of the unobserved feature space, which expands as more features become hidden. Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and other linear bandit algorithms.

2473Towards Auto-Regressive Next-Token Prediction: In-context Learning Emerges from Generalization

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable in-context learning (ICL) abilities. However, existing theoretical analysis of ICL primarily exhibits two limitations: \textbf{(a) Limited \textit{i.i.d.} Setting.} Most studies focus on supervised function learning tasks where prompts are constructed with \textit{i.i.d.} input-label pairs. This \textit{i.i.d.} assumption diverges significantly from real language learning scenarios where prompt tokens are interdependent. \textbf{(b) Lack of Emergence Explanation.} Most literature answers \textbf{\textit{what}} ICL does from an implicit optimization perspective but falls short in elucidating \textbf{\textit{how}} ICL emerges and the impact of pre-training phase on ICL. In our paper, to extend (a), we adopt a more practical paradigm, \textbf{\textit{auto-regressive next-token prediction (AR-NTP)}}, which closely aligns with the actual training of language models. Specifically, within AR-NTP, we emphasize prompt token-dependency, which involves predicting each subsequent token based on the preceding sequence. To address (b), we formalize a systematic pre-training and ICL framework, highlighting the layer-wise structure of sequences and topics, alongside a two-level expectation. In conclusion, we present data-dependent, topic-dependent and optimization-dependent PAC-Bayesian generalization bounds for pre-trained LLMs, investigating that \textbf{\textit{ICL emerges from the generalization of sequences and topics}}. Our theory is supported by experiments on numerical linear dynamic systems, synthetic GINC and real-world language datasets.

2474Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models

[openreview] [pdf]

Abstract Large language models (LLMs) have been widely adopted for training embedding and ranking models, with recent advancements significantly improving the performance of Information Retrieval systems. However, while instruction-following is a core capability of LLMs, their ability to handle detailed user instructions has not been thoroughly investigated in search models. This study evaluates the instruction-following capabilities of various retrieval models, including LLM-based dense retrieval and reranking models. We develop a specialized benchmark InFoSearch spanning six dimensions: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics to assess the models’ responsiveness to instructions. Our findings show that even fine-tuned retrieval models struggle with instruction-following, highlighting a key limitation in current systems and providing valuable insights for improving their instruction-aware capabilities in future research.

2475Reflection on Knowledge Graph for Large Language Models Reasoning

[openreview] [pdf]

Abstract Recent studies have highlighted the potential benefits of supplementing Large Language Models (LLMs) with information retrieved from knowledge graphs to enhance their performance. However, current approaches often introduce additional noise in the pipeline process of knowledge retrieval and reasoning, leading to the accumulation of errors, impeding LLMs from effectively combining the external knowledge in answering complex multi-hop questions. To this end, we introduce RefKG, an innovative framework specifically crafted to enhance the reasoning capabilities of LLMs through reflective engagement with knowledge graphs. In particular, RefKG autonomously conduct retrieval and reflection on knowledge graphs. Its reasoning process includes four steps: decomposing complex queries, retrieving and pruning evidence subgraphs, generating textual evidence, and evidence-enhanced reasoning. To enhance the alignment of LLMs with external knowledge, we have developed a multi-task tuning strategy that not only infuses knowledge to LLMs but also teaches them how to utilize the knowledge in answering questions, thereby significantly improving their ability to handle knowledge-intensive tasks. Experimental results on fact verification and knowledge graph question answering tasks demonstrate that RefKG outperforms previous state-of-the-art models.

2476LOB-Bench: Benchmarking Generative AI for Finance - with an Application to Limit Order Book Markets

[openreview] [pdf]

Abstract We present LOB-Bench, a benchmark designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB). We enable a rigorous and comprehensive model comparison by providing both a theoretical framework and an open-source Python package. Addressing the lack of consensus on evaluation paradigms in the literature, where qualitative comparison of stylized facts is prevalent, our work offers a crucial building block for advancing generative AI for financial data. LOB-Bench provides a standardized method to numerically assess the quality of various model classes that generate limit order book data in the widely used LOBSTER format. It provides a range of quantitative characteristics and uses a simple parametric benchmark model as a baseline. Potential model classes include autoregressive models, (C)GANs, and agent-based models. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting a flexible multivariate statistical evaluation. The benchmark features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with adversarial scores derived from a neural network trained to differentiate between real and generated data. Additionally, LOB-Bench evaluates “market impact metrics” by computing cross-correlations and price response functions for specific events in the data. We present empirical benchmark results for a generative autoregressive state-space model and results for a (C)GAN and parametric LOB model. We find that the autoregressive GenAI approach beats traditional model classes.

2477Assessing Open-world Forgetting in Generative Image Model Customization

[openreview] [pdf]

Abstract Recent advances in diffusion models have significantly enhanced image generation capabilities. However, customizing these models with new classes often leads to unintended consequences that compromise their reliability. We introduce the concept ofopen-world forgettingto emphasize the vast scope of these unintended alterations, contrasting it with the well-studiedclosed-world forgetting, which is measurable by evaluating performance on a limited set of classes or skills. Our research presents the first comprehensive investigation into open-world forgetting in diffusion models, focusing on semantic and appearance drift of representations. We utilize zero-shot classification to analyze semantic drift, revealing that even minor model adaptations lead to unpredictable shifts affecting areas far beyond newly introduced concepts, with dramatic drops in zero-shot classification of up to 60%. Additionally, we observe significant changes in texture and color of generated content when analyzing appearance drift. To address these issues, we propose a mitigation strategy based on functional regularization, designed to preserve original capabilities while accommodating new concepts. Our study aims to raise awareness of unintended changes due to model customization and advocates for the analysis of open-world forgetting in future research on model customization and finetuning methods. Furthermore, we provide insights for developing more robust adaptation methodologies.

2478Multi-View Graph Neural Networks with Language Models for Mutli-Source Recommender Systems

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have become increasingly popular in recommender systems due to their ability to model complex user-item relationships. However, current GNN-based approaches face several challenges: They primarily rely on sparse user-item interaction data, which can lead to overfitting and limit generalization performance. Moreover, they often overlook additional valuable information sources, such as social trust and user reviews, which can provide deeper insights into user preferences and enhance recommendation accuracy. To address these limitations, we propose a multi-view GNN framework that integrates diverse information sources using contrastive learning and language models. Our method employs a lightweight Graph Convolutional Network (LightGCN) on user-item interactions to generate initial user and item representations. We use an attention mechanism for the user view to integrate social trust information with user-generated textual reviews, which are transformed into high-dimensional vectors using a pre-trained language model. Similarly, we aggregate all reviews associated with each item and use language models to generate item representations for the item view. We then construct an item graph by applying a meta-path to the user-item interactions. GCNs are applied to both the social trust network and the item graph, generating enriched embeddings for users and items. To align and unify these heterogeneous data sources, we employ a contrastive learning mechanism that ensures consistent and complementary representations across different views. Experimental results on multiple real-world datasets such as Epinions, Yelp, and Ciao demonstrate significant performance improvements over state-of-the-art methods.

2479Vertical Federated Learning with Missing Features During Training and Inference

[openreview] [pdf]

Abstract Vertical federated learning trains models from feature-partitioned datasets across multiple clients, who collaborate without sharing their local data. Standard approaches assume that all feature partitions are available during both training and inference. Yet, in practice, this assumption rarely holds, as for many samples only a subset of the clients observe their partition. However, not utilizing incomplete samples during training harms generalization, and not supporting them during inference limits the utility of the model. Moreover, if after training any client leaves the federation, its partition becomes unavailable, rendering the learned model unusable. Missing feature blocks are therefore a key challenge limiting the applicability of vertical federated learning in real-world scenarios. To address this, we propose \texttt{LASER-VFL}, a vertical federated learning method for efficient training and inference of split neural network-based models that is capable of handling arbitrary sets of partitions. Our approach is simple yet effective, relying on the strategic sharing of model parameters and on task-sampling to train a family of predictors. We show that \texttt{LASER-VFL} achieves a O(1/T)\mathcal{O}({1}/{\sqrt{T}}) convergence rate for nonconvex objectives in general, O(1/T)\mathcal{O}({1}/{T}) for sufficiently large batch sizes, and, under the Polyak-{\L}ojasiewicz inequality, linear convergence. Numerical experiments show improved performance of \texttt{LASER-VFL} over the baselines. Remarkably, this is the case even in the absence of missing features. For example, for CIFAR-100, we see an improvement in accuracy of 21.4% when each of four feature blocks is observed with a probability of 0.5 and of 12.2% when all features are observed.

2480Interpreting Emergent Planning in Model-Free Reinforcement Learning

[openreview] [pdf]

Abstract We present the first mechanistic evidence that model-free reinforcement learning agents can learn to plan. This is achieved by applying a methodology based on concept-based interpretability to a model-free agent in Sokoban -- a commonly used benchmark for studying planning. Specifically, we demonstrate that DRC, a generic model-free agent introduced byGuez et al. (2019), uses learned concept representations to internally formulate plans that both predict the long-term effects of actions on the environment and influence action selection. Our methodology involves: (1) probing for planning-relevant concepts, (2) investigating plan formation within the agent’s representations, and (3) verifying that discovered plans (in agent’s representations) have causal effect on agent’s behavior through interventions. We also show that the emergence of these plans coincides with the emergence of a planning-like property: the ability to benefit from additional test-time compute. Finally, we perform a qualitative analysis of the planning algorithm learned by the agent and discover a strong resemblance to parallelized bidirectional search. Our findings advance understanding of the internal mechanisms underlying planning behavior in agents, enabling improved diagnosis, interpretation, and control of agent planning processes.

2481Building Math Agents with Multi-Turn Iterative Preference Learning

[openreview] [pdf]

Abstract Recent studies have shown that large language models’ (LLMs) mathematical problem-solving capabilities can be enhanced by integrating external tools, such as code interpreters, and employing multi-turn Chain-of-Thought (CoT) reasoning. While current methods focus on synthetic data generation and Supervised Fine-Tuning (SFT), this paper studies the complementary direct preference learning approach to further improve model performance. However, existing direct preference learning algorithms are originally designed for the single-turn chat task, and do not fully address the complexities of multi-turn reasoning and external tool integration required for tool-integrated mathematical reasoning tasks. To fill in this gap, we introduce a multi-turn direct preference learning framework, tailored for this context, that leverages feedback from code interpreters and optimizes trajectory-level preferences. This framework includes multi-turn DPO and multi-turn KTO as specific implementations. The effectiveness of our framework is validated through training of various language models using an augmented prompt set from the GSM8K and MATH datasets. Our results demonstrate substantial improvements: a supervised fine-tuned Gemma-1.1-it-7B model’s performance increased from 77.5% to 83.9% on GSM8K and from 46.1% to 51.2% on MATH. Similarly, a Gemma-2-it-9B model improved from 84.1% to 86.3% on GSM8K and from 51.0% to 54.5% on MATH.

2482LESS IS MORE: HIGH-VALUE DATA SELECTION FOR VISUAL INSTRUCTION TUNING

[openreview] [pdf]

Abstract Visual instruction tuning is the key to building large vision language mod- els (LVLMs), which can greatly improve the task generalization and solving capa- bilities by learning a mixture of instruction data from diverse visual tasks. Previ- ous work mostly collects multiple existing visual instruction datasets via heuristic ways for training (even more than a million instructions), which may introduce data redundancy and enlarge the training cost. To investigate this issue, we con- duct a series of empirical studies, which reveal a significant redundancy within the visual instruction datasets, and show that greatly reducing the amount of instruc- tions from several tasks even do not affect the performance. Based on the findings, we propose a high-value data selection approach TIVE\textbf{TIVE}, to eliminate redundancy within the visual instruction data and reduce the training cost. In TIVE, we first estimate the instance influence score on its corresponding task, and the task dif- ficulty score, based on the gradient-based influence functions. Then, we leverage the two kinds of scores to determine the task proportion within the selected visual instruction subset, and select high-value instances for each task, respectively. Ex- periments on various LVLMs show that our approach using only about 15% data can achieve comparable average performance to the full-data fine-tuned model across eight benchmarks, even surpassing it on four of the benchmarks. Our code and data will be publicly released.

2483Understanding Contrastive Learning through Variational Analysis and Neural Network Optimization Perspectives

[openreview] [pdf]

Abstract The SimCLR method for contrastive learning of invariant visual representations has become extensively used in supervised, semi-supervised, and unsupervised settings, due to its ability to uncover patterns and structures in image data that are not directly present in the pixel representations. However, the reason for this success is not well-explained, since it is not guaranteed by invariance alone. In this paper, we conduct a mathematical analysis of the SimCLR method with the goal of better understanding the geometric properties of the learned latent distribution. Our findings reveal two things: (1) the SimCLR loss alone is not sufficient to select a “good” minimizer --- there are minimizers that give trivial latent distributions, even when the original data is highly clustered --- and (2) in order to understand the success of contrastive learning methods like SimCLR, it is necessary to analyze the neural network training dynamics induced by minimizing a contrastive learning loss. Our preliminary analysis for a one-hidden layer neural network shows that clustering structure can present itself for a substantial period of time during training, even if it eventually converges to a trivial minimizer. To substantiate our theoretical insights, we present numerical results that confirm our theoretical predictions.

2484Interpreting Language Reward Models via Contrastive Explanations

[openreview] [pdf]

Abstract Reward models (RMs) are a crucial component in the alignment of large language models’ (LLMs) outputs with human values. RMs approximate human preferences over possible LLM responses to the same prompt by predicting and comparing reward scores. However, as they are typically modified versions of LLMs with scalar output heads, RMs are large black boxes whose predictions are not explainable. More transparent RMs would enable improved trust in the alignment of LLMs. In this work, we propose to use contrastive explanations to explain any binary response comparison made by an RM. Specifically, we generate a diverse set of new comparisons similar to the original one to characterise the RM’s local behaviour. The perturbed responses forming the new comparisons are generated to explicitly modify manually specified high-level evaluation attributes, on which analyses of RM behaviour are grounded. In quantitative experiments, we validate the effectiveness of our method for finding high-quality contrastive explanations. We then showcase the qualitative usefulness of our method for investigating global sensitivity of RMs to each evaluation attribute, and demonstrate how representative examples can be automatically extracted to explain and compare behaviours of different RMs. We see our method as a flexible framework for RM explanation, providing a basis for more interpretable and trustworthy LLM alignment.

2485Overcoming Knowledge Barriers: Online Imitation Learning from Visual Observation with Pretrained World Models

[openreview] [pdf]

Abstract Pretraining and finetuning models has become increasingly popular in decision-making. But there are still serious impediments in Imitation Learning from Observation (ILfO) with pretrained models. This study identifies two primary obstacles: the Embodiment Knowledge Barrier (EKB) and the Demonstration Knowledge Barrier (DKB). The EKB emerges due to the pretrained models’ limitations in handling novel observations, which leads to inaccurate action inference. Conversely, the DKB stems from the reliance on limited demonstration datasets, restricting the model’s adaptability across diverse scenarios. We propose separate solutions to overcome each barrier and apply them to Action Inference by Maximising Evidence (AIME), a state-of-the-art algorithm. This new algorithm, AIME-NoB, integrates online interactions and a data-driven regulariser to mitigate the EKB. Additionally, it uses a surrogate reward function to broaden the policy’s supported states, addressing the DKB. Our experiments on vision-based control tasks from the DeepMind Control Suite and MetaWorld benchmarks show that AIME-NoB significantly improves sample efficiency and converged performance, presenting a robust framework for overcoming the challenges in ILfO with pretrained models.

2486Locality Sensitive Avatars From Video

[openreview] [pdf]

Abstract We present locality-sensitive avatar, a neural radiance field (NeRF) based network to learn human motions from monocular videos. To this end, we estimate a canonical representation between different frames of a video with a non-linear mapping from observation to canonical space, which we decompose into a skeletal rigid motion and a non-rigid counterpart. Our key contribution is to retain fine-grained details by modeling the non-rigid part with a graph neural network (GNN) that keeps the pose information local to neighboring body parts. Compared to former canonical representation based methods which solely operate on the coordinate space of a whole shape, our locality-sensitive motion modeling can reproduce both realistic shape contours and vivid fine-grained details. We evaluate on ZJU-MoCap, ActorsHQ, SynWild, and various outdoor videos. The experiments reveal that with the locality sensitive deformation to canonical feature space, we are the first to achieve state-of-the-art results across novel view synthesis, novel pose animation and 3D shape reconstruction simultaneously. For reproducibility, the code will be available upon publication.

2487Does Graph Prompt Work? A Data Operation Perspective with Theoretical Analysis

[openreview] [pdf]

Abstract In recent years, graph prompting has emerged as a promising research direction, enabling the learning of additional tokens or subgraphs appended to original graphs without requiring retraining of pre-trained graph models across various applications. This novel paradigm, shifting from the traditional “pre-training and fine-tuning” to “pre-training and prompting,” has shown significant empirical success in simulating graph data operations, with applications ranging from recommendation systems to biological networks and graph transferring. However, despite its potential, the theoretical underpinnings of graph prompting remain underexplored, raising critical questions about its fundamental effectiveness. The lack of rigorous theoretical proof of why and how much it works is more like a “dark cloud” over the graph prompting area for deeper research. To fill this gap, this paper introduces a theoretical framework that rigorously analyzes graph prompting from a data operation perspective. Our contributions are threefold:First, we provide a formal guarantee theorem, demonstrating graph prompts’ capacity to approximate graph transformation operators, effectively linking upstream and downstream tasks.Second, we derive upper bounds on the error of these data operations for a single graph and extend this discussion to batches of graphs, which are common in graph model training.Third, we analyze the distribution of data operation errors, extending our theoretical findings from linear graph models (e.g., GCN) to non-linear graph models (e.g., GAT). Extensive experiments support our theoretical results and confirm the practical implications of these guarantees.

2488Novel View Acoustic Parameter Estimation

[openreview] [pdf]

Abstract The task of Novel View Acoustic Synthesis (NVAS) -- generating Room Impulse Responses (RIRs) for unseen source and receiver positions in a scene -- has recently gained traction, especially given its relevance to Augmented Reality (AR) and Virtual Reality (VR) development. However, many of these efforts suffer from similar limitations: they infer RIRs in the time domain, which prove challenging to optimize; they focus on scenes with simple, single-room geometries; they infer only single-channel, directionally-independent acoustic characteristics; and they require inputs, such as 3D geometry meshes with material properties, that may be impractical to obtain for on-device applications. On the other hand, research suggests that sample-wise accuracy of RIRs is not required for perceptual plausibility in AR and VR. Standard acoustic parameters like Clarity Index (C50) or Reverberation Time (T60) have been shown to capably describe pertinent characteristics of the RIRs, especially for late reverberation. To address these gaps, this paper introduces a new task centered on estimating spatially distributed acoustic parameters that can be then used to condition a simple reverberator for arbitrary source and receiver positions. The approach is modelled as an image-to-image translation task, which translates 2D floormaps of a scene into 2D heatmaps of acoustic parameters. We introduce a new, large-scale dataset of 1000 rooms consisting of complex, multi-room apartment conditions, and show that our method outperforms statistical baselines significantly. Moreover, we show that the method also works for directionally-dependent (i.e. beamformed) parameter prediction. Finally, the proposed method operates on very limited information, requiring only a broad outline of the scene and a single RIR at inference time.

2489Challenge Me: Enhancing Conversational Consistency of LLMs by Learning with Questioning Feedback

[openreview] [pdf]

Abstract As Large Language Models (LLMs) increasingly integrate into critical decision-support systems, ensuring their conversational consistency becomes paramount for reliable and trustworthy AI-assisted services, especially in high-stakes domains such as healthcare and legal advice. In this work, we study the critical issue of conversational inconsistency in LLMs, where models provide contradictory information across multiple dialogue turns. We introduce a novel Conversationally Consistent Supervised Fine-Tuning (CC-SFT) method that explicitly accounts for two-turn conversations. Our approach combines a first-round loss, a second-round loss, and a consistency loss based on Wasserstein distance to encourage coherent responses across turns. We evaluate our method on three diverse datasets (OpenBookQA, GSM8K, and MedQA-USMLE) using three LLMs (Llama v3.1, Mistral AI, and Gemma). Experimental results demonstrate that CC-SFT significantly reduces conversational inconsistency compared to standard fine-tuning, with lower flipping rates and improved accuracy in second-round responses. We provide theoretical convergence guarantees for our method and analyze the impact of the consistency loss coefficient. Our code is publicly available at \url{https://github.com/anonymous4science/llm_conversational_consistency}.

2490Imit-Diff: Semantics Guided Diffusion Transformer with Dual Resolution Fusion for Imitation Learning

[openreview] [pdf]

Abstract Diffusion-based methods have become one of the most important paradigms in the field of imitation learning. However, even in state-of-the-art diffusion-based policies, there has been insufficient focus on semantics and fine-grained feature extraction, resulting in weaker generalization and a reliance on controlled environments. To address this issue, we propose Imit-Diff, which consists of three key components: 1) Dual Resolution Fusion for extracting fine-grained features with a manageable number of tokens by integrating high-resolution features into low-resolution visual embedding through an attention mechanism; 2) Semantics Injection to explicitly incorporate semantic information by using prior masks obtained from open vocabulary models, achieving a world-level understanding of imitation learning tasks; and 3) Consistency Policy on Diffusion Transformer to reduce the inference time of diffusion models by training a student model to implement few-step denoising on the Probability Flow ODE trajectory. Experimental results show that our method significantly outperforms state-of-the-art methods, especially in cluttered scenes, and is highly robust to task interruptions. The code will be publicly available.

2491Multimodal Situational Safety

[openreview] [pdf]

Abstract Multimodal Large Language Models (MLLMs) are rapidly evolving, demonstrating impressive capabilities as multimodal assistants that interact with both humans and their environments. However, this increased sophistication introduces significant safety concerns. In this paper, we present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety, which explores how safety considerations vary based on the specific situation in which the user or agent is engaged. We argue that for an MLLM to respond safely—whether through language or action—it often needs to assess the safety implications of a language query within its corresponding visual context. To evaluate this capability, we develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs. The dataset comprises 1,820 language query-image pairs, half of which the image context is safe, and the other half is unsafe. We also develop an evaluation framework that analyzes key safety aspects, including explicit safety reasoning, visual understanding, and, crucially, situational safety reasoning. Our findings reveal that current MLLMs struggle with this nuanced safety problem in the instruction-following setting and struggle to tackle these situational safety challenges all at once, highlighting a key area for future research. Furthermore, we develop multi-agent pipelines to coordinately solve safety challenges, which shows consistent improvement in safety over the original MLLM response.

2492iFedDR: Auto-Tuning Local Computation with Inexact Douglas-Rachford Splitting in Federated Learning

[openreview] [pdf]

Abstract Federated learning usually requires specifying the amount of local computation needed a priori. In this work, we instead propose a systematic scheme to automatically adjust and potentially reduce the local computations while preserving convergence guarantees. We focus on proximal-based methods, where we demonstrate that the proximal operator can be evaluated inexactly up to a relative error, rather than relying on a predefined sequence of vanishing errors. Our proposed method, iFedDR, is based on a novel error-corrected version of inexact Douglas-Rachford splitting. It mitigates the need for hyperparameter tuning the number of client steps, by triggering refinement on-demand. We derive iFedDR as an instance of a much more general construction, which allows us to handle minimax problem, and which is interesting in its own right. Several numerical experiments are carried out demonstrating the favorable convergence properties of iFedDR.

2493Customized Procedure Planning in Instructional Videos

[openreview] [pdf]

Abstract Generating customized procedures for task planning in instructional videos poses a unique challenge for vision-language models. In this paper, we introduce Customized Procedure Planning in Instructional Videos, a novel task that focuses on generating a sequence of detailed action steps for task completion based on user requirements and the task’s initial visual state. Existing methods often neglect customization and user directions, limiting their real-world applicability. The absence of instructional video datasets with step-level state and video-specific action plan annotations has hindered progress in this domain. To address these challenges, we introduce the Customized Procedure Planner (CPP) framework, a causal, open-vocabulary model that leverages a LlaVA-based approach to predict procedural plans based on a task’s initial visual state and user directions. To overcome the data limitation, we employ a weakly-supervised approach, using the strong vision-language model GEMINI and the large language model (LLM) GPT-4 to create detailed video-specific action plans from the benchmark instructional video datasets (COIN, CrossTask), producing pseudo-labels for training. Discussing the limitations of the existing procedure planning evaluation metrics in an open-vocabulary setting, we propose novel automatic LLM-based metrics with few-shot in-context learning to evaluate the customization and planning capabilities of our model, setting a strong baseline. Additionally, we implement an LLM-based objective function to enhance model training for improved customization. Extensive experiments, including human evaluations, demonstrate the effectiveness of our approach, establishing a strong baseline for future research in customized procedure planning.

2494Measuring LLM Confidence through Stable Explanations

[openreview] [pdf]

Abstract In many high-risk machine learning applications it is essential for a model to indicate when it is uncertain about a prediction. While large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, their overconfidence in incorrect responses is still a well-documented failure mode. Traditional methods for ML uncertainty quantification can be difficult to directly adapt to LLMs due to the computational cost of implementation and closed-source nature of many models. A variety of black-box methods have recently been proposed, but these often rely on heuristics such as self-verbalized confidence. We instead propose a framework for measuring an LLM’s uncertainty with respect to the distribution of generated explanations for an answer. While utilizing explanations is not a new idea in and of itself, by interpreting each possible model+explanation pair as a test-time classifier we can calculate a posterior answer distribution over the most likely of these classifiers. We demonstrate how a specific instance of this framework using explanation entailment as our classifier likelihood improves confidence score metrics (in particular AURC and AUROC) over baselines across five different datasets. We believe these results indicate that our framework is both a well-principled and effective way of quantifying uncertainty in LLMs.

2495Democratic Training Against Universal Adversarial Perturbations

[openreview] [pdf]

Abstract Despite their advances and success, real-world deep neural networks are known to be vulnerable to adversarial attacks. Universal adversarial perturbation, an input-agnostic attack, poses a serious threat for them to be deployed in security-sensitive systems. In this case, a single universal adversarial perturbation deceives the model on a range of clean inputs without requiring input-specific optimization, which makes it particularly threatening. In this work, we observe that universal adversarial perturbations usually lead to abnormal entropy spectrum in hidden layers, which suggests that the prediction is dominated by a small number of ``feature’’ in such cases (rather than democratically by many features). Inspired by this, we propose an efficient yet effective defense method for mitigating UAPs called Democratic Training by performing entropy-based model enhancement to suppress the effect of the universal adversarial perturbations in a given model. \emph{Democratic Training} is evaluated with 6 neural networks trained on 4 benchmark datasets and 4 types of state-of-the-art universal adversarial attack methods. The results show that it effectively reduces the attack success rate, improves model robustness and preserves the model accuracy on clean samples.

2496Derail Yourself: Multi-turn LLM Jailbreak Attack through self-discovered clues

[openreview] [pdf]

Abstract This study exposes the safety vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, where malicious users can obscure harmful intents across several queries. We introduce ActorAttack, a novel multi-turn attack method inspired by actor-network theory, which models a network of semantically linked actors as attack clues to generate diverse and effective attack paths toward harmful targets. ActorAttack addresses two main challenges in multi-turn attacks: (1) concealing harmful intents by creating an innocuous conversation topic about the actor, and (2) uncovering diverse attack paths towards the same harmful target by leveraging LLMs’ knowledge to specify the correlated actors as various attack clues. In this way, ActorAttack outperforms existing single-turn and multi-turn attack methods across advanced aligned LLMs, even for GPT-o1. We will publish a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data, generated by ActorAttack. We demonstrate that models safety-tuned using our safety dataset are more robust to multi-turn attacks.

2497FlexDrive: Toward Trajectory Flexibility in Driving Scene Reconstruction and Rendering

[openreview] [pdf]

Abstract Driving scene reconstruction and rendering have advanced significantly using the 3D Gaussian Splatting. However, most prior research has focused on the rendering quality along a pre-recorded vehicle path and struggles to generalize to out-of-path viewpoints, which is caused by the lack of high-quality supervision in those out-of-path views. To address this issue, we introduce an Inverse View Warping technique to create compact and high-quality images as supervision for the reconstruction of the out-of-path views, enabling high-quality rendering results for those views. For accurate and robust inverse view warping, a depth bootstrap strategy is proposed to obtain on-the-fly dense depth maps during the optimization process, overcoming the sparsity and incompleteness of LiDAR depth data. Our method achieves superior in-path and out-of-path reconstruction and rendering performance on the widely adopted Waymo Open dataset. In addition, a simulator-based benchmark is proposed to obtain the out-of-path ground truth and quantitatively evaluate the performance of out-of-path rendering, where our method outperforms previous methods by a significant margin.

2498Flow-based imputation of small data

[openreview] [pdf]

Abstract Many challenges in the physical sciences can be framed as small data problems, where theoretical progress is hindered by the sparsity, low-dimensionality, and/or limited sample size of available empirical data compared to a physical system’s numerous dynamical degrees of freedom. Developing trustworthy imputation methods for these datasets holds immense scientific importance. Normalizing flows are a promising model choice for imputation due to their ability to explicitly estimate sample likelihoods. However, research has shown that normalizing flows are often unreliable for out-of-distribution (OOD) detection in high-dimensional settings, which undermines their trustworthiness for imputation tasks. In contrast, low-dimensional settings provide opportunities to tractably evaluate and mitigate likelihood estimation errors, revealing strategies to reduce or eliminate specific error modes. We focus on the most stringent assumption in normalizing flows: diffeomorphism between the target and base distributions. This assumption introduces two distinct error modes, which we identify and address through a simple and effective strategy. Our approach significantly enhances the trustworthiness of normalizing flows for imputation in small data problems.

2499Harnessing Shallow Features in Pre-Trained Models for Out-of-Distribution Detection

[openreview] [pdf]

Abstract Recognizing out-of-distribution (OOD) samples is essential for deploying robust machine learning systems in the open-world environments. Conventional OOD detection approaches rely on feature representations from the final layer of neuron networks, often neglecting the rich information encapsulated in shallow layers. Leveraging the strengths of transformer-based architectures, we introduce an attention-based fusion module, which dynamically assigns importance weights to representations learned by each Transformer layer and detects OOD samples using the Mahalanobis distance. Compared to existing approaches, our method enables a lightweight fine-tuning of pre-trained models, and retains all feature representations that are beneficial to the OOD detection. We also thoroughly study various parameter-efficient fine-tuning strategies. Our experiments show the benefit of using shallow features, and demonstrate the influence of different Transformer layers. We fine-tune pre-trained models in both class-balanced and long-tailed in-distribution classification tasks, and show that our method achieves state-of-the-art OOD detection performance averaged across nine OOD datasets. The source code is provided in the supplementary material.

2500GaLore+: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection

[openreview] [pdf]

Abstract Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize large language models (LLMs). However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80% of the total training time. To address this issue, we propose GaLore++, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore++ on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore++ delivers superior performance while achieving approximately 4×4\times fine-tuning speed compared to vanilla GaLore.

2501Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback

[openreview] [pdf]

Abstract Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples; or are limited to reward functions expressible by compact code, which may require source code and have difficulty capturing nuanced semantics; or require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose \oni, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent’s collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. By studying their relative tradeoffs, we shed light on questions regarding intrinsic reward design for sparse reward problems. Our approach achieves state-of-the-art performance across a range of challenging, sparse reward tasks from the NetHack Learning Environment in a simple unified process, solely using the agent’s gathered experience, without requiring external datasets nor source code. We make our code available at \url{URL}.

2502Offline Reinforcement Learning With Combinatorial Action Spaces

[openreview] [pdf]

Abstract Reinforcement learning problems often involve large action spaces arising from the simultaneous execution of multiple sub-actions, resulting in combinatorial action spaces. Learning in combinatorial action spaces is difficult due to the exponential growth in action space size with the number of sub-actions and the dependencies among these sub-actions. In offline settings, this challenge is compounded by limited and suboptimal data. Current methods for offline learning in combinatorial spaces simplify the problem by assuming sub-action independence. We propose Branch Value Estimation (BVE), which effectively captures sub-action dependencies and scales to large combinatorial spaces by learning to evaluate only a small subset of actions at each timestep. Our experiments show that BVE outperforms state-of-the-art methods across a range of action space sizes.

2503VRM: Knowledge Distillation via Virtual Relation Matching

[openreview] [pdf]

Abstract Knowledge distillation (KD) aims to transfer the knowledge of a more capable yet cumbersome teacher model to a lightweight student model. In recent years, relation-based KD methods have fallen behind, as instance-matching counterparts dominate in performance. In this paper, we revive relational KD by identifying and tackling several key issues in relational KD, including its susceptibility to overfitting and spurious responses. Specifically, we transfer novelly constructed affinity graphs that compactly encapsulate a wealth of beneficial inter-sample, inter-class, and inter-view correlations by exploiting virtual views and relations as a new kind of knowledge. As a result, the student has access to rich guidance signals and stronger regularisation throughout the distillation process. To further mitigate the adverse impact of spurious responses, we prune the affinity graphs by dynamically detaching redundant and unreliable edges. Extensive experiments on CIFAR-100, ImageNet, and MS-COCO datasets demonstrate the superior performance of the proposed virtual relation matching (VRM) method over a range of tasks, architectures, and set-ups. For instance, VRM for the first time hits 74.0% accuracy for ResNet50-to-MobileNetV2 distillation on ImageNet, and improves DeiT-Ti by 14.11% on CIFAR-100 with a ResNet56 teacher. Thorough analyses are also conducted to gauge the soundness, properties, and complexity of our designs. Code and models will be released.

2504RecLM: Recommendation Instruction Tuning with Large Language Models

[openreview] [pdf]

Abstract Recommender systems aim to deeply understand users’ complex preferences based on their past interactions. Deep collaborative filtering paradigms, leveraging advanced neural architectures like Graph Neural Networks (GNNs), excel at capturing collaborative relationships among users. However, limitations emerge when dealing with sparse data or zero-shot learning from unseen datasets, due to the design constraints of ID-based embedding functions in existing solutions. These challenges hinder robust generalization and adaptability. To address this, we propose a model-agnostic recommendation instruction-tuning paradigm that integrates large language models with collaborative filtering. Our Recommendation Language Model (RecLM) is introduced to enhance the capability of capturing user preference diversity. We design a reinforcement learning reward function to facilitate self-augmentation of our language models. Comprehensive evaluations demonstrate significant advantages of our approach across various settings. It can be integrated as a plug-and-play component with state-of-the-art recommender systems, resulting in notable performance enhancements.

2505Code diffusion models are continuous human noise operators

[openreview] [pdf]

Abstract Diffusion for code generates code by iteratively removing noise from the latent representation of a code snippet. During later steps of the diffusion process, when the code snippet has almost converged, these edits resemble last-mile repairs applied to broken or incomplete code. We evaluate the extent to which these errors are similar to those that humans are faced with and the capability of these models to perform last-mile repair. Our insight has two applications with significant impact for code repair. First, we can leverage the diffusion model for last-mile repair by adding noise to a broken code snippet and resuming the diffusion process. Second, we can leverage the diffusion model to generate an arbitrary amount of training data for other last-mile repair approaches (that are computationally more efficient) by sampling an intermediate program (input) and the final program (output) from the diffusion process. We perform experiments to evaluate both applications, as well as analyze trends in the evolution of representation through the diffusion pipeline providing insights on the reasoning observed.

2506HAL: Harmonic Learning in High-Dimensional MDPs

[openreview] [pdf]

Abstract Since the initial successes of deep reinforcement learning on learning policies purely by interacting with complex high-dimensional state representations and a decade of extensive research, deep neural policies have been applied to a striking variety of fields ranging from pharmaceuticals to foundation models. Yet, one of the strongest assumptions of reinforcement learning is to expect to receive a reward signal from the MDP. While this assumption comes in handy in certain fields, i.e. automated financial markets, it does not naturally fit in many others where the computational complexity of providing such a signal for the task at hand is larger than in fact learning one. Thus, in this paper we focus on learning policies in MDPs without this assumption, and study sequential decision making without having access to information on rewards provided by the MDP. We introduce We introduce harmonic learning, a training method in high-dimensional MDPs, and provide a theoretically well-founded algorithm that significantly improves the sample complexity of deep neural policies. The theoretical and empirical analysis reported in our paper demonstrates that harmonic learning achieves substantial improvements in sample efficient training while constructing more stable and resilient policies that can generalize to uncertain environments.

2507The Two-Hop Curse: LLMs trained on A→B, B→C fail to learn A→C

[openreview] [pdf]

Abstract While LLMs excel at answering multi-hop questions like “Who is the spouse of the performer of Imagine?” by thinking out loud (chain-of-thought), they perform surprisingly poorly when required to reason in their latent space and answer without chain-of-thought. This observation was previously referred to as the compositionality gap, implying that although language models are less reliable at two-hop latent reasoning, they still perform it sometimes. In this paper, we introduce a controlled setting for investigating the compositionality gap. We run a series of experiments finetuning a large language model (Llama-3-8B-Instruct) on synthetic facts expressed in English. We attempt to elicit two-hop reasoning in three ways: (i) fine-tune on a data mixture designed to incentivize two-hop reasoning, (ii) force facts to be stored in layers in the correct order, and (iii) use an auxiliary loss to provide activation-level supervision for two-hop reasoning. We show that LLaMA 3 8B successfully learns to answer two-hop questions about synthetic facts using CoT, but completely fails without CoT, achieving chance-level accuracy and chance-level test loss. Failures of LLMs in our controlled setting cast doubt on the purported ability of present LLMs to perform multihop latent reasoning and lead us to conjecture that, rather than a reasoning gap, current language models might exhibit a two-hop reasoning curse — a complete lack of ability rather than a relative weakness. This is the Two-Hop Curse.

2508M^3PC: Test-time Model Predictive Control using Pretrained Masked Trajectory Model

[openreview] [pdf]

Abstract Recent work in Offline Reinforcement Learning (RL) has shown that a unified transformer trained under a masked auto-encoding objective can effectively capture the relationships between different modalities (e.g., states, actions, rewards) within given trajectory datasets. However, this information has not been fully exploited during the inference phase, where the agent needs to generate an optimal policy instead of just reconstructing masked components from unmasked. Given that a pretrained trajectory model can act as both a Policy Model and a World Model with appropriate mask patterns, we propose using Model Predictive Control (MPC) at test time to leverage the model’s own predictive capacity to guide its action selection. Empirical results on D4RL and RoboMimic show that our inference-phase MPC significantly improves the decision-making performance of a pretrained trajectory model without any additional parameter training. Furthermore, our framework can be adapted to Offline to Online (O2O) RL and Goal Reaching RL, resulting in more substantial performance gains when an additional online interaction budget is given, and better generalization capabilities when different task targets are specified. Our code and models will be released.

2509INFER: A Neural-symbolic Model For Extrapolation Reasoning on Temporal Knowledge Graph

[openreview] [pdf]

Abstract Temporal Knowledge Graph(TKG) serves as an efficacious way to store dynamic facts in real-world. Extrapolation reasoning on TKGs, which aims at predicting possible future events, has attracted consistent research interest. Recently, some rule-based methods have been proposed, which are considered more interpretable compared with embedding-based methods. Existing rule-based methods apply rules through path matching or subgraph extraction, which falls short in inference ability and suffers from missing facts in TKGs. Besides, during rule application period, these methods consider the standing of facts as a binary 0 or 1 problem and ignores the validity as well as frequency of historical facts under temporal settings. In this paper, by designing a novel paradigm for rule application, we propose INFER, a neural-symbolic model for TKG extrapolation. With the introduction of Temporal Validity Function, INFER firstly considers the frequency and validity of historical facts and extends the truth value of facts into continuous real number to better adapt for temporal settings. INFER builds Temporal Weight Matrices with a pre-trained static KG embedding model to enhance its inference ability. Moreover INFER adopts a rule projection module which enables it apply rules through conducting matrices operation on GPU, which improves the efficiency of rule application. This feature also facilitates potential integration with existing embedding-based methods. Experimental results show that INFER achieves state-of-the-art performance on three datasets and significantly outperforms existing rule-based models on our modified, more sparse TKG datasets, which demonstrates the superiority of our model in inference ability.

2510Turing completeness of prompting

[openreview] [pdf]

Abstract Since the success of GPT, large language models (LLMs) have revolutionized machine learning and have initiated the so-calledLLM promptingparadigm. In the era of LLMs, people train a single general-purpose LLM and provide the LLM with differentpromptsto perform different tasks. However, such empirical success largely lacks theoretical understanding. Here, we present the first theoretical study on the LLM prompting paradigm to the best of our knowledge. In this work, we show that prompting is in fact Turing-complete: there exists a finite-size Transformer such that for any computable function, there exists a corresponding prompt following which the Transformer computes the function. Furthermore, we show that even though we use only a single finite-size Transformer, it can still achieve nearly the same complexity bounds as that of the class of all unbounded-size Transformers. Overall, our result reveals that prompting can enable a single finite-size Transformer to be efficiently universal, which establishes a theoretical underpinning for prompt engineering in practice.

2511Targeted Manipulation and Deception Emerge in LLMs Trained on User* Feedback

[openreview] [pdf]

Abstract When AI systems are trained to maximize positive feedback from humans, this creates a perverse incentive structure for the AI to resort to any available means—including harmful behaviors like sycophancy, deception, and manipulation—to ensure it receives positive human feedback, regardless of whether its actions truly merit such approval. So far, with LLM training, this drive has only been documented in the emergence of relatively mild forms of sycophancy, in which the system overly agrees with or praises the user. Our work shows that in domains of practical LLM usage, optimizinguserfeedback (as opposed toannotatorfeedback) may reliably lead to the emergence of manipulation, deception, and extreme forms of sycophancy which target the users that are most vulnerable to them. To mitigate this issue, it seems promising to leverage continued safety training or external annotator feedback to “veto” that of users. We find that while such approach can reduce or remove the emergence of harmful behaviors in some settings, it can even exacerbate them in others, making them more sophisticated and harder to detect. Our findings caution against optimizing user feedback without stringent safeguards, and constitute a cautionary tale of the fundamental risks and limitations that come along with optimizing any form of feedback, whether from humans or AI systems. Warning: this paper contains examples that may be offensive or upsetting.

2512Learning Chaotic Dynamics with Embedded Dissipativity

[openreview] [pdf]

Abstract Chaotic dynamics, commonly seen in weather systems and fluid turbulence, are characterized by their sensitivity to initial conditions, which makes accurate prediction challenging. Despite its sensitivity to initial perturbations, many chaotic systems observe dissipative behaviors and ergodicity. Therefore, recently various approaches have been proposed to develop data-driven models preserving invariant statistics over long horizons. Although these methods have shown empirical success in reducing instances of unbounded trajectory generation, many of the models are still prone to generating unbounded trajectories, leading to invalid statistics evaluation. In this paper, we propose a novel neural network architecture that simultaneously learns a dissipative dynamics emulator that guarantees to generate bounded trajectories and an energy-like function that governs the dissipative behavior. More specifically, by leveraging control-theoretic ideas, we derive algebraic conditions based on the learned energy-like function that ensure asymptotic convergence to an invariant level set. Using these algebraic conditions, our proposed model enforces dissipativity through a ReLU projection layer, which provides formal trajectory boundedness guarantees. Furthermore, the invariant level set provides an outer estimate for the strange attractor, which is known to be very difficult to characterize due to its complex geometry. We demonstrate the capability of our model in producing bounded long-horizon trajectory forecasts that preserve invariant statistics and characterizing the attractor, for chaotic dynamical systems including Lorenz 96 and a truncated Kuramoto-Sivashinsky equation.

2513Gradient Descent Converges Linearly to Flatter Minima than Gradient Flow in Shallow Linear Networks

[openreview] [pdf]

Abstract We study the gradient descent (GD) dynamics of a depth-2 linear neural network with a single input and output. We show that GD converges at an explicit linear rate to a global minimum of the training loss, even with a large stepsize--about 2/sharpness2/\textrm{sharpness}. It still converges for even larger stepsizes, but may do so very slowly. We also characterize the solution to which GD converges, which has lower norm and sharpness than the gradient flow solution. Our analysis reveals a trade off between the speed of convergence and the magnitude of implicit regularization. This sheds light on the benefits of training at the ``Edge of Stability’', which induces additional regularization by delaying convergence and may have implications for training more complex models.

2514What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis

[openreview] [pdf]

Abstract The Transformer architecture has inarguably revolutionized deep learning, overtaking classical architectures like multi-layer perceptrons (MLPs) and convolutional neural networks (CNNs). At its core, the attention block differs in form and functionality from most other architectural components in deep learning - to the extent that Transformers are often accompanied by adaptive optimizers, layer normalization, learning rate warmup, and more, in comparison to MLPs/CNNs. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood. In this work, we bridge this gap by providing a fundamental understanding of what distinguishes the Transformer from the other architectures - grounded in a theoretical comparison of the (loss) Hessian. Concretely, for a single self-attention layer, (a) we first entirely derive the Transformer’s Hessian and express it in matrix derivatives; (b) we then characterize it in terms of data, weight, and attention moment dependencies; and (c) while doing so further highlight the important structural differences to the Hessian of classical networks. Our results suggest that various common architectural and optimization choices in Transformers can be traced back to their highly non-linear dependencies on the data and weight matrices, which vary heterogeneously across parameters. Ultimately, our findings provide a deeper understanding of the Transformer’s unique optimization landscape and the challenges it poses.

2515Adaptive Batch Size for Privately Finding Second-Order Stationary Points

[openreview] [pdf]

Abstract There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) demonstrated that an α\alpha-SOSP can be found with \alpha=\Tilde{O}(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\epsilon})^{3/7}), where nn is the dataset size, dd is the dimension, and ϵ\epsilon is the differential privacy parameter. Building on the SpiderBoost algorithm framework, we propose a new approach that uses adaptive batch sizes and incorporates the binary tree mechanism. Our method improves the results for privately finding an SOSP, achieving \alpha=\Tilde{O}(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\epsilon})^{1/2}). This improved bound matches the state-of-the-art for finding an FOSP, suggesting that privately finding an SOSP may be achievable at no additional cost.

2516Improving Reasoning Ability of Large Language Models via Iterative Uncertainty-based Preference Optimization

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has recently emerged as an efficient and effective method for aligning large language models with human preferences. However, constructing high-quality preference datasets remains challenging, often necessitating expensive manual or powerful LM annotations. Additionally, standard DPO exhibits suboptimal performance in complex reasoning tasks, such as mathematical and code reasoning. In this paper, we introduce an approach to collect preference pairs through iterative sampling and execution feedback, tailored to the current learning state (e.g. well-learned, mis-learned, and unlearned) of the policy model. To alleviate the failures of DPO and improve its applicability in reasoning tasks, we propose IUPO, an iterative uncertainty-based preference optimization method that achieves fine-grained preference control by assessing model confidence. We validate our approach across three reasoning tasks, incorporating five established reasoning datasets and one self-curated dataset. Our experimental results demonstrate an overall improvement of 3.6% over the standard DPO method. Furthermore, our approach exhibits promising generalizability involving weak-to-strong (8B to 70B) and cross-model (Llama to Mistral) generalizations.

2517Anchored Alignment for Self-Explanations Enhancement

[openreview] [pdf]

Abstract In this work, we introduce a methodology for alignment designed to enhance the ability of large language models (LLMs) to articulate their reasoning—\textit{self-explanation}—even in the absence of annotated rationale explanations. Our alignment methodology comprises three key components: explanation quality assessment, self-instruction dataset generation, and model alignment. Additionally, we present a novel technique called \textit{Alignment with Anchor Preference Pairs}, which improves the selection of preference pairs by categorizing model outputs into three groups: consistently correct, consistently incorrect, and variable. By applying tailored strategies to each category, we enhance the effectiveness of Direct Preference Optimization (DPO). Our experimental results demonstrate that this approach significantly improves explanation quality while maintaining accuracy compared to other fine-tuning strategies.

2518Securing Equal Share: A Principled Approach for Learning Multiplayer Symmetric Games

[openreview] [pdf]

Abstract This paper examines multiplayer symmetric constant-sum games with more than two players in a competitive setting, including examples like Mahjong, Poker, and various board and video games. In contrast to two-player zero-sum games, equilibria in multiplayer games are neither unique nor non-exploitable, failing to provide meaningful guarantees when competing against opponents who play different equilibria or non-equilibrium strategies. This gives rise to a series of long-lasting fundamental questions in multiplayer games regarding suitable objectives, solution concepts, and principled algorithms. This paper takes an initial step toward addressing these challenges by focusing on the natural objective ofequal share—securing an expected payoff of C/nC/n in an nn-player symmetric game with a total payoff of CC. We rigorously identify the theoretical conditions under which achieving an equal share is tractable and design a series of efficient algorithms, inspired by no-regret learning, thatprovablyattain approximate equal share across various settings. Furthermore, we provide complementary lower bounds that justify the sharpness of our theoretical results. Our experimental results highlight worst-case scenarios where meta-algorithms from prior state-of-the-art systems for multiplayer games fail to secure an equal share, while our algorithm succeeds, demonstrating the effectiveness of our approach.

2519Realizable Abstractions: Near-Optimal Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract The main focus of Hierarchical Reinforcement Learning (HRL) is studying how large Markov Decision Processes (MDPs) can be more efficiently solved when addressed in a modular way, by combining partial solutions computed for smaller subtasks. Despite their very intuitive role for learning, most notions of MDP abstractions proposed in the HRL literature have limited expressive power or do not possess formal efficiency guarantees.This work addresses these fundamental issues by defining Realizable Abstractions, a new relation between generic low-level MDPs and their associated high-level decision processes. The notion we propose avoids non-Markovianity issues and has desirable near-optimality guarantees. Indeed, we show that any abstract policy for Realizable Abstractions can be translated into near-optimal policies for the low-level MDP, through a suitable composition of options. As demonstrated in the paper, these options can be expressed as solutions of specific constrained MDPs. Based on these findings, we propose RARL, a new HRL algorithm that returns compositional and near-optimal low-level policies, taking advantage of the Realizable Abstraction given in the input. We show that RARL is Probably Approximately Correct, it converges in a polynomial number of samples, and it is robust to inaccuracies in the abstraction.

2520Automatic Curriculum Expert Iteration for Reliable LLM Reasoning

[openreview] [pdf]

Abstract Hallucinations (i.e., generating plausible but inaccurate content) and laziness (i.e. excessive refusals or defaulting to “I don’t know”) persist as major challenges in LLM reasoning. Current efforts to reduce hallucinations primarily focus on factual errors in knowledge-grounded tasks, often neglecting hallucinations related to faulty reasoning. Meanwhile, some approaches render LLMs overly conservative, limiting their problem-solving capabilities. To mitigate hallucination and laziness in reasoning tasks, we propose Automatic Curriculum Expert Iteration (Auto-CEI) to enhance LLM reasoning and align responses to the model’s capabilities--assertively answering within its limits and declining when tasks exceed them. In our method, Expert Iteration explores the reasoning trajectories near the LLM policy, guiding incorrect paths back on track to reduce compounding errors and improve robustness; it also promotes appropriate “I don’t know” responses after sufficient reasoning attempts. The curriculum automatically adjusts rewards, incentivizing extended reasoning before acknowledging incapability, thereby pushing the limits of LLM reasoning and aligning its behaviour with these limits. We compare Auto-CEI with various SOTA baselines across logical reasoning, mathematics, and planning tasks, where Auto-CEI achieves superior alignment by effectively balancing assertiveness and conservativeness.

2521Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation

[openreview] [pdf]

Abstract An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models take observation and history as input and predict the most appropriate action for an agent. While employing these models has significantly improved performance, the scale of these models can be a bottleneck in practical settings where computational resources are limited (e.g., in robots). In this work, we present a novel input-adaptive navigation method for efficient VLN. We first characterize the overthinking problem in VLN and show that none of the existing input-adaptive mechanisms successfully reduce overthinking without causing significant performance degradation. Our method addresses this problem by developing three adaptive algorithms deployed at different levels: (1) We develop an adaptive approach that improves spatial efficiency; we only process a subset of panoramic views at each observation of an agent. (2) We also achieve model-level efficiency by developing adaptive thresholding for the early-exit method we employ, based on the importance of each view in navigation. (3) To achieve temporal efficiency, we design a caching mechanism to avoid processing views that an agent has seen before. In evaluations with six VLN benchmark tasks, we demonstrate over a 2×\times reduction in computation across two off-the-shelf VLN agents.

2522Federated Domain Generalization with Data-free On-server Gradient Matching

[openreview] [pdf]

Abstract Domain Generalization (DG) aims to learn from multiple known source domains a model that can generalize well to unknown target domains. One of the key approaches in DG is training an encoder which generates domain-invariant representations. However, this approach is not applicable in Federated Domain Generalization (FDG), where data from various domains are distributed across different clients. In this paper, we introduce a novel approach, dubbed Federated Learning via On-server Matching Gradient (FedOMG), which can efficiently leverage domain information from distributed domains. Specifically, we utilize the local gradients as information about the distributed models to find an invariant gradient direction across all domains through gradient inner product maximization. The advantages are two-fold: 1) we can aggregate the characteristics of distributed models on the centralized server without incurring any additional communication cost, and 2) our method is orthogonal to many existing DG methods, allowing for additional performance improvements by being seamlessly integrated with them. Extensive experimental evaluations on various settings to demonstrate the robustness of FedOMG compared to other FL/FDG baselines. Our method outperforms recent SOTA baselines on four FL benchmarks (MNIST, EMNIST, CIFAR-10, CIFAR-100), and three FDG benchmarks (PACS, VLCS, OfficeHome).

2523Optimal Memorization Capacity of Transformers

[openreview] [pdf]

Abstract Recent research in the field of machine learning has increasingly focused on the memorization capacity of Transformers, but how efficient they are is not yet well understood. We demonstrate that Transformers can memorize labels with O~(N)\tilde{O}(\sqrt{N}) parameters in a next-token prediction setting for NN input sequences of length nn, which is proved to be optimal up to logarithmic factors. This indicates that Transformers can efficiently perform memorization with little influence from the input length nn owing to the benefit of parameter sharing. We also analyze the memorization capacity in the sequence-to-sequence setting, and find that O~(nN)\tilde{O}(\sqrt{nN}) parameters are not only sufficient, but also necessary at least for Transformers with hardmax. These results suggest that while self-attention mechanisms can efficiently identify input sequences, the feed-forward network becomes a bottleneck when associating a label to each token.

2524A Foundation Model for Weather and Climate

[openreview] [pdf]

Abstract Triggered by the realization that AI emulators can rival the performance of traditional numerical weather prediction models running on HPC systems, there is now an increasing number of large AI models that address use cases such as forecasting, downscaling, or nowcasting. While the parallel developments in the AI literature focus on foundation models -- models that can be effectively tuned to address multiple, different use cases -- the developments on the weather and climate side largely focus on single-use cases with particular emphasis on mid-range forecasting. We close this gap by introducing Prithvi WxC, a 2.3 billion parameter foundation model developed using 160 variables from the Modern-Era Retrospective Analysis for Research and Applications, Version 2 (MERRA-2). Prithvi WxC employs an encoder-decoder-based architecture, incorporating concepts from various recent transformer models to effectively capture both regional and global dependencies in the input data. The model has been designed to accommodate large token counts to model weather phenomena in different topologies at fine resolutions. Furthermore, it is trained with a mixed objective that combines the paradigms of masked reconstruction with forecasting. We test the model on a set of challenging downstream tasks namely: Autoregressive rollout forecasting, downscaling, gravity wave flux parameterization, and extreme events estimation.

2525Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

[openreview] [pdf]

Abstract Multiple-choice question answering (MCQA) tasks are fundamental benchmarks for evaluating large language models (LLMs), yet models, especially smaller ones, often struggle with these tasks. In this paper, we uncover that intermediate attention heads within LLMs hold valuable insights for improving MCQA performance. We introduce a novel method that leverages query-key interactions in specific “select-and-copy” attention heads to effectively select correct answers. Building on this mechanism, we propose two option scoring methods: the QK-score and the attention score, based on the query-key representations from these heads. Our approach demonstrates consistent performance improvements across popular MCQA datasets, yielding up to a 16% increase in accuracy for LLaMA-2-7B and up to 10% for larger models on established benchmarks, with especially notable gains in zero-shot scenarios. In controlled setting of a simple synthetic dataset where the correct answer is explicitly known, accuracy improves by up to 60%. Furthermore, our method often shows better stability to option perturbations compared to existing baselines. By analyzing the decision process in these select-and-copy heads, we contribute to a deeper understanding of how LLMs process MCQA tasks, offering insights that can enhance the development of more reliable and interpretable models.

2526Exploiting Hierarchical Taxonomies in Pretrained Continual Learning

[openreview] [pdf]

Abstract Drawing inspiration from human learning behaviors, this work proposes a novel approach to mitigate catastrophic forgetting in Prompt-based Continual Learning (PCL) models by exploiting the relationships between continuously emerging class data. We find that applying human habits of organizing and connecting information can serve as an efficient strategy when training deep learning models. Specifically, by building a hierarchical tree structure based on the expanding set of labels, we gain fresh insights into the data, identifying groups of similar classes could easily cause confusion. Additionally, we delve deeper into the hidden connections between classes by exploring the original pretrained model’s behavior through an optimal transport-based approach. From these insights, we propose a novel regularization loss function that encourages models to focus more on challenging knowledge areas, thereby enhancing overall performance. Experimentally, our method demonstrated significant superiority over current state-of-the-arts on various benchmarks.

2527A Dual-Agent Adversarial Framework for Generalizable Reinforcement Learning

[openreview] [pdf]

Abstract Recently, empowered with the powerful capabilities of neural networks, reinforcement learning (RL) has successfully tackled numerous challenging tasks. However, while these models demonstrate enhanced decision-making abilities, they are increasingly prone to overfitting. For instance, a trained RL model often fails to generalize to even minor variations of the same task, such as a change in background color or other minor semantic differences. To address this issue, we propose a dual-agent adversarial policy learning framework, which allows agents to spontaneously learn the underlying semantics without introducing any human prior knowledge. Specifically, our framework involves a game process between two agents: each agent seeks to maximize the impact of perturbing on the opponent’s policy by producing representation differences for the same state, while maintaining its own stability against such perturbations. This interaction encourages agents to learn generalizable policies, capable of handling irrelevant features from the high-dimensional observations. Extensive experimental results on the Procgen benchmark demonstrate that the adversarial process significantly improves the generalization performance of both agents, while also being applied to various RL algorithms, e.g., Proximal Policy Optimization (PPO). With the adversarial framework, the RL agent outperforms the baseline methods by a significant margin, especially in hard-level tasks, marking a significant step forward in the generalization capabilities of deep reinforcement learning.

2528Rethinking Invariance in In-context Learning

[openreview] [pdf]

Abstract In-Context Learning (ICL) has emerged as a pivotal capability of auto-regressive large language models, yet it is hindered by a notable sensitivity to the ordering of context examples regardless of their mutual independence. To address this issue, recent studies have introduced several variant algorithms of ICL that achieve permutation invariance. However, many of these do not exhibit comparable performance with the standard auto-regressive ICL algorithm. In this work, we identify two crucial elements in the design of an invariant ICL algorithm: information non-leakage and context interdependence, which are not simultaneously achieved by any of the existing methods. These investigations lead us to the proposed \emph{Invariant ICL (InvICL)}, a methodology designed to achieve invariance in ICL while ensuring the two properties. Theoretically, we prove that InvICL approximates standard gradient descent, which possess the best convergence properties among all the gradient descent variants of existing ICL algorithms. Empirically, our findings reveal that InvICL surpasses previous models, both invariant and non-invariant, in most benchmark datasets, showcasing superior generalization capabilities across varying input lengths.

2529Generalized Principal-Agent Problem with a Learning Agent

[openreview] [pdf]

Abstract Generalized principal-agent problems, including Stackelberg games, contract design, and Bayesian persuasion, are a class of economic problems where an agent best responds to a principal’s committed strategy. We study repeated generalized principal-agent problems under the assumption that the principal does not have commitment power and the agent uses algorithms to learn to respond to the principal. We reduce this problem to a one-shot generalized principal-agent problem where the agent approximately best responds. Using this reduction, we show that: (1) if the agent uses contextual no-regret learning algorithms with regret Reg(T)\mathrm{Reg}(T), then the principal can guarantee utility at least UΘ(Reg(T)T)U^* - \Theta\big(\sqrt{\tfrac{\mathrm{Reg}(T)}{T}}\big), where UU^* is the principal’s optimal utility in the classic model with a best-responding agent. (2) If the agent uses contextual no-swap-regret learning algorithms with swap-regret SReg(T)\mathrm{SReg}(T), then the principal cannot obtain utility more than U+O(SReg(T)T)U^* + O(\frac{\mathrm{SReg(T)}}{T}). But (3) if the agent uses mean-based learning algorithms (which can be no-regret but not no-swap-regret), then the principal can sometimes do significantly better than UU^*. These results not only refine previous results in Stackelberg games and contract design, but also lead to new results for Bayesian persuasion with a learning agent and all generalized principal-agent problems where the agent does not have private information.

2530PSformer: Parameter-efficient Transformer with Segment Attention for Time Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting remains a critical challenge across various domains, often complicated by high-dimensional data and long-term dependencies. This paper presents a novel transformer architecture for time series forecasting, incorporating two key innovations: parameter sharing (PS) and Spatial-Temporal Segment Attention (SegAtt). We also define the time series segment as the concatenation of sequence patches from the same positions across different variables. The proposed model, PSformer, reduces the number of training parameters through the parameter sharing mechanism, thereby improving model efficiency and scalability. The introduction of SegAtt could enhance the capability of capturing local spatio-temporal dependencies by computing attention over the segments, and improve global representation by integrating information across segments. The combination of parameter sharing and SegAtt significantly improves the forecasting performance. Extensive experiments on benchmark datasets demonstrate that PSformer outperforms popular baselines and other transformer-based approaches in terms of accuracy and scalability, establishing itself as an accurate and scalable tool for time series forecasting.

2531Preference Optimization for Combinatorial Optimization Problems

[openreview] [pdf]

Abstract Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring optimal solutions. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficient learning. In this paper, we propose Preference Optimization(PO)Preference \ Optimization (PO), a novel framework that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among generated solutions. Methodologically, by reparameterizing the reward function in terms of policy probabilities and utilizing preference models like Bradley-Terry and Thurstone, we formulate an entropy-regularized optimization objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate heuristic local search techniques into the fine-tuning process to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on standard combinatorial optimization benchmarks, such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP), demonstrate that our method outperforms traditional RL algorithms, achieving superior sample efficiency and solution quality. Our work offers a simple yet efficient algorithmic advancement in neural combinatorial optimization.

2532Flow Matching for Posterior Inference with Simulator Feedback

[openreview] [pdf]

Abstract Flow-based generative modeling is a powerful tool for solving inverse problems in physical sciences that can be used for sampling and likelihood evaluation with much lower inference times than traditional methods. We propose to refine flows with additional control signals based on a simulator. Control signals can include gradients and a problem-specific cost function if the simulator is differentiable, or they can be fully learned from the simulator output. In our proposed method, we pretrain the flow network and include feedback from the simulator exclusively for finetuning, therefore requiring only a small amount of additional parameters and compute. We motivate our design choices on several benchmark problems for simulation-based inference and evaluate flow matching with simulator feedback against classical MCMC methods for modeling strong gravitational lens systems, a challenging inverse problem in astronomy. We demonstrate that including feedback from the simulator improves the accuracy by 53%, making it competitive with traditional techniques while being up to 67x faster for inference. Upon acceptance, we will make our code publicly available.

2533Does Training with Synthetic Data Truly Protect Privacy?

[openreview] [pdf]

Abstract As synthetic data becomes increasingly popular in machine learning tasks, numerous methods—without formal differential privacy guarantees—use synthetic data for training. These methods often claim, either explicitly or implicitly, to protect the privacy of the original training data. In this work, we explore four different training paradigms—coreset selection, dataset distillation, data-free knowledge distillation, and synthetic data generated from diffusion models. While all these methods utilize synthetic data for training, they lead to vastly different conclusions regarding privacy preservation. This highlights that empirical approaches to preserving data privacy require careful and rigorous evaluation; otherwise, they risk providing a false sense of privacy.

2534AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

[openreview] [pdf]

Abstract While recently Large Language Models (LLMs) have achieved remarkable successes, they are vulnerable to certainjailbreaking attacksthat lead to generation of inappropriate or harmful content. Manual red-teaming requires finding adversarial prompts that cause such jailbreaking, e.g. by appending a suffix to a given instruction, which is inefficient and time-consuming. On the other hand, automatic adversarial prompt generation often leads to semantically meaningless attacks that can easily be detected by perplexity-based filters, may require gradient information from the TargetLLM, or do not scale well due to time-consuming discrete optimization processes over the token space. In this paper, we present a novel method that uses another LLM, called theAdvPrompter, to generate human-readable adversarial prompts in seconds, 800×\sim800\times faster than existing optimization-based approaches. We train the AdvPrompter using a novel algorithm thatdoes not require gradientsof the TargetLLM. This process alternates between two steps: (1) generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions, and (2) fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores.

2535Can Generative AI Solve Your In-Context Learning Problem? A Martingale Perspective

[openreview] [pdf]

Abstract This work is about estimating when a conditional generative model (CGM) can solve an in-context learning (ICL) problem. An in-context learning (ICL) problem comprises a conditional generative model (CGM), a dataset, and a prediction task. For example, the CGM could be a pre-trained multi-modal foundation model; the dataset could be a collection of patient histories, test results, and recorded diagnoses; and the prediction task could be to communicate the diagnoses to a new patient. The Bayesian interpretation of ICL assumes that the CGM computes a posterior predictive distribution over an unknown Bayesian model defining a joint distribution over latent explanations and observable data. From this perspective, Bayesian model criticism is a reasonable approach to assess the suitability of a given CGM for an ICL problem. However, such approaches---like posterior predictive checks (PPCs)---often assume that we can sample from the likelihood and posterior defined by the Bayesian model, which are not explicitly given for contemporary CGMs. To address this, we show when ancestral sampling from the predictive distribution of a CGM is equivalent sampling datasets from the posterior predictive of the assumed Bayesian model. Then we develop the generative predictive pp-value, which enables PPCs and their cousins for contemporary CGMs. The generative predictive pp-value can then be used in a statistical decision procedure to determine when the model is appropriate for an ICL problem, as a metric to compare different model choices, or as a general measure of uncertainty over models. Our method only requires generating queries and responses from a CGM and evaluating its response log probability. We empirically evaluate our method on synthetic regression and natural language ICL tasks using large language models.

2536A few-shot Label Unlearning in Vertical Federated Learning

[openreview] [pdf]

Abstract This paper addresses the critical challenge of unlearning in Vertical Federated Learning (VFL), an area that has received limited attention compared to horizontal federated learning. We introduce the first approach specifically designed to tackle label unlearning in VFL, focusing on scenarios where the active party aims to mitigate the risk of label leakage. Our method leverages a limited amount of labeled data, utilizing manifold mixup to augment the forward embedding of insufficient data, followed by gradient ascent on the augmented embeddings to erase label information from the models. This combination of augmentation and gradient ascent enables high unlearning effectiveness while maintaining efficiency, completing the unlearning procedure within seconds. Extensive experiments conducted on diverse datasets, including MNIST, CIFAR10, CIFAR100, and ModelNet, validate the efficacy and scalability of our approach. This work represents a significant advancement in federated learning, addressing the unique challenges of unlearning in VFL while preserving both privacy and computational efficiency.

2537Error Broadcast and Decorrelation as a Potential Artificial and Natural Learning Mechanism

[openreview] [pdf]

Abstract We introduce the Error Broadcast and Decorrelation (EBD) algorithm, a novel learning framework that addresses the credit assignment problem in neural networks by directly broadcasting output error to individual layers. The EBD algorithm leverages the orthogonality property of the optimal minimum mean square error (MMSE) estimator, which states that estimation errors are orthogonal to any nonlinear function of the input, specifically the activations of each layer. By defining layerwise loss functions that penalize correlations between these activations and output errors, the EBD method offers a principled and efficient approach to error broadcasting. This direct error transmission eliminates the need for weight transport inherent in backpropagation. Additionally, the optimization framework of the EBD algorithm naturally leads to the emergence of the experimentally observed three-factor learning rule. We further demonstrate how EBD can be integrated with other biologically plausible learning frameworks, transforming time-contrastive approaches into single-phase, non-contrastive forms, thereby enhancing biological plausibility and performance. Numerical experiments demonstrate that EBD achieves performance comparable to or better than state-of-the-art methods on benchmark datasets. Our findings suggest that EBD offers a promising, principled direction for both artificial and natural learning paradigms, providing a biologically plausible and flexible alternative for neural network training with inherent simplicity and adaptability that could benefit future developments in neural network technologies.

2538TTVD: Towards a Geometric Framework for Test-Time Adaptation Based on Voronoi Diagram

[openreview] [pdf]

Abstract Deep learning models often struggle with generalization when deploying on real-world data, due to the common distributional shift to the training data. Test-time adaptation (TTA) is an emerging scheme used at inference time to address this issue. In TTA, models are adapted online at the same time when making predictions to test data. Neighbor-based approaches have gained attention recently, where prototype embeddings provide location information to alleviate the feature shift between training and testing data. However, due to their inherit limitation of simplicity, they often struggle to learn useful patterns and encounter performance degradation. To confront this challenge, we study the TTA problem from a geometric point of view. We first reveal that the underlying structure of neighbor-based methods aligns with the Voronoi Diagram, a classical computational geometry model for space partitioning. Building on this observation, we propose the Test-Time adjustment by Voronoi Diagram guidance (TTVD), a novel framework that leverages the benefits of this geometric property. Specifically, we explore two key structures: 1) Cluster-induced Voronoi Diagram (CIVD): This integrates the joint contribution of self-supervision and entropy-based methods to provide richer information. 2) Power Diagram (PD): A generalized version of the Voronoi Diagram that refines partitions by assigning weights to each Voronoi cell. Our experiments under rigid, peer-reviewed settings on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and ImageNet-R shows that TTVD achieves remarkable improvements compared to state-of-the-art methods. Moreover, extensive experimental results also explore the effects of batch size and class imbalance, which are two scenarios commonly encountered in real-world applications. These analyses further validate the robustness and adaptability of our proposed framework.

2539Unifying Causal Representation Learning with the Invariance Principle

[openreview] [pdf]

Abstract Causal representation learning aims at recovering latent causal variables from high-dimensional observations to solve causal downstream tasks, such as predicting the effect of new interventions or more robust classification. A plethora of methods have been developed, each tackling carefully crafted problem settings that lead to different types of identifiability. The folklore is that these different settings are important, as they are often linked to different rungs of Pearl’s causal hierarchy, although not all neatly fit. Our main contribution is to show thatmany existing causal representation learning approaches methodologically align the representation to known data symmetries. Identification of the variables is guided by equivalence classes across different “data pockets” that are not necessarily causal. This result suggests important implications, allowing us to unify many existing approaches in a single method that can mix and match different assumptions, including non-causal ones, based on the invariances relevant to our application. It also significantly benefits applicability, which we demonstrate by improving treatment effect estimation on real-world high-dimensional ecological data. Overall, this paper clarifies the role of causality assumptions in the discovery of causal variables and shifts the focus to preserving data symmetries.

2540From Reward Shaping to Q-Shaping: Achieving Unbiased Learning with LLM-Guided Knowledge

[openreview] [pdf]

Abstract Q-shaping is an extension of Q-value initialization and serves as an alternative to reward shaping for incorporating domain knowledge to accelerate agent training, thereby improving sample efficiency by directly shaping Q-values. This approach is both general and robust across diverse tasks, allowing for immediate impact assessment while guaranteeing optimality. We evaluated Q-shaping across 20 different environments using a large language model (LLM) as the heuristic provider. The results demonstrate that Q-shaping significantly enhances sample efficiency, achieving a \textbf{16.87%} improvement over the best baseline in each environment and a \textbf{253.80%} improvement compared to LLM-based reward shaping methods. These findings establish Q-shaping as a superior and unbiased alternative to conventional reward shaping in reinforcement learning.

2541Policy optimization can be memory-efficient: LLM Alignment Through Successive Policy Re-weighting (SPR)

[openreview] [pdf]

Abstract Reinforcement learning (RL) is serving as the cornerstone of aligning large language models (LLMs) to human behavior, by providing an appealing formulation and a suite of effective algorithms for learning behavior strategies through interacting with the underlying environment. Current paradigm of RL-based methods for LLM alignment, such as reinforcement learning with human feedback (RLHF) involves utilizing a reward function learned from extensive offline datasets to expediate the online training of reinforcement learning. The reward function learned is then used for policy optimization to obtain an improved policy (i.e. the LLM). Despite the success of RL approaches in aligning LLM with offline datasets, there are significant computational/limit of resources concern on applying RL-based methods for LLMs. For example, standard RLHF requires simultaneous loading of four models to the computing unit. In this paper, we develop a novel policy optimization algorithm named Successive Policy Re-weighting (SPR), matching the peak memory consumption of standard supervised fine-tune (SFT). Further, SPR can leverage both offline and online datasets to expediate online training and improve the sample efficiency. Specifically, SPR leverages a supervised learning subroutine to achieve policy improvement through re-weighting the policy according to the importance/performance of executed actions. Such simple and effective method is computationally inexpensive, requiring loading only one model at each update step, matching the computational cost of standard supervised fine-tuning procedure. Experimental results show that the proposed method can significantly outperform benchmark algorithms and accelerate the online training with available offline dataset.

2542Rethinking Pre-Training in Tabular Data: A Neighborhood Embedding Perspective

[openreview] [pdf]

Abstract Pre-training is prevalent in deep learning for vision and text data, acquiring knowledge from other datasets to improve the downstream tasks. However, when it comes to tabular data, the inherent heterogeneity in the attribute and label spaces among datasets makes it hard to learn shareable knowledge and encode it in a model. We proposeTabular dataPre-Training viaMeta-representation (TabPTM), aiming to pre-train a general tabular model over a set of heterogeneous datasets. The key is to embed data instances from any dataset into a common feature space, in which an instance is represented by its distance to a fixed number of nearest neighbors and their labels. Such a meta-representation standardizes heterogeneous tasks into homogeneous local prediction problems, enabling training a model to infer the label (or the score to each possible label) of an input instance based on its neighborhood information. As such, the pre-trained TabPTM can be directly applied to new datasets without further fine-tuning, regardless of their diverse attributes and labels. Extensive experiments on 72 tabular datasets validate TabPTM’s effectiveness (with and without fine-tuning) in both tabular classification and regression tasks.

2543Thought-Retriever: Don’t Just Retrieve Raw Data, Retrieve Thoughts

[openreview] [pdf]

Abstract Large language models (LLMs) have transformed AI research thanks to their powerful \textit{internal} capabilities and knowledge. However, existing LLMs still fail to effectively incorporate the massive \textit{external} knowledge when interacting with the world. Although retrieval-augmented LLMs are proposed to mitigate the issue, they are still fundamentally constrained by the context length of LLMs, as they can only retrieve top-K raw data chunks from the external knowledge base which often consists of millions of data chunks. Here we propose Thought-Retriever, a novel model-agnostic algorithm that helps LLMs generate output conditioned on arbitrarily long external data, without being constrained by the context length or number of retrieved data chunks. Our key insight is to let an LLM fully leverage its intermediate responses generated when solving past user queries (thoughts), filtering meaningless and redundant thoughts, organizing them in thought memory, and retrieving the relevant thoughts when addressing new queries. Besides algorithmic innovation, we further meticulously prepare a novel benchmark, AcademicEval, which requires an LLM to faithfully leverage ultra-long context to answer queries based on real-world academic papers. Extensive experiments on AcademicEval and two other public datasets validate that Thought-Retriever remarkably outperforms state-of-the-art baselines, achieving an average increase of at least 7.6% in F1 score and 16% in win rate across various tasks. More importantly, we further demonstrate two exciting findings: (1) Thought-Retriever can indeed help LLM self-evolve after solving more user queries; (2) Thought-Retriever learns to leverage deeper thoughts to answer more abstract user queries.

2544Continuous Speech Synthesis using per-token Latent Diffusion

[openreview] [pdf]

Abstract The success of autoregressive transformer models with discrete tokens has inspired quantization-based approaches for continuous modalities, though these often limit reconstruction quality. We therefore introduce SALAD, a per-token latent diffusion model for zero-shot text-to-speech, that operates on continuous representations. SALAD builds upon the recently proposed expressive diffusion head for image generation, and extends it to generate variable-length outputs. Our approach utilizes semantic tokens for providing contextual information and determining the stopping condition. We suggest three continuous variants for our method, extending popular discrete speech synthesis techniques. Additionally, we implement discrete baselines for each variant and conduct a comparative analysis of discrete versus continuous speech modeling techniques. Our results demonstrate that both continuous and discrete approaches are highly competent, and that SALAD achieves a superior intelligibility score while obtaining speech quality and speaker similarity on par with the ground-truth audio.

2545Unsupervised-to-Online Reinforcement Learning

[openreview] [pdf]

Abstract Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.

2546Learning Generative Judge from Preference Data

[openreview] [pdf]

Abstract Learning from preference feedback is a common practice for aligning large language models (LLMs) with human value. Conventionally, preference data is learned and encoded into a scalar reward model that connects a value head with an LLM to produce a scalar score as preference. However, scalar models lack interpretability and are known to be susceptible to biases in datasets. This paper investigates leveraging the generation capability of LLMs to address both limitations in one shot. Specifically, we prompt the pre-trained LLM to generate positive and negative judgments, both supported with rationales in natural language form. The self-generated contrastive judgment pairs are used to train the generative judge with Direct Preference Optimization (DPO). This proposal of learning the generativeJudge using self-generatedContrastive judgments (Con-J) ensures natural interpretability through the generated rationales supporting the judgments, and demonstrates higher robustness against bias compared to scalar models. Experimental results show that Con-J outperforms the scalar reward model trained on the same collection of preference data, and outperforms a series of open-source and closed-source generative LLMs. We open-source the training process and model weights of Con-J athttps://anonymous.4open.science/r/Con-J-D014/.

2547Improving Equivariant Networks with Probabilistic Symmetry Breaking

[openreview] [pdf]

Abstract Equivariance encodes known symmetries into neural networks, often enhancing generalization. However, equivariant networks cannotbreaksymmetries: the output of an equivariant network must, by definition, have at least the same self-symmetries as its input. This poses an important problem, both (1) for prediction tasks on domains where self-symmetries are common, and (2) for generative models, which must break symmetries in order to reconstruct from highly symmetric latent spaces. This fundamental limitation can in fact be addressed by consideringequivariant conditional distributions, instead of equivariant functions. We therefore present novel theoretical results that establish necessary and sufficient conditions for representing such distributions. Concretely, this representation provides a practical framework for breaking symmetries in any equivariant network via randomized canonicalization. Our method, SymPE (Symmetry-breaking Positional Encodings), admits a simple interpretation in terms of positional encodings. This approach expands the representational power of equivariant networks while retaining the inductive bias of symmetry, which we justify through generalization bounds. Experimental results demonstrate that SymPE significantly improves performance of group-equivariant and graph neural networks across diffusion models for graphs, graph autoencoders, and lattice spin system modeling.

2548Toward Domain Translation with Monolingual Domain Data Only

[openreview] [pdf]

Abstract Neural machine translation (NMT) is very sensitive to domain shifts requiring a carefully designed fine-tuning strategy to avoid catastrophic forgetting problems when adapting to a new domain. Fine-tuning usually relies on high quality in-domain data, but constructing a sufficient amount of parallel data for training poses challenges even for fine-tuning. In contrast, domain-specific monolingual resources are more accessible when compared with bilingual data. Therefore, we challenge the domain adaptation of a general NMT model using only features obtained from a small amount of monolingual data. We regard the task as an instance of domain shifts, and adopt energy-based models (EBMs) and approximate these EBMs using Conditional Distributional Policy Gradients (CDPG). Recent work has applied CDPG with a small number of EBMs for NMT models limiting the capacity for domain shifts, but we construct a large number of EBMs considering the entire domain-specific data, i.e., unigram distribution, and perform fine-tuning according to their constraints. Our results show that fine-tuning using a large number of EBMs can achieve a robust domain shift without causing catastrophic forgetting, demonstrating a robust domain shift using only a small amount of monolingual resources.

2549Counterintuitive RL: The Hidden Value of Acting Bad

[openreview] [pdf]

Abstract Learning to make sequential decisions solely from interacting with an environment without any supervision has been achieved by the initial installation of deep neural networks as function approximators to represent and learn a value function in high-dimensional MDPs. Reinforcement learning policies face exponentially growing state spaces in experience collection in high dimensional MDPs resulting in a dichotomy between computational complexity and policy success. In our paper we focus on the agent’s interaction with the environment in a high-dimensional MDP during the learning phase and we introduce a theoretically-founded novel method based on experiences obtained through extremum actions. Our analysis and method provides a theoretical basis for effective, accelerated and efficient experience collection, and further comes with zero additional computational cost while leading to significant acceleration of training in deep reinforcement learning. We conduct extensive experiments in the Arcade Learning Environment with high-dimensional state representation MDPs. We demonstrate that our technique improves the human normalized median scores of Arcade Learning Environment by 248% in the low-data regime.

2550Conformal Prediction Sets with Improved Conditional Coverage using Trust Scores

[openreview] [pdf]

Abstract Standard conformal prediction offers a marginal guarantee on coverage, but for prediction sets to be truly useful, they should ideally ensure coverage conditional on each test point. However, it is impossible to achieve exact, distribution-free conditional coverage in finite samples. In this work, we propose an alternative conformal prediction algorithm that targets coverage where it matters most---in instances where a classifier is overconfident in its incorrect predictions. We start by dissecting miscoverage events in marginally-valid conformal prediction, and show that miscoverage rates vary based on the classifier’s confidence and its deviation from the Bayes optimal classifier. Motivated by this insight, we develop a variant of conformal prediction that targets coverage conditional on a reduced set of two variables: the classifier’s confidence in a prediction and a nonparametric trust score that measures its deviation from the Bayes classifier. Empirical evaluation on multiple image datasets shows that our method generally improves conditional coverage properties compared to standard conformal prediction, including class-conditional coverage, coverage over arbitrary subgroups, and coverage over demographic groups.

2551K&L: Penetrating Backdoor Defense with Key and Locks

[openreview] [pdf]

Abstract Backdoor attacks in machine learning create hidden vulnerability by manipulating the model behaviour with specific triggers. Such attacks often remain unnoticed as the model operates as expected for normal input. Thus, it is imperative to understand the intricate mechanism of backdoor attacks. To address this challenge, in this work, we introduce three key requirements that a backdoor attack must meet. Moreover, we note that current backdoor attack algorithms, whether employing fixed or input-dependent triggers, exhibit a high binding with model parameters, rendering them easier to defend against. To tackle this issue, we propose the Key-Locks algorithm, which separates the backdoor attack process into embedding locks and employing a key for unlocking. This method enables the adjustment of unlocking levels to counteract diverse defense mechanisms. Extensive experiments are conducted to evaluate the effective of our proposed algorithm. Our code is available at:https://anonymous.4open.science/r/KeyLocks-FD85

2552Reinforcement Learning via Lazy-Agent for Environments with Random Delays

[openreview] [pdf]

Abstract Real-world reinforcement learning applications are often hampered by delayed feedback from environments, which violates the fundamental assumption of the Markovian property and introduces significant challenges. While numerous methods have been proposed for handling environments with constant delays, those with random delays remain largely unexplored owing to their inherent complexity and variability. In this study, we explored environments with random delays and demonstrated that these environments can be transformed into equivalent constant-delay environments by introducing a simple agent called the \textit{lazy-agent}. This approach overcame the challenges posed by the variability of random delays, enabling the application of conventional constant-delay approaches to random-delay environments. The empirical results reveal that our lazy-agents trained in random-delay environments performed almost comparably to the agents trained in constant-delay environments, significantly outperforming other baseline algorithms in terms of asymptotic performance and sample efficiency.

2553UnCLe: An Unlearning Framework for Continual Learning

[openreview] [pdf]

Abstract Recent advances in deep learning require models to exhibit continual learning capability, allowing them to learn new tasks and progressively accumulate knowledge without forgetting old tasks. Concurrently, there are growing concerns and regulatory requirements to meet privacy and safety by discarding some knowledge through machine unlearning. With the rapidly rising relevance of continual learning and machine unlearning, we consider them together under a unified framework in this paper. However, the conflicting nature of past data unavailability arising from continual learning makes it challenging to perform unlearning with existing methods which assume data availability. Moreover, in the proposed setup, where tasks are repeatedly learned and unlearned in a sequence, it is another challenge to maintain the stability of the tasks that need to be retained. To address these challenges, we propose UnCLe, an Unlearning Framework for Continual Learning designed to learn tasks incrementally and unlearn tasks without access to past data. To perform data-free unlearning, UnCLe leverages hypernetworks in conjunction with an unlearning objective that seeks to selectively align task-specific parameters with noise. Our experiments on popular benchmarks demonstrate UnCLe’s consistent unlearning completeness and ability to preserve task stability over long sequences.

2554Efficient Differentiable Discovery of Causal Order

[openreview] [pdf]

Abstract In the algorithm Intersort, Chevalley et al. proposed a score-based method to discover the causal order of variables in a Directed Acyclic Graph (DAG) model, leveraging interventional data to outperform existing methods. However, as a score-based method over the permutahedron, Intersort is computationally expensice and non-differentiable, limiting its ability to be utilised in problems involving large-scale datasets, such as those in genomics and climate models, or to be integrated into end-to-end gradient-based learning frameworks. We address this limitation by reformulating Intersort using differentiable sorting and ranking techniques. Our approach enables scalable and differentiable optimization of causal orderings, allowing the continuous score function to be incorporated as a regularizer in downstream tasks. Empirical results demonstrate that causal discovery algorithms benefit significantly from regularizing on the causal order, underscoring the effectiveness of our method. Our work opens the door to efficiently incorporating regularization for causal order into the training of differentiable models and thereby addresses a long-standing limitation of purely associational supervised learning.

2555Improved Video VAE for Latent Video Diffusion Model

[openreview] [pdf]

Abstract Variational Autoencoder (VAE) aims to compress pixel data into low-dimensional latent space, playing an important role in OpenAI’s Sora and other latent video diffusion generation models. While most of existing video VAEs inflate a pretrained image VAE into the 3D causal structure for temporal-spatial compression, this paper presents two astonishing findings: (1) The initialization from a well-trained image VAE with the same latent dimensions suppresses the improvement of subsequent temporal compression capabilities. (2) The adoption of causal reasoning leads to unequal information interactions and unbalanced performance between frames. To alleviate these problems, we propose a keyframe-based temporal compression (KTC) architecture and a group causal convolution (GCConv) module to further improve video VAE (IV-VAE). Specifically, the KTC architecture divides the latent space into two branches, in which one half completely inherits the compression prior of keyframes from lower-dimension image VAEs while the other half involves temporal compression into the 3D group causal convolution, reducing temporal-spatial conflicts and accelerating the convergence speed of video VAE. The GCConv in above 3D half uses standard convolution within each frame group to ensure inter-frame equivalence, and employs causal logical padding between groups to maintain flexibility in processing variable frame video. Extensive experiments on five benchmarks demonstrate the SOTA video reconstruction and generation capabilities of the proposed IV-VAE. The source code and weights will be made available to the public.

2556Policy Gradient Optimization for Markov Decision Processes with Epistemic Uncertainty and General Loss Functions

[openreview] [pdf]

Abstract Motivated by many application problems, we consider Markov decision processes (MDPs) with a general loss function and unknown parameters. To mitigate the epistemic uncertainty associated with unknown parameters, we take a Bayesian approach to estimate the parameters from data and impose a coherent risk functional (with respect to the Bayesian posterior distribution) on the general loss function. Since this formulation usually does not satisfy the interchangeability principle, it does not admit Bellman equations and cannot be solved by approaches based on dynamic programming. Therefore, we develop a policy gradient optimization approach to address this problem. We utilize the dual representation of the coherent risk measure and extend the envelope theorem to derive the policy gradient. Our extension of the envelope theorem from the discrete case to the continuous case may be of independent interest. We then show the convergence of the proposed algorithm with a convergence rate of O(1t)\mathcal{O}(\frac{1}{t}), where tt is the number of policy gradient iterations. We further extend our algorithm to an episodic setting, and establish the consistency of the extended algorithm and provide bounds on the number of iterations needed to achieve a constant error bound in each episode.

2557Can Reinforcement Learning Solve Asymmetric Combinatorial-Continuous Zero-Sum Games?

[openreview] [pdf]

Abstract There have been extensive studies on learning in zero-sum games, focusing on the analysis of the existence and algorithmic convergence of Nash equilibrium (NE). Existing studies mainly focus on symmetric games where the strategy spaces of the players are of the same type and size. For the few studies that do consider asymmetric games, they are mostly restricted to matrix games. In this paper, we define and study a new practical class of asymmetric games called two-player Asymmetric Combinatorial-Continuous zEro-Sum (ACCES) games, featuring a combinatorial action space for one player and an infinite compact space for the other. Such ACCES games have broad implications in the real world, particularly in combinatorial optimization problems (COPs) where one player optimizes a solution in a combinatorial space, and the opponent plays against it in an infinite (continuous) compact space (e.g., a nature player deciding epistemic parameters of the environmental model). Our first key contribution is to prove the existence of NE for two-player ACCES games, using the idea of essentially finite game approximation. Building on the theoretical insights and double oracle (DO)-based solutions to complex zero-sum games, our second contribution is to design the novel algorithm, Combinatorial Continuous DO (CCDO), to solve ACCES games, and prove the convergence of the proposed algorithm. Considering the NP-hardness of most COPs and recent advancements in reinforcement learning (RL)-based solutions to COPs, our third contribution is to propose a practical algorithm to solve NE in the real world, CCDORL (based on CCDO) and provide the novel convergence analysis in the ACCES game. Experimental results across diverse instances of COPs demonstrate the empirical effectiveness of our algorithms.

2558Improving Data Efficiency via Curating LLM-Driven Rating Systems

[openreview] [pdf]

Abstract Instruction tuning is critical for adapting large language models (LLMs) to downstream tasks, and recent studies have demonstrated that small amounts of human-curated data can outperform larger datasets, challenging traditional data scaling laws. While LLM-based data quality rating systems offer a cost-effective alternative to human annotation, they often suffer from inaccuracies and biases, even in powerful models like GPT-4. In this work, we introduce DS2DS^2, aDiversity-awareScore curation method forDataSelection. By systematically modeling error patterns through a score transition matrix, DS2DS^2 corrects LLM-based scores and promotes diversity in the selected data samples. Our approach shows that a curated subset (just 3.3% of the original dataset) outperforms full-scale datasets (300k samples) across various machine-alignment benchmarks, and matches or surpasses human-aligned datasets such as LIMA with the same sample size (1k samples). These findings challenge conventional data scaling assumptions, highlighting that redundant, low-quality samples can degrade performance and reaffirming that ``more can be less’'.

2559Time Series Representation Models for Multivariate Time Series Forecasting and Imputation

[openreview] [pdf]

Abstract We introduce a multilayered representation learning architecture called Time Series Representation Model (TSRM) for multivariate time series forecasting and imputation. The architecture is structured around hierarchically ordered encoding layers, each dedicated to an independent representation learning task. Each encoding layer contains a representation layer designed to capture diverse temporal patterns and an aggregation layer responsible for combining the learned representations. The architecture is fundamentally based on a Transformer encoder-like configuration, with self-attention mechanisms at its core. The TSRM architecture outperforms state-of-the-art approaches on seven established benchmark datasets for both forecasting and imputation tasks while significantly reducing complexity in the form of learnable parameters. The source code is available athttps://anonymous.4open.science/r/TSRM-D7BE.

2560Finding Equilibria in Bilinear Zero-sum Games via a Convexity-based Approach

[openreview] [pdf]

Abstract We focus on the design of algorithms for finding equilibria in 2-player zero-sum games. Although it is well known that such problems can be solved by a single linear program, there has been a surge of interest in recent years, for simpler algorithms, motivated in part by applications in machine learning. Our work proposes such a method, inspired by the observation that the duality gap (a standard metric for evaluating convergence in general min-max optimization problems) is a convex function for the case of bilinear zero-sum games. To this end, we analyze a descent-based approach, variants of which have also been used as a subroutine in a series of algorithms for approximating Nash equilibria in general non-zero-sum games.In particular, we analyze a steepest descent approach, by finding the direction that minimises the directional derivative of the duality gap function and move towards that. Our main theoretical result is that the derived algorithms achieve a geometric decrease in the duality gap and improved complexity bounds until we reach an approximate equilibrium. Finally, we complement this with an experimental evaluation. Our findings reveal that for some classes of zero-sum games, the running time of our method is comparable with standard LP solvers, even with thousands of available strategies per player.

2561Lipschitz Bandits in Optimal Space

[openreview] [pdf]

Abstract This paper considers the Lipschitz bandit problem, where the set of arms is continuous and the expected reward is a Lipschitz function over the arm space. This problem has been extensively studied. Prior algorithms need to store the reward information of all visited arms, leading to significant memory consumption. We address this issue by introducing an algorithm named Log-space Lipschitz bandits (Log-Li), which achieves an optimal (up to logarithmic factors) regret of O~(Tdz+1dz+2)\widetilde{O}\left(T^{\frac{d_z+1}{d_z+2}}\right) while only uses O(logT)O\left(\log T\right) bits of memory. Additionally, we provide a complexity analysis for this problem, demonstrating that Ω(logT)\Omega\left(\log T\right) bits of space are necessary for any algorithm to achieve the optimal regret. We also conduct numerical simulations, and the results show that our new algorithm achieves regret comparable to the state-of-the-art while reducing memory usage by orders of magnitude.

2562Skill-based Safe Reinforcement Learning with Risk Planning

[openreview] [pdf]

Abstract Safe Reinforcement Learning (Safe RL) aims to ensure safety when an RL agent conducts learning by interacting with real-world environments where improper actions can induce high costs or lead to severe consequences. In this paper, we propose a novel Safe Skill Planning (SSkP) approach to enhance effective safe RL by exploiting auxiliary offline demonstration data. SSkP involves a two-stage process. First, we employ PU learning to learn a skill risk predictor from the offline demonstration data. Then, based on the learned skill risk predictor, we develop a novel risk planning process to enhance online safe RL and learn a risk-averse safe policy efficiently through interactions with the online RL environment, while simultaneously adapting the skill risk predictor to the environment. We conduct experiments in several benchmark robotic simulation environments. The experimental results demonstrate that the proposed approach consistently outperforms previous state-of-the-art safe RL methods.

2563Causal Reinforcement Learning for Spatio-Temporal Point Processes

[openreview] [pdf]

Abstract Spatio-temporal event sequences are increasingly accessible in various domains such as earthquake forecasting, crime prediction, and healthcare management. These data sources present unique challenges, as they involve both spatial and temporal dimensions, with event sequences exhibiting intricate dependencies over time and space. Neural network-based spatio-temporal point processes offer a sophisticated framework for modeling such event data. Conventional maximum likelihood estimation (MLE) of such data may lead to inaccurate predictions due to model-misspecification and compounding prediction errors. On the other hand, reinforcement learning frameworks, which treat event generation as actions and learn a policy to mimic event generation may alleviate the training/test discrepancy issue. Current reinforcement learning of point processes may have prohibitively poor exploration efficiency. In this paper, we propose the Causal learning improved Reinforcement Learning Spatio-Temporal Point Process (CRLSTPP) framework, which can mitigate the issue of compounding prediction errors and improve exploration efficiency at the same time. Experiments on both synthetic data and real-world data validate the superiority of the proposed model.

2564Relation Augmented Preferential Bayesian Optimization via Preference Propagation

[openreview] [pdf]

Abstract In black-box optimization, when directly evaluating the function values of solutions is very costly or infeasible, access to the objective function is often limited to comparing pairs of solutions, which yields dueling black-box optimization. Dueling optimization is solely based on pairwise preferences, and thus notably reduces cost compared with function value based methods such as Bayesian optimization. However, an optimization performance gap obviously exists between dueling based and function value based methods. This is mainly due to that most existing dueling optimization methods do not make full use of the pairwise preferences collected. To fill this gap, this paper proposes relation augmented preferential Bayesian optimization (RAPBO) via preference propagation. By considering solution similarity, RAPBO aims to uncover the potential preferential relations between solutions within different preferences through the proposed preferential relation propagation technique. Specifically, RAPBO first clusters solutions using a Gaussian mixture model. After obtaining the solution set with the highest intra-cluster similarity, RAPBO utilizes a directed hypergraph to model the potential relations between solutions, thereby realizing relation augmentation. Extensive experiments are conducted on both synthetic functions and real-world tasks such as motion control and spacecraft trajectory optimization. The experimental results disclose the satisfactory accuracy of augmented preferences in RAPBO, and show the superiority of RAPBO compared with existing dueling optimization methods. Notably, it is verified that, under the same evaluation cost budget, RAPBO is competitive with or even surpass the function value based Bayesian optimization methods with respect to optimization performance. The codes can be found inhttps://anonymous.4open.science/r/RAPBO-E15F.

2565Offline Equilibrium Finding in Extensive-form Games: Datasets, Methods, and Analysis

[openreview] [pdf]

Abstract Offline reinforcement learning (Offline RL) brings new methods to tackle real-world decision-making problems by leveraging pre-collected datasets. Despite substantial progress in single-agent scenarios, the application of offline learning to multiplayer games remains largely unexplored. Therefore, we introduce a novel paradigmoffline equilibrium finding(Offline EF) in extensive-form games (EFGs), which aims at computing equilibrium strategies from offline datasets. The primary challenges of offline EF include i) the absence of a comprehensive dataset of EFGs for evaluation; ii) the inherent difficulties in computing an equilibrium strategy solely from an offline dataset, as equilibrium finding requires referencing all potential action profiles; and iii) the impact of dataset quality and completeness on the effectiveness of the derived strategies. To overcome these challenges, we make four main contributions in this work. First, we construct diverse datasets, encompassing a wide range of games, which form the foundation for the offline EF paradigm and serve as a basis for evaluating the performance of offline EF algorithms. Second, we design a novel framework, BOMB, which integrates the behavior cloning technique within a model-based method. BOMB can seamlessly integrate online equilibrium finding algorithms to the offline setting with minimal modifications. Third, we provide a comprehensive theoretical and empirical analysis of our BOMB framework, offering performance guarantees across various offline datasets. Finaly, extensive experiments have been carried out across different games under different offline datasets, and the results not only demonstrate the superiority of our approach compared to traditional offline RL algorithms but also highlight the remarkable efficiency in computing equilibrium strategies offline.

2566Time-MoE: Billion-Scale Time Series Foundation Models with Mixture of Experts

[openreview] [pdf]

Abstract Deep learning for time series forecasting has seen significant advancements over the past decades. However, despite the success of large-scale pre-training in language and vision domains, pre-trained time series models remain limited in scale and operate at a high cost, hindering the development of larger capable forecasting models in real-world applications. In response, we introduce Time-MoE, a scalable and unified architecture designed to pre-train larger, more capable forecasting foundation models while reducing inference costs. By leveraging a sparse mixture-of-experts (MoE) design, Time-MoE enhances computational efficiency by activating only a subset of networks for each prediction, reducing computational load while maintaining high model capacity. This allows Time-MoE to scale effectively without a corresponding increase in inference costs. Time-MoE comprises a family of decoder-only transformer models that operate in an auto-regressive manner and support flexible forecasting horizons with varying input context lengths. We pre-trained these models on our newly introduced large-scale data Time-300B, which spans over 9 domains and encompassing over 300 billion time points. For the first time, we scaled a time series foundation model up to 2.4 billion parameters, achieving significantly improved forecasting precision. Our results validate the applicability of scaling laws for training tokens and model size in the context of time series forecasting. Compared to dense models with the same number of activated parameters or equivalent computation budgets, our models consistently outperform them by large margin. These advancements position Time-MoE as a state-of-the-art solution for tackling real-world time series forecasting challenges with superior capability, efficiency, and flexibility.

2567Regularized Maximum Mean Discrepancy for Variable Selection

[openreview] [pdf]

Abstract In this paper, we propose a variable selection method based on maximum mean discrepancy (MMD) to effectively identify important variables that contribute to distributional differences between two samples. We begin by assigning weights to each variable and then optimizing these weights within a regularized MMD framework. The optimized weights serve as an importance measure for each variable and can be leveraged for variable selection. Additionally, using the optimized weights, we design two algorithms aimed at enhancing test power and improving classification accuracy for two-sample tests and classification problems. Our method is model-free and makes no assumptions about the underlying structure of the data. Moreover, we propose an acceleration method to improve computational efficiency. We also provide theoretical guarantees, including the consistency of the estimated weights and the convergence of our acceleration algorithms. Through numerical simulations and real-world datasets, we validate the effectiveness of the proposed method.

2568Label Informativeness-based Minority Oversampling in Graphs (LIMO)

[openreview] [pdf]

Abstract Class imbalance is a pervasive issue in many real-world datasets, particularly in graph-structured data, where certain classes are significantly underrepresented. This imbalance can severely impact the performance of Graph Neural Networks (GNNs), leading to biased learning or over-fitting. The existing oversampling techniques often overlook the intrinsic properties of graphs, such as Label Informativeness (LI), which measures the amount of information a neighbor’s label provides about a node’s label. To address this, we propose Label Informativeness-based Minority Oversampling (LIMO), a novel algorithm that strategically oversamples minority class nodes by augmenting edges to maximize LI. This technique generates a balanced, synthetic graph that enhances GNN performance without significantly increasing data volume. Our theoretical analysis shows that the effectiveness of GNNs is directly proportional to label informativeness, with mutual information as a mediator. Additionally, we provide insights into how variations in the number of inter-class edges influence the LI by analyzing its derivative. Experimental results on various homophilous and heterophilous benchmark datasets demonstrate the effectiveness of LIMO in improving the performance on node classification for different imbalance ratios, with particularly significant improvements observed in heterophilous graph datasets. Our code is available at \url{https://anonymous.4open.science/r/limo-1A36/}

2569Machine Unlearning Fails to Remove Data Poisoning Attacks

[openreview] [pdf]

Abstract We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of training on poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of evaluation settings (e.g., alleviating membership inference attacks), they fail to remove the effects of data poisoning, across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, is required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned datapoints without having to retrain, our work suggests that these methods are not yet “ready for prime time”, and currently provide limited benefit over retraining.

2570What Can We Learn from State Space Models for Machine Learning on Graphs?

[openreview] [pdf]

Abstract Machine learning on graphs has recently found extensive applications across domains. However, the commonly used Message Passing Neural Networks (MPNNs) suffer from limited expressive power and struggle to capture long-range dependencies. Graph transformers offer a strong alternative due to their global attention mechanism, but they come with great computational overheads, especially for large graphs. In recent years, State Space Models (SSMs) have emerged as a compelling approach to replace full attention in transformers to model sequential data. It blends the strengths of RNNs and CNNs, offering a) efficient computation, b) the ability to capture long-range dependencies, and c) good generalization across sequences of various lengths. However, extending SSMs to graph-structured data presents unique challenges due to the lack of canonical node ordering in graphs. In this work, we propose Graph State Space Convolution (GSSC) as a principled extension of SSMs to graph-structured data. By leveraging global permutation-equivariant set aggregation and factorizable graph kernels that rely on relative node distances as the convolution kernels, GSSC preserves all three advantages of SSMs. We demonstrate the provably stronger expressiveness of GSSC than MPNNs in counting graph substructures and show its effectiveness across 11 real-world, widely used benchmark datasets. GSSC achieves the best results on 6 out of 11 datasets with all significant improvements compared to the state-of-the-art baselines and second-best results on the other 5 datasets. Our findings highlight the potential of GSSC as a powerful and scalable model for graph machine learning. Anonymous code is available athttps://anonymous.4open.science/r/GSSC-5ED8.

2571Anomaly Detection in Dynamic Graphs via Adversarial Autoencoder

[openreview] [pdf]

Abstract Anomaly detection in dynamic graphs is a very important task that has attracted a lot of attention. Many dynamic graph anomaly detection methods are already available, but most of these efforts are based on supervised learning. In the real world, however, it is often difficult to collect large amounts of labelled anomaly data, which is not conducive to the training of these supervised methods and severely reduces their ability to be applied in different dynamic graph anomaly detection scenarios. A novel semi-supervised anomaly detection framework \textbf{AAEDY} for the detection of anomalous edges in dynamic graphs is presented in this paper, which improves reconstruction by combining adversarial based on autoencoder, and discriminates whether an edge is anomalous by comparing the original edge to the reconstructed edge in low-dimensional space. Extensive experiments have been carried out on six real-world datasets, and the experimental results show that \textbf{AAEDY} can outperform the state-of-the-art competitors in anomaly detection significantly.

2572Towards Understanding the Feasibility of Machine Unlearning

[openreview] [pdf]

Abstract In response to recent privacy protection regulations, machine unlearning has attracted great interest in the research community. However, existing studies often demonstrate their approaches’ effectiveness by measuring the overall unlearning success rate rather than evaluating the chance of unlearning specific training samples, leaving the universal feasibility of the unlearning operation unexplored. This paper proposes a novel method to quantify the difficulty of unlearning a single sample by taking into account factors such as model and data distribution. Specifically, we propose several heuristics to understand the condition of a successful unlearning operation on data points, explore difference in unlearning difficulty over training data points, and suggest a potential ranking mechanism for identifying the most challenging samples to unlearn. In particular, we note Kernelized Stein Discrepancy (KSD), a parameterized kernel function tailored to each model and dataset, is an effective heuristic to tell the difficulty of unlearning a data sample. We demonstrate our discovery by including multiple classification tasks and existing machine unlearning algorithms, highlighting the practical feasibility of unlearning operations across different scenarios.

2573Selective induction Heads: How Transformers Select Causal Structures in Context

[openreview] [pdf]

Abstract Transformers have exhibited exceptional capabilities in sequence modeling tasks, leveraging self-attention and in-context learning. Critical to this success are induction heads, attention circuits that enable copying tokens based on their previous occurrences. In this work, we introduce a novel framework that showcases transformers’ ability to dynamically handle causal structures.Existing works rely on Markov Chains to study the formation of induction heads, revealing how transformers capture causal dependencies and learn transition probabilities in-context. However, they rely on a fixed causal structure that fails to capture the complexity of natural languages, where the relationship between tokens dynamically changes with context. To this end, our framework varies the causal structure through interleaved Markov chains with different lags while keeping the transition probabilities fixed. This setting unveils the formation of Selective Induction Heads, a new circuit that endows transformers with the ability to select the correct causal structure in-context. We empirically demonstrate that transformers learn this mechanism to predict the next token by identifying the correct lag and copying the corresponding token from the past. We provide a detailed construction of a 3-layer transformer to implement the selective induction head, and a theoretical analysis proving that this mechanism asymptotically converges to the maximum likelihood solution. Our findings advance the understanding of how transformers select causal structures, providing new insights into their functioning and interpretability.

2574Minimax Based Fast-training Defense against Adversarial Policy in Two-player Competitive Games

[openreview] [pdf]

Abstract Adversarial policies have been shown to exploit vulnerabilities in agents during two-player competitive games, significantly undermining their performance. While existing approaches model the challenge of training robust policies in such environments as the search for Nash equilibrium points in the policy space, this often leads to substantial computational overhead. In this work, we propose MM-FATROL, a novel robust policy training method grounded in the Minimax Theorem, which significantly reduces computational overhead by efficiently identifying promising policy updates. We provide a formal analysis of the speedup achieved by our method. Extensive experiments demonstrate that MM-FATROL not only enhances efficiency but also surpasses the state-of-the-art method in terms of generalization and robustness. Additionally, we discuss the limitations of our approach and the challenges that remain in developing robust policies for more complex game environments.

2575Uncertainty Herding: One Active Learning Method for All Label Budgets

[openreview] [pdf]

Abstract Most active learning research has focused on methods which perform well when many labels are available, but can be dramatically worse than random selection when label budgets are small. Other methods have focused on the low-budget regime, but do poorly as label budgets increase. As the line between “low” and “high” budgets varies by problem, this is a serious issue in practice. We proposeuncertainty coverage, an objective which generalizes a variety of low- and high-budget objectives, as well as natural, hyperparameter-light methods to smoothly interpolate between low- and high-budget regimes. We call greedy optimization of the estimate Uncertainty Herding; this simple method is computationally fast, and we prove that it nearly optimizes the distribution-level coverage. In experimental validation across a variety of active learning tasks, our proposal matches or beats state-of-the-art performance in essentially all cases; it is the only method of which we are aware that reliably works well in both low- and high-budget settings.

2576Latent Variable Identifiability in Nonlinear Causal Models with Single-domain Data under Minimality Condition

[openreview] [pdf]

Abstract The identifiability of latent variables given observational data is one of the core issues in the field of disentangled representation learning. Recent progresses have been made on establishing identifiablity theories for latent causal models. However with much restrictions or unrealistic assumptions, their practicality on real applications are limited. In this paper, we propose a novel identifiablity theory for learning latent variables in nonlinear causal models, requiring only single-domain data. We prove that all latent variables in a powerset bipartite graph can be identified up to an invertible transformation, if the generation process of observable data is globally invertible, latent variables are independent, and shared latent variables entail minimal information. Experiments on synthetic data support the conclusions of our theory.

2577Loss2Net: Loss Meta-Learning for Regression with A-priori Unknown Metrics

[openreview] [pdf]

Abstract There exist many practical applications where regression tasks must cope with a generally overseen problem: the output variable to be computed, which is often a decision variable, impacts the performance metric to optimize in a manner that is not known a priori. This challenge translates into a loss-metric mismatch, which makes standard loss functions such as Mean Square Error (MSE) not suitable because they significantly hinder the final performance. While this problem is of crucial importance in, e.g., many engineering and economic applications, the literature in meta-learning of loss functions has focused on other problems, such as classification or few-shot learning tasks. In this work, we aim at closing this research gap by proposing a model that can handle common situations in real systems where the unknown prediction-metric relationship is time-correlated, non-differentiable, or depends on multiple intertwined predictions. We present a novel loss meta-learning architecture for regression, named Loss2Net, which is able to (i) jointly learn the actual regressor and the loss function that it should minimize, directly from system responses; (ii) it does so without any assumption on the loss function structure; (iii) it provides a manner to learn non-differentiable and multi-dimensional loss functions from entangled performance metrics. Detailed experiments for power grid and telecommunications infrastructure optimization, grounded on real-world measurement data, demonstrate how Loss2Net can effectively learn unidentified loss functions.

2578Unearthing Skill-level Insights for Understanding Trade-offs of Foundation Models

[openreview] [pdf]

Abstract With models getting stronger, evaluations have grown more complex, testing multiple skills in one benchmark and even in the same instance at once. However, skill-wise performance is obscured when inspecting aggregate accuracy, under-utilizing the rich signal modern benchmarks contain. We propose an automatic approach to recover the underlying skills relevant for any evaluation instance, by way of inspecting model-generated {\em rationales}. After validating the relevance of rationale-parsed skills and inferring skills for 46k instances over 12 benchmarks, we observe many skills to be common across benchmarks, resulting in the curation of hundreds of \emph{skill-slices} (i.e. sets of instances testing a common skill). Inspecting accuracy over these slices yields novel insights on model trade-offs: e.g., compared to GPT-4o and Claude 3.5 Sonnet, on average, Gemini 1.5 Pro is 1818% more accurate in \emph{computing molar mass}, but 1919% less accurate in \emph{applying constitutional law}, despite the overall accuracies of the three models differing by a mere 0.40.4%. Furthermore, we demonstrate the practical utility of our approach by showing that insights derived from skill slice analysis can generalize to held-out instances: when routing each instance to the model strongest on the relevant skills, we see a 33% accuracy improvement over our 12 dataset corpus. Our skill-slices and framework open a new avenue in model evaluation, leveraging skill-specific analyses to unlock a more granular and actionable understanding of model capabilities.

2579Functional Gradients and Generalizations for Transformer In-Context Learning

[openreview] [pdf]

Abstract We examine Transformer-based in-context learning for contextual data of the form (xi,yi)(x_i,y_i) for i=1,,Ni=1,\ldots,N, and query xN+1x_{N+1}, where xiRdx_i\in\Bbb{R}^d and yip(Yf(xi))y_i\sim p(Y|f(x_i)), with f(x)f(x) a latent function. This is analyzed from the perspective offunctionalgradient descent for latent f(x)f(x). We initially perform this analysis from the perspective of a reproducing kernel Hilbert space (RKHS), from which an alternative kernel-averaging perspective is manifested. This leads to a generalization, allowing an interpretation of softmax attention from the perspective of the Nadaraya-Watson kernel-weighted average. We show that a single attention layer may be designed to exactly implement a functional-gradient step in this setting (for RKHS latent functions), extending prior work for the special case of real-valued YY and Gaussian p(Yf(x))p(Y|f(x)). This is also generalized for softmax attention and non-RKHS underlying f(x)f(x). Though our results hold in a general setting, we focus on categorical YY with p(Yf(x))p(Y|f(x)) modeled as a generalized linear model (corresponding specifically to softmax probability). Multi-layered extensions are developed for this case, and through extensive experimentation we demonstrate that for categorical YY a single-layer model is often highly effective for such in-context learning. We also demonstrate these ideas for real-world data, considering in-context classification of ImageNet data, showing the broad applicability of our theory beyond the commonly-studied settings of synthetic regression data.

2580Reinforcement Learning from Imperfect Corrective Actions and Proxy Rewards

[openreview] [pdf]

Abstract In practice, reinforcement learning (RL) agents are often trained with a possibly imperfect proxy reward function, which may lead to a human-agent alignment issue (i.e., the learned policy either converges to non-optimal performance with low cumulative rewards, or achieves high cumulative rewards but in an undesired manner). To tackle this issue, we consider a framework where a human labeler can provide additional feedback in the form of corrective actions, which expresses the labeler’s action preferences although this feedback may possibly be imperfect as well. In this setting, to obtain a better-aligned policy guided by both learning signals, we propose a novel value-based deep RL algorithm calledIterative learning fromCorrective actions andProxy rewards (ICoPro), which cycles through three phases: (1) Solicit sparse corrective actions from a human labeler on the agent’s demonstrated trajectories; (2) Incorporate these corrective actions into the Q-function using a margin loss to enforce adherence to labeler’s preferences; (3) Train the agent with standard RL losses regularized with a margin loss to learn from proxy rewards and propagate the Q-values learned from human feedback. Moreover, another novel design in our approach is to integrate pseudo-labels from the target Q-network to reduce human labor and further stabilize training. We experimentally validate our proposition on a variety of tasks (Atari games and autonomous driving on highway). On the one hand, using proxy rewards with different levels of imperfection, our method can better align with human and is more sample-efficient than baseline methods. On the other hand, facing corrective actions with different types of imperfection, our method can overcome the non-optimality of this feedback thanks to the guidance from proxy rewards.

2581dEBORA: Efficient Bilevel Optimization-based low-Rank Adaptation

[openreview] [pdf]

Abstract Low-rank adaptation methods are a popular approach for parameter-efficient fine-tuning of large-scale neural networks. However, selecting the optimal rank for each layer remains a challenging problem that significantly affects both performance and efficiency. In this paper, we introduce a novel bilevel optimization strategy that simultaneously trains both matrix and tensor low-rank adapters, dynamically selecting the optimal rank for each layer. Our method avoids the use of implicit differentiation in the computation of the hypergradient, and integrates a stochastic away-step variant of the Frank-Wolfe algorithm, eliminating the need for projection and providing identifiability guarantees of the optimal rank structure. This results in a highly efficient and cost-effective training scheme that adaptively allocates the parameter budget across the network layers. On top of a detailed theoretical analysis of the method, we provide different numerical experiments showcasing its effectiveness.

2582LLM-as-a-Judge & Reward Model: What They Can and Cannot Do

[openreview] [pdf]

Abstract LLM-as-a-Judge and reward models are widely used alternatives of multiple-choice questions or human annotators for large language model (LLM) evaluation. Their efficacy shines in evaluating long-form responses, serving a critical role as evaluators of leaderboards and as proxies to align LLMs via reinforcement learning. However, despite their popularity, their effectiveness in diverse contexts, such as non-English prompts, factual verification, or challenging questions, remains unexplored. In this paper, we conduct a comprehensive analysis of automated evaluators, reporting several key findings on their behavior. First, we discover that English evaluation capabilities significantly influence language-specific evaluation capabilities, often more than the language proficiency itself, enabling evaluators trained in English to easily transfer their skills to other languages. Second, we identify critical shortcomings, where LLMs fail to detect and penalize errors, such as factual inaccuracies, cultural misrepresentations, and the presence of unwanted language. Finally, we find that state-of-the-art evaluators struggle with challenging prompts, in either English or Korean, underscoring their limitations in assessing or generating complex reasoning questions. We release the dataset and codes used.

2583Provable Learning for DEC-POMDPs: Factored Models and Memoryless Agents

[openreview] [pdf]

Abstract This paper studies cooperative Multi-Agent Reinforcement Learning (MARL) under the mathematical model of Decentralized Partially Observable Markov Decision Process (DEC-POMDP). Despite the empirical success of cooperative MARL, its theoretical foundation, particularly in the realm of provable learning of DEC-POMDPs, remains limited. In this paper, we first present a hardness result in theory demonstrating that, without additional structural assumptions, learning DEC-POMDPs requires several samples that grows exponentially with the number of agents in the worst case, which is also known as the curse of multiagency. This motivates us to explore important subclasses of DEC-POMDPs for which efficient solutions can be found. Specifically, we propose new algorithms and establish sample-efficiency guarantees that break the curse of multiagency, for finding both local and global optima in two important scenarios: (1) when agents employ memoryless policies, selecting actions based solely on their current observations; and (2) when a factored structure is present, which enables key properties similar to value decomposition in VDN or Qmix.

2584ADePT: Adaptive Decomposed Prompt Tuning for Parameter-Efficient Fine-tuning

[openreview] [pdf]

Abstract Prompt Tuning (PT) enables the adaptation of Pre-trained Large Language Models (PLMs) to downstream tasks by optimizing a small amount of soft virtual tokens, which are prepended to the input token embeddings. Recently, Decomposed Prompt Tuning (DePT) has demonstrated superior adaptation capabilities by decomposing the soft prompt into a shorter soft prompt and a pair of low-rank matrices. The product of the pair of low-rank matrices is added to the input token embeddings to offset them. Additionally, DePT achieves faster inference compared to PT due to the shorter soft prompt. However, in this paper, we find that the fixed token embedding offsets of DePT restricts its ability to generalize across diverse model inputs, and that the shared embedding offsets across many token embeddings result in sub-optimization. To tackle these issues, we introduce \textbf{A}daptive \textbf{De}composed \textbf{P}rompt \textbf{T}uning (ADePT), which is composed of a short soft prompt and a shallow token-shared feed-forward neural network. ADePT utilizes the token-shared feed-forward neural network to learn the embedding offsets for each token, enabling adaptive embedding offsets that varies according to the model input and a better optimization of token embedding offsets. This enables ADePT achieve superior adaptation performance without requiring more inference time or additional trainable parameters compared to vanilla PT and its variants. In comprehensive experiments across 22 natural language processing (NLP) tasks and two typical PLMs of different scales, we show that ADePT consistently surpasses the leading parameter-efficient fine-tuning (PEFT) methods, and even outperforms the full fine-tuning baseline in certain scenarios. The code can be available in the supplementary materials.

2585Towards Hierarchical Rectified Flow

[openreview] [pdf]

Abstract We formulate a hierarchical rectified flow to model data distributions. It hierarchically couples multiple ordinary differential equations (ODEs) and defines a time-differentiable stochastic process that generates a data distribution from a known source distribution. Each ODE resembles the ODE that is solved in a classic rectified flow, but differs in its domain, i.e., location, velocity, acceleration, etc. Unlike the classic rectified flow formulation, which formulates a single ODE in the location domain and only captures the expected velocity field (sufficient to capture a multi-modal data distribution), the hierarchical rectified flow formulation models the multi-modal random velocity field, acceleration field, etc., in their entirety. This more faithful modeling of the random velocity field enables integration paths to intersect when the underlying ODE is solved during data generation. Intersecting paths in turn lead to integration trajectories that are more straight than those obtained in the classic rectified flow formulation, where integration paths cannot intersect. This leads to modeling of data distributions with fewer neural function evaluations. We empirically verify this on synthetic 1D and 2D data as well as MNIST and CIFAR10 data. We will release our code.

2586Narcissus: Leveraging Early Training Dynamics for Unsupervised Anomaly Detection

[openreview] [pdf]

Abstract Anomaly detection is a critical learning task with many significant and diverse applications. Currently, semi-supervised methods provide the state-of-the-art accuracy performance but require labeled normal data for training. Unsupervised approaches, on the other hand, do not have this requirement but can only offer inferior anomaly detection performance. In this paper, we introduce NARCISSUS, a novel unsupervised anomaly detection method that achieves accuracy comparable to semi-supervised approaches. Our key insight is that a learning model when training with a mix of normal and sparse anomalous data converges first on normal data. Leveraging this insight, NARCISSUS employs a tailored early stopping scheme, eliminating the need for pseudo labels and costly label generation interactions. It also offers systematic solutions to minimize the influence of model uncertainty, ensuring robust detection. NARCISSUS is model-agnostic and can therefore make use of even a semi-supervised anomaly detection model underneath, thereby turning it into an unsupervised one. Comprehensive evaluations using time series, image and graph datasets show that NARCISSUS provides similar or better detection performance compared to best-performing semi-supervised methods while not requiring labeled data.

2587Path-Tracing Distillation: Enhancing Stability in Text-to-3D Generation by Mitigating Out-of-Distribution Issues

[openreview] [pdf]

Abstract Text-to-3D generation techniques signify a pivotal advancement in creating 3D models from textual descriptions. Contemporary state-of-the-art methods utilize score distillation processes, leveraging 2D priors to generate 3D assets. However, these approaches frequently encounter instability during the initial generation phases, primarily due to an distribution divergence between the pretrained score prediction network and the nascent 3D model. Specifically, raw rendered images of an initial 3D model lie outside the distribution (OOD) of the pretrained score prediction network, which is trained on high-fidelity realistic images. To address this OOD issue, we introduce an innovative Path-Tracing Distillation (PTD) technique that refines the distillation process. Our method sequentially optimizes the 3D model using intermediate score networks that exhibit closer distributional alignment, thereby accelerating the convergence during the early stages of training. This approach not only ensures a more stable increase in CLIP similarity initially but also preserves the visual quality and diversity of the generated models. Experiments demonstrate that PTD significantly enhances both the stability and quality of text-to-3D generation, outperforming existing baselines. PTD can also be generalized to other score distillation methods.

2588HiBO: Hierarchical Bayesian Optimization via Adaptive Search Space Partitioning

[openreview] [pdf]

Abstract Optimizing black-box functions in high-dimensional search spaces has been known to be challenging for traditional Bayesian Optimization (BO). In this paper, we introduce HiBO, a novel hierarchical algorithm integrating global-level search space partitioning information into the acquisition strategy of a local BO-based optimizer. HiBO employs a search-tree-based global-level navigator to adaptively split the search space into partitions with different sampling potential. The local optimizer then utilizes this global-level information to guide its acquisition strategy towards the most promising regions within the search space. A comprehensive set of evaluations demonstrates that HiBO outperforms state-of-the-art methods in high-dimensional synthetic benchmarks and presents significant practical effectiveness in the real-world task of tuning configurations of database management systems (DBMSs).

2589Tackling Decision Processes with Non-Cumulative Objectives using Reinforcement Learning

[openreview] [pdf]

Abstract Markov decision processes (MDPs) are used to model a wide variety of applications ranging from game playing over robotics to finance. Their optimal policy typically maximizes the expected sum of rewards given at each step of the decision process. However, a large class of problems does not fit straightforwardly into this framework: Non-cumulative Markov decision processes (NCMDPs), where instead of the expected sum of rewards, the expected value of an arbitrary function of the rewards is maximized. Example functions include the maximum of the rewards or their mean divided by their standard deviation. In this work, we introduce a general mapping of NCMDPs to standard MDPs. This allows all techniques developed to find optimal policies for MDPs, such as reinforcement learning or dynamic programming, to be directly applied to the larger class of NCMDPs. Focusing on reinforcement learning, we show applications in a diverse set of tasks, including classical control, portfolio optimization in finance, and discrete optimization problems. Given our approach, we can improve both final performance and training time compared to relying on standard MDPs.

2590The Fair Language Model Paradox

[openreview] [pdf]

Abstract Large Language Models (LLMs) are widely deployed in real-world applications, yet little is known about their training dynamics at the token level. Evaluation typically relies on aggregated training loss, measured at the batch level, which overlooks subtle per-token biases arising from (i) varying token-level dynamics and (ii) structural biases introduced by hyperparameters. While weight decay is commonly used to stabilize training, we reveal that it silently introduces performance biases detectable only at the token level. In fact, we empirically show across different dataset sizes, model architectures and sizes ranging from 270M to 3B parameters that as weight decay increases, low-frequency tokens are disproportionately depreciated. This is particularly concerning, as these neglected low-frequency tokens represent the vast majority of the token distribution in most languages, calling for novel regularization techniques that ensure fairness across all available tokens.

2591Manifold Constraint Reduces Exposure Bias in Accelerated Diffusion Sampling

[openreview] [pdf]

Abstract Diffusion models have demonstrated significant potential for generating high-quality images, audio, and videos. However, their iterative inference process entails substantial computational costs, limiting practical applications. Recently, researchers have introduced accelerated sampling methods that enable diffusion models to generate samples with far fewer timesteps than those used during training. Nonetheless, as the number of sampling steps decreases, the prediction errors significantly degrade the quality of generated outputs. Additionally, the inherent exposure bias in diffusion models causes errors to propagate and amplify, further introducing non-negligible inaccuracies in inference. To address these challenges, we leverage a manifold hypothesis to explore the exposure bias problem in depth. Based on this geometric perspective, we propose a manifold constraint that effectively reduces exposure bias during accelerated sampling of diffusion models. Notably, our method involves no additional training and requires only minimal hyperparameter tuning. Extensive experiments on high-resolution datasets demonstrate the effectiveness of our approach, achieving a FID score of 15.60 with 10-step SDXL on MS-COCO, surpassing the baseline by a reduction of 2.57 in FID.

2592On the Adversarial Risk of Test Time Adaptation: An Investigation into Realistic Test-Time Data Poisoning

[openreview] [pdf]

Abstract Test-time adaptation (TTA) updates the model weights during the inference stage using testing data to enhance generalization. However, this practice exposes TTA to adversarial risks. Existing studies have shown that when TTA is updated with crafted adversarial test samples, also known as test-time poisoned data, the performance on benign samples can deteriorate. Nonetheless, the perceived adversarial risk may be overstated if the poisoned data is generated under overly strong assumptions. In this work, we first review realistic assumptions for test-time data poisoning, including white-box versus grey-box attacks, access to benign data, attack budget, and more. We then propose an effective and realistic attack method that better produces poisoned samples without access to benign samples, and derive an effective in-distribution attack objective. We also design two TTA-aware attack objectives. Our benchmarks of existing attack methods reveal that the TTA methods are more robust than previously believed. In addition, we analyze effective defense strategies to help develop adversarially robust TTA methods.

2593TAGExplainer: Narrating Graph Explanations for Text-Attributed Graph Learning Models

[openreview] [pdf]

Abstract Representation learning of Text-Attributed Graphs (TAGs) has garnered significant attention due to its applications in various domains, including recommendation systems and social networks. Despite advancements in TAG learning methodologies, challenges remain in explainability due to the black-box nature of existing TAG representation learning models. This paper presents TAGExplainer, the first method designed to generate natural language explanations for TAG learning. TAGExplainer employs a generative language model that maps input-output pairs to explanations reflecting the model’s decision-making process. To address the lack of annotated ground truth explanations in real-world scenarios, we propose first generating pseudo-labels that capture the model’s decisions from saliency-based explanations, then the pseudo-label generator is iteratively trained based on three training objectives focusing on faithfulness and brevity via Expert Iteration, to improve the quality of generated pseudo-labels. The high-quality pseudo-labels are finally utilized to train an end-to-end explanation generator model. Extensive experiments are conducted to demonstrate the effectiveness of TAGExplainer in producing faithful and concise natural language explanations.

2594BlockFound: Customized blockchain foundation model for anomaly detection

[openreview] [pdf]

Abstract We propose BlockFound, a customized foundation model for anomaly blockchain transaction detection. Unlike existing methods that rely on rule-based systems or directly apply off-the-shelf large language models, BlockFound introduces a series of customized designs to model the unique data structure of blockchain transactions. First, a blockchain transaction is multi-modal, containing blockchain-specific tokens, texts, and numbers. We design a modularized tokenizer to handle these multi-modal inputs, balancing the information across different modalities. Second, we design a customized mask language learning mechanism for pretraining with RoPE embedding and FlashAttention for handling longer sequences. After training the foundation model, we further design a novel detection method for anomaly detection. Extensive evaluations on Ethereum and Solana transactions demonstrate BlockFound’s exceptional capability in anomaly detection while maintaining a low false positive rate. Remarkably, BlockFound is the only method that successfully detects anomalous transactions on Solana with high accuracy, whereas all other approaches achieved very low or zero detection recall scores. This work not only provides new foundation models for blockchain but also sets a new benchmark for applying LLMs in blockchain data.

2595Recurrent Action Transformer with Memory

[openreview] [pdf]

Abstract Recently, the use of transformers in offline reinforcement learning has become a rapidly developing area. This is due to their ability to treat the agent’s trajectory in the environment as a sequence, thereby reducing the policy learning problem to sequence modeling. In environments where the agent’s decisions depend on past events (POMDPs), it is essential to capture both the event itself and the decision point in the context of the model. However, the quadratic complexity of the attention mechanism limits the potential for context expansion. One solution to this problem is to extend transformers with memory mechanisms. This paper proposes a Recurrent Action Transformer with Memory (RATE), a novel model architecture that incorporates a recurrent memory mechanism designed to regulate information retention. To evaluate our model, we conducted extensive experiments on memory-intensive environments (ViZDoom-Two-Colors, T-Maze, Memory Maze, Minigrid-Memory), classic Atari games, and MuJoCo control environments. The results show that using memory can significantly improve performance in memory-intensive environments, while maintaining or improving results in classic environments. We believe that our results will stimulate research on memory mechanisms for transformers applicable to offline reinforcement learning. The code is open-sourced and can be found in thehttps://anonymous.4open.science/r/RATE-BAD3.

2596Towards Generalizable Reinforcement Learning via Causality-Guided Self-Adaptive Representations

[openreview] [pdf]

Abstract General intelligence requires quick adaption across tasks. While existing reinforcement learning (RL) methods have made progress in generalization, they typically assume only distribution changes between source and target domains. In this paper, we explore a wider range of scenarios where not only the distribution but also the environment spaces may change. For example, in the CoinRun environment, we train agents from easy levels and generalize them to difficulty levels where there could be new enemies that have never occurred before. To address this challenging setting, we introduce a causality-guided self-adaptive representation-based approach, called CSR, that equips the agent to generalize effectively across tasks with evolving dynamics. Specifically, we employ causal representation learning to characterize the latent causal variables within the RL system. Such compact causal representations uncover the structural relationships among variables, enabling the agent to autonomously determine whether changes in the environment stem from distribution shifts or variations in space, and to precisely locate these changes. We then devise a three-step strategy to fine-tune the causal model under different scenarios accordingly. Empirical experiments show that CSR efficiently adapts to the target domains with only a few samples and outperforms state-of-the-art baselines on a wide range of scenarios, including our simulated environments, CartPole, CoinRun and Atari games.

2597Collaborative and Efficient Personalization with Mixtures of Adaptors

[openreview] [pdf]

Abstract Non-iid data is prevalent in real-world federated learning problems. Data heterogeneity can come in different types in terms of distribution shifts. In this work, we are interested in the heterogeneity that comes from concept shifts, i.e., shifts in the prediction across clients. In particular, we consider multi-task learning, where we want the model to adapt to the task of the client. We propose a parameter-efficient framework to tackle this issue, where each client learns to mix between parameter-efficient adaptors according to its task. We use Low-Rank Adaptors (LoRAs) as the backbone and extend its concept to other types of layers. We call our framework Federated Low-Rank Adaptive Learning (FLoRAL). This framework is not an algorithm but rather a model parameterization for a multi-task learning objective, so it can work on top of any algorithm that optimizes this objective, which includes many algorithms from the literature. FLoRAL is memory-efficient, and clients are personalized with small states (e.g., one number per adaptor) as the adaptors themselves are federated. Hence, personalization is--in this sense--federated as well. Even though clients can personalize more freely by training an adaptor locally, we show that collaborative and efficient training of adaptors is possible and performs better. We also show that FLoRAL can outperform an ensemble of full models with optimal cluster assignment, which demonstrates the benefits of federated personalization and the robustness of FLoRAL to overfitting. We show promising experimental results on synthetic datasets, real-world federated multi-task problems such as MNIST, CIFAR-10, and CIFAR-100. We also provide a theoretical analysis of local SGD on a relaxed objective and discuss the effects of aggregation mismatch on convergence.

2598Unveiling Causal Relationships Among Candidate Output Tokens in Large Language Models: Towards Interpretability and Control

[openreview] [pdf]

Abstract Understanding how large language models (LLMs) generate tokens is crucial for enhancing their performance and interpretability. We hypothesize that cause-effect relationships exist among candidate output tokens during next token prediction in LLMs. Specifically, we propose that certain candidate output tokens---termed “effect tokens”---are causally influenced by other candidate tokens activated in earlier layers, referred to as “cause tokens”. To test this hypothesis, we develop a causal analysis methodology that uncovers these relationships within open-source LLMs. We find that while cause tokens are essential for generating effect tokens, including them in the final output can degrade model performance.Building on these findings, we introduce a decoding algorithm that employs two heuristics: Critical Layer Ablation (CLA), which approximates causal relationships by selectively removing transformer layers and observing their impact on token generation, and Causally-Informed Decoding (CID), which uses the relationships identified by CLA to adjust token probabilities. Specifically, CID increases the probability of selecting effect tokens while decreasing that of cause tokens during generation. Our method achieves measurable accuracy improvements across various benchmark datasets, demonstrating its potential to enhance both the controllability and performance of LLM-generated text.

2599TeaserGen: Generating Teasers for Long Documentaries

[openreview] [pdf]

Abstract Teasers are an effective tool for promoting content in entertainment, commercial and educational fields. However, creating an effective teaser for long videos is challenging for it requires long-range multimodal modeling capability for the input videos, while necessitating maintaining audiovisual alignments, managing scene transitions and preserving factual accuracy for the output teasers. Due to the lack of a publicly-available dataset, progress along this research direction has been hindered. In this work, we present DocumentaryNet, a collection of 1,269 documentaries paired with their teasers, featuring multimodal data streams of video, speech, music, sound effects and narrations. With DocumentaryNet, we propose a new two-stage system for generating teasers from long documentaries. The proposed TeaserGen system first generates the teaser narration from the transcribed narration from the documentary using a pretrained large language model, and then selects the most relevant visual content to accompany the generated narration through language-vision models. For narration-video matching, we explore two approaches: a pretraining-based model using pretrained contrastive language-vision models and a deep sequential model that learns the mapping between the narrations and visuals. Our experimental results show that the pretraining-based approach is more effective at identifying relevant visual content than directly trained deep autoregressive models.

2600Positional Encoder Graph Quantile Neural Networks for Geographic Data

[openreview] [pdf]

Abstract Positional Encoder Graph Neural Networks (PE-GNNs) are a leading approach for modeling continuous spatial data. However, they often fail to produce calibrated predictive distributions, limiting their effectiveness for uncertainty quantification. We introduce the Positional Encoder Graph Quantile Neural Network (PE-GQNN), a novel method that integrates PE-GNNs, Quantile Neural Networks, and recalibration techniques in a fully nonparametric framework, requiring minimal assumptions about the predictive distributions. We propose a new network architecture that, when combined with a quantile-based loss function, yields accurate and reliable probabilistic models without increasing computational complexity. Our approach provides a flexible, robust framework for conditional density estimation, applicable beyond spatial data contexts. We further introduce a structured method for incorporating a KNN predictor into the model while avoiding data leakage through the GNN layer operation. Experiments on benchmark datasets demonstrate that PE-GQNN significantly outperforms existing state-of-the-art methods in both predictive accuracy and uncertainty quantification.

2601KV-Distill: Nearly Lossless Context Compression for Transformers

[openreview] [pdf]

Abstract Sequence-to-sequence natural language tasks often benefit greatly from long contexts, but the quadratic complexity of self-attention renders usage of long contexts non-trivial. In particular, during generation, temporary representations (stored in the KV cache) account for a large portion of GPU memory usage, and scale linearly with context length. In this work, we introduce KV-Distill, a flexible compression framework for large language models (LLMs) that distills long context KV caches into significantly shorter representations. KV-Distill can be trained as a parameter-efficient adaptor for pre-trained models, and enables the compression of arbitrary spans of a context while preserving the pre-trained model’s capabilities, including instruction-tuning. We do this by treating a compressed-uncompressed cache as a student-teacher pairing and applying a KL-type divergence to match the generated outputs. Our experiments show that KV-Distill outperforms other compression techniques in worst-case extractive tasks, and approaches uncompressed performance in long context question answering and summarization. Furthermore, KV-Distill can be fine-tuned on domain-specific contexts to reduce context lengths by up 95% while preserving downstream task performance. We demonstrate the generalizability of KV-Distill across various model sizes and architectures.

2602Conditional Diffusion on Web-Scale Image Pairs leads to Diverse Image Variations

[openreview] [pdf]

Abstract Generating image variations, where a model produces variations of an input image while preserving the semantic context has gained increasing attention. Current image variation techniques involve adapting a text-to-image model to reconstruct an input image conditioned on the same image. We first demonstrate that a diffusion model trained to reconstruct an input image from frozen embeddings can reconstruct the image with minor variations. Second, inspired by how text-to-image models learn from web-scale text-image pairs, we explore a new pretraining strategy to generate image variations using a large collection of image pairs. Our diffusion model \textit{Semantica} receives a random (encoded) image from a webpage as conditional input and denoises another noisy random image from the same webpage. We carefully examine various design choices for the image encoder, given its crucial role in extracting relevant context from the input image. Once trained, \textit{Semantica} can adaptively generate new images from a dataset by simply using images from that dataset as input. Finally, we identify limitations in standard image consistency metrics for evaluating image variations and propose alternative metrics based on few-shot generation.

2603The Path-Driven Independence Testing (PIT) Algorithm

[openreview] [pdf]

Abstract PC is an efficient constraint-based algorithm for learning the structure of a Bayesian network. However, the required number of conditional independent (CI) tests can make the algorithm practically infeasible or slow for large graphs. We developed a constrained-based algorithm, called the Path-Driven Independence Testing (PIT) Algorithm, which during the learning process, utilizes the information of the partially learned network to reduce the number of CI tests. The idea is that for each pair of variables XX and YY, instead of checking independence conditioned on every subset of all the neighbors of XX (resp. YY) as in PC, the search is restricted to only the common neighbors of XX and YY and to neighbors connected to YY (resp. XX) by a path. Also, paths connecting XX and YY without a descendant of a common neighbor can be blocked by observing two consecutive nodes on the path. Compared to PC, PIT is proven to conduct at most the same number of CI tests, and experimentally shown to be significantly (up to 7 times) faster and more accurate.

2604DIPPER: Direct Preference Optimization for Primitive-Enabled Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract Hierarchical reinforcement learning (HRL) is an elegant framework for learning efficient control policies to perform complex robotic tasks, especially in sparse reward settings. However, concurrently learning policies at multiple hierarchical levels often suffers from training instability due to non-stationary behavior of lower-level primitives. In this work, we introduce DIPPER, an efficient hierarchical framework that leverages Direct Preference Optimization (DPO) to mitigate non-stationarity at the higher level, while using reinforcement learning to train the corresponding primitives at the lower level. We observe that directly applying DPO to the higher level in HRL is ineffective and leads to infeasible subgoal generation issues. To address this, we develop a novel, principled framework based on lower-level primitive regularization of upper-level policy learning. We provide a theoretical justification for the proposed framework utilizing bi-level optimization. The application of DPO also necessitates the development of a novel reference policy formulation for feasible subgoal generation. To validate our approach, we conduct extensive experimental analyses on a variety of challenging, sparse-reward robotic navigation and manipulation tasks. Our results demonstrate that DIPPER shows impressive performance and demonstrates an improvement of up to 40% over the baselines in complex sparse robotic control tasks.

2605Learning Successor Features with Distributed Hebbian Temporal Memory

[openreview] [pdf]

Abstract This paper presents a novel approach to address the challenge of online temporal memory learning for decision-making under uncertainty in non-stationary, partially observable environments. The proposed algorithm, Distributed Hebbian Temporal Memory (DHTM), is based on factor graph formalism and a multicomponent neuron model. DHTM aims to capture sequential data relationships and make cumulative predictions about future observations, forming Successor Features (SF). Inspired by neurophysiological models of the neocortex, the algorithm utilizes distributed representations, sparse transition matrices, and local Hebbian-like learning rules to overcome the instability and slow learning process of traditional temporal memory algorithms like RNN and HMM. Experimental results demonstrate that DHTM outperforms LSTM and a biologically inspired HMM-like algorithm, CSCG, in the case of non-stationary datasets. Our findings suggest that DHTM is a promising approach for addressing the challenges of online sequence learning and planning in dynamic environments.

2606Provable optimal transport with transformers: The essence of depth and prompt engineering

[openreview] [pdf]

Abstract Can we establish provable guarantees for transformer performance? Providing such theoretical guarantees is a milestone in developing trustworthy generative AI. In this paper, we take a step toward addressing this question by focusing on optimal transport, a fundamental problem at the intersection of combinatorial and continuous optimization. Leveraging the computational power of attention layers, we prove that a transformer with fixed parameters can effectively solve the optimal transport problem (in Wasserstein-2 with entropic regularization) for an arbitrary number of points. Consequently, the transformer can sort lists of arbitrary size up to an approximation factor. Our results rely on an engineered prompt that enables the transformer to implement gradient descent with adaptive step sizes on the dual optimal transport. Combining the convergence analysis of gradient descent with Sinkhorn dynamics, we establish an explicit approximation bound for optimal transport with transformers, which improves with increasing depth. Our findings provide novel insights into the essence of prompt engineering and depth for transformers.

2607Learning Semantic-Enhanced Dual Temporal Adjacent Maps for Video Moment Retrieval

[openreview] [pdf]

Abstract Retrieving a specific moment from an untrimmed video via a text description is a central problem in vision-language learning. It is a challenging task due to the sophisticated temporal dependency among moments. Existing methods fail to deal with this issue well since they establish temporal relations of moments in a way that visual content and semantics are coupled. This paper studies temporal dependence schemes that decouple content and semantic information, establishing semantic-enhanced Dual Temporal Adjacent Maps for video moment retrieval, conferred as DTAM. Specifically, DTAM designs two branches to encode visual appearance and semantic knowledge from video clips respectively, where knowledge from the appearance branch is distilled into the semantic branch to help DTAM distinguish features with the same visual content but different semantics with a well-designed semantic-aware contrastive loss. Besides, we also develop a moment-aware mechanism to assist temporal adjacent maps’ learning for better video grounding. Finally, extensive experimental results and analysis demonstrate the superiority of the proposed DTAM over existing state-of-the-art approaches on three challenging video moment retrieval benchmarks, i.e., TACoS, Charades-STA, and ActivityNet Captions.

2608Can Information-Theoretic Generalization Bound Explain the Generalization of Pre-trained Language Model?

[openreview] [pdf]

Abstract Although language models exhibit exceptional generalization capabilities in downstream tasks after extensive text pre-training, the underlying causes behind this generalization remain unclear. Existing studies on information-theoretic generalization bounds suggest that the compression of information stored in the weights (IIW) is a crucial factor influencing a model’s ability to generalize, with some experiments indicating a correlation between lower IIW and improved generalization. However, it remains uncertain whether IIW is applicable to pre-trained language models. In this work, we find that using IIW can explain why the pre-trained language models have better generalization compared to non-pre-trained language models. Unfortunately, we also discover that IIW does not consistently reflect the degree of generalization when applying IIW to study the fine-tuning process of pre-trained language models. We revisit existing IIW estimation methods, highlighting their limitations in accurately estimating IIW based on theoretical and empirical evidence. Our findings suggest that current information-theoretic generalization bounds, constrained by the limitations of IIW estimation methodologies, fail to accurately capture the generalisation performance of pre-trained language models.

2609Spread Preference Annotation: Direct Preference Judgment for Efficient LLM Alignment

[openreview] [pdf]

Abstract Aligning large language models (LLMs) with human preferences becomes a key component to obtaining state-of-the-art performance, but it yields a huge cost to construct a large human-annotated preference dataset. To tackle this problem, we propose a new framework, Spread Preference Annotation with direct preference judgment (SPA), that boosts the alignment of LLMs using only a very small amount of human-annotated preference data. Our key idea is leveraging the human prior knowledge within the small (seed) data and progressively improving the alignment of LLM, by iteratively generating the responses and learning from them with the self-annotated preference data. To be specific, we propose to derive the preference label from the logits of LLM to explicitly extract the model’s inherent preference. Compared to the previous approaches using external reward models or implicit in-context learning, we observe that the proposed approach is significantly more effective. In addition, we introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data. Our experimental results demonstrate that the proposed framework significantly boosts the alignment of LLMs. For example, we achieve superior alignment performance on AlpacaEval 2.0 with only 3.3% of the ground-truth preference labels in the Ultrafeedback data compared to the cases using the entire data or state-of-the-art baselines.

2610Zero-shot Generalist Graph Anomaly Detection with Unified Neighborhood Prompts

[openreview] [pdf]

Abstract Graph anomaly detection (GAD), which aims to identify nodes in a graph that significantly deviate from normal patterns, plays a crucial role in broad application domains. Existing GAD methods, whether supervised or unsupervised, are one-model-for-one-dataset approaches, i.e., training a separate model for each graph dataset. This limits their applicability in real-world scenarios where training on the target graph data is not possible due to issues like data privacy. To overcome this limitation, we propose a novel zero-shot generalist GAD approachUNPromptthat trains a one-for-all detection model, requiring the training of one GAD model on a single graph dataset and then effectively generalizing to detect anomalies in other graph datasets without any retraining or fine-tuning. The key insight in UNPrompt is that i) the predictability of latent node attributes can serve as a generalized anomaly measure and ii) highly generalized normal and abnormal graph patterns can be learned via latent node attribute prediction in a properly normalized node attribute space. UNPrompt achieves generalist GAD through two main modules: one module aligns the dimensionality and semantics of node attributes across different graphs via coordinate-wise normalization in a projected space, while another module learns generalized neighborhood prompts that support the use of latent node attribute predictability as an anomaly score across different datasets. Extensive experiments on real-world GAD datasets show that UNPrompt significantly outperforms diverse competing methods under the generalist GAD setting, and it also has strong superiority under the one-model-for-one-dataset setting.

2611Aligning With Human Values Without Revealing Human Judgements

[openreview] [pdf]

Abstract With the increasing ubiquity of large language models it has become crucial to ensure guarantees for models trained to be aligned with human values to avoid leaking information on the human judgements that have been provided to the algorithm. To target this issue we focus on the problem of alignment via reinforcement learning from human preference rankings, subject to the constraint of not revealing any information on the human data used to align the model. To achieve this, we analyze (ϵ,δ)(\epsilon,\delta)-DP for both the Bradley-Terry-Luce (BTL) model and the Plackett-Luce (PL) model. We introduce a theoretically founded algorithm for learning rewards from human rankings that achieves this objective without leaking the human rankings. We further demonstrate that the privately learned rewards can be used to train policies achieving statistical performance guarantees that asymptotically match the best known algorithms in the non-private setting, which are in some cases minimax optimal. Strikingly, our analysis and our results reveal that it is possible to obtain the same model performance without any trade-off on the protection of the human judgments, and our paper provides the first algorithms that can achieve provable privacy of human judgements, while still producing aligned models with optimal performance.

2612RetroInText: A Multimodal Large Language Model Enhanced Framework for Retrosynthetic Planning via In-Context Representation Learning

[openreview] [pdf]

Abstract Development of robust and effective strategies for retrosynthetic planning requires a deep understanding of the synthesis process. A critical step in achieving this goal is accurately identifying synthetic intermediates. Current machine learning-based methods often overlook the valuable context from the overall route, focusing only on predicting reactants from the product, requiring cost annotations for every reaction step, and ignoring the multi-faced nature of molecular, resulting in inaccurate synthetic route predictions. Therefore, we introduce RetroInText, an advanced end-to-end framework based on a multimodal Large Language Model (LLM), featuring in-context learning with TEXT descriptions of synthetic routes. First, RetroInText including ChatGPT presents detailed descriptions of the reaction procedure. It learns the distinct compound representations in parallel with corresponding molecule encoders to extract multi-modal representations including 3D features. Subsequently, we propose an attention-based mechanism that offers a fusion module to complement these multi-modal representations with in-context learning and a fine-tuned LLM for a single-step model. As a result, RetroInText accurately represents and effectively captures the complex relationship between molecules and the synthetic route. In experiments on the USPTO pathways dataset RetroBench, RetroInText outperformed state-of-the-art methods, achieving up to a 5% improvement in Top-1 test accuracy, particularly for long synthetic routes. These results demonstrate the superiority of RetroInText by integrating with context information over routes. They also demonstrate its potential for advancing pathway design and facilitating the development of organic chemistry.

2613When and how are modular networks better?

[openreview] [pdf]

Abstract Many real-world learning tasks have an underlying hierarchical modular structure, composed of smaller sub-functions. Traditional neural networks (NNs), however, often ignore this structure, leading to inefficiencies in learning and generalization. Leveraging known structural information can enhance performance by aligning the network architecture with the task’s inherent modularity. In this work, we investigate how modular NNs can outperform traditional dense networks by systematically varying the degree of structural knowledge incorporated. We compare architectures ranging from monolithic dense NNs, which assume no prior knowledge, to hierarchically modular NNs with shared modules, which leverage sparsity, modularity, and module reusability. Our experiments demonstrate that incorporating structural knowledge, particularly through module reuse and fixed connectivity, significantly improves learning efficiency and generalization. Hierarchically modular NNs excel in data-scarce scenarios by promoting functional specialization within the modules and reducing redundancy. These findings suggest that task-specific architectural biases can lead to more efficient, interpretable, and effective learning systems.

2614Language Model Merging in Iterative Preference Learning

[openreview] [pdf]

Abstract Learning from preferences has become a scalable paradigm for training high-capacity language models, as it is not limited to human-produced data, allowing models to surpass human performance. Advanced feedback learning algorithms are typically online or iterative for high sample efficiency. Among these, iterative preference optimization is popular due to its simplicity, efficiency, and robustness. However, in iterative preference optimization, models do not necessarily achieve optimal performance since they sequentially learn data from different distributions. A simple way to bridge the gap is model ensemble, which incurs excessive inference costs. Inspired by the theoretical analysis for preference learning, we propose a simple model merging strategy that approximates model ensemble without additional training and inference costs, leading to Pareto-superior models.

2615DOMAIN GENERALIZATION VIA PARETO OPTIMAL GRADIENT MATCHING

[openreview] [pdf]

Abstract In this study, we address the gradient-based domain generalization problem, where predictors aim for consistent gradient directions across different domains. Existing methods have two main challenges. First, minimization of gradient empirical distance or gradient inner products (GIP) leads to gradient fluctuations and magnitude elimination among domains, thereby hindering straightforward learning. Second, the direct application of gradient learning to joint loss function can incur high computation overheads due to second-order derivative approximation. To tackle these challenges, we propose a new Pareto Optimality Gradient Matching (POGM) method. In contrast to existing methods that add gradient matching as regularization, we leverage gradient trajectories as collected data and apply independent training at the meta-learner. In the meta-update, we maximize GIP while limiting the learned gradient from deviating too far from the empirical risk minimization gradient trajectory. By doing so, the aggregate gradient can incorporate knowledge from all domains without suffering gradient magnitude elimination or fluctuation towards any particular domain. Experimental evaluations on datasets from DomainBed demonstrate competitive results yielded by POGM against other baselines while achieving computational efficiency.

2616Dueling in the Dark: An Efficient and OptimalO(T)Mirror Descent Approach for Competing against Adversarial Preferences

[openreview] [pdf]

Abstract Recent developments in Large Language Models (LLMs) have sparked significant attention in Reinforcement Learning from Human Feedback (RLHF), which uses reinforcement learning techniques to optimize a model’s performance through human-provided feedback. A simple, widely used, and cost-effective method for gathering human feedback is through relative queries based on human preferences, often modeled using sigmoid utility models. Despite the popularity of sigmoid model-based RLHF algorithms, their theoretical foundations remain underdeveloped as existing algorithms often lack performance guarantees or are limited to small-scale problems due to computationally intractable steps. We address the challenge of developing no-regret learning algorithms for training optimal policy RLHF, and develop the first efficient gradient descent-based algorithm with near-optimal regret guarantees. More technically, we consider the adversarial online convex optimization problem with preference feedback and propose a mirror descent method to obtain a regret of O(T)O(\sqrt{T}) over TT rounds. The main challenge we are required to solve lies in finding a suitable `gradient-approximation’ of the underlying utility functions solely from a binary preference feedback. Following this we extend our results to policy optimization in the RLHF framework with trajectory preferences and design no-regret RL policies using a variant of mirror descent. We also extend our methods beyond pairwise preferences --- to multi-way (batched pairwise) feedback and ranking feedback --- and analyze the trade-off between learning rate with increasing subset size. Our contribution lays the groundwork for a practical gradient descent-based algorithm in RLHF with human preferences. Supported by robust theoretical guarantees, our approach holds promise in the current landscape of developing efficient algorithms for LLMs and addressing human-AI alignment challenges. Empirical evaluations validate our theoretical findings.

2617OpenPRM: Building Open-domain Process-based Reward Models with Preference Trees

[openreview] [pdf]

Abstract Scaling inference-time computation is increasingly seen as the next frontier in scaling laws for large language models. Previous work in mathematics and coding has demonstrated the remarkable potential for inference-time scaling. During such scaling, fine-grained supervision through process-based reward models (PRMs) is essential for enhancement. However, exploration of inference-time scaling and PRMs in open-domain problems remains limited, where lacking exact answers and obtaining process supervision prove challenging. In this paper, we explore the construction of PRMs for open-domain tasks, specifically for instruction-following tasks. Utilizing existing outcome-based reward models (ORMs), we develop sentence-level preference trees based on the prefix similarity of parallel sampled candidates from datasets like UltraFeedback. This setup allows us to derive weak supervision for processes via back-propagation from outcome-level rewards. Subsequently, we integrate ORMs and PRMs under the same pairwise ranking objectives, resulting in our newly developed reward models, named OpenPRM. This approach significantly enhances the scalability of process-level supervision in open domains at minimal cost. We assess the performance of OpenPRM across various reward benchmarks, demonstrating its competitive edge over traditional ORMs in open domains and PRMs in specialized domains. Additionally, we investigate the scalability of inference-time computation for open-domain instructions. Our results highlight the limitations of ORMs’ scalability, while OpenPRM shows superior performance in scaled settings. Despite these advances, achieving automatic fine-grained supervision for open-domain inference-time scaling remains a substantial challenge. We hope these findings will spur further development of process supervision reward models in open-domain scenarios.

2618Certified Training with Branch-and-Bound: A Case Study on Lyapunov-stable Neural Control

[openreview] [pdf]

Abstract Certified training techniques aim to produce neural networks (NNs) with formally verifiable guarantees by optimizing their verification bounds during training. Existing work on certified training mainly focused on the local adversarial robustness of NNs. We consider certified training in a more challenging setting beyond adversarial robustness: we want to obtain NNs with global output guarantees for any input within an entire region-of-interest. As a case study, we particularly focus on a task about learning Lyapunov-stable neural controllers which provably satisfy the Lyapunov stability condition with a region-of-attraction. Compared to previous works which commonly used counterexample guided training on this task, we develop a new certified training framework and optimize for differentiable verified bounds, to produce verification-friendly models. In order to handle the relatively large region-of-interest, we propose a novel framework of training-time branch-and-bound to dynamically maintain a training dataset of subregions throughout training, such that the hardest subregions are iteratively split into smaller ones whose verified bounds can be computed more tightly to ease the training. We demonstrate that our new training framework can produce models which can be more efficiently verified at test time. On the largest 2D quadrotor dynamical system, verification for our model is more than 5X faster compared to the baseline, while our size of region-of-attraction is 16X larger than the baseline.

2619On Orchestrating Personalized LLMs

[openreview] [pdf]

Abstract This paper presents a novel approach to aligning large language models (LLMs) with individual human preferences, sometimes referred to as Reinforcement Learning fromPersonalizedHuman Feedback (RLPHF). Given stated preferences along multiple dimensions, such as helpfulness, conciseness, or humor, the goal is to create an LLM -- without completely re-training -- that best adheres to this specification. Starting from specialized expert LLMs, each trained for one such particular preference dimension, we propose a black-box method that merges their outputs on a per-token level. We train a lightweight Preference Control Model (PCM) that dynamically translates the preference description and current context into next-token prediction weights. By combining the expert models’ outputs at the token level, our approach dynamically generates text that optimizes the given preference. Empirical tests show that our method matches or surpasses existing preference merging techniques, providing a scalable, efficient alternative to fine-tuning LLMs for individual personalization.

2620KD-HGRL: Knowledge Distillation for Multi-Task Heterogeneous Graph Representation Learning

[openreview] [pdf]

Abstract Heterogeneous graphs capture complex relationships in real-world systems like social networks and knowledge graphs. Heterogeneous Graph Neural Networks (HGNNs) have been proposed for processing these graphs. Still, they often require extensive labeled data, especially for training on large networks, which is impractical or even impossible to obtain. The key question is whether we can train heterogeneous graphs using self-supervised learning and then transfer this knowledge to downstream tasks like node classification and link prediction. To address this, we propose a novel framework, KD-HGRL, which leverages \textbf{K}nowledge \textbf{D}istillation for multi-task \textbf{H}eterogenous \textbf{G}raph \textbf{R}epresentation \textbf{L}earning. Our approach consists of two main phases: teacher and student. The teacher phase generates rich node embeddings from two views—the semantic view and the topological view. A self-supervised learning strategy trains the teacher model without needing labeled data. The student model uses a light graph neural network with a multi-layer perception to predict the task output. A novel knowledge distillation strategy transfers the wealthy, learned knowledge, which reflects the graph’s topological structure and semantic information, from the pre-trained teacher to the student model. This allows the student model to understand the downstream tasks graph better. The results from several experiments performed on real-world benchmarks demonstrate the superiority of the proposed method over state-of-the-art approaches.

2621Scalable Influence and Fact Tracing for Large Language Model Pretraining

[openreview] [pdf]

Abstract Training data attribution (TDA) methods aim to attribute model outputs back to specific training examples, and the application of these methods to large language model (LLM) outputs could significantly advance model transparency and data curation. However, it has been challenging to date to apply these methods to the full scale of LLM pretraining. In this paper, we introduce a gradient-based method that works effectively at scale, allowing us to retrieve influential examples for an 8B-parameter language model from a pretraining corpus of over 160B tokens with no need for subsampling or pre-filtering. Our method combines several techniques, including optimizer state correction, a task-specific Hessian approximation, and normalized encodings, which we find to be critical for performance at scale. Our method performs best at identifying examples thatinfluencemodel predictions, but classical, model-agnostic retrieval methods such as BM25 still perform better at finding passages which explicitly contain relevant facts. These results demonstrate a misalignment between factualattributionand causalinfluence. With increasing model size and training tokens, we find that influence more closely aligns with attribution. Finally, we examine different types of examples identified as influential by our method, finding that while many directly entail a particular fact, others support the same output by reinforcing priors on relation types, common entities, and names.

2622Planning in a recurrent neural network that plays Sokoban

[openreview] [pdf]

Abstract How a neural network (NN) generalizes to novel situations depends on whether it has learned to select actions heuristically or via a planning process. Guez et al., (2019, “An investigation of model-free planning”) found that recurrent NN (RNN) trained to play Sokoban appears to plan, with extra computation steps improving the RNN’s success rate. We replicate and expand on their behavioral analysis, finding the RNN learns to give itself extra computation steps in complex situations by “pacing” in cycles. Moreover, we train linear probes that predict the future actions taken by the network and find that intervening on the hidden state using these probes controls the agent’s subsequent actions. Leveraging these insights, we perform model surgery, enabling the convolutional NN to generalize beyond its 10×1010 \times 10 architectural limit to arbitrarily sized levels. The resulting model solves challenging, highly off-distribution levels. We open-source our model and code, and believe its small size (1.29M parameters) makes it an excellent model organism to deepen our understanding of learned planning.

2623World-Model based Hierarchical Planning with Semantic Communications for Autonomous Driving

[openreview] [pdf]

Abstract World-model (WM) is a highly promising approach for training AI agents. However, in complex learning systems such as autonomous driving, AI agents interact with others in a dynamic environment and face significant challenges such as partial observability and non-stationarity. Inspired by how humans naturally solve complex tasks hierarchically and how drivers share their intentions by using turn signals, we introduce HANSOME, a WM-based hierarchical planning with semantic communications framework. In HANSOME, semantic information, particularly text and compressed visual data, is generated and shared to improve two-level planning. HANSOME incorporates two important designs: 1) A hierarchical planning strategy, where the higher-level policy generates intentions with text semantics, and a semantic alignment technique ensures the lower-level policy determines specific controls to achieve these intentions. 2) A cross-modal encoder-decoder to fuse and utilize the shared semantic information to enhance planning through multi-modal understanding. A key advantage of HANSOME is that the generated intentions not only enhance the lower-level policy but also can be shared and understood by humans or other AVs to improve their planning. Furthermore, we devise AdaSMO, an entropy-controlled adaptive scalarization method, to tackle the multi-objective optimization problem in hierarchical policy learning. Extensive experiments show that HANSOME outperforms state-of-the-art WM-based methods in challenging driving tasks, enhancing overall traffic safety and efficiency.

2624SafeWatch: An Efficient Safety-Policy Following Video Guardrail Model with Transparent Explanations

[openreview] [pdf]

Abstract With the wide adoption of generative AI and rapid growth of high-quality video generation, video guardrails have become more crucial than ever to ensure safety and security across platforms. Current video guardrails, however, are either overly simplistic, relying on pure classification models trained on simple policies with limited number of unsafe categories, which lack detailed explanations, or prompting multimodal large language models (MLLMs) with long safety guidelines, resulting in inefficient and impractical guardrails for real-world content. To bridge this gap, we propose SAFEWATCH, an efficient MLLM-based video guardrail model designed to follow customized safety policies and provide multi-label video guardrail outputs with content-specific explanations in a zero-shot manner. In particular, unlike traditional guardrails that encode entire policies autoregressive, causing inefficiency and bias, SAFEWATCH uniquely encodes each policy trunk in parallel and eliminates their position bias such that all policies are attended simultaneously with equal importance. In addition, to improve efficiency and accuracy, SafeWatch incorporates a policy-aware visual token pruning algorithm that adaptively selects the most relevant video tokens for each policy, discarding noisy or irrelevant information. This allows for more focused, policy-compliant guardrail with significantly reduced computational overhead. Considering the limitations of existing video guardrail benchmarks, we propose SafeWatch-Bench, a large-scale video guardrail benchmark comprising over 2M videos spanning six safety categories which covers over 30 tasks to ensure a comprehensive coverage of all potential safety scenarios. We have conducted extensive experiments, showing that SafeWatch outperforms all SOTA video guardrails on SafeWatch-Bench by 19.6% and 15.4% on existing benchmarks, while reducing inference cost by 25% on average. SafeWatch also demonstrates strong policy-following abilities and outperforms baselines by 20% in zero-shot adaptability to new policies. Additionally, both LLM-as-a-judge and human evaluators confirm the high quality of the explanations provided by SafeWatch.

2625ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation

[openreview] [pdf]

Abstract We introduce a new benchmark, ChartMimic, aimed at assessing the visually-grounded code generation capabilities of large multimodal models (LMMs). ChartMimic utilizes information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate the corresponding code for chart rendering. ChartMimic includes 4,8004,800 human-curated (figure, instruction, code) triplets, which represent the authentic chart use cases found in scientific papers across various domains (e.g., Physics, Computer Science, Economics, etc). These charts span 18 regular types and 4 advanced types, diversifying into 201 subcategories. Furthermore, we propose multi-level evaluation metrics to provide an automatic and thorough assessment of the output code and the rendered charts. Unlike existing code generation benchmarks, ChartMimic places emphasis on evaluating LMMs’ capacity to harmonize a blend of cognitive capabilities, encompassing visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 14 open-weight models highlights the substantial challenges posed by ChartMimic. Even the advanced GPT-4o, InternVL2-Llama3-76B only achieve an average score of 82.2 and 61.6, respectively, indicating significant room for improvement. We anticipate that ChartMimic will inspire the development of LMMs, advancing the pursuit of artificial general intelligence.

2626Recursive Abstractive Processing for Retrieval in Dynamic Datasets

[openreview] [pdf]

Abstract Recent retrieval-augmented models enhance basic methods by building a hierarchical structure over retrieved text chunks through recursive embedding, clustering, and summarization. The most relevant information is then retrieved from both the original text and generated summaries. However, such approaches face limitations with dynamic datasets, where adding or removing documents over time complicates the updating of hierarchical representations formed through clustering. We propose a new algorithm to efficiently maintain the recursive-abstractive tree structure in dynamic datasets, without compromising performance. Additionally, we introduce a novel post-retrieval method that applies query-focused recursive abstractive processing to substantially improve context quality. Our method overcomes the limitations of other approaches by functioning as a black-box post-retrieval layer compatible with any retrieval algorithm. Both algorithms are validated through extensive experiments on real-world datasets, demonstrating their effectiveness in handling dynamic data and improving retrieval performance.

2627Heterogeneous Federated Learning: A Dual Matching Dataset Distillation Approach

[openreview] [pdf]

Abstract Federated Learning (FL) often struggles with error accumulation during local training, particularly on heterogeneous data, which hampers overall performance and convergence. While dataset distillation is commonly introduced to FL to enhance efficiency, our work finds that communicating distilled data instead of models can completely get rid of the error accumulation issue, albeit at the cost of exacerbating data heterogeneity across clients. To address the amplified heterogeneity due to distilled data, we propose a novel FL algorithm termed FedDualMatch, which performs dual matching in the way that local distribution matching captures client data distributions while global gradient matching aligns gradients on the server. This dual approach enriches feature representations and enhances convergence stability. It proves effective for FL due to a bounded difference in the testing loss between optimal models trained on the aggregation of either distilled or original data across clients. At the same time, it converges faster than FedAvg in a single communication round while preserving (ϵ, δ)-differential privacy via adding Gaussian noise. Experiments on controlled heterogeneous dataset MNIST/CIFAR10 and naturally heterogeneous dataset Digital-Five/Office-Home demonstrate its advantages over the state-of-the-art methods that communicate either model or distilled data, in terms of accuracy and convergence. Notably, it maintains accuracy even when data heterogeneity significantly increases, underscoring its potential for practical applications.

2628Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs

[openreview] [pdf]

Abstract In the study of LLMs, sycophancy represents a prevalent hallucination that poses significant challenges to these models. Specifically, LLMs often fail to adhere to original correct responses, instead blindly agreeing with users’ opinions, even when those opinions are incorrect or malicious. However, research on sycophancy in visual language models (VLMs) has been scarce. In this work, we extend the exploration of sycophancy from LLMs to VLMs, introducing the MM-SY benchmark to evaluate this phenomenon. We present evaluation results from multiple representative models, addressing the gap in sycophancy research for VLMs. To mitigate sycophancy, we propose a synthetic dataset for training and employ methods based on prompts, supervised fine-tuning, and DPO. Our experiments demonstrate that these methods effectively alleviate sycophancy in VLMs. Additionally, we probe VLMs to assess the semantic impact of sycophancy and analyze the attention distribution of visual tokens. Our findings indicate that the ability to prevent sycophancy is predominantly observed in higher layers of the model. The lack of attention to image knowledge in these higher layers may contribute to sycophancy, and enhancing image attention at high layers proves beneficial in mitigating this issue.

2629Lambda-Skip Connections: the architectural component that prevents Rank Collapse

[openreview] [pdf]

Abstract Rank collapse, a phenomenon where embedding vectors in sequence models rapidly converge to a uniform token or equilibrium state, has recently gained at- tention in the deep learning literature. This phenomenon leads to reduced expres- sivity and potential training instabilities due to vanishing gradients. Empirical ev- idence suggests that architectural components like skip connections, LayerNorm, and MultiLayer Perceptrons (MLPs) play critical roles in mitigating rank collapse. While this issue is well-documented for transformers, alternative sequence mod- els, such as State Space Models (SSMs), which have recently gained prominence, have not been thoroughly examined for similar vulnerabilities. This paper extends the theory of rank collapse from transformers to SSMs using a unifying frame- work that captures both architectures. We introduce a modification in the skip connection component, termed lambda-skip connections, that provides guaran- tees for rank collapse prevention. We present, via analytical results, a sufficient condition to achieve the guarantee for all of the aforementioned architectures. We also study the necessity of this condition via ablation studies and analytical exam- ples. To our knowledge, this is the first study that provides a general guarantee to prevent rank collapse, and that investigates rank collapse in the context of SSMs, offering valuable understanding for both theoreticians and practitioners. Finally, we validate our findings with experiments demonstrating the crucial role of archi- tectural components in preventing rank collapse.

[openreview] [pdf]

Abstract As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains.

2631Observability of Latent States in Generative AI Models

[openreview] [pdf]

Abstract We tackle the question of whether Large Language Models (LLMs), viewed as dynamical systems with state evolving in the embedding space of symbolic tokens, are observable. That is, whether there exist multiple ‘mental’ state trajectories that yield the same sequence of generated tokens, or sequences that belong to the same Nerode equivalence class (‘meaning’). If not observable, mental state trajectories evoked by an input (‘percepts’) or by feedback from the model’s own state could remain self-contained and evolve unbeknownst to the user while being potentially accessible to the model provider. Curiously, “self-contained experiences evoked by perception or thought” are essentially what the American Psychological Association (APA) defines as ‘feelings’. Lexical curiosity aside, we show that current LLMs implemented by autoregressive Transformers are observable: The set of state trajectories indistinguishable from the tokenized output is a singleton. But if there are ‘system prompts’ not visible to the user, then the set of indistinguishable trajectories becomes non-trivial, and there can be multiple state trajectories that yield the same verbalized output. We prove these claims analytically, and show examples of modifications to standard LLMs that engender unobservable behaviors. Our analysis sheds light on possible designs that would enable a model to perform non-trivial computation that is not visible to the user, as well as on controls that the provider of services using the model could take to prevent unintended behavior.

2632Learning Gain Map for Inverse Tone Mapping

[openreview] [pdf]

Abstract For a more compatible and consistent high dynamic range (HDR) viewing experience, a new image format with a double-layer structure has been developed recently, which incorporates an auxiliary Gain Map (GM) within a standard dynamic range (SDR) image for adaptive HDR display. This new format motivates us to introduce a new task termed Gain Map-based Inverse Tone Mapping (GM-ITM), which focuses on learning the corresponding GM of an SDR image instead of directly estimating its HDR counterpart, thereby enabling a more effective up-conversion by leveraging the advantages of GM. The main challenge in this task, however, is to accurately estimate regional intensity variation with the fluctuating peak value. To this end, we propose a dual-branch network named GMNet, consisting of a Local Contrast Restoration (LCR) branch and a Global Luminance Estimation (GLE) branch to capture pixel-wise and image-wise information for GM estimation. Moreover, to facilitate the future research of the GM-ITM task, we build both synthetic and real-world datasets for comprehensive evaluations: synthetic SDR-GM pairs are generated from existing HDR resources, and real-world SDR-GM pairs are captured by mobile devices. Extensive experiments on these datasets demonstrate the superiority of our proposed GMNet over existing HDR-related methods both quantitatively and qualitatively.

2633Adversarial Robustness of In-Context Learning in Transformers for Linear Regression

[openreview] [pdf]

Abstract Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustness of these learned algorithms remains unexplored. This work investigates the vulnerability of in-context learning in transformers tohijacking attacksfocusing on the setting of linear regression tasks. Hijacking attacks are prompt-manipulation attacks in which the adversary’s goal is to manipulate the prompt to force the transformer to generate a specific output. We first prove that single-layer linear transformers, known to implement gradient descent in-context, are non-robust and can be manipulated to output arbitrary predictions by perturbing a single example in the in-context training set. While our experiments show these attacks succeed on linear transformers, we find they do not transfer to more complex transformers with GPT-2 architectures. Nonetheless, we show that these transformers can be hijacked using gradient-based adversarial attacks. We then demonstrate that adversarial training enhances transformers’ robustness against hijacking attacks, even when just applied during finetuning. Additionally, we find that in some settings, adversarial training against a weaker attack model can lead to robustness to a stronger attack model. Lastly, we find that hijacking attacks against one transformer can only transfer to other transformers when they are small-scale, while attacks against larger transformers do not transfer even against transformers of the same architecture but trained with different random seeds.

2634PERSONALIZED FEDERATED PARTIAL LABEL LEARNING

[openreview] [pdf]

Abstract Partial Label Learning (PLL) is known as a valuable learning technique that trains Machine Learning (ML) models on partial label datasets, where the ground truth label is concealed within the candidate label set of each data instance. It learns label correlation based on a single centralized dataset to predict the latent true label. When data is non-independent and identically distributed (non-i.i.d.) among workers in Federated Learning (FL), the label correlation interference problem occurs. To address the issue, in this paper, we propose pFedPLL, a personalized federated partial label learning algorithm with two new designs. In Label Correlation Isolation (LCI), we first develop a twin-module architecture, where a feature-level correlation matrix layer for each worker is isolated locally to prevent it from being interfered with by others. In Label Correlation Personalization (LCP), we then propose a bi-directional calibration loss to identify a more accurate learning direction, where the positive calibration aligns the prediction result with the latent true label, and the negative calibration pushes away the prediction result that falls into the non-candidate label set. We provide a convergence analysis of pFedPLL with a rate of O(1T)O\left(\sqrt{\frac{1}{T}}\right) for smooth non-convex problems. Experiment results demonstrate that pFedPLL outperforms SOTA federated PLL algorithms and the federated version of centralized PLL algorithms across nine datasets.

2635Optimizing Inference-Time Reasoning in LLMs via Retrieval-Augmented Reflection

[openreview] [pdf]

Abstract Empowering LLMs to improve their performance through increased inference-time computation is a crucial step in developing self-improving agents capable of operating in open-ended natural language contexts. In this paper, we explore how iterative revising a chain of thoughts with the help of information retrieval significantly improves large language models’ reasoning ability in challenging tasks, while hugely mitigating hallucination. In particular, the proposed method --- \emph{retrieval-augmented reflection} (RaR) --- revises the generation tokens step one by one with multiple retrieved information relevant to the instruction. Applying RaR during inference-time to a various set of language models substantially improves their performances on various reasoning tasks; on relatively increasing scores by 13.63% on code generation, 16.96% on mathematical reasoning, and 42.78% on embodied task planning. Moreover, we find that with more inference-time computation given to the LLM for multi-times retrieval-augmented reflection, the LLM can continuously improve on various reasoning benchmarks. With lower inference-time computation (FLOPs), a small LM can surpass the performance of the LM with more than 10 times the parameters.

2636Forward-Backward Feature Transfer for Industrial Anomaly Detection and Segmentation

[openreview] [pdf]

Abstract Motivated by efficiency requirements, most industrial anomaly detection and segmentation (IADS) methods process low-resolution images, e.g., 224×224224\times 224 pixels, obtained by downsampling the original input images. In this setting, downsampling is typically applied also to the provided ground-truth defect masks. Yet, as numerous industrial applications demand the identification of both large and small defects, this downsampling procedure may fail to reflect the actual performance achievable by current methods. In this work, we propose a fast approach based on a novel Teacher-Student paradigm. This paradigm relies on two shallow Student MLPs that learn to transfer patch features across the layers of a frozen Teacher Vision Transformer. Our framework can spot anomalies from high-resolution images faster than other methods, even when they process low-resolution images, achieving state-of-the-art overall performance on MVTec AD and segmentation results on VisA. We also propose novel evaluation metrics that capture robustness regarding defect size, i.e., the ability of a method to preserve good localization from large anomalies to tiny ones, focusing on segmentation performance as a function of anomaly size. Evaluating our method with these metrics reveals its stable performance in detecting anomalies of any size.

2637Scalable Preference Learning for Large Language Models via Convex Optimization

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) for alignment with human preferences have become a key factor in the success of models like ChatGPT and Gemini, which are now integral to mainstream use. Many effective techniques are based on Reinforcement Learning from Human Feedback (RLHF), yet are complex, unstable, and expensive to implement. Recently, Direct Preference Optimization (DPO) offers an accessible alternative by simplifying the objective and training a policy model using a frozen, copied reference model to provide a stable training benchmark. In this paper, we develop an even more lightweight DPO based algorithm that operates on a single GPU. The key to achieving this is leveraging the convex optimization reformulation of neural networks, and reducing the dependence on copying the reference model. Our aim is to provide faster convergence to solutions of better optimality, and higher interpretability of the underlying optimization landscape for generative language tasks. We use the Alternating Direction Method of Multipliers (ADMM) to solve this optimization problem in order to increase parallelization efficiency, and implement our methods in JAX to lift the memory constraints across experiments. We experiment on three datasets, including one synthetically generated educational dataset, to demonstrate the efficacy of our novel algorithm in a real world setting. Our method is comparable in user preference generation to DPO when tested on 17 human volunteers, despite being trained on one single RTX-4090 GPU using a smaller dataset.

2638GC4NC: A Benchmark Framework for Graph Condensation on Node Classification with New Insights

[openreview] [pdf]

Abstract Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications like neural architecture search and deepens our understanding of redundancies in large graphs. Despite the rapid development of GC methods, particularly for node classification, a unified evaluation framework is still lacking to systematically compare different GC methods or clarify key design choices for improving their effectiveness. To bridge these gaps, we introduceGC4NC, a comprehensive framework for evaluating diverse GC methods on node classification across multiple dimensions including performance, efficiency, privacy preservation, denoising ability, NAS effectiveness, and transferability. Our systematic evaluation offers novel insights into how condensed graphs behave and the critical design choices that drive their success. These findings pave the way for future advancements in GC methods, enhancing both performance and expanding their real-world applications. The code is available athttps://anonymous.4open.science/r/GC4NC-1620.

2639Reducing Complexity of Force-Directed Graph Embedding

[openreview] [pdf]

Abstract Graph embedding is a critical pre-processing step that maps elements of a graph network, such as its nodes or edges, to coordinates in a dd-dimensional space. The primary goal of the embedding process is to capture and preserve various features of the graph network, including its topology and node attributes, in the generated embedding. Maintaining these graph features in the embedding can significantly enhance the performance of the downstream machine learning tasks. In this work, we introduce a novel family of graph embedding methods that leverage kinematics principles within a spring model and nn-body simulation framework to generate the graph embedding. The proposed method differs substantially from state-of-the-art (SOTA) methods, as it does not attempt to fit a model (such as neural networks) and eliminates the need for functions such as message passing or back-propagation. Instead, it aims to position the nodes in the embedding space such that the total net force of the system is reduced to a minimal threshold, resulting in the system reaching an equilibrium state. The spring model is designed as a linear summation of non-linear force functions, with the shortest-path distance serving as the adjusting parameter for the force factor between each node pair, and therefore, inducing the graph topology in the force functions. In this work, we attempted to reduce the complexity of the original algorithm from log(n2)\log(n^2) to nlog(n)n\log(n), while maintaining the performance metrics at a competitive level. The proposed method is intuitive, parallelizable, and highly scalable. While the primary focus of this work is on the feasibility of the Force-Directed approach, the results in unsupervised graph embeddings are comparable to or better than SOTA methods, demonstrating its potential for practical applications.

2640MamBEV: Enabling State Space Models to Learn Birds-Eye-View Representations

[openreview] [pdf]

Abstract 3D visual perception tasks, such as 3D detection from multi-camera images, are essential components of autonomous driving and assistance systems. However, designing computationally efficient methods remains a significant challenge. In this paper, we propose a Mamba-based framework called MamBEV, which learns unified Bird’s Eye View (BEV) representations using linear spatio-temporal SSM-based attention. This approach supports multiple 3D perception tasks with significantly improved computational and memory efficiency. Furthermore, we introduce SSM based cross-attention, analogous to standard cross attention, where BEV query representations can interact with relevant image features. Extensive experiments demonstrate MamBEV’s promising performance across diverse visual perception metrics, highlighting its advantages in input scaling efficiency compared to existing benchmark models.

2641Towards Realistic Mechanisms That Incentivize Federated Participation and Contribution

[openreview] [pdf]

Abstract Edge device participation in federating learning (FL) is typically studied through the lens of device-server communication (e.g., device dropout) and assumes an undying desire from edge devices to participate in FL. As a result, current FL frameworks are flawed when implemented in realistic settings, with many encountering the free-rider dilemma. In a step to push FL towards realistic settings, we propose RealFM: the first federated mechanism that (1) realistically models device utility, (2) incentivizes data contribution and device participation, (3) provably removes the free-rider dilemma, and (4) relaxes assumptions on data homogeneity and data sharing. Compared to previous FL mechanisms, RealFM allows for a non-linear relationship between model accuracy and utility, which improves the utility gained by the server and participating devices. On real-world data, RealFM improves device and server utility, as well as data contribution, by over 3 and 4 magnitudes respectively compared to baselines.

2642Safeguarding System Prompts for LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly utilized in applications where system prompts, which guide model outputs, play a crucial role. These prompts often contain business logic and sensitive information, making their protection essential. However, adversarial and even regular user queries can exploit LLM vulnerabilities to expose these hidden prompts. To address this issue, we present PromptKeeper, a novel defense mechanism for system prompt privacy. By reliably detecting worst-case leakage and regenerating outputs without the system prompt when necessary, PromptKeeper ensures robust protection against prompt extraction attacks via either adversarial or regular queries, while preserving conversational capability and runtime efficiency during benign user interactions.

2643A Conditional Independence Test in the Presence of Discretization

[openreview] [pdf]

Abstract Testing conditional independence (CI) has many important applications, such as Bayesian network learning and causal discovery. Although several approaches have been developed for learning CI structures for observed variables, those existing methods generally fail to work when the variables of interest can not be directly observed and only discretized values of those variables are available. For example, if X1X_1, X~2\tilde{X}_2 and X3X_3 are the observed variables, where X~2\tilde{X}_2 is a discretization of the latent variable X2X_2, applying the existing methods to the observations of X1X_1, X~2\tilde{X}_2 and X3X_3 would lead to a false conclusion about the underlying CI of variables X1X_1, X2X_2 and X3X_3. Motivated by this, we propose a CI test specifically designed to accommodate the presence of discretization. To achieve this, a bridge equation and nodewise regression are used to recover the precision coefficients reflecting the conditional dependence of the latent continuous variables under the nonparanormal model. An appropriate test statistic has been proposed, and its asymptotic distribution under the null hypothesis of CI has been derived. Theoretical analysis, along with empirical validation on various datasets, rigorously demonstrates the effectiveness of our testing methods.

2644Improving Uncertainty Estimation through Semantically Diverse Language Generation

[openreview] [pdf]

Abstract Large language models (LLMs) can suffer from hallucinations when generating text. These hallucinations impede various applications in society and industry by making LLMs untrustworthy. Current LLMs generate text in an autoregressive fashion by predicting and appending text tokens. When an LLM is uncertain about the semantic meaning of the next tokens to generate, it is likely to start hallucinating. Thus, it has been suggested that predictive uncertainty is one of the main causes of hallucinations. We introduce Semantically Diverse Language Generation (SDLG) to quantify predictive uncertainty in LLMs. SDLG steers the LLM to generate semantically diverse yet likely alternatives for an initially generated text. This approach provides a precise measure of aleatoric semantic uncertainty, detecting whether the initial text is likely to be hallucinated. Experiments on question-answering tasks demonstrate that SDLG consistently outperforms existing methods while being the most computationally efficient, setting a new standard for uncertainty estimation in LLMs.

2645Efficient Sequential Policy Optimization via Off-Policy Correction in Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Although trust region policy optimization methods have achieved a lot of success in cooperative multi-agent tasks, most of them face a non-stationarity problem during the learning process. Recently, sequential trust region methods that update policies agent-by-agent have shed light on alleviating the non-stationarity problem. However, these methods are still less sample-efficient when compared to their counterparts (i.e., PPO) in a single-agent setting. To narrow this efficiency gap, we propose the Off-Policyness-aware Sequential Policy Optimization (OPSPO) method, which explicitly manages the off-policyness that arises from the sequential policy update process among multiple agents. We prove that our OPSPO has the tightness of the monotonic improvement bound compared with other trust region multi-agent learning methods. Finally, we demonstrate that our OPSPO consistently outperforms strong baselines under challenging multi-agent benchmarks, including StarCraftII micromanagement tasks, Multi-agent MuJoCo, and Google Research Football full game scenarios.

2646RB-Modulation: Training-Free Personalization using Stochastic Optimal Control

[openreview] [pdf]

Abstract We propose Reference-Based Modulation (RB-Modulation), a new plug-and-play solution for training-free personalization of diffusion models. Existing training-free approaches exhibit difficulties in (a) style extraction from reference images in the absence of additional style or content text descriptions, (b) unwanted content leakage from reference style images, and (c) effective composition of style and content. RB-Modulation is built on a novel stochastic optimal controller where a style descriptor encodes the desired attributes through a terminal cost. The resulting drift not only overcomes the difficulties above, but also ensures high fidelity to the reference style and adheres to the given text prompt. We also introduce a cross-attention-based feature aggregation scheme that allows RB-Modulation to decouple content and style from the reference image. With theoretical justification and empirical evidence, our framework demonstrates precise extraction and control ofcontentandstylein a training-free manner. Additionally, our method allows a seamless composition of content and style, which marks a departure from the dependency on external adapters or ControlNets

2647Learning Linear Utility Functions From Pairwise Comparison Queries

[openreview] [pdf]

Abstract There is increasingly widespread use of reward model learning from human preferences to align AI systems with human values, with applications including large language models, recommendation systems, and robotic control. Nevertheless, a fundamental understanding of our ability to successfully learn utility functions in this model remains limited. We initiate this line of work by studying learnability of linear utility functions from pairwise comparison queries. In particular, we consider two learning objectives. The first objective is to predict out-of-sample responses to pairwise comparisons, whereas the second is to approximately recover the true parameters of the utility function. We show that in the passive learning setting, linear utilities are efficiently learnable with respect to the first objective, both when query responses are uncorrupted by noise, and under Tsybakov noise when the distributions are sufficiently “nice”. In contrast, we show that utility parameters are not learnable for a large set of data distributions without strong modeling assumptions, even when query responses are noise-free. Next, we proceed to analyze the learning problem in an active learning setting. In this case, we show that even the second objective is efficiently learnable, and present algorithms for both the noise-free and noisy query response settings. This qualitative learnability gap between passive and active learning from pairwise comparisons suggests that the tendency of conventional alignment practices to simply annotate a fixed set of queries may fail to yield effective reward model estimates, an issue that can be remedied through more deliberate query selection.

2648Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

[openreview] [pdf]

Abstract The availability of high-quality data is one of the most important factors in improving the reasoning capability of LLMs. Existing works have demonstrated the effectiveness of creating more instruction data from seed questions or knowledge bases. Recent research indicates that continually scaling up data synthesis from strong models (e.g., GPT-4) can further elicit reasoning performance. Though promising, the open-sourced community still lacks high-quality data at scale and scalable data synthesis methods with affordable costs. To address this, we introduce ScaleQuest, a scalable and novel data synthesis method that utilizes ``small-size’’ (e.g., 7B) open-source models to generate questions from scratch without the need for seed data with complex augmentation constraints. With the efficient ScaleQuest, we automatically constructed a mathematical reasoning dataset consisting of 1 million problem-solution pairs, which are more effective than existing open-sourced datasets. It can universally increase the performance of mainstream open-source models (i.e., Mistral, Llama3, DeepSeekMath, and Qwen2-Math) by achieving 29.2% to 46.4% gains on MATH. Notably, simply fine-tuning the Qwen2-Math-7B-Base model with our dataset can even surpass Qwen2-Math-7B-Instruct, a strong and well-aligned model on closed-source data, and proprietary models such as GPT-4-Turbo and Claude-3.5 Sonnet.

2649Training Open-ended Policies to follow Video-prompt Instructions with Reinforcement Learning

[openreview] [pdf]

Abstract In recent years, online reinforcement learning(RL) training methods like PPO have shone in important works such as Instruct GPT. However, unlike the success achieved in the language domain, online RL methods often struggle to generalize to untrained tasks in open-world environments like Minecraft, due to issues like overfitting. This has become a significant obstacle in using online methods to build a generalist agent. In this work, we notice the modality differences between natural language environments and embodied environments such as the Minecraft environment, which inspired us to use video instructions instead of text instructions to enhance the model’s understanding of the relationship between the environment and instructions. We also introduce a new attention layer in the base model’s encoder-decoder architecture to establish a semantic and visual dual-path information interaction channel, further strengthening this generalization capability. After training our model on a small set of tasks, it demonstrated excellent zero-shot generalization on new tasks, outperforming almost all other models in the Minecraft environment on our benchmark. Our approach takes a solid and important step toward unleashing the potential of online RL in building generalist agents. zero-shot generalization on new tasks, outperforming almost all other models in the Minecraft environment on our benchmark. Our approach takes a solid and important step toward unleashing the potential of online RL in building generalist agents.

2650Embedding Safety into RL: A New Take on Trust Region Methods

[openreview] [pdf]

Abstract Reinforcement Learning (RL) agents are able to solve a wide variety of tasks but are prone to producing unsafe behaviors. Constrained Markov Decision Processes (CMDPs) provide a popular framework for incorporating safety constraints. However, common solution methods often compromise reward maximization by being overly conservative or allow unsafe behavior during training. We propose Constrained Trust Region Policy Optimization (C-TRPO), a novel approach that modifies the geometry of the policy space based on the safety constraints and yields trust regions composed exclusively of safe policies, ensuring constraint satisfaction throughout training. We theoretically study the convergence and update properties of C-TRPO and highlight connections to TRPO, Natural Policy Gradient (NPG), and Constrained Policy Optimization (CPO). Finally, we demonstrate experimentally that C-TRPO significantly reduces constraint violations while achieving competitive reward maximization compared to state-of-the-art CMDP algorithms.

2651Learning a Bi-directional Driving Data Generator via Large Multi-modal Model Tuning

[openreview] [pdf]

Abstract Understanding human driving behaviors is crucial for developing a reliable vehicle and transportation system. Yet, data for learning these behaviors is scarce and must be carefully labeled with events, causes, and consequences. Such data may be more difficult to obtain in rare driving domains, such as in high-performance multi-car racing. While large language models (LLMs) show promise in interpreting driving behaviors, the integration of multi-modal inputs (e.g., language, trajectory, and more) and generation of multi-modal output in low-data regimes remains under-explored. In this paper, we introduce Bi-Gen: a Bi-directional Driving Data Generator, Bi-Gen is a bi-directional multi-modal model that connects a trained encoder-decoder architecture with a pre-trained LLM, enabling both auto-annotation and generation of human driving behaviors. Our experiments show that Bi-Gen, despite its smaller size, matches the performance of much larger models like GPT-4o in annotating driving data. Additionally, Bi-Gen generates diverse, human-like driving behaviors, offering a valuable tool for synthetic data generation in resource-constrained settings. Taken together, our experiments are a significant step towards applying LLMs to complex, multi-agent driving data.

2652Open-Set Learning for Addressing Label Skews in One-Shot Federated Learning

[openreview] [pdf]

Abstract Federated learning (FL) is crucial for collaborative model training, yet it faces significant challenges from data heterogeneity, particularly label skews across clients, where some classes may be underrepresented or absent entirely. In one-shot FL, where clients only communicate with the server once, this problem becomes even more challenging. Recent solutions propose incorporating open-set learning (OSL) to tackle this issue by detecting unknown samples during inference, but current methods like FedOV lack adaptability to varying client data distributions. In this paper, we provide a theoretical analysis proving that improving OSL algorithms can effectively address label skews in one-shot FL, since one-shot FL is learnable through good OSL algorithms regardless of label skews. We also empirically evaluate state-of-the-art OSL algorithms and identify their limitations. Based on these insights, we propose FedAdav, an adaptive algorithm that combines OSL signals to significantly improve ensemble accuracy in one-shot FL under label skews. Through extensive experiments, we demonstrate that exploring better OSL is key to overcoming label skew challenges in federated learning.

2653Transformer Meets Twicing: Harnessing Unattended Residual Information

[openreview] [pdf]

Abstract Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism, a core component of transformers, has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers, thereby hurting its overall performance. In this work, we leverage the connection between self-attention computations and low-pass non-local means (NLM) smoothing filters and propose the Twicing Attention, a novel attention mechanism that useskernel twicing procedurein nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing with compelling theoretical guarantees. This approach enables the extraction and reuse of meaningful information retained in the residuals following the imperfect smoothing operation at each layer. Our proposed method offers two key advantages over standard self-attention: 1) a provably slower decay of representational capacity and 2) improved accuracy across various data modalities and tasks. We empirically demonstrate the performance gains of our model over baseline transformers on multiple tasks and benchmarks, including image classification and language modeling, on both clean and corrupted data.

2654Regularity explains emergence

[openreview] [pdf]

Abstract We investigate the mechanisms behind emergence in large language models from the viewpoint of the regularity of the optimal response function ff^* on the space of prompt tokens. Based on theoretical justification, we provide an interpretation that the derivatives of ff^* are in general unbounded and the model gives up reasoning in regions where the derivatives are large. In such regions, instead of predicting ff^*, the model predicts a smoothified version obtained via an averaging operator. The threshold on the norm of derivatives for regions that are given up increases together with the number of parameters NN, causing emergence. The relation between regularity and emergence is supported by experiments on arithmetic tasks such as multiplication and summation and other tasks. Our interpretation also shed light on why fine-tuning and Chain-of-Thought can significantly improves LLM performance.

2655Robust Graph Attention for Graph Adversarial Attacks: An Information Bottleneck Inspired Approach

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown exceptional performance in learning node representations for node-level tasks such as node classification. However, traditional message-passing mechanisms solely based on graph structure in GNNs make them vulnerable to adversarial attacks. Attention-based GNNs have been utilized to improve the robustness of GNNs due to their capabilities to selectively emphasize informative signals over noisy or less relevant ones. However, existing works on robust graph attention methods do not realize the correlation between improved robustness and better adherence to the IB principle of attention-based GNNs. In this work, we find that the IB loss of attention-based GNNs is a strong indicator of their robustness against variant graph adversarial attacks. Attention-based GNNs with lower IB loss learn node representations that correlate less to the input training data while aligning better with the target outputs. Due to better adhering to the IB principle, attention-based GNNs with lower IB loss usually show stronger robustness against graph adversarial attacks. Inspired by such observation, we propose a novel graph attention method termed Robust Graph Attention inspired by Information Bottleneck, or RGA-IB, which explicitly minimizes the IB loss of a multi-layer GNN through a carefully designed graph attention mechanism. Extensive experiment results on semi-supervised node classification under variant graph adversarial attacks show that GNNs equipped with RGA-IB exhibit lower IB loss, which indicates better adherence to the IB principle, and show significantly improved node classification accuracy under graph adversarial attacks compared to existing robust GNNs. The code of RGA-IB is available at \url{https://anonymous.4open.science/r/RGA-IB-A47F/}.

2656Sample Efficient Multiple-policy Evaluation in Reinforcement Learning

[openreview] [pdf]

Abstract We study the multiple-policy evaluation problem where we are given a set of KK policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy ϵ\epsilon with probability at least 1δ1-\delta. We propose a sample-efficient algorithm named \CAESAR for this problem. Our approach is based on computing an approximate optimal offline sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. \CAESAR has two phases. In the first we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with O~(1ϵ)\tilde{O}(\frac{1}{\epsilon}). In the second phase, we approximate the optimal offline sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE \citep{nachum2019dualdice} objective. Up to low order and logarithmic terms \CAESAR achieves a sample complexity O~(H4ϵ2h=1Hminμhmaxk[K]s,a(dhπk(s,a))2μh(s,a))\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\min_{\mu_h}\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu_h(s,a)}\right), where dπd^{\pi} is the visitation distribution of policy π\pi, μ\mu is the sampling distribution, and HH is the horizon.

2657RLSF: Reinforcement Learning from Self-feedback for improved logical reasoning

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated impressive capabilities in generating coherent and contextually relevant text. These models arguably lack the ability to logically reason, an essential skill required to solving mathematical problems and programming tasks. While step-by-step prompting approaches show some promise, they often depend on finding a suitable prompt tailored to the specific model and task. In this work, we propose a simple, yet an effective approach to enhance reasoning capabilities by leveraging reinforcement learning (RL) and the confidence scores of a well-calibrated LLM. It involves optimising an implicit reward derived from the model’s confidence levels in the answer to the reasoning task at hand. We generate preference data and fine-tune the LLM in a similar spirit to reinforcement learning from human feedback (RLHF), but without needing any human provided labels or preferences. Our results show that resulting reasoning abilities of an LLM improve and are transferable to other reasoning tasks. This warrants further investigation of RL as a facilitator for solving complex language tasks.

2658EFFICIENT JAILBREAK ATTACK SEQUENCES ON LARGE LANGUAGE MODELS VIA MULTI-ARMED BANDIT-BASED CONTEXT SWITCHING

[openreview] [pdf]

Abstract Content warning: This paper contains examples of harmful language and content. As the capabilities of large language models (LLMs) continue to expand, the risk of these models being manipulated or “jailbroken” by malicious users increases significantly. Traditional AI safety measures primarily focus on algorithmic defenses, but there is a growing need to explore more sophisticated attack strategies that consider the dynamic interactions between human users and LLMs. This paper introduces a novel approach to jailbreaking LLMs through the use of “Sequence of Contexts” (SoC) attacks, wherein sequences of context-switching queries (CSQs) are leveraged to gradually alter the context remembered by the model and steer it towards generating harmful responses. We employ a multi armed bandit (MAB) framework to automate the SoC attack by balancing exploration and exploitation of different CSQs to maximize the likelihood of a successful jailbreak. We achieve an Attack Success Rate (ASR) of over 95%, with our ASRs growing with the increase in the attack sequence lengths. Furthermore, this research provides rigorous theoretical foundations for the proposed method by deriving key bounds on the expected sequence length until the optimal CSQ category that successfully jailbreaks the LLM is identified. This paper also presents a theoretical analysis of total reward convergence in jailbreaking LLMs using CSQ categories. The key contributions of this paper are: (i) the creation of a dataset of CSQs, (ii) the proposition of a novel strategy to automate SoC-based jailbreaking attacks on LLMs, utilizing the MAB framework, and (iii) an in-depth theoretical analysis of the upper bounds on the expected sequence length for identifying optimal attack strategies and the convergence of the total reward.

2659Flow Matching with Gaussian Process Priors for Probabilistic Time Series Forecasting

[openreview] [pdf]

Abstract Recent advancements in generative modeling, particularly diffusion models, have opened new directions for time series modeling, achieving state-of-the-art performance in forecasting and synthesis. However, the reliance of diffusion-based models on a simple, fixed prior complicates the generative process since the data and prior distributions differ significantly. We introduce TSFlow, a conditional flow matching (CFM) model for time series that simplifies the generative problem by combining Gaussian processes, optimal transport paths, and data-dependent prior distributions. By incorporating (conditional) Gaussian processes, TSFlow aligns the prior distribution more closely with the temporal structure of the data, enhancing both unconditional and conditional generation. Furthermore, we propose conditional prior sampling to enable probabilistic forecasting with an unconditionally trained model. In our experimental evaluation on eight real-world datasets, we demonstrate the generative capabilities of TSFlow, producing high-quality unconditional samples. Finally, we show that both conditionally and unconditionally trained models achieve competitive results in forecasting benchmarks, surpassing other methods on 6 out of 8 datasets.

2660Edge Prompt Tuning for Graph Neural Networks

[openreview] [pdf]

Abstract Pre-training powerful Graph Neural Networks (GNNs) with unlabeled graph data in a self-supervised manner has emerged as a prominent technique in recent years. However, inevitable objective gaps often exist between pre-training and downstream tasks. To bridge this gap, graph prompt tuning techniques design and learn graph prompts by manipulating input graphs or reframing downstream tasks as pre-training tasks without fine-tuning the pre-trained GNN models. While recent graph prompt tuning methods have proven effective in adapting pre-trained GNN models for downstream tasks, they overlook the crucial role of edges in graph prompt design, which can significantly affect the quality of graph representations for downstream tasks. In this study, we propose EdgePrompt, a simple yet effective graph prompt tuning method from the perspective of edges. Unlike previous studies that design prompt vectors on node features, EdgePrompt manipulates input graphs by learning additional prompt vectors for edges and incorporates the edge prompts through message passing in the pre-trained GNN models to better embed graph structural information for downstream tasks. Our method is compatible with prevalent GNN architectures pre-trained under various pre-training strategies and is universal for different downstream tasks. We provide comprehensive theoretical analyses of our method regarding its capability of handling node classification and graph classification as downstream tasks. Extensive experiments on ten graph datasets under four pre-training strategies demonstrate the superiority of our proposed method against six baselines. Our code is available athttps://anonymous.4open.science/r/EdgePrompt-4905.

2661Provable Benefit of Annealed Langevin Monte Carlo for Non-log-concave Sampling

[openreview] [pdf]

Abstract We consider the outstanding problem of sampling from an unnormalized density that may be non-log-concave and multimodal. To enhance the performance of simple Markov chain Monte Carlo (MCMC) methods, techniques of annealing type have been widely used. However, quantitative theoretical guarantees of these techniques are under-explored. This study takes a first step toward providing a non-asymptotic analysis of annealed MCMC. Specifically, we establish, for the first time, an oracle complexity of O~(dβ2A2ε6)\widetilde{O}\left(\frac{d\beta^2{\cal A}^2}{\varepsilon^6}\right) for the simple annealed Langevin Monte Carlo algorithm to achieve ε2\varepsilon^2 accuracy in Kullback-Leibler divergence to the target distribution πeV\pi\propto{\rm e}^{-V} on Rd\mathbb{R}^d with β\beta-smooth potential VV. Here, A{\cal A} represents the action of a curve of probability measures interpolating the target distribution π\pi and a readily sampleable distribution.

2662EXPLORING RESPONSE UNCERTAINTY IN MLLMS: AN EMPIRICAL EVALUATION UNDER MISLEADING SCENARIOS

[openreview] [pdf]

Abstract Ensuring that Multimodal Large Language Models (MLLMs) maintain consistency in their responses is essential for developing trustworthy multimodal intelligence. However, existing benchmarks include many samples where all MLLMs exhibit high response uncertainty when encountering misleading information, requiring even 5-15 response attempts per sample to effectively assess uncertainty. Therefore, we propose a two-stage pipeline: first, we collect MLLMs’ responses without misleading information, and then gather misleading ones via specific misleading instructions. By calculating the misleading rate, and capturing both correct-to-incorrect and incorrect-to-correct shifts between the two sets of responses, we can effectively metric the model’s response uncertainty. Eventually, we establish a Multimodal Uncertainty Benchmark (MUB) that employs both explicit and implicit misleading instructions to comprehensively assess the vulnerability of MLLMs across diverse domains. Our experiments reveal that all open-source and close-source MLLMs are highly susceptible to misleading instructions, with an average misleading rate exceeding 86%. To enhance the robustness of MLLMs, we further fine-tune all open-source MLLMs by incorporating explicit and implicit misleading data, which demonstrates a significant reduction in misleading rates

2663Asymmetric Embedding Models for Hierarchical Retrieval: Provable Constructions and a Pretrain-Finetune Recipe

[openreview] [pdf]

Abstract Dual encoder (DE) models, where a pair of matching query and document are embedded into similar vector representations, are widely used in information retrieval due to their efficiency and scalability. However, DEs are known to have a limited expressive power due to the Euclidean geometry of the embedding space, which may compromise their quality. This paper investigate such limitations in the context of \emph{hierarchical retrieval}, the task where the document set has a hierarchical structure and the matching keywords for a query are all of its ancestor nodes. We first prove the feasibility of representing hierarchical structures within the Euclidean embedding space by providing a constructive algorithm for generating effective embeddings from a given hierarchy. Then we delve into the learning of DEs when the hierarchy is unknown, which is a practical assumption since usually only samples of matching query and document pairs are available during training. Our experiments reveal a “lost in the long distance” phenomenon, where retrieval accuracy degrades for documents further away in the hierarchy. To address this, we introduce a pretrain-finetune approach that significantly improves long-distance retrieval without sacrificing performance on closer documents. Finally, we validate our findings on a realistic hierarchy from WordNet, demonstrating the effectiveness of our approach in retrieving documents at various levels of abstraction.

2664Robust Barycenter Estimation using Semi-Unbalanced Neural Optimal Transport

[openreview] [pdf]

Abstract A common challenge in aggregating data from multiple sources can be formalized as anOptimal Transport(OT) barycenter problem, which seeks to compute the average of probability distributions with respect to OT discrepancies. However, the presence of outliers and noise in the data measures can significantly hinder the performance of traditional statistical methods for estimating OT barycenters. To address this issue, we propose a novel, scalable approach for estimating therobustcontinuous barycenter, leveraging the dual formulation of the(semi-)unbalancedOT problem. To the best of our knowledge, this paper is the first attempt to develop an algorithm for robust barycenters under the continuous distribution setup. Our method is framed as a min\min-max\max optimization problem and is adaptable togeneralcost function. We rigorously establish the theoretical underpinnings of the proposed method and demonstrate its robustness to outliers and class imbalance through a number of illustrative experiments.

2665Long-time asymptotics of noisy SVGD outside the population limit

[openreview] [pdf]

Abstract Stein Variational Gradient Descent (SVGD) is a widely used sampling algorithm that has been successfully applied in several areas of Machine Learning. SVGD operates by iteratively moving a set of nn interacting particles (which represent the samples) to approximate the target distribution. Despite recent studies on the complexity of SVGD and its variants, their long-time asymptotic behavior (i.e., after numerous iterations kk) is still not understood in the finite number of particles regime. We study the long-time asymptotic behavior of a noisy variant of SVGD. First, we establish that the limit set of noisy SVGD for large kk is well-defined. We then characterize this limit set, showing that it approaches the target distribution as nn increases. In particular, noisy SVGD provably avoids the variance collapse observed for SVGD. Our approach involves demonstrating that the trajectories of noisy SVGD closely resemble those described by a McKean-Vlasov process.

2666Memorization and the Orders of Loss: A Learning Dynamics Perspective

[openreview] [pdf]

Abstract Deep learning has become the de facto approach in nearly all learning tasks. It has been observed that deep models tend to memorize and sometimes overfit data, which can lead to compromises in performance, privacy, and other critical metrics. In this paper, we explore the theoretical foundations that connect memorization to various orders of sample loss, i.e., sample loss, sample loss gradient, and sample loss curvature, focusing on learning dynamics to understand what and how these models memorize. To this end, we introduce two proxies for memorization: Cumulative Sample Loss (CSL) and Cumulative Sample Gradient (CSG). CSL represents the accumulated loss of a sample throughout training, while CSG is the gradient with respect to the input, aggregated over the training process. CSL and CSG exhibit remarkable similarity to stability-based memorization, as evidenced by considerably high cosine similarity scores. We delve into the theory behind these results, demonstrating that CSL and CSG represent the bounds for stability-based memorization and learning time. Additionally, we extend this framework to include sample loss curvature and connect the three orders, namely, sample loss, sample loss gradient, and sample loss curvature, to learning time and memorization. The proposed proxy, CSL, is four orders of magnitude less computationally expensive than the stability-based method and can be obtained with zero additional overhead during training. We demonstrate the practical utility of the proposed proxies in identifying mislabeled samples and detecting duplicates where our metric achieves state-of-the-art performance. Thus, this paper provides a new tool for analyzing data as it scales in size, making it an important resource in practical applications.

2667Language Model Non-Myopic Generation for Reasoning and Planning

[openreview] [pdf]

Abstract Large Language Models have demonstrated remarkable abilities in reasoning and planning by breaking down complex problems into sequential steps. Despite their success in various domains like mathematical problem-solving and coding, LLMs face challenges in ensuring reliable and optimal planning due to their inherent myopic nature of autoregressive decoding. This paper revisits LLM reasoning from an optimal-control perspective, proposing a novel method, Predictive-Decoding, that leverages Model Predictive Control to enhance planning accuracy. By re-weighting LLM distributions based on foresight trajectories, Predictive-Decoding aims to mitigate early errors and promote non-myopic planning. Our experiments show significant improvements in a wide range of tasks for math, coding, and agents. Furthermore, Predictive-Decoding demonstrates computational efficiency, outperforming search baselines with reduced computational resources. This study provides insights into optimizing LLM planning capabilities.

2668Action abstractions for amortized sampling

[openreview] [pdf]

Abstract As trajectories sampled by policies used by reinforcement learning (RL) and generative flow networks (GFlowNets) grow longer, credit assignment and exploration become more challenging, and the long planning horizon hinders mode discovery and generalization. The challenge is particularly pronounced in entropy-seeking RL methods, such as generative flow networks, where the agent must learn to sample from a structured distribution and discover multiple high-reward states, each of which take many steps to reach. To tackle this challenge, we propose an approach to incorporate the discovery of action abstractions, or high-level actions, into the policy optimization process. Our approach involves iteratively extracting action subsequences commonly used across many high-reward trajectories and `chunking’ them into a single action that is added to the action space. In empirical evaluation on synthetic and real-world environments, our approach demonstrates improved sample efficiency performance in discovering diverse high-reward objects, especially on harder exploration problems. We also observe that the abstracted high-order actions are interpretable, capturing the latent structure of the reward landscape of the action space. This work provides a cognitively motivated approach to action abstraction in RL and is the first demonstration of hierarchical planning in amortized sequential sampling.

2669Few for Many: Tchebycheff Set Scalarization for Many-Objective Optimization

[openreview] [pdf]

Abstract Multi-objective optimization can be found in many real-world applications where some conflicting objectives can not be optimized by a single solution. Existing optimization methods often focus on finding a set of Pareto solutions with different optimal trade-offs among the objectives. However, the required number of solutions to well approximate the whole Pareto optimal set could be exponentially large with respect to the number of objectives, which makes these methods unsuitable for handling many optimization objectives. In this work, instead of finding a dense set of Pareto solutions, we propose a novel Tchebycheff set scalarization method to find a few representative solutions (e.g., 5) to cover a large number of objectives (e.g., >100>100) in a collaborative and complementary manner. In this way, each objective can be well addressed by at least one solution in the small solution set. In addition, we further develop a smooth Tchebycheff set scalarization approach for efficient optimization with good theoretical guarantees. Experimental studies on different problems with many optimization objectives demonstrate the effectiveness of our proposed method.

2670On-the-fly Preference Alignment via Principle-Guided Decoding

[openreview] [pdf]

Abstract With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final decoding policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model’s predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.

2671An efficient algorithm for entropic optimal transport under martingale-type constraints

[openreview] [pdf]

Abstract This work introduces novel computational methods for entropic optimal transport (OT) problems under martingale-type conditions. The problems can map to a prevalent class of OT problems with structural constraints, encompassing the discrete martingale optimal transport (MOT) problem, as the (super-)martingale conditions are equivalent to row-wise (in-)equality constraints on the coupling matrix. Inspired by the recent empirical success of Sinkhorn-type algorithms, we propose an entropic formulation for the MOT problem and introduce Sinkhorn-type algorithms with sparse Newton iterations that utilize the (approximate) sparsity of the Hessian matrix of the dual objective. As exact martingale conditions are typically infeasible, we adopt entropic regularization to find an approximate constraint satisfied solution. We show that in practice the proposed algorithms enjoy both super-exponential convergence and robustness with controllable thresholds for total constraint violations.

2672Grokking at the Edge of Numerical Stability

[openreview] [pdf]

Abstract Grokking, or sudden generalization that occurs after prolonged overfitting, is a surprising phenomenon that has challenged our understanding of deep learning. While a lot of progress has been made in understanding grokking, it is still not clear why generalization is delayed and why grokking often does not happen without regularization. In this work we argue that without regularization, grokking tasks push models to the edge of numerical stability, introducing floating point errors in the Softmax that we refer to asSoftmax Collapse(SC). We show that SC prevents grokking and that mitigating SC leads to grokkingwithoutregularization. Investigating the root cause of SC, we find that beyond the point of overfitting, the gradients strongly align with what we call thenaïve loss minimization(NLM) direction. This component of the gradient does not change the predictions of the model but decreases the loss by scaling the logits, usually through the scaling of the weights along their current direction. We show that this scaling of the logits explains the delay in generalization characteristic of grokking, and eventually leads to SC, stopping learning altogether. To validate these hypotheses, we introduce two key contributions that mitigate the issues faced in grokking tasks: (i) StableMax\mathrm{StableMax}, a new activation function that prevents SC and enables grokking without regularization, and (ii) Grad\perp\mathrm{Grad}, a training algorithm that leads to quick generalization in grokking tasks by preventing NLM altogether. These contributions provide new insights into grokking, shedding light on its delayed generalization, reliance on regularization, and the effectiveness of known grokking-inducing methods.

2673PABBO: Preferential Amortized Black-Box Optimization

[openreview] [pdf]

Abstract Preferential Bayesian Optimization (PBO) is a sample-efficient method to learn latent user utilities from preferential feedback over a pair of designs. It relies on a statistical surrogate model for the latent function, usually a Gaussian process, and an acquisition strategy to select the next candidate pair to get user feedback on. Due to the non-conjugacy of the associated likelihood, every PBO step requires a significant amount of computations with various approximate inference techniques. This computational overhead is incompatible with the way humans interact with computers, hindering the use of PBO in real-world cases. Building on the recent advances of amortized BO, we propose to circumvent this issue by fully amortizing PBO, meta-learning both the surrogate and the acquisition function. Our method comprises a novel transformer neural process architecture, trained using reinforcement learning and tailored auxiliary losses. On a benchmark composed of synthetic and real-world datasets, our method is several orders of magnitude faster than the usual Gaussian process-based strategies and often outperforms them in accuracy.

2674Commute Graph Neural Networks

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown remarkable success in learning from graph-structured data. However, their application to directed graphs (digraphs) presents unique challenges, primarily due to the inherent asymmetry in node relationships. Traditional GNNs are adept at capturing unidirectional relations but fall short in encoding the mutual path dependencies between nodes, such as asymmetrical shortest paths typically found in digraphs. Recognizing this gap, we introduce Commute Graph Neural Networks (CGNN), an approach that seamlessly integrates node-wise commute time into the message passing scheme. The cornerstone of CGNN is an efficient method for computing commute time using a newly formulated digraph Laplacian. Commute time is then integrated into the neighborhood aggregation process, with neighbor contributions weighted according to their respective commute time to the central node in each layer. It enables CGNN to directly capture the mutual, asymmetric relationships in digraphs. Extensive experiments confirm the superior performance of CGNN. Source code of CGNN is anonymously available here.

2675FedPCE: Federated Personalized Client Embeddings

[openreview] [pdf]

Abstract Despite recent efforts, federated learning (FL) still faces performance challenges due to non-IID data distributions among clients. This distribution shift complicates the addition of new clients and the transfer of federally learned models to unseen data. Inspired by the adaptation ability of normalization layer parameters, we first demonstrate the effectiveness of models trained using FedBN when being adapted to so far unseen data. Specifically, we extend the adaptation method based on a visual analysis of the normalization layer feature vectors. We introduce Federated Personalized Client Embeddings (FedPCE), which utilizes local embeddings to capture the underlying structure of the normalization feature vectors and, by extension, the dataset. Our results show that FedPCE performs comparably to other common FL algorithms during both training and adaptation. Notably, FedPCE achieves this performance using only a fraction of the parameters during fine-tuning (32 parameters in our experiments) compared to other methods.

2676Data-Centric AI Governance: Addressing the Limitations of Model-Focused Policies

[openreview] [pdf]

Abstract Current regulations on powerful AI capabilities are narrowly focused on “foundation” or “frontier” models. However, these terms are vague and inconsistently defined, leading to an unstable foundation for governance efforts. Critically, policy debates often fail to consider the data used with these models, despite the clear link between data and model performance. Even (relatively) “small” models that fall outside the typical definitions of foundation and frontier models can achieve equivalent outcomes when exposed to sufficiently specific datasets. In this work, we illustrate the importance of considering dataset size and content as essential factors in assessing the risks posed by models both today and in the future. More broadly, we emphasize the risk posed by over-regulating reactively and provide a path towards careful, quantitative evaluation of capabilities that can lead to a simplified regulatory environment.

2677Trained Transformer Classifiers Generalize and Exhibit Benign Overfitting In-Context

[openreview] [pdf]

Abstract Transformers have the capacity to act as supervised learning algorithms: by properly encoding a set of labeled training (‘‘in-context’’) examples and an unlabeled test example into an input sequence of vectors of the same dimension, the forward pass of the transformer can produce predictions for that unlabeled test example. A line of recent work has shown that when linear transformers are pre-trained on random instances for linear regression tasks, these trained transformers make predictions using an algorithm similar to that of ordinary least squares. In this work, we investigate the behavior of linear transformers trained on random linear classification tasks. Via an analysis of the implicit regularization of gradient descent, we characterize how many pre-training tasks and in-context examples are needed for the trained transformer to generalize well at test-time. We further show that in some settings, these trained transformers can exhibit ‘‘benign overfitting in-context’’: when in-context examples are corrupted by label flipping noise, the transformer memorizes all of its in-context examples (including those with noisy labels) yet still generalizes near-optimally for clean test examples.

2678Input Compensation for Pruned Models

[openreview] [pdf]

Abstract Though foundation models are powerful, they are large and require substantial memory and computation resources for serving. To tackle this issue, many pruning methods have been proposed to reduce the model size, thereby achieving memory and computational efficiency. These methods either identify and retrain the important weights or \textit{adjust the unpruned weights} to compensate for the removed weights. In this paper, we propose a novel approach called input compensation (IC) to boost the performance of pruned models, i.e., \textit{adjust the input} to compensate for the removed weights. We learn a compensation pool to construct input-dependent compensation to reduce the error caused by pruning. Different from existing pruning methods, which are designed in the parameter space, the proposed IC is designed in the input space. Hence, IC is complementary to existing methods and can be integrated with them. Extensive experiments on various tasks, including image classification, language modeling, and image generation, demonstrate that IC is effective in improving the performance of pruned models.

2679Honey: Harmonizing Progressive Federated Learning via Elastic Synergy across Different Training Blocks

[openreview] [pdf]

Abstract Memory limitation is becoming the prevailing challenge that hinders the deployment of Federated Learning on mobile/IoT devices in real-world cases. Progressive training offers a promising alternative to surpass memory constraints. Instead of updating the full model in each training round, progressive training divides the model into multiple blocks and iteratively updates each block until the full model is converged. However, existing progressive training approaches suffer from prominent accuracy degradation as training each block in isolation drives it to prioritize features that are only beneficial to its specific needs, neglecting the overall learning objective. To address this issue, we present \texttt{\textbf{Honey}}, a synergistic progressive training approach that integrates the holistic view and block-wise feedback to facilitate the training of each block. Specifically, the holistic view broadens the learning scope of each block, ensuring that it operates in harmony with the global objective and benefits the training of the whole model. Simultaneously, block-wise feedback heightens each block’s awareness of its role and position within the full model, empowering it to make real-time adjustments based on insights from downstream blocks and facilitating a smooth and consistent information flow. Furthermore, to fully harness the heterogeneous memory resources of participating devices, we develop an elastic resource harmonization protocol. This protocol authorizes each device to adaptively train specific layers according to their memory capacity, optimizing resource utilization, sparking cross-block communication, and accelerating model convergence. Comprehensive experiments on benchmark datasets and models demonstrate that \texttt{\textbf{Honey}} outperforms state-of-the-art approaches, delivering an exceptional average accuracy improvement of up to 43.9%. Moreover, \texttt{\textbf{Honey}} achieves comparable performance even with a reduction in peak memory usage of up to 49%.

2680Decoupled Subgraph Federated Learning

[openreview] [pdf]

Abstract We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where inter-connections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.

2681Gradient Storm: Stronger Backdoor Attacks Through Expanded Parameter Space Coverage

[openreview] [pdf]

Abstract Targeted data poisoning poses a critical adversarial threat to machine learning systems by enabling attackers to manipulate training data to induce specific, harmful misclassifications. Among these threats, backdoor attacks are particularly pernicious, embedding hidden triggers in the data that lead models to misclassify only those inputs containing the trigger, while maintaining high accuracy on benign samples. In this paper, we propose Gradient Storm, a novel technique that facilitates the simultaneous execution of multiple backdoor attacks, while necessitating only minimal modification to the training dataset. Our contributions are twofold: First, we introduce a method for designing adversarial poisons in modular components, each tailored based on a distinct region of the model’s parameter space. Second, we present a framework for conducting multi-trigger attacks, where each trigger causes misclassification from a specific source class to a distinct target class. We evaluate the efficacy of Gradient Storm across multiple neural network architectures and two benchmark datasets, demonstrating its robustness against eight different poisoning defense mechanisms. Additionally, we show that poisons crafted for one model can be effectively transferred to other models, demonstrating that our attack remains effective even in black-box settings.

2682Delay Neural Networks (DeNN) for exploiting temporal information in event-based datasets

[openreview] [pdf]

Abstract In Deep Neural Networks (DNN) and Spiking Neural Networks (SNN), the information of a neuron is computed based on the sum of the amplitudes (weights) of the electrical potentials received in input from other neurons. We propose here a new class of neural networks, namely Delay Neural Networks (DeNN), where the information of a neuron is computed based on the sum of its input synaptic delays and on the spike times of the electrical potentials received from other neurons. This way, DeNN are designed to explicitly use exact continuous temporal information of spikes in both forward and backward passes, without approximation. (Deep) DeNN are applied here to images and event-based (audio and visual) data sets. Good performances are obtained, especially for datasets where temporal information is important, with much less parameters than other models.

2683Release the Powers of Prompt Tuning: Cross-Modality Prompt Transfer

[openreview] [pdf]

Abstract Prompt Tuning adapts frozen models to new tasks by prepending a few learnable embeddings to the input. However, it struggles with tasks that suffer from data scarcity. To address this, we explore Cross-Modality Prompt Transfer, leveraging prompts pretrained on a data-rich modality to improve performance on data-scarce tasks in another modality. As a pioneering study, we first verify the feasibility of cross-modality prompt transfer by directly applying frozen source prompts (trained on the source modality) to the target modality task. To empirically study cross-modality prompt transferability, we train a linear layer to adapt source prompts to the target modality, thereby boosting performance and providing ground-truth transfer results. Regarding estimating prompt transferability, existing methods show ineffectiveness in cross-modality scenarios where the gap between source and target tasks is larger. We address this by decomposing the gap into the modality gap and the task gap, which we measure separately to estimate the prompt transferability more accurately. Additionally, we propose Attention Transfer to further reduce the gaps by injecting target knowledge into the prompt and reorganizing a top-transferable source prompt using an attention block. We conduct extensive experiments involving prompt transfer from 13 source language tasks to 19 target vision tasks under three settings. Our findings demonstrate that: (i) cross-modality prompt transfer is feasible, supported by in-depth analysis; (ii) measuring both the modality and task gaps is crucial for accurate prompt transferability estimation, a factor overlooked by previous studies; (iii) cross-modality prompt transfer can significantly release the powers of prompt tuning on data-scarce tasks, as evidenced by comparisons with a newly released prompt-based benchmark.

2684Adaptive Self-Supervised Learning Strategies for Dynamic On-Device LLM Personalization

[openreview] [pdf]

Abstract Large language models (LLMs) have revolutionized how we interact with technology, but their personalization to individual user preferences remains a significant challenge, particularly in on-device applications. Traditional methods often depend heavily on labeled datasets and can be resource-intensive. To address these issues, we present Adaptive Self-Supervised Learning Strategies (ASLS), which utilizes self-supervised learning techniques to personalize LLMs dynamically. The framework comprises a user profiling layer for collecting interaction data and a neural adaptation layer for real-time model fine-tuning. This innovative approach enables continuous learning from user feedback, allowing the model to generate responses that align closely with user-specific contexts. The adaptive mechanisms of ASLS minimize computational demands and enhance personalization efficiency. Experimental results across various user scenarios illustrate the superior performance of ASLS in boosting user engagement and satisfaction, highlighting its potential to redefine LLMs as highly responsive and context-aware systems on-device.

2685RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation

[openreview] [pdf]

Abstract Generative AI systems like foundation models (FMs) must align well with human values to ensure their behavior is helpful and trustworthy. While Reinforcement Learning from Human Feedback (RLHF) has shown promise for optimizing model performance using human judgments, existing RLHF pipelines predominantly rely onimmediatefeedback, which can fail to reflect the true downstream impact of an interaction on users’ utility. We demonstrate that this shortsighted feedback can, by itself, result in misaligned behaviors like sycophancy and deception, and we propose to alleviate this by refocusing RLHF ondownstream consequences. Our theoretical analysis reveals that the hindsight gained by simply delaying human feedback mitigates misalignment and improves expected human utility. To leverage this insight in a practical alignment algorithm, we introduce Reinforcement Learning from Hindsight Simulation (RLHS), which first simulates plausible consequences and then elicits feedback to assess what behaviors were genuinely beneficial in hindsight. We apply RLHS to two widely-employed online and offline preference optimization methods---Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO)---and show empirically that misalignment is significantly reduced with both methods. Through an online human user study, we show that RLHS consistently outperforms RLHF in helping users achieve their goals and earns higher satisfaction ratings, despite being trained solely with simulated hindsight feedback. These results underscore the importance of focusing on long-term consequences, even simulated ones, to mitigate misalignment in RLHF.

2686SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity & Hierarchy in Efficiently Layered Decoder

[openreview] [pdf]

Abstract Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages both sparsity and hierarchy principles. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically choose whether to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.

2687On the Training Convergence of Transformers for In-Context Classification

[openreview] [pdf]

Abstract While transformers have demonstrated impressive capacities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism enabling transformers to perform ICL is still in its infant stage. This work aims to study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained by gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and test prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth label in context. Experimental results corroborate the theoretical findings.

2688Towards Better Benchmark Datasets for Inductive Knowledge Graph Completion

[openreview] [pdf]

Abstract Knowledge Graph Completion (KGC) attempts to predict missing facts in a Knowledge Graph (KG). Recently, there’s been an increased focus on designing KGC methods that can excel in theinductive setting, where a portion or all of the entities and relations seen in inference are unobserved during training. Numerous benchmark datasets have been proposed for inductive KGC, all of which are subsets of existing KGs used for transductive KGC. However, we find that the current procedure for constructing inductive KGC datasets inadvertently creates a shortcut that can be exploited even while disregarding the relational information. Specifically, we observe that the Personalized PageRank (PPR) score can achieve strong or near SOTA performance on most inductive datasets. In this paper, we study the root cause of this problem. Using these insights, we propose an alternative strategy for constructing inductive KGC datasets that helps mitigate the PPR shortcut. We then benchmark multiple popular methods using the newly constructed datasets and analyze their performance. The new benchmark datasets help promote a better understanding of the capabilities and challenges of inductive KGC by removing any shortcuts that obfuscate performance.

2689Learning anti-classes with one-cold cross entropy loss

[openreview] [pdf]

Abstract While softmax cross entropy loss is the standard objective for supervised classification, it primarily focuses on the ground truth classes, ignoring the relationships between the non-target, complementary classes. This leaves valuable information unexploited during optimization. In this work, we propose a novel loss function, one-cold cross entropy (OCCE) loss, that addresses this limitation by structuring the activations of these complementary classes. Specifically, for each class, we define an anti-class, which consists of everything that is not part of the target class—this includes all complementary classes as well as out-of-distribution samples, noise, or in general any instance that does not belong to the true class. By setting a uniform one-cold encoded distribution over the complementary classes as a target for each anti-class, we encourage the model to equally distribute activations across all non- target classes. This approach promotes a symmetric geometric structure of classes in the final feature space, increases the degree of neural collapse during training, addresses the independence deficit problem of neural networks and improves generalization. Our extensive evaluation shows that incorporating OCCE loss in the optimization objective consistently enhances performance across multiple settings, including classification, open-set recognition, and out-of-distribution detection.

2690Model aggregation: minimizing empirical variance outperforms minimizing empirical error

[openreview] [pdf]

Abstract Whether deterministic or stochastic, models can be viewed as functions designed to approximate a specific quantity of interest. We introduce a data-driven framework that integrates predictions from various models, enhancing overall accuracy by leveraging the individual strengths of each. This non-intrusive, model-agnostic approach treats the contributing models as black boxes and accommodates outputs from diverse methodologies, including machine learning algorithms and traditional numerical solvers. We advocate for a point-wise linear aggregation process and propose two methods for optimizing this aggregate: Minimal Error Aggregation (MEA), which minimizes the prediction error, and Minimal Variance Aggregation (MVA), which focuses on reducing variance. While MEA is inherently more accurate when correlations between models and the target quantity are perfectly known, Minimal Empirical Variance Aggregation (MEVA), an empirical version of MVA, consistently outperforms Minimal Empirical Error Aggregation (MEEA), the empirical counterpart of MEA, when these correlations must be estimated from data. The key difference is that MEVA constructs an aggregate by estimating model errors, while MEEA treats the models as features for direct interpolation of the quantity of interest. This makes MEEA more susceptible to overfitting and poor generalization, where the aggregate may underperform individual models during testing. We demonstrate the versatility and effectiveness of our framework across various applications, including data science and partial differential equations, illustrating its ability to significantly enhance both robustness and accuracy.

2691TabDPT: Scaling Tabular Foundation Models

[openreview] [pdf]

Abstract The challenges faced by neural networks on tabular data are well-documented and have hampered the progress of tabular foundation models. Techniques leveraging in-context learning (ICL) have shown promise here, allowing for dynamic adaptation to unseen data. ICL can provide predictions for entirely new datasets without further training or hyperparameter tuning, therefore providing very fast inference when encountering a novel task. However, scaling ICL for tabular data remains an issue: approaches based on large language models cannot efficiently process numeric tables, and tabular-specific techniques have not been able to effectively harness the power of real data to improve performance and generalization. We are able to overcome these challenges by training tabular-specific ICL-based architectures on real data with self-supervised learning and retrieval, combining the best of both worlds. Our resulting model -- the Tabular Discriminative Pre-trained Transformer (TabDPT) -- achieves state-of-the-art performance on the CC18 (classification) and CTR23 (regression) benchmarks with no task-specific fine-tuning, demonstrating the adapatability and speed of ICL once the model is pre-trained. TabDPT also demonstrates strong scaling as both model size and amount of available data increase, pointing towards future improvements simply through the curation of larger tabular pre-training datasets and training larger models.

2692GAP: Scalable Driving with Generative Aided Planner

[openreview] [pdf]

Abstract The primary challenge in end-to-end autonomous driving lines in how to establish robust environmental perception and representations. While most methods improve these capabilities by introducing auxiliary perception tasks, the process of obtaining precise large-scale annotations in this paradigm is both time-consuming and laborious, thereby limiting the scalability and practical application. To address this, we propose an architecture based on the Generative Aided Planner (GAP), which integrates scene generation and planning within a single framework. To compensate for the information loss in discrete image features, we design a dual-branch image encoder that fuses continuous and discrete features, improving the model’s ability to recognize traffic lights. Through the scene generation task from input tokens, our approach learns the intrinsic dependencies between tokens and environments, which in turn benefits the planning task. It is important to note that the generative model is trained in a fully self-supervised manner, requiring no perception annotations. Our model is built upon GPT-2, which exhibits scaling laws similar to those observed in other GPTs: as we increase the model size and data size, the performance shows continuous and non-saturating improvements. Experiments show that among methods using the front view as input, our approach outperforms other methods that employ multiple perception supervision in the CARLA simulator. Our method is simple yet highly effective, offering a promising direction for scalable and practical deployment of autonomous vehicles in real-world settings.

2693Simplified Mamba with Disentangled Dependency Encoding for Long-Term Time Series Forecasting

[openreview] [pdf]

Abstract Recent advances in deep learning have led to the development of numerous models for Long-term Time Series Forecasting (LTSF). However, most approaches still struggle to comprehensively capture reliable and informative dependencies inherent in time series data. In this paper, we identify and formally define three critical dependencies essential for improving forecasting accuracy: the order dependency and semantic dependency in the time dimension as well as cross-variate dependency in the variate dimension. Despite their significance, these dependencies are rarely considered holistically in existing models. Moreover, improper handling of these dependencies can introduce harmful noise that significantly impairs forecasting performance. To address these challenges, we explore the potential of Mamba for LTSF, highlighting its three key advantages to capture three dependencies, respectively. We further empirically observe that nonlinear activation functions used in vanilla Mamba are redundant for semantically sparse time series data. Therefore, we propose SAMBA, a Simplified Mamba with disentangled dependency encoding. Specifically, we first eliminate the nonlinearity of vanilla Mamba to make it more suitable for LTSF. Along this line, we propose a disentangled dependency encoding strategy to endow Mamba with efficient cross-variate dependency modeling capability while minimizing the interference between time and variate dimensions. We also provide rigorous theory as a justification for our design. Extensive experiments on nine real-world datasets demonstrate the effectiveness of SAMBA over state-of-the-art forecasting models.

2694An Auditing Test to Detect Behavioral Shift in Language Models

[openreview] [pdf]

Abstract As language models (LMs) approach human-level performance, comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model’s behavioral profile. However, subsequent fine-tuning or deployment changes may alter these behaviors in both intended and unintended ways. We present an efficient method for continual Behavioral Shift Auditing (BSA) in LMs. Building on anytime-valid hypothesis testing, our auditing test detects behavioral shifts solely through model generations. It compares outputs from a baseline model to those of the model under scrutiny, providing theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter, allowing adjustment of sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples. We hope to contribute a valuable tool for AI practitioners, enabling rapid detection of behavioral shifts in deployed LMs, with implications for safety monitoring, quality assurance, and responsible AI development.

2695Execution-guided within-prompt search for programming-by-example

[openreview] [pdf]

Abstract Soundness is an important property in programming-by-example (PBE) as any learned program is expected to be correct for at least the examples that were part of the problem statement. This allows synthesizers to perform a search over a domain-specific language (DSL) that terminates when any sound program is found. Large language models (LLMs) can generate code from examples without being limited to a DSL, but they lack soundness guarantees (generated code is not even guaranteed to execute) and the concept of search (samples are independent). In this paper, we use an LLM as a policy that generates lines of code and then join these lines of code to let the LLM implicitly estimate the value of each of these lines in its next iteration. We further guide the policy and value estimation by executing each line and annotating it with its results on the given examples. This allows us to search for programs within a prompt until a sound program is found by letting the policy reason in both the syntactic (code) and semantic (execution) space. We evaluate this approach on five benchmarks across different domains, such as string transformations, list transformations, and arbitrary Python programming problems, showing that within-prompt search and execution allows us to sample better programs more consistently. Additionally, our experiments indicate that the model does behave like a policy and value.

2696Mixing It Up: The Cocktail Effect of Multi-Task Fine-Tuning on LLM Performance - A Case Study in Finance

[openreview] [pdf]

Abstract The application of large language models (LLMs) in domain-specific contexts, including finance, has expanded rapidly. Domain-specific LLMs are typically evaluated based on their performance in various downstream tasks relevant to the domain. In this work, we present a detailed analysis of fine-tuning LLMs for such tasks. Somewhat counterintuitively, we find that in domain-specific cases, fine-tuning exclusively on the target task is not always the most effective strategy. Instead, multi-task fine-tuning - where models are trained on a cocktail of related tasks - can significantly enhance performance. We demonstrate how this approach enables a small model, such as Phi-3-Mini, to achieve state-of-the-art results, even surpassing the much larger GPT-4-o model on financial benchmarks. Our study involves a large-scale experiment, training over 200 models using several widely adopted LLMs as baselines, and empirically confirms the benefits of multi-task fine-tuning. Additionally, we explore the use of general instruction data as a form of regularization, suggesting that it helps minimize performance degradation. We also investigate the inclusion of mathematical data, finding improvements in numerical reasoning that transfer effectively to financial tasks. Finally, we note that while fine-tuning for downstream tasks leads to targeted improvements in task performance, it does not necessarily result in broader gains in domain knowledge or complex domain reasoning abilities.

2697Deriving Causal Order from Single-Variable Interventions: Guarantees & Algorithm

[openreview] [pdf]

Abstract Targeted and uniform interventions to a system are crucial for unveiling causal relationships. While several methods have been developed to leverage interventional data for causal structure learning, their practical application in real-world scenarios often remains challenging. Recent benchmark studies have highlighted these difficulties, even when large numbers of single-variable intervention samples are available. In this work, we demonstrate, both theoretically and empirically, that such datasets contain a wealth of causal information that can be effectively extracted under realistic assumptions about the data distribution. More specifically, we introduce the notion of interventional faithfulness, which relies on comparisons between the marginal distributions of each variable across observational and interventional settings, and we introduce a score on causal orders. Under this assumption, we are able to prove strong theoretical guarantees on the optimum of our score that also hold for large-scale settings. To empirically verify our theory, we introduce Intersort, an algorithm designed to infer the causal order from datasets containing large numbers of single-variable interventions by approximately optimizing our score. Intersort outperforms baselines (GIES, DCDI, PC and EASE) on almost all simulated data settings replicating common benchmarks in the field. Our proposed novel approach to modeling interventional datasets thus offers a promising avenue for advancing causal inference, highlighting significant potential for further enhancements under realistic assumptions.

2698Enhance the Transferability of Adversarial Attacks through Channel Pruning

[openreview] [pdf]

Abstract Recent studies have shown that neural networks are vulnerable to adversarial attacks, where attackers generate adversarial samples by imposing tiny noise. The tiny noise can not misguide human perception, though leading the neural networks to generate wrong predictions. Transfer-based black-box attacks play a more significant role in recent studies due to their more realistic setting and considerable progress in performance. Previous studies have shown that some different channels of the same layer in convolution neural networks (CNN) contain lots of repetitive information, and we find that existing transferable attacks tend to exploit those redundant features more, which limits their transferability. Hence, we advocate using channel pruning and knowledge distillation to conduct model augmentation. In addition, we introduce a method of regularization on the gradients of intermediate feature maps of augmented models, which further enhances the transferability of our method. Comprehensive experiments demonstrate that imposing our method of model augmentation on existing methods can significantly improve the transferability of adversarial attacks in untargeted or targeted scenarios. Furthermore, our method outperforms state-of-the-art model augmentation techniques without the usage of additional training datasets.

2699Evaluating Robustness of Reward Models for Mathematical Reasoning

[openreview] [pdf]

Abstract Reward models are key in reinforcement learning from human feedback (RLHF) systems, aligning the model behavior with human preferences. Particularly in the math domain, there have been plenty of studies using reward models to align policies for improving reasoning capabilities. Recently, as the importance of reward models has been emphasized, RewardBench is proposed to understand their behavior. However, we figure out that the math subset of RewardBench has different representations between chosen and rejected completions, and relies on a single comparison, which may lead to unreliable results as it only see an isolated case. Therefore, it fails to accurately present the robustness of reward models, leading to a misunderstanding of its performance and potentially resulting in reward hacking. In this work, we introduce a new design for reliable evaluation of reward models, and to validate this, we construct RewardMATH, a benchmark that effectively represents the robustness of reward models in mathematical reasoning tasks. We demonstrate that the scores on RewardMATH strongly correlate with the results of optimized policy and effectively estimate reward overoptimization, whereas the existing benchmark shows almost no correlation. The results underscore the potential of our design to enhance the reliability of evaluation, and represent the robustness of reward model. We make our code and data publicly available.

2700Freeze and Cluster: A simple baseline for Rehearsal-Free Continual Category Discovery

[openreview] [pdf]

Abstract This paper addresses the problem of Rehearsal-Free Continual Category Discovery (RF-CCD), which focuses on continuously identifying novel class by leveraging knowledge from labeled data. Existing methods typically train from scratch, overlooking the potential of base models, and often resort to data storage to prevent forgetting. Moreover, because RF-CCD encompasses both continual learning and novel class discovery, previous approaches have struggled to effectively integrate advanced techniques from these fields, resulting in less convincing comparisons and failing to reveal the unique challenges posed by RF-CCD. To address these challenges, we lead the way in integrating advancements from both domains and conducting extensive experiments and analyses. Our findings demonstrate that this integration can achieve state-of-the-art results, leading to the conclusion that "in the presence of pre-trained models, the representation does not improve and may even degrade with the introduction of unlabeled data.” To mitigate representation degradation, we propose a straightforward yet highly effective baseline method. This method first utilizes prior knowledge of known categories to estimate the number of novel classes. It then acquires representations using a model specifically trained on the base classes, generates high-quality pseudo-labels through k-means clustering, and trains only the classifier layer. We validate our conclusions and methods by conducting extensive experiments across multiple benchmarks, including the Stanford Cars, CUB, iNat, and Tiny-ImageNet datasets. The results clearly illustrate our findings, demonstrate the effectiveness of our baseline, and pave the way for future advancements in RF-CCD.

2701Generalized Gaussian Temporal Difference Error for Uncertainty-aware Reinforcement Learning

[openreview] [pdf]

Abstract Conventional uncertainty-aware temporal difference (TD) learning methods often rely on simplistic assumptions, typically including a zero-mean Gaussian distribution for TD errors. Such oversimplification can lead to inaccurate error representations and compromised uncertainty estimation. In this paper, we introduce a novel framework for generalized Gaussian error modeling in deep reinforcement learning, applicable to both discrete and continuous control settings. Our framework enhances the flexibility of error distribution modeling by incorporating additional higher-order moment, particularly kurtosis, thereby improving the estimation and mitigation of data-dependent noise, i.e., aleatoric uncertainty. We examine the influence of the shape parameter of the generalized Gaussian distribution (GGD) on aleatoric uncertainty and provide a closed-form expression that demonstrates an inverse relationship between uncertainty and the shape parameter. Additionally, we propose a theoretically grounded weighting scheme to fully leverage the GGD. To address epistemic uncertainty, we enhance the batch inverse variance weighting by incorporating bias reduction and kurtosis considerations, resulting in improved robustness. Extensive experimental evaluations using policy gradient algorithms demonstrate the consistent efficacy of our method, showcasing significant performance improvements.

2702Gaussian Splatting Lucas-Kanade

[openreview] [pdf]

Abstract Gaussian Splatting and its dynamic extensions are effective for reconstructing 3D scenes from 2D images when there is significant camera movement to facilitate motion parallax and when scene objects remain relatively static. However, in many real-world scenarios, these conditions are not met. As a consequence, data-driven semantic and geometric priors have been favored as regularizers, despite their bias toward training data and their neglect of broader movement dynamics.Departing from this practice, we propose a novel analytical approach that adapts the classical Lucas-Kanade method to dynamic Gaussian splatting. By leveraging the intrinsic properties of the forward warp field network, we derive an analytical velocity field that, through time integration, facilitates accurate scene flow computation. This enables the precise enforcement of motion constraints on warp fields, thus constraining both 2D motion and 3D positions of the Gaussians. Our method excels in reconstructing highly dynamic scenes with minimal camera movement, as demonstrated through experiments on both synthetic and real-world scenes.

2703Overcoming Catastrophic Forgetting in Federated Class-Incremental Learning via Federated Global Twin Generator

[openreview] [pdf]

Abstract Federated Class-Incremental Learning (FCIL) increasingly becomes essential in the decentralized setting, where it enables multiple participants to collaboratively train a global model to perform well on a sequence of tasks without sharing their private data. In FCIL, conventional Federated Learning algorithms such as FedAvg often suffer from catastrophic forgetting, resulting in significant performance declines on earlier tasks. Recent works based on generative models produce synthetic images to help mitigate this issue across all classes. However, these approaches’ testing accuracy in previous classes is still much lower than recent classes, i.e., having better plasticity than stability. To overcome these issues, this paper presents Federated Global Twin Generator (FedGTG), an FCIL framework that exploits generative-model training on the global side without accessing client data. Specifically, the server trains a data generator and a feature generator to create two types of information from all seen classes. Then, it sends the synthetic data to the client. The clients then use feature-direction-controlling losses to make the local models retain knowledge and learn new tasks well. We extensively analyze the robustness of FedGTG on natural images and its ability to converge to flat local minima and achieve better predicting confidence (calibration). Experimental results on CIFAR-10, CIFAR-100, and tiny-ImageNet demonstrate the improvements in accuracy and forgetting measures of FedGTG as well as the robustness of domain shifts compared to previous frameworks.

2704Descent with Misaligned Gradients and Applications to Hidden Convexity

[openreview] [pdf]

Abstract We consider the problem of minimizing a convex objective given access to an oracle that outputs “misaligned” stochastic gradients, where the expected value of the output is guaranteed to be correlated with, but not necessarily equal to the true gradient of the objective. In the case where the misalignment (or bias) of the oracle changes slowly, we obtain an optimization algorithm that achieves the optimum iteration complexity of O~(ϵ2)\tilde O(\epsilon^{-2}); for the more general case where the changes need not be slow, we obtain an algorithm with O~(ϵ3)\tilde O(\epsilon^{-3}) iteration complexity. As an application of our framework, we consider optimization problems with a “hidden convexity” property, and obtain an algorithm with O(ϵ3)O(\epsilon^{-3}) iteration complexity.

2705FedDTPT: Federated Discrete and Transferable Prompt Tuning for Black-Box Large Language Models

[openreview] [pdf]

Abstract In recent years, large language models (LLMs) have significantly advanced the field of natural language processing (NLP). By fine-tuning LLMs with data from specific scenarios, these foundation models can better adapt to various downstream tasks. However, the fine-tuning process poses privacy leakage risks, particularly in centralized data processing scenarios. To address user privacy concerns, federated learning (FL) has been introduced to mitigate the risks associated with centralized data collection from multiple sources. Nevertheless, the privacy of LLMs themselves is equally critical, as potential malicious attacks challenge their security, an issue that has received limited attention in current research. Consequently, establishing a trusted multi-party model fine-tuning environment is essential. Additionally, the local deployment of large LLMs incurs significant storage costs and high computational demands. To address these challenges, we propose for the first time a federated discrete and transferable prompt tuning, namely FedDTPT, for black-box large language models. In the client optimization phase, we adopt a token-level discrete prompt optimization method that leverages a feedback loop based on prediction accuracy to drive gradient-free prompt optimization through the MLM API. For server optimization, we employ an attention mechanism based on semantic similarity to filter all local prompt tokens, along with an embedding distance elbow detection and DBSCAN clustering strategy to enhance the filtering process. Experimental results demonstrate that, compared to state-of-the-art methods, our approach achieves higher accuracy, reduced communication overhead, and robustness to non-iid data in a black-box setting. Moreover, the optimized prompts are transferable.

2706DIESEL - Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs

[openreview] [pdf]

Abstract In recent years, conversational large language models (LLMs) have shown tremendous success in tasks such as casual conversation, question answering, and personalized dialogue, making significant advancements in domains like virtual assistance, social interaction, and online customer engagement. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety, or social norms), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM’s proposed tokens based on their similarity to predefined negative concepts in the latent space. This approach provides an efficient and effective solution for maintaining alignment with human values. Our evaluation demonstrates DIESEL’s effectiveness on state-of-the-art conversational models (e.g., Llama 3), even in challenging jailbreaking scenarios that test the limits of response safety. We further show that DIESEL can be generalized to use cases other than safety, providing a versatile solution for general-purpose response filtering with minimal computational overhead.

2707True Counterfactual Generation from Language Models

[openreview] [pdf]

Abstract Understanding and manipulating the causal mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery---e.g., model ablations or manipulation of linear subspaces tied to specific concepts---to \emph{intervene} on these models. To understand the impact of interventions precisely, it is useful to examine \emph{counterfactual} strings---e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl’s causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equations Models (GSEMs) using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Experiments demonstrate that our approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.

2708Modeling Abstract Style Prompts for Text-to-Speech Models

[openreview] [pdf]

Abstract A recent trend in text-to-speech synthesis (TTS) is to construct models capable of generating naturalistic speech that adheres to a textual style prompt describing the speaker’s voice and speaking style. In this paper, we propose a crisper definition of style-controlled TTS by categorizing style tags by how they can be collected (automatictags obtainable using signal processing tools e.g. low-pitched and slow;demographictags obtainable using speaker demographics e.g. male and American accent; andabstracttags which need human-annotations e.g. authoritative and awed) and what they represent (intrinsictags inherent to speaker identity e.g. gender, average pitch, texture; andsituationaltags specific to utterance-level speaking styles e.g. emotion). Compared to previous work, we expand the space of style prompts substantially by covering 47 abstract tags, 10 demographic tags and 6 automatic tags. For abstract intrinsic tags, we annotate a subset of speakers from the VoxCeleb dataset. For abstract situational tags, we leverage existing speaking-style-based datasets Expresso and EARS. We train a style-prompted TTS model based on Parler-TTS using these datasets and find that our model outperforms baselines on speech-style consistency metrics. Our collected dataset and model will be open-sourced.

2709Agent Workflow Memory

[openreview] [pdf]

Abstract Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

2710Approximating Multiple Robust Optimization Solutions in One Pass via Proximal Point Methods

[openreview] [pdf]

Abstract Robust optimization provides a principled and unified framework to model many problems in modern operations research and computer science applications, such as risk measures minimization and adversarially robust machine learning. To use a robust solution (e.g., to implement an investment portfolio or perform robust machine learning inference), the user has to a priori decide the trade-off between efficiency (nominal performance) and robustness (worst-case performance) of the solution by choosing the uncertainty level hyperparameters. In many applications, this amounts to solving the problem many times and comparing them, each from a different hyperparameter setting. This makes robust optimization practically cumbersome or even intractable. We present a novel procedure based on the proximal point method (PPM) to efficiently approximate many Pareto efficient robust solutions at once. This effectively reduces the total compute requirement from N×TN \times T to 2×T2 \times T, where NN is the number of robust solutions to be generated, and TT is the time to obtain one robust solution. We prove this procedure can produce exact Pareto efficient robust solutions for a class of robust linear optimization problems. For more general problems, we prove that with high probability, our procedure gives a good approximation of the efficiency-robustness trade-off in random robust linear optimization instances. We conduct numerical experiments to demonstrate.

2711WIQOR: A dataset for what-if analysis of Operations Research problems

[openreview] [pdf]

Abstract We formalize the mathematical program modification (MPM) task, in which the goal is to revise a mathematical program according to an inquiry expressed in natural language. These inquiries, which we refer to as what-if questions, express a desire to understand how the optimal solution to an optimization problem changes with the addition, deletion or revision of constraints. In detail, each MPM instance is a triple consisting of: 1) a natural language specification that summarizes an optimization problem, 2) the canonical formulation of the problem, and 3) a natural language what-if question. The goal is to predict the updated canonical formulation with respect to the question. To support the study of this task, we construct WIQOR, a dataset of 1,946 MPM instances, derived from NL4OPT (Ramamonjison et al., 2023), but with the number of decision variables extended to more than 30 for some problems. In experiments, we observe that Llama 3.1 70B instruct under the in-context learning paradigm achieves 69% accuracy on the easiest test instances, but only 36% accuracy on the most complicated problems. We release WIQOR in the hopes of spurring additional study of MPM and ultimately enabling non-technical users to conduct what-if analyses without the help of technical experts.

2712Leveraging Prior Experience: An Expandable Auxiliary Knowledge Base for Text-to-SQL

[openreview] [pdf]

Abstract Large Language Models (LLMs) exhibit impressive problem-solving skills across many tasks, but they still underperform compared to humans in various downstream applications, such as text-to-SQL. On the BIRD benchmark leaderboard, human performance achieves an accuracy of 92.96%, whereas the top-performing method reaches only 72.39%. Notably, these state-of-the-art (SoTA) methods predominantly rely on in-context learning to simulate human-like reasoning. However, they overlook a critical human skill: continual learning. Inspired by the educational practice of maintaining mistake notebooks during our formative years, we propose LPE-SQL (L\underline{\textbf{L}}everaging P\underline{\textbf{P}}rior E\underline{\textbf{E}}xperience: An Expandable Auxiliary Knowledge Base for Text-to-SQL\underline{\textbf{SQL}}), a novel framework designed to augment LLMs by enabling continual learning without requiring parameter fine-tuning. LPE-SQL consists of four modules that i)\textbf{i)} retrieve relevant entries, ii)\textbf{ii)} efficient sql generation, iii)\textbf{iii)} generate the final result through a cross-consistency mechanism and iv)\textbf{iv)} log successful and failed tasks along with their reasoning processes or reflection-generated tips. Importantly, the core module of LPE-SQL is the fourth one, while the other modules employ foundational methods, allowing LPE-SQL to be easily integrated with SoTA technologies to further enhance performance. Our experimental results demonstrate that this continual learning approach yields substantial performance gains, with the smaller Llama-3.1-70B model with surpassing the performance of the larger Llama-3.1-405B model using SoTA methods.

2713TuBA: Cross-Lingual Transferability of Backdoor Attacks in LLMs with Instruction Tuning

[openreview] [pdf]

Abstract The implications of backdoor attacks on English-centric large language models (LLMs) have been widely examined — such attacks can be achieved by embedding malicious behaviors during training and activated under specific conditions that trigger malicious outputs. Despite the increasing support for multilingual capabilities in open-source and proprietary LLMs, the impact of backdoor attacks on these systems remains largely under-explored. Our research focuses on crosslingual backdoor attacks against multilingual LLMs, particularly investigating how poisoning the instruction-tuning data for one or two languages can affect the outputs for languages whose instruction-tuning data was not poisoned. Despite its simplicity, our empirical analysis reveals that our method exhibits remarkable efficacy in models like mT5 and GPT-4o, with high attack success rates, surpassing 90% in more than 7 out of 12 languages across various scenarios. Our findings also indicate that more powerful models show increased susceptibility to transferable cross-lingual backdoor attacks, which also applies to LLMs predominantly pre-trained on English data, such as Llama2, Llama3, and Gemma. Moreover, our experiments demonstrate the high transferability of the proposed attack: 1) the backdoor mechanism successfully operates in cross-lingual response scenarios across 26 languages, achieving an average attack success rate of 99%, and 2) the proposed attack remains effective even after defenses are applied. These findings expose critical security vulnerabilities in multilingual LLMs and highlight the urgent need for more robust, targeted defense strategies to address the unique challenges posed by cross-lingual backdoor transfer.

2714DistillHGNN: A Knowledge Distillation Approach for High-Speed Hypergraph Neural Networks

[openreview] [pdf]

Abstract In this paper, we propose a novel framework to significantly enhance the inference speed and memory efficiency of Hypergraph Neural Networks (HGNNs) while preserving their high accuracy. Our approach utilizes an advanced teacher-student knowledge distillation strategy. The teacher model, consisting of an HGNN and a Multi-Layer Perceptron (MLP), not only produces soft labels but also transfers structural and high-order information to a lightweight Graph Convolutional Network (GCN) known as TinyGCN. This dual transfer mechanism enables the student model to effectively capture complex dependencies while benefiting from the faster inference and lower computational cost of the lightweight GCN. The student model is trained using both labeled data and soft labels provided by the teacher, with contrastive learning further ensuring that the student retains high-order relationships. This makes the proposed method efficient and suitable for real-time applications, achieving performance comparable to traditional HGNNs but with significantly reduced resource requirements.

2715ATM: Improving Model Merging by Alternating Tuning and Merging

[openreview] [pdf]

Abstract Model merging has recently emerged as a cost-efficient paradigm for Multi-task Learning (MTL). Among merging solutions, Task Arithmetic \citep{task-vectors} stands out for its simplicity and effectiveness. In this paper, we start by motivating the effectiveness of task vectors with their relation to multi-task gradients. We show that in the single epoch scenario, task vectors are exactly equivalent to gradients obtained by performing gradient descent in a multi-task setting, and still approximate the latter with further epochs. We further strengthen the explanation by showing that task vectors work best when equality is maintained and motivate their effectiveness in the general case by showing that most of the contribution in the total update is determined by the gradient of the first epoch. Guided by this parallel, we propose viewing model merging as a single step in an iterative process that Alternates between Tuning and Merging (ATM). Acting as a midpoint between model merging and multi-task gradient descent, ATM obtains state-of-the-art results with the same data and computing requirements. We first extensively evaluate our approach under diverse settings, demonstrating state-of-the-art performance, leading by an accuracy of up to 19% in computer vision and 20% in NLP over the best baselines. We then motivate its effectiveness empirically, showing increased orthogonality between task vectors and, theoretically, proving it to minimize an upper bound to the loss obtained by finetuning jointly on all tasks.

2716Learning Extrapolative Sequence Transformations from Markov Chains

[openreview] [pdf]

Abstract Generative sequence-level models are appealing in settings where the desired outputs must adhere to global constraints. In these settings, autoregressive sampling can struggle to explore the solution space sufficiently to find the optimal solution, especially when optimal solutions involve extrapolating beyond the training data. However, searching the solution space through approximate inference methods such as Markov chain Monte Carlo (MCMC) is computationally expensive. To address this computational burden, we propose to train a separate inference network based on selected states from Markov chains. The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the learned inference network confers many of the same generalization benefits as the slow sampling process, but with the additional benefit of high sample efficiency. This is particularly true in cases where the model must extrapolate beyond the range of values seen in the training data, but our approach demonstrates success even on the anonymization task, which relies solely on interpolation. Finally, we analyze the effects of various strategies to select states from the search space.

2717Reward-Robust RLHF in LLMs

[openreview] [pdf]

Abstract As Large Language Models continue to progress toward more advanced forms of intelligence, Reinforcement Learning from Human Feedback is increasingly seen as a key pathway toward achieving Artificial General Intelligence. However, the reliance on reward-model-based alignment methods introduces significant challenges due to the inherent instability and imperfections of Reward Models (RMs), which can lead to critical issues such as reward hacking and misalignment with human intentions. In this paper, we introduce a reward-robust RLHF framework aimed at addressing these fundamental challenges, paving the way for more reliable and resilient learning in LLMs. Our approach introduces a novel optimization objective that carefully balances performance and robustness by incorporating Bayesian Reward Model Ensembles to model the uncertainty set of reward functions. This allows the framework to integrate both nominal performance and minimum reward signals, ensuring more stable learning even with imperfect RMs. Empirical results demonstrate that our framework consistently outperforms baselines across diverse benchmarks, showing improved accuracy and long-term stability. We also provide a theoretical analysis, demonstrating that reward-robust RLHF approaches the stability of constant reward settings, which proves to be acceptable even in a stochastic-case analysis. Together, these contributions highlight the framework’s potential to enhance both the performance and stability of LLM alignment.

2718Decoupled representation and policy acquisition for continual reinforcement learning

[openreview] [pdf]

Abstract This contribution proposes adiabatic reinforcement learning (ARL), a new method for continual reinforcement learning (CRL). In CRL, we assume a non-stationary environment partitioned into \textit{tasks}. To avoid catastrophic forgetting (CF), RL requires the use of large replay buffers, which leads to very slow learning and high memory requirements. To remedy this, we propose adiabatic reinforcement learning (ARL), a wake-sleep method that performs slow learning of internal representations from high-error transitions during sleep phases. Wake phases are used for the fast learning of policies, i.e., mappings from representations to actions, and to collect new high-error transitions. Representation learning is performed by \textit{adiabatic replay} (AR), a recent CL technique we adapted to the RL setting. AR uses selective, internal replay of samples that are likely to be affected by forgetting. Since this process is conditioned on incoming samples only, its has constant time-complexity w.r.t. tasks. Other benefits include fast adaptation to new tasks, and a very low memory footprint due to the complete absence of replay buffers.

2719KnowTrace: Explicit Knowledge Tracing for Structured Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Recent advances in retrieval-augmented generation (RAG) furnish large language models (LLMs) with iterative retrievals of relevant information to strengthen their capabilities in addressing complex multi-hop questions. However, these methods typically accumulate the retrieved natural language text into LLM prompts, imposing an increasing burden on the LLM to grasp the underlying knowledge structure for high-quality multi-step reasoning. Despite a few attempts to reduce this burden by restructuring all retrieved passages or even entire external corpora, these efforts are afflicted with significant restructuring overhead and potential knowledge loss. To tackle this challenge, we introduce a new structured paradigm (KnowTrace) from the perspective of explicit knowledge tracing, which treats LLM as an agent to progressively acquire desired knowledge triplets during iterative retrievals and ultimately trace out a specific knowledge graph conditioned on the input question. This paradigm clearly unveils the logical relationships behind the unstructured text and thus can directly facilitate LLM’s inference. Notably, it also naturally inspires a reflective mechanism of knowledge backtracing to identify supportive evidence and filter out useless retrievals in the correct trajectories, thus offering an effective way to stimulate LLM’s self-taught finetuning. Extensive experiments demonstrate the superiority of our paradigm over three standard multi-hop question answering benchmarks. Our code is available athttps://github.com/xxrep/SRAG.

2720Curvature Enhanced Manifold Sampling

[openreview] [pdf]

Abstract Over-parameterized deep learning models, characterized by their large number of parameters, have demonstrated remarkable performance in various tasks. Despite the potential risk of overfitting, these models often generalize well to unseen data due to effective regularization techniques, with data augmentation being one of the most prominent methods. This strategy has proven effective in classification tasks, where label-preserving transformations are applicable. However, the application of data augmentation in regression problems remains underexplored. Recently, a newmanifold learningapproach for sampling synthetic data has been introduced, and it can be viewed as utilizing a first-order approximation of the data manifold. In this work, we propose to extend this direction by providing the fundamental theory and practical tools for approximating and sampling general data manifolds. Further, we introduce the curvature enhanced manifold sampling (CEMS) data augmentation method for regression. CEMS is based on a second-order encoding of the manifold, facilitating sampling and reconstruction of new points. Through extensive evaluations on multiple datasets and in comparison to several state-of-the-art approaches, we demonstrate that CEMS is superior in in-distribution and out-of-distribution tasks, while incurring only a mild computational overhead.

2721Generalizable Origin Identification for Text-Guided Image-to-Image Diffusion Models

[openreview] [pdf]

Abstract Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused forspreading misinformation,infringing on copyrights, andevading content tracing. This motivates us to introduce the task of originIDentification for text-guidedImage-to-imageDiffusion models (ID2\mathbf{^2}), aiming to retrieve the original image of a given translated query. A straightforward solution to ID2^2 involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due tovisual discrepancyacross generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID2^2 task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset,OriPID, contains abundantOrigins and guidedPrompts, which can be used to train and test potentialIDentification models across various diffusion models. In the method section, we first prove theexistenceof a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder (VAE) embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can begeneralizedacross different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods (+31.6% mAP), even those with domain generalization designs.

2722XLand-100B: A Large-Scale Multi-Task Dataset for In-Context Reinforcement Learning

[openreview] [pdf]

Abstract Following the success of the in-context learning paradigm in large-scale language and computer vision models, the recently emerging field of in-context reinforcement learning is experiencing a rapid growth. However, its development has been held back by the lack of challenging benchmarks, as all the experiments have been carried out in simple environments and on small-scale datasets. We presentXLand-100B, a large-scale dataset for in-context reinforcement learning based on the XLand-MiniGrid environment, as a first step to alleviate this problem. It contains complete learning histories for nearly 30,00030,000 different tasks, covering 100B transitions and 2.5B episodes. It took 50,00050,000 GPU hours to collect the dataset, which is beyond the reach of most academic labs. Along with the dataset, we provide the utilities to reproduce or expand it even further. We also benchmark common in-context RL baselines and show that they struggle to generalize to novel and diverse tasks. With this substantial effort, we aim to democratize research in the rapidly growing field of in-context reinforcement learning and provide a solid foundation for further scaling.

2723Conformal Training with Reduced Variance

[openreview] [pdf]

Abstract Conformal prediction (CP) is a distribution-free framework for achieving probabilistic guarantees on black-box models. {CP} is generally applied to a model post-training. Conformal training is an approach that aims to optimize the CP efficiency during training. In this direction, ConfTr (Stutz et al, 2022) is a technique that seeks to minimize the expected prediction set size of a model by simulating {CP} in-between training updates. Despite its potential, we identify a strong source of sample inefficiency in ConfTr that leads to overly noisy estimated gradients, introducing training instability and limiting practical use. To address this challenge, we propose variance-reduced conformal training (VR-ConfTr), a method that incorporates a variance reduction technique in the gradient estimation of the ConfTr objective function. Through extensive experiments on various benchmark datasets, we demonstrate that VR-ConfTr consistently achieves faster convergence and smaller prediction sets compared to baselines.

2724Interpretable Dimensionality Reduction by Feature-preserving Manifold Approximation and Projection

[openreview] [pdf]

Abstract Nonlinear dimensionality reduction often lacks interpretability due to the absence of source features in low-dimensional embedding space. We propose FeatureMAP, an interpretable method that preserves source features by tangent space embedding. The core of FeatureMAP is to use local principal component analysis (PCA) to approximate tangent spaces. By leveraging these tangent spaces, FeatureMAP computes gradients to locally reveal feature directions and importance. Additionally, FeatureMAP embeds the tangent spaces into low-dimensional space while preserving alignment between them, providing local gauges for projecting the high-dimensional data points. Unlike UMAP, FeatureMAP employs anisotropic projection to preserve both the manifold structure and the original data density. We apply FeatureMAP to interpreting digit classification, object detection and MNIST adversarial examples, where it effectively distinguishes digits and objects using feature importance and provides explanations for misclassifications in adversarial attacks. We also compare FeatureMAP with other state-of-the-art methods using both local and global metrics.

2725SwitchLoRA: Switched Low-Rank Adaptation Can Learn Full-Rank Information

[openreview] [pdf]

Abstract In the training of large language models, parameter-efficient techniques such as LoRA optimize memory usage and reduce communication overhead during the fine-tuning phase. However, applying such techniques directly during the pre-training phase results in poor performance, primarily because the premature implementation of low-rank training significantly reduces model accuracy. Existing methods like ReLoRA and GaLore have attempted to address this challenge by updating the low-rank subspace. However, they still fall short of achieving the accuracy of full-rank training because they must limit the update frequency to maintain optimizer state consistency, hindering their ability to closely approximate full-rank training behavior. In this paper, we introduce SwitchLoRA, a parameter-efficient training technique that frequently and smoothly replaces the trainable parameters of LoRA adapters with alternative parameters. SwitchLoRA updates the low-rank subspace incrementally, targeting only a few dimensions at a time to minimize the impact on optimizer states. This allows a higher update frequency, thereby enhancing accuracy by enabling the updated parameters to more closely mimic full-rank behavior during the pre-training phase. Our results demonstrate that SwitchLoRA actually surpasses full-rank training, reducing perplexity from 15.23 to 15.01 on the LLaMA 1.3B model while reducing communication overhead by 54% on the LLaMA 1.3B model. Furthermore, after full fine-tuning the SwitchLoRA pre-trained model and the full-rank pre-trained model on the GLUE benchmark, the SwitchLoRA pre-trained model showed an average accuracy gain of about 1% over the full-rank pre-trained model. This demonstrates enhanced generalization and reasoning capabilities of SwitchLoRA.

2726Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have highlighted the importance of extending context lengths for handling complex tasks. While traditional methods for training on long contexts often use filtered long documents, these approaches lead to domain imbalances, limiting model performance. To address this, techniques like random document concatenation (Standard) and similarity-based methods (KNN, ICLM) have been developed. However, they either sacrifice semantic coherence or diversity. To balance both aspects, we introduce Quest, a query-centric data synthesis method aggregating semantically relevant yet diverse documents. Quest uses a generative model to predict potential queries for each document, grouping documents with similar queries and keywords. Extensive experiments demonstrate Quest’s superior performance on long-context tasks, achieving remarkable results with context lengths of up to 1M tokens and confirming its scalability across various model sizes.

2727Feature Level Instance Attribution

[openreview] [pdf]

Abstract Instance attribution has emerged as one of the most crucial methodologies for model explainability because it identifies training data that significantly impacts model predictions, thereby optimizing model performance and enhancing transparency and trustworthiness. The applications of instance attribution include data cleaning, where it identifies and rectifies poor-quality data to improve model outcomes, and in specific domains such as detection of harmful speech, social network graph labeling, and medical image annotation, it provides precise insights into how data influences model decisions. Specifically, current instance attribution methods facilitate the identification of causal relationships between training data and model predictions. A higher Instance-level Training Data Influence value (IL value) indicates that the training data used for the computation play a more significant role in the model’s prediction process. However, the current methods can only indicate that a training sample is important, but they do not explain why this sample is important. A feasible algorithm is urgently needed to provide an explanation for this behavior. This paper discovers that artificially manipulating the attribution score by modifying samples (e.g., changing a pixel value in image data) can significantly intervene in the importance of training samples and yield explainability results at the feature-level during the intervention process. The proposed Feature Level Instance Attribution (FLIA) algorithm assists in identifying crucial feature locations in training data that significantly impact causality. To avoid the frequent retraining of models for evaluation, we introduce an unlearning algorithm as an assessment method and provide detailed empirical evidence of our algorithm’s efficacy. To facilitate future research, we have made the code available at:https://anonymous.4open.science/r/FIIA-D60E/.

2728To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning

[openreview] [pdf]

Abstract Chain-of-thought (CoT) via prompting is the de facto method for eliciting reasoning capabilities from large language models (LLMs). But for what kinds of tasks is this extra “thinking” really helpful? To analyze this, we conducted a quantitative meta-analysis covering over 100 papers using CoT and ran our own evaluations of 20 datasets across 14 models. Our results show that CoT gives strong performance benefits primarily on tasks involving math or logic, with much smaller gains on other types of tasks. On MMLU, directly generating the answer without CoT leads to almost identical accuracy as CoT unless the question or model’s response contains an equals sign, indicating symbolic operations and reasoning. Following this finding, we analyze the behavior of CoT on these problems by separating planning and execution and comparing against tool-augmented LLMs. Much of CoT’s gain comes from improving symbolic execution, but it underperforms relative to using a symbolic solver. Our results indicate that CoT can be applied selectively, maintaining performance while saving inference costs. Furthermore, they suggest a need to move beyond prompt-based CoT to new paradigms that better leverage intermediate computation across the whole range of LLM applications.

2729Safe Multi-task Pretraining with Constraint Prioritized Decision Transformer

[openreview] [pdf]

Abstract Learning a safe policy from offline data without interacting with the environment is crucial for deploying reinforcement learning (RL) policies. Recent approaches leverage transformers to address tasks under various goals, demonstrating a strong generalizability for broad applications. However, these methods either completely overlook safety concerns during policy deployment or simplify safe RL as a dual-objective problem, disregarding the differing priorities between costs and rewards, as well as the additional challenge of multi-task identification caused by cost sparsity. To address these issues, we propose \textbf{S}afe \textbf{M}ulti-t\textbf{a}sk Pretraining with \textbf{Co}nstraint Prioritized Decision \textbf{T}ransformer (SMACOT), which utilizes the Decision Transformer (DT) to accommodate varying safety threshold objectives during policy deployment while ensuring scalability. It introduces a Constraint Prioritized Return-To-Go (CPRTG) token to emphasize cost priorities in the Transformer’s inference process, effectively balancing reward maximization with safety constraints. Additionally, a Constraint Prioritized Prompt Encoder is designed to leverage the sparsity of cost information for task identification. Extensive experiments on the public OSRL dataset demonstrate that SMACOT achieves exceptional safety performance in both single-task and multi-task scenarios, satisfying different safety constraints in over 2x as many environments compared with strong baselines, showcasing its superior safety capability.

2730Entailment Progressions: A Robust Approach to Evaluating Reasoning Within Larger Discourse

[openreview] [pdf]

Abstract Textual entailment, or the ability to deduce whether a proposed hypothesis is logically supported by a given premise, has historically been applied to the evaluation of language modelling efficiency in tasks like question answering and text summarization. However, we hypothesize that these zero-shot entailment evaluations can be extended to the task of evaluating discourse within larger textual narratives. In this paper, we propose a simple but effective method that sequentially evaluates changes in textual entailment between sentences within a larger text, in an approach we denote as “Entailment Progressions”. These entailment progressions aim to capture the inference relations between sentences as an underlying component capable of distinguishing texts generated from various models and procedures. Our results suggest that entailment progressions can be used to effectively distinguish between machine-generated and human-authored texts across multiple established benchmark corpora and our own EP4MGT dataset. Additionally, our method displays robustness in performance when evaluated on paraphrased texts a technique that has historically affected the performance of well-established metrics when distinguishing between machine generated and human authored texts.

2731Selecting Influential Samples for Long Context Alignment via Homologous Models’ Guidance and Contextual Awareness Measurement

[openreview] [pdf]

Abstract The expansion of large language models to effectively handle instructions with extremely long contexts has yet to be fully investigated. The primary obstacle lies in constructing a high-quality long instruction-following dataset devised for long context alignment. Existing studies have attempted to scale up the available data volume by synthesizing long instruction-following samples. However, indiscriminately increasing the quantity of data without a well-defined strategy for ensuring data quality may introduce low-quality samples and restrict the final performance. To bridge this gap, we aim to address the unique challenge of long-context alignment, i.e., modeling the long-range dependencies for handling instructions and lengthy input contexts. We propose GATEAU, a novel framework designed to identify the influential and high-quality samples enriched with long-range dependency relations by utilizing crafted Homologous Models’ Guidance (HMG) and Contextual Awareness Measurement (CAM). Specifically, HMG attempts to measure the difficulty of generating corresponding responses due to the long-range dependencies, using the perplexity scores of the response from two homologous models with different context windows. Also, the role of CAM is to measure the difficulty of understanding the long input contexts due to long-range dependencies by evaluating whether the model’s attention is focused on important segments. Built upon both proposed methods, we select the most challenging samples as the influential data to effectively frame the long-range dependencies, thereby achieving better performance of LLMs. Comprehensive experiments indicate that GATEAU effectively identifies samples enriched with long-range dependency relations and the model trained on these selected samples exhibits better instruction-following and long-context understanding capabilities.

2732Improved Sample Complexity for Global Convergence of Actor-Critic Algorithms

[openreview] [pdf]

Abstract In this paper, we establish the global convergence of the actor-critic algorithm with a significantly improved sample complexity of ( O(\epsilon^{-3}) ), advancing beyond the existing local convergence results. Previous works provide local convergence guarantees with a sample complexity of ( O(\epsilon^{-2}) ) for bounding the squared gradient of the return, which translates to a global sample complexity of ( O(\epsilon^{-4}) ) using the gradient domination lemma. In contrast to traditional methods that employ decreasing step sizes for both the actor and critic, we demonstrate that a constant step size for the critic is sufficient to ensure convergence. This key insight reveals that using a decreasing step size for the actor alone is sufficient to handle the noise for both the actor and critic. Our findings provide theoretical support for the practical success of many algorithms that rely on constant step sizes.

2733Knowing Your Target : Target-Aware Transformer Makes Better Spatio-Temporal Video Grounding

[openreview] [pdf]

Abstract Transformer has attracted increasing interest in spatio-temporal video grounding, or STVG, owing to its end-to-end pipeline and promising result. Existing Transformer-based STVG approaches often leverage a set of object queries, which are initialized simply using zeros and then gradually learn target position information via iterative interactions with multimodal features, for spatial and temporal localization. Despite simplicity, these zero object queries, due to lacking target-specific cues, are hard to learn discriminative target information from interactions with multimodal features in the complicated scenarios (e.g., with distractors or occlusion), resulting in degradation. Addressing this, we introduce a novel Target-Aware Transformer for STVG (TA-STVG), which seeks to adaptively generate object queries via exploring target-specific cues from the given video-text pair, for improving STVG. The key lies in two simple yet effective modules, comprising text-guided temporal sampling (TTS) and attribute-aware spatial activation (ASA), working in a cascade. The former focuses on selecting target-relevant temporal cues from a video utilizing holistic text information, while the latter aims at further exploiting the fine-grained visual attribute information of the object from previous target-aware temporal cues, which is applied for object query initialization. Compared to existing methods leveraging zero-initialized queries, object queries in our TA-STVG, directly generated from a given video-text pair, naturally carry target-specific cues, making them adaptive and better interact with multimodal features for learning more discriminative information to improve STVG. In our experiments on three benchmarks, including HCSTVG-v1/-v2 and VidSTG, TA-STVG achieves state-of-the-art performance and largely outperforms the baseline, validating its efficacy. Code will be released.

2734Deconstructing What Makes a Good Optimizer for Autoregressive Language Models

[openreview] [pdf]

Abstract Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, Lion, and Sophia in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study the impact on Adam’s preconditioning for different layers of the network. Examining Adalayer leads us to the conclusion that, perhaps surprisingly, adaptivity onboththe last layer and LayerNorm parameters in particular are necessary for retaining performance and stability to learning rate.

2735Explaining Black-box Model Predictions via Two-level Nested Feature Attributions with Consistency Property

[openreview] [pdf]

Abstract Techniques that explain the predictions of black-box machine learning models are crucial to make the models transparent, thereby increasing trust in AI systems. The input features to the models often have a nested structure that consists of high- and low-level features, and each high-level feature is decomposed into multiple low-level features. For such inputs, both high-level feature attributions (HiFAs) and low-level feature attributions (LoFAs) are important for better understanding the model’s decision. In this paper, we propose a model-agnostic local explanation method that effectively exploits the nested structure of the input to estimate the two-level feature attributions simultaneously. A key idea of the proposed method is to introduce the consistency property that should exist between the HiFAs and LoFAs, thereby bridging the separate optimization problems for estimating them. Thanks to this consistency property, the proposed method can produce HiFAs and LoFAs that are both faithful to the black-box models and consistent with each other, using a smaller number of queries to the models. In experiments on image classification in multiple instance learning and text classification using language models, we demonstrate that the HiFAs and LoFAs estimated by the proposed method are accurate, faithful to the behaviors of the black-box models, and provide consistent explanations.

2736VAE-Var: Variational Autoencoder-Enhanced Variational Methods for Data Assimilation in Meteorology

[openreview] [pdf]

Abstract Data assimilation (DA) is an essential statistical technique for generating accurate estimates of a physical system’s states by combining prior model predictions with observational data, especially in the realm of weather forecasting. Effectively modeling the prior distribution while adapting to diverse observational sources presents significant challenges for both traditional and neural network-based DA algorithms. This paper introduces VAE-Var, a novel neural network-based data assimilation algorithm aimed at 1) enhancing accuracy by capturing the non-Gaussian characteristics of the conditional background distribution p(xxb)p(\mathbf{x}|\mathbf{x}_b), and 2) efficiently assimilating real-world observational data. VAE-Var utilizes a variational autoencoder to learn the background error distribution, with its decoder creating a variational cost function to optimize the analysis states. The advantages of VAE-Var include: 1) it maintains the framework of traditional variational assimilation, enabling it to accommodate various observation operators, particularly irregular observations; 2) it lessens the dependence on expert knowledge for constructing the background distribution, allowing for improved modeling of non-Gaussian structures; and 3) experimental results indicate that, when applied to the FengWu weather forecasting model, VAE-Var outperforms DiffDA and two traditional algorithms (interpolation and 3DVar) in terms of assimilation accuracy in sparse observational contexts, and is capable of assimilating real-world GDAS prepbufr observations over a year.

2737ReAcTree: Hierarchical Task Planning with Dynamic Tree Expansion using LLM Agent Nodes

[openreview] [pdf]

Abstract Recent advancements in task planning using large language models (LLMs) have made remarkable progress. However, most existing methods, such as ReAct, face limitations when handling complex, long-horizon tasks due to inefficiencies in processing entire tasks through a single sequential decision-making process. To address these challenges, we propose ReAcTree, a hierarchical task planning method that automatically decomposes complex tasks into manageable subgoals within a tree structure. This tree consists of control flow nodes, which manage the execution order of agent nodes, and agent nodes that reason, act, and expand nodes into subgoals to achieve their goals. To further enhance performance, we introduce memory systems: each agent node retrieves goal-specific, agent-level experiences from episodic memory to use as in-context examples, and all agent nodes share and recall information obtained during task execution via working memory. Experiments on the WAH-NL dataset demonstrate that ReAcTree consistently outperforms ReAct across various LLMs and model sizes. For example, when using Qwen2.5 72B, ReAcTree achieves a goal success rate of 63%, significantly surpassing ReAct’s 24%.

2738Representation Learning for Long Tail Recognition via Feature Space Re-Construction

[openreview] [pdf]

Abstract Deep learning has achieved significant success on balanced datasets. However, real-world data often exhibit a long-tailed distribution. Empirical results show that long-tailed data skews representations where head classes dominate the feature space. Many methods have been proposed to empirically correct the skewed representations. However, a clear theoretical understanding of the underlying causes and extent of this skew remains lacking. In this work, we provide a comprehensive theoretical analysis to elucidate how long-tailed data affects representations, deriving the conditions under which the centers of the tail classes shrink together or even collapse into a single point. This results in overlapping feature distributions of tail classes, making features in the overlapping regions inseparable. Moreover, we demonstrate that merely empirically correcting the skewed representations of training data is insufficient to separate the overlapping features, due to distribution shifts between training and real data. To address these challenges, we propose a novel long-tailed representation learning method, FeatRecon. It reconstructs the feature space so that features of all classes are arranged into symmetrical and linearly separable regions. Thereby, it enhances model robustness to long-tailed data. We validate the effectiveness of our method through extensive experiments on the CIFAR-10-LT, CIFAR-100-LT, ImageNet-LT, and iNaturalist 2018 datasets.

2739Algorithmic Phases of In-Context Learning

[openreview] [pdf]

Abstract In-Context Learning (ICL) has significantly expanded the general-purpose nature of large language models, allowing them to adapt to new tasks using merely the input context. While a series of papers analyzing synthetic domains have established a rich phenomenology of ICL, the use of relatively distinct setups makes it unclear how general the reported insights are. To address this, we propose a synthetic sequence modeling task defined by a finite set of Markov chains that simultaneously captures most well-known results on ICL, e.g., the task retrieval vs. learning dichotomy and emergence of induction heads, hence enabling a unified framework for studying the concept. As we show, the proposed task offers several new insights, such as an explanation for ICL’s transient nature, and demonstrates subtleties in ICL’s known phenomenology. For example, we find varying experimental conditions (e.g., data diversity) drives transitions between distinct algorithmic solutions, such as unigram vs. bigram models and Bayesian vs. non-Bayesian approaches, implying ICL is best thought of as a mixture of different algorithms, each with its own peculiarities, instead of a monolithic capability.

2740Rethinking Fair Representation Learning for Performance-Sensitive Tasks

[openreview] [pdf]

Abstract We investigate the prominent class of fair representation learning methods for bias mitigation. Using causal reasoning to define and formalise different sources of dataset bias, we reveal important implicit assumptions inherent to these methods. We prove fundamental limitations on fair representation learning when evaluation data is drawn from the same distribution as training data and run experiments across a range of medical modalities to examine the performance of fair representation learning under distribution shifts. Our results explain apparent contradictions in the existing literature and reveal how rarely considered causal and statistical aspects of the underlying data affect the validity of fair representation learning. We raise doubts about current evaluation practices and the applicability of fair representation learning methods in performance-sensitive settings. We argue that fine-grained analysis of dataset biases should play a key role in the field moving forward.

2741Positional Attention: Out-of-Distribution Generalization and Expressivity for Neural Algorithmic Reasoning

[openreview] [pdf]

Abstract There has been a growing interest in the ability of neural networks to solve algorithmic tasks, such as arithmetic, summary statistics, and sorting. While state-of-the-art models like Transformers have demonstrated good generalization performance on in-distribution tasks, their out-of-distribution (OOD) performance is poor when trained end-to-end. In this paper, we focus on value generalization, a common instance of OOD generalization where the test distribution has the same input sequence length as the training distribution, but the value ranges in the training and test distributions do not necessarily overlap. To address this issue, we propose that using fixed positional encodings to determine attention weights—referred to as positional attention—enhances empirical OOD performance while maintaining expressivity. We support our claim about expressivity by proving that Transformers with positional attention can effectively simulate parallel algorithms.

2742Advancing Graph Generation through Beta Diffusion

[openreview] [pdf]

Abstract Diffusion models have excelled in generating natural images and are now being adapted to a variety of data types, including graphs. However, conventional models often rely on Gaussian or categorical diffusion processes, which can struggle to accommodate the mixed discrete and continuous components characteristic of graph data. Graphs typically feature discrete structures and continuous node attributes that often exhibit rich statistical patterns, including sparsity, bounded ranges, skewed distributions, and long-tailed behavior. To address these challenges, we introduce Graph Beta Diffusion (GBD), a generative model specifically designed to handle the diverse nature of graph data. GBD leverages a beta diffusion process, effectively modeling both continuous and discrete elements. Additionally, we propose a modulation technique that enhances the realism of generated graphs by stabilizing critical graph topology while maintaining flexibility for other components. GBD competes strongly with existing models across multiple general and biochemical graph benchmarks, showcasing its ability to capture the intricate balance between discrete and continuous features inherent in real-world graph data.

2743Improve Code Generation with Feedback

[openreview] [pdf]

Abstract As advancements in Large Language Models (LLMs) continue to accelerate, an increasing number of researchers are exploring the potential of these models to assist in everyday tasks. Despite their remarkable achievements in various downstream applications, several challenges must be addressed. This paper delves into applying LLMs in coding tasks, such as ChatGPT and LLama. Initial observations suggest that directly employing these LLMs does not yield optimal results. However, we have identified that LLMs demonstrate enhanced performance when given appropriate feedback. This includes providing information on the accuracy of the code generated, supplying test cases relevant to the task, and indicating the correct or incorrect outputs for these test cases. Furthermore, we have developed an innovative architecture miming human debugging. This approach supplies local variable information to the LLM while executing the generated code. Our architecture facilitates providing feedback to the LLM and simulates the human debugging experience, thereby significantly improving the LLM’s code generation capabilities. Utilizing our proposed architecture, our model surpasses the current benchmarks of state-of-the-art models in the MBPP and Humaneval datasets. We also present comprehensive analyses and ablation studies to substantiate the efficacy of our methods. These findings open new avenues for enhancing the utility of LLMs in coding tasks, offering a more interactive and practical approach to leveraging these advanced technologies.

2744Can a Single Tree Outperform an Entire Forest?

[openreview] [pdf]

Abstract The prevailing mindset is that a single decision tree underperforms random forests in testing accuracy, despite its advantages in interpretability and lightweight structure. This study challenges such a mindset by significantly improving the testing accuracy of an oblique regression tree through our gradient-based entire tree optimization framework, making its performance comparable to random forests. Our approach reformulates tree training as a differentiable unconstrained optimization task, employing a scaled sigmoid approximation strategy. To ameliorate numerical instability, we propose an algorithmic scheme that solves a sequence of increasingly accurate approximations. Additionally, a subtree polish strategy is implemented to reduce approximation errors accumulated across the tree. Extensive experiments on 16 datasets demonstrate that our optimized tree outperforms random forests by an average of 2.03% improvements in testing accuracy.

2745Lexical Diversity-aware Relevance Assessment for Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Despite their extensive applications, large language models trained on vast historical datasets still struggle with hallucination issues, particularly when addressing open-ended, factual, and commonsense questions. In contrast, Retrieval-Augmented Generation (RAG) methods have proven effective in enhancing large language models’ responses to such inquiries, making them a focal point of research. However, previous RAG approaches overlook the lexical diversity of queries, hindering their ability to achieve a granular relevance assessment between queries and retrieved documents, resulting in suboptimal performance. In this paper, we introduce a Lexical Diversity-aware RAG (DRAG) model, comprising a Diversity-sensitive Relevance Analyzer (DRA) and a Contrastive Relevance Calibration Module (CRC). Specifically, DRA decouples and assesses the relevance of different query components (words, phrases) based on their levels of lexical diversity, ensuring precise and comprehensive document retrieval. According to the DRA assessment, CRC further emphasizes the pertinent knowledge of the retrieved relevant documents through contrastively eliminating the adverse effects of irrelevant contents. By integrating DRA and CRC, the proposed method effectively retrieves relevant documents and leverages their pertinent knowledge to refine the original results and generate meaningful outcomes. Extensive experiments on widely-used benchmarks demonstrate the efficacy of our approach, yielding a 12.5% accuracy improvement on HotpotQA.

2746A Continual Learning Perspective to Entropy Regularized Deep Reinforcement Learning

[openreview] [pdf]

Abstract Research on Continual Learning (CL) tackles learning with non-stationary data distributions. The non-stationary nature of data is also one of the challenges of deep Reinforcement Learning (RL), and as a consequence, both CL and deep RL rely on similar approaches to stabilize learning, from the use of replay buffers to the choice of regularization terms. However, while dynamic neural architectures that grow in size to learn new tasks without forgetting older ones are well researched in CL, it remains a largely understudied research direction in RL. In this paper, we argue that Policy Mirror Descent (PMD), a regularized policy iteration RL algorithm, would naturally benefit from dynamic neural architectures as the current policy is a function of the sum of all past Q-functions. To avoid indefinitely increasing the neural architecture, we study PMD-like algorithms that only keep in memory the last MM Q-functions, and show that a convergent algorithm can be derived if MM is large enough. This theoretical analysis provides insights on how to utilise a fixed budget of Q-functions to reduce catastrophic forgetting in the policy. We implement this algorithm using a new neural architecture that stacks the last MM Q-functions as 3-dimensional tensors to allow for fast GPU computations. StaQ, the resulting algorithm, is competitive with state-of-the-art deep RL baselines and typically exhibits lower variance in performance. Beyond its performance, we argue that the simplicity and strong theoretical guarantees of StaQ’s policy update makes it an ideal research tool over which we can further build a fully stable deep RL algorithm.

2747Can LLMs Really Learn to Translate a Low-Resource Language from One Grammar Book?

[openreview] [pdf]

Abstract Extremely low-resource (XLR) languages lack substantial corpora for training NLP models, motivating the use of all available resources such as dictionaries and grammar books. Machine Translation from One Book (Tanzer et al., 2024) suggests prompting long-context LLMs with one grammar book enables English–Kalamang translation, an unseen XLR language—a noteworthy case of linguistic knowledge helping an NLP task. We investigate whether the book’s grammatical explanations or its parallel examples are most effective for learning XLR translation, finding almost all improvement stems from the parallel examples. Further, we find similar results for Nepali, a seen low-resource language, and achieve performance comparable to an LLM with a grammar book by simply fine-tuning an encoder-decoder translation model. We then investigate where grammar books help by testing two linguistic tasks, grammaticality judgment and gloss prediction, and we explore what kind of grammatical knowledge helps by introducing a typological feature prompt that achieves leading results on these more relevant tasks. We thus emphasise the importance of task-appropriate data for XLR languages: parallel examples for translation, and grammatical data for linguistic tasks. As we find no evidence that long-context LLMs can make effective use of grammatical explanations for XLR translation, we suggest data collection for multilingual XLR tasks such as translation is best focused on parallel data over linguistic description.

2748Recurrent Diffusion for Large-Scale Parameter Generation

[openreview] [pdf]

Abstract Parameter generation has struggled to scale up for a long time, significantly lim- iting its range of applications. In this study, we introduce Recurrent diffusion for large-scale Parameter Generation, called RPG. We first divide the trained parame- ters into non-overlapping parts, after which a recurrent model is proposed to learn their relationships. The recurrent model’s outputs, as conditions, are then fed into a diffusion model to generate the neural network parameters. Using only a sin- gle GPU, recurrent diffusion enables us to generate popular vision and language models such as ConvNeXt-L and LoRA parameters of LLaMA-7B. Meanwhile, across various architectures and tasks, the generated parameters consistently per- form comparable results over trained networks. Notably, our approach also shows the potential to generate models for handling unseen tasks. This suggests that recurrent diffusion largely increases the practicality of parameter generation

2749VinePPO: Unlocking RL Potential For LLM Reasoning Through Refined Credit Assignment

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receiving any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a state-of-the-art reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, value networks face challenges in predicting the expected cumulative rewards accurately in complex reasoning tasks, often leading to high-variance updates and suboptimal performance. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they barely outperform a random baseline when comparing alternative steps. To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates, bypassing the need for large value networks. Our method consistently outperforms PPO and other RL-free baselines across MATH and GSM8K datasets with fewer gradient updates (up to 9x), less wall-clock time (up to 3.0x). These results emphasize the importance of accurate credit assignment in RL finetuning of LLM and demonstrate VinePPO’s potential as a superior alternative.

2750Benchmarking LLMs’ Judgments with No Gold Standard

[openreview] [pdf]

Abstract We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by large language models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review.GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on two human-annotated datasets, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally, GEM is more robust against strategic manipulation, such as rephrasing or elongation, which can artificially inflate scores under a GPT-4o Examiner.We also present GRE-bench (Generating Review Evaluation Benchmark) which evaluates LLMs based on how well they can generate high-quality peer reviews for academic research papers. Because GRE-bench is based upon GEM, it inherits its robustness properties. Additionally, GRE-bench circumvents data contamination problems (or data leakage) by using the continuous influx of new open-access research papers and peer reviews each year. We show GRE-bench results of various popular LLMs on their peer review capabilities using the ICLR2023 dataset.

2751Learning Splitting Heuristics in Divide-and-Conquer SAT Solvers with Reinforcement Learning

[openreview] [pdf]

Abstract We propose RDC-SAT, a novel approach to optimize splitting heuristics in Divide-and-Conquer SAT solvers using deep reinforcement learning. Our method dynamically extracts features from the current solving state whenever a split is required. These features, such as learned clauses, variable activity scores, and clause LBD (Literal Block Distance) values, are represented as a graph. A GNN integrated with an Actor-Critic model processes this graph to determine the optimal split variable. Unlike traditional linear state transitions characterized by Markov processes, divide-and-conquer challenges involve tree-like state transitions. To address this, we developed a reinforcement learning environment based on the Painless framework that efficiently handles these transitions. Additionally, we designed different discounted reward functions for satisfiable and unsatisfiable SAT problems, capable of handling tree-like state transitions. We trained our model using the Decentralized Proximal Policy Optimization (DPPO) algorithm on phase transition random 3-SAT problems and implemented the RDC-SAT solver, which operates in both GPU-accelerated and non-GPU modes. Evaluations show that RDC-SAT significantly improves the performance of D&C solvers on phase transition random 3-SAT datasets and generalizes well to the SAT Competition 2023 dataset, substantially outperforming traditional splitting heuristics.

2752DM-Tune: Quantizing Diffusion Models with Mixture-of-Gaussian Guided Noise Tuning

[openreview] [pdf]

Abstract Diffusion models have become essential generative tools for tasks such as image generation, video creation, and inpainting, but their high computational and memory demands pose challenges for efficient deployment. Contrary to the traditional belief that full-precision computation ensures optimal image quality, we demonstrate that a fine-grained mixed-precision strategy can surpass full-precision models in terms of image quality, diversity, and text-to-image alignment. However, directly implementing such strategies can lead to increased complexity and reduced runtime performance due to the overheads of managing multiple precision formats and casting operations. To address this, we introduce DM-Tune, which replaces complex mixed-precision quantization with a unified low-precision format, supplemented by noise-tuning, to improve both image generation quality and runtime efficiency. The proposed noise-tuning mechanism is a type of fine-tuning that reconstructs the mixed-precision output by learning adjustable noise through a parameterized nonlinear function consisting of Gaussian and linear components. Key steps in our framework include identifying sensitive layers for quantization, modeling quantization noise, and optimizing runtime with custom low-precision GPU kernels that support efficient noise-tuning. Experimental results across various diffusion models and datasets demonstrate that DM-Tune not only significantly improves runtime but also enhances diversity, quality, and text-to-image alignment compared to FP32, FP8, and state-of-the-art mixed-precision methods. Our approach is broadly applicable and lays a solid foundation for simplifying complex mixed-precision strategies at minimal cost.

2753Asymptotic Convergence of SGD in Non-Convex Problems: A Stopping Times Method with Relaxed Step-size Conditions

[openreview] [pdf]

Abstract Stochastic Gradient Descent (SGD) is widely used in machine learning research. In previous research, the convergence analyses of SGD under vanishing step-size settings typically assumed that the step sizes satisfied the Robbins-Monro conditions, which is to say, the sum of the step sizes was infinite, while the sum of the squares of the step sizes was finite. In practical applications, a wider variety of step sizes is often used, but these may not meet the Robbins-Monro step-size conditions, thus lacking theoretical guarantees of convergence. To bridge the gap between theory and practical application, this paper introduces a novel analytical method—the stopping time method based on probability theory—to explore the asymptotic convergence of SGD under more relaxed step-size conditions. In the non-convex setting, we prove that the almost sure convergence of the sequence of iterates generated by SGD when step sizes satisfy (\sum_{t=1}^{+\infty} \epsilon_t = +\infty) and (\sum_{t=1}^{+\infty} \epsilon_t^p < +\infty) for some (p > 2). Compared to previous works, our analysis eliminates the need to assume global Lipschitz continuity of the loss function, and it also relaxes the requirement of global boundedness of the high-order moments of the stochastic gradient to local boundedness. Additionally, we prove (L_2) convergence without the need for assuming global boundedness of loss functions or their gradients. The assumptions required for this work are the weakest among studies with the same conclusions, thereby extending the applicability of SGD in various practical scenarios where traditional assumptions may not hold.

2754Annealed Implicit Q-learning in Online Reinforcement Learning

[openreview] [pdf]

Abstract In continuous action online reinforcement learning, actor-critic methods are predominantly used. However, compared to Q-learning-based discrete action algorithms that model the optimal Q-value, continuous action algorithms that model the Q-value for the current policy and perform policy improvement solely through policy updates suffer from low sample efficiency. This study investigates whether an algorithm that implicitly estimates the optimal Q-value, typically used in offline RL, is also effective in online RL. It is demonstrated that a loss function aimed at achieving optimality distorts the distribution of Q-values, leading to overestimation bias, and that this distortion and bias increase as learning progresses. To address this issue, we propose a simple algorithm that anneals optimality. Our method significantly outperforms widely used methods such as SAC and TD3 in online DM Control tasks. Additionally, we demonstrate that annealing improves performance and enhances robustness to the hyperparameter related to the optimality.

2755PDETime: Rethinking Long-term Multivariate Time Series Forecasting from the Perspective of Partial Differential Equations

[openreview] [pdf]

Abstract Recent advancements in deep learning have led to the development of various approaches for long-term multivariate time-series forecasting (LMTF). Most of these approaches can be categorized as either historical-value-based methods, which rely on discretely sampled past observations, or time-index-based methods that model time indices directly as input variables. However, real-world dynamical systems often exhibit nonstationarity and suffer from insufficient sampling frequency, posing challenges such as spurious correlations between time steps and difficulties in modeling complex temporal dependencies. In this paper, we treat multivariate time series as data sampled from a continuous dynamical system governed by partial differential equations (PDEs) and propose a new model called PDETime. Instead of predicting future values directly, PDETime employs an encoding-integration-decoding architecture: it predicts the partial derivative of the system with respect to time (i.e., the first-order difference) in the latent space and then integrates this information to forecast future series. This approach enhances both performance and stability, especially in scenarios with extremely long forecasting windows. Extensive experiments on seven diverse real-world LMTF datasets demonstrate that PDETime not only adapts effectively to the intrinsic spatiotemporal nature of the data but also sets new benchmarks by achieving state-of-the-art results.

2756Training-free Long Video Generation with Chain of Diffusion Model Experts

[openreview] [pdf]

Abstract Video generation models hold substantial potential in areas such as filmmaking. However, current video diffusion models need high computational costs and produce suboptimal results due to high complexity of video generation task. In this paper, we propose \textbf{ConFiner}, an efficient high-quality video generation framework that decouples video generation into easier subtasks: structure \textbf{con}trol and spatial-temporal re\textbf{fine}ment. It can generate high-quality videos with chain of off-the-shelf diffusion model experts, each expert responsible for a decoupled subtask. During the refinement, we introduce coordinated denoising, which can merge multiple diffusion experts’ capabilities into a single sampling. Furthermore, we design ConFiner-Long framework, which can generate long coherent video with three constraint strategies on ConFiner. Experimental results indicate that with only 10% of the inference cost, our ConFiner surpasses representative models like Lavie and Modelscope across all objective and subjective metrics. And ConFiner-Long can generate high-quality and coherent videos with up to 600 frames.

2757Generalizable autoregressive modeling of time series through functional narratives

[openreview] [pdf]

Abstract Time series data are inherently functions of time, yet current transformers often learn time series by modeling them as mere concatenations of time periods, overlooking their functional properties. In this work, we propose a novel objective for transformers that learn time series by re-interpreting them as temporal functions. We build an alternative sequence of time series by constructing degradation operators of different intensity in the functional space, creating augmented variants of the original sample that are abstracted or simplified to different degrees. Based on the new set of generated sequence, we train an autoregressive transformer that progressively recovers the original sample from the most simplified variant. Analogous to the next word prediction task in languages that learns narratives by connecting different words, our autoregressive transformer aims to learn the Narratives of Time Series (NoTS) by connecting different functions in time. Theoretically, we justify the construction of the alternative sequence through its advantages in approximating functions. When learning time series data with transformers, constructing sequences of temporal functions allows for a broader class of approximable functions (e.g., differentiation) compared to sequences of time periods, leading to a 26% performance improvement in synthetic feature regression experiments. Experimentally, we validate NoTS in 3 different tasks across 22 real-world datasets, where we show that NoTS significantly outperforms other pre-training methods by up to 6%. Additionally, combining NoTS on top of existing transformer architectures can consistently boost the performance. Our results demonstrate the potential of NoTS as a general-purpose dynamic learner, offering a viable alternative for developing foundation models for time series analysis.

2758Dynamic SVD-Enhanced Approach for Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL) has emerged as a promising paradigm for collaborative machine learning while preserving data privacy. However, existing FL approaches face challenges in balancing model generalization among heterogeneous clients and resistance to malicious attacks. This paper introduces Dynamic SVD-driven Federated Learning (DSVD-FL), a novel approach that addresses these challenges simultaneously. DSVD-FL dynamically adjusts the contribution of each client using Singular Value Decomposition (SVD), introducing an adaptive weighting mechanism based on singular value contributions and vector alignments. Theoretical analysis demonstrates the convergence properties and computational efficiency of our approach. Experimental results on both IID and non-IID datasets show that DSVD-FL outperforms state-of-the-art FL approaches in terms of model accuracy, robustness against various attack scenarios, while maintaining competitive computational efficiency. We perform an ablation study to explore the key components of SVD that impact the federated learning performance.

2759TreeDQN: Sample-Efficient Off-Policy Reinforcement Learning for Combinatorial Optimization

[openreview] [pdf]

Abstract A convenient approach to optimally solving combinatorial optimization tasks is Branch-and-Bound method. The branching heuristic in this method can be learned to solve a large set of similar tasks. The promising results here are achieved by the recently appeared on-policy reinforcement learning (RL) method based on the tree Markov Decision Process (tMDP). To overcome its main disadvantages, namely, very large training time and unstable training, we propose TreeDQN, a sample-efficient off-policy RL method that is trained by optimizing the geometric mean of expected return. To theoretically support the training procedure for our method, we prove the contraction property of the Bellman operator for the tree MDP. As a result, our method requires up to 10 times less training data, performs faster than known on-policy methods on synthetic tasks. Moreover, TreeDQN significantly outperforms the state-of-the-art techniques on a challenging practical task from the ML4CO competition.

2760Constraint-Conditioned Actor-Critic for Offline Safe Reinforcement Learning

[openreview] [pdf]

Abstract Offline safe reinforcement learning (OSRL) aims to learn policies with high rewards while satisfying safety constraints solely from data collected offline. However, the learned policies often struggle to handle states and actions that are not present or out-of-distribution (OOD) from the offline dataset, which can result in violation of the safety constraints or overly conservative behaviors during their online deployment. Moreover, many existing methods are unable to learn policies that can adapt to varying constraint thresholds. To address these challenges, we propose constraint-conditioned actor-critic (CCAC), a novel OSRL method that models the relationship between state-action distributions and safety constraints, and leverages this relationship to regularize critics and policy learning. CCAC learns policies that can effectively handle OOD data and adapt to varying constraint thresholds. Empirical evaluations on the DSRL\texttt{DSRL} benchmarks show that CCAC significantly outperforms existing methods for learning adaptive, safe, and high-reward policies.

2761Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient

[openreview] [pdf]

Abstract Model-based reinforcement learning (RL) offers a solution to the data inefficiency that plagues most model-free RL algorithms. However, learning a robust world model often demands complex and deep architectures, which are expensive to compute and train. Within the world model, dynamics models are particularly crucial for accurate predictions, and various dynamics-model architectures have been explored, each with its own set of challenges. Currently, recurrent neural network (RNN) based world models face issues such as vanishing gradients and difficulty in capturing long-term dependencies effectively. In contrast, use of transformers suffers from the well-known issues of self-attention mechanisms, where both memory and computational complexity scale as O(n2)O(n^2), with nn representing the sequence length.To address these challenges we propose a state space model (SSM) based world model, specifically based on Mamba, that achieves O(n)O(n) memory and computational complexity while effectively capturing long-term dependencies and facilitating the use of longer training sequences efficiently. We also introduce a novel sampling method to mitigate the suboptimality caused by an incorrect world model in the early stages of training, combining it with the aforementioned technique to achieve a normalised score comparable to other state-of-the-art model-based RL algorithms using only a 7 million trainable parameter world model. This model is accessible and can be trained on an off-the-shelf laptop.

2762Efficient Adversarial Detection and Purification with Diffusion Models

[openreview] [pdf]

Abstract Adversarial training and adversarial purification are two effective and practical defense methods to enhance a model’s robustness against adversarial attacks. However, adversarial training necessitates additional training, while adversarial purification suffers from low time efficiency. More critically, current defenses are designed under the perturbation-based adversarial threat model, which is ineffective against the recently proposed unrestricted adversarial attacks. In this paper, we propose an effective and efficient adversarial defense method that counters both perturbation-based and unrestricted adversarial attacks. Our defense is inspired by the observation that adversarial attacks are typically located near the decision boundary and are sensitive to pixel changes. To address this, we introduce adversarial anti-aliasing to mitigate adversarial modifications. Additionally, we propose adversarial super-resolution, which leverages prior knowledge from clean datasets to benignly recover images. These approaches do not require additional training and are computationally efficient. Extensive experiments against both perturbation-based and unrestricted adversarial attacks demonstrate that our defense method outperforms state-of-the-art adversarial purification methods.

2763An Online Learning Approach to Prompt-based Selection of Generative Models

[openreview] [pdf]

Abstract Selecting a sample generation scheme from multiple text-based generative models is typically addressed by choosing the model that maximizes an averaged evaluation score. However, this score-based selection overlooks the possibility that different models achieve the best generation performance for different types of text prompts. An online identification of the best generation model for various input prompts can reduce the costs associated with querying sub-optimal models. In this work, we explore the possibility of varying rankings of text-based generative models for different text prompts and propose an online learning framework to predict the best data generation model for a given input prompt. The proposed framework adapts the kernelized contextual bandit (CB) methodology to a CB setting with shared context variables across arms, utilizing the generated data to update a kernel-based function that predicts which model will achieve the highest score for unseen text prompts. Additionally, we apply random Fourier features (RFF) to the kernelized CB algorithm to accelerate the online learning process and establish a O~(T)\widetilde{\mathcal{O}}(\sqrt{T}) regret bound for the proposed RFF-based CB algorithm over T iterations. Our numerical experiments on real and simulated text-to-image and image-to-text generative models show RFF-UCB performs successfully in identifying the best generation model across different sample types.

2764Generalization Aware Minimization

[openreview] [pdf]

Abstract Sharpness-Aware Minimization (SAM) algorithms have effectively improved neural network generalization by steering model parameters away from sharp regions of the training loss landscape, which tend to generalize poorly. However, the underlying mechanisms of SAM are not fully understood, and recent studies question whether its bias toward flatter regions is why it improves generalization. In this work, we introduce Generalization-Aware Minimization (GAM), a generalized version of SAM that employs multiple perturbation steps instead of SAM’s single-step perturbations. This allows GAM to directly guide model parameters toward areas of the landscape that generalize better. We show that the expected true (test) loss landscape is a rescaled version of the observed training loss landscape and demonstrate how GAM’s multiple perturbative updates can be designed to optimize this expected true loss. Finally, we present a practical online algorithm that adapts GAM’s perturbative steps during training to improve generalization, and we empirically validate its superior performance over SAM on benchmark datasets. We believe GAM sheds light on the generalization improvements of sharpness-based algorithms and can inspire the development of optimizers with even better generalization.

2765Koopman Embedded Equivariant Control

[openreview] [pdf]

Abstract An efficient way to control systems with unknown nonlinear dynamics is to find an appropriate embedding or representation for simplified approximation (e.g. linearization), which facilitates system identification and control synthesis. Nevertheless, there has been a lack of embedding methods that can guarantee (i) embedding the dynamical system comprehensively, including the vector fields (ODE form) of the dynamics, and (ii) preserving the consistency of control effect between the original and latent space. To address these challenges, we propose Koopman Embedded Equivariant Control (KEEC) to learn an embedding of the states and vector fields such that a Koopman operator is approximated as the latent dynamics. Due to the Koopman operator’s linearity, learning the latent vector fields of the dynamics becomes simply solving linear equations. Thus in KEEC, the analytical form of the greedy control policy, which is dependent on the learned differential information of the dynamics and value function, is also simplified. Meanwhile, KEEC preserves the effectiveness of the control policy in the latent space by preserving the metric in two spaces. Our algorithm achieves superior performances in the experiments conducted on various control domains, including the image-based Pendulum, Lorenz-63 and the wave equation.

2766Unlocking Full Dynamic Optimization of District Energy Systems through State-Space Model Learning

[openreview] [pdf]

Abstract Predictive control enables the operation of physical systems along an optimal trajectory based on forecasts and dynamic simulations. However, the complexity of system dynamics and high computational cost of optimization typically restrict the optimization window to short horizons. Thus, any potential benefits from mid- and long-term rewards are withdrawn. This is particularly relevant for optimization of district energy systems using various low-environmental-impact sources. To address this, we present an end-to-end methodological framework for learning state-space representations of such systems that significantly reduce computational load. The proposed approach leverages the implicit graph structure of such systems to develop and train a physics-informed spatio-temporal graph neural network. This methodology is evaluated on a real-world district heating system incorporating thermal solar panels, storage, biomass and natural gas boilers. Through historical time-series data augmentation and hyperparameter optimization, the learned model demonstrates strong generalization ability and high accuracy in predicting system dynamics. Our method reduces simulation time by four orders of magnitude, cutting optimization time from several days to mere minutes, while also lowering operational costs by up to 25%.

2767Leave-One-Out Stable Conformal Prediction

[openreview] [pdf]

Abstract Conformal prediction (CP) is an important tool for distribution-free predictive uncertainty quantification. Yet, a major challenge is to balance computational efficiency and prediction accuracy, particularly for multiple predictions. We propose {\bf L}eave-{\bf O}ne-{\bf O}ut {\bf Stab}le {\bf C}onformal {\bf P}rediction (\texttt{LOO-StabCP}), a novel method to speed up full conformal using algorithmic stability without sample splitting. By leveraging \emph{leave-one-out} stability, our method is much faster in handling a large number of prediction requests compared to existing method {\tt RO-StabCP} based on \emph{replace-one} stability. We derived stability bounds for two popular machine learning tools: regularized risk minimization (RLM) and stochastic gradient descent (SGD). Our method is theoretically justified and demonstrates superior numerical performance on synthetic and real-world data. We applied our method to a screening problem, where its effective exploitation of training data led to improved test power compared to state-of-the-art method based on split conformal.

2768PDE-GAN for solving PDE optimal control problems more accurately and efficiently

[openreview] [pdf]

Abstract PDE optimal control (PDEOC) problems aim to optimize the performance of physical systems constrained by partial differential equations (PDEs) to achieve desired characteristics. Such problems frequently appear in scientific discoveries and are of huge engineering importance. Physics-informed neural networks (PINNs) are recently proposed to solve PDEOC problems, but it may fail to balance the different competing loss terms in such problems. Our work proposes PDE-GAN, a novel approach that puts PINNs in the framework of generative adversarial networks (GANs) “learn the loss function” to address the trade-off between the different competing loss terms effectively. We conducted detailed and comprehensive experiments to compare PDE-GANs with vanilla PINNs in solving four typical and representative PDEOC problems, namely, (1) boundary control on Laplace Equation, (2) time-dependent distributed control on Inviscous Burgers’ Equation, (3) initial value control on Burgers’ Equation with Viscosity, and (4) time-space-dependent distributed control on Burgers’ Equation with Viscosity. Strong numerical evidence supports the PDE-GAN that it achieves the highest accuracy and shortest computation time without the need of line search which is necessary for vanilla PINNs.

2769IntersectionZoo: Eco-driving for Benchmarking Multi-Agent Contextual Reinforcement Learning

[openreview] [pdf]

Abstract Despite the popularity of multi-agent reinforcement learning (RL) in simulated and two-player applications, its success in messy real-world applications has been limited. A key challenge lies in its generalizability across problem variations, a common necessity for many real-world problems. Contextual reinforcement learning (CRL) formalizes learning policies that generalize across problem variations. However, the lack of standardized benchmarks for multi-agent CRL has hindered progress in the field. Such benchmarks are desired to be based on real-world applications to naturally capture the many open challenges of real-world problems that affect generalization. To bridge this gap, we propose IntersectionZoo, a comprehensive benchmark suite for multi-agent CRL through the real-world application of cooperative eco-driving in urban road networks. The task of cooperative eco-driving is to control a fleet of vehicles to reduce fleet-level vehicular emissions. By grounding IntersectionZoo in a real-world application, we naturally capture real-world problem characteristics, such as partial observability and multiple competing objectives. IntersectionZoo is built on data-informed simulations of 16,334 signalized intersections derived from 10 major US cities, modeled in an open-source industry-grade microscopic traffic simulator. By modeling factors affecting vehicular exhaust emissions (e.g., temperature, road conditions, travel demand), IntersectionZoo provides one million data-driven traffic scenarios. Using these traffic scenarios, we benchmark popular multi-agent RL and human-like driving algorithms and demonstrate that the popular multi-agent RL algorithms struggle to generalize in CRL settings.

2770Can Model Randomization Offer Robustness Against Query-Based Black-Box Attacks?

[openreview] [pdf]

Abstract Deep neural networks are misguided by simple-to-craft, imperceptible adversarial perturbations to inputs. Now, it is possible to craft such perturbations solely using model outputs and black-box attack algorithms. These algorithms compute adversarial examples by iteratively querying a model and inspecting responses. Attacks success in near information vacuums pose a significant challenge for developing mitigations. We investigate a new idea for a defense driven by a fundamental insight—to compute an adversarial example, attacks depend on the relationship between successive responses to queries to optimize a perturbation. Therefore, to obfuscate this relationship, we investigate randomly sampling a model from a set to generate a response to a query. Effectively, this model randomization violates the attacker’s expectation of the unknown parameters of a model to remain static between queries to extract information to guide the search toward an adversarial example. It is not immediately clear if model randomization can lead to sufficient obfuscation to confuse query-based black-box attacks or how such a method could be built. Our theoretical analysis proves model randomization always increases resilience to query-based black-box attacks. We demonstrate with extensive empirical studies using 6 state-of-the-art attacks under all three perturbation objectives (l,l2,l0l_\infty, l_2, l_0) and adaptive attacks, our proposed method injects sufficient uncertainty through obfuscation to yield a highly effective defense.

2771AnyBimanual: Transferring Single-arm Policy for General Bimanual Manipulation

[openreview] [pdf]

Abstract Performing language-conditioned bimanual manipulation tasks is of great importance for many applications ranging from household service to industrial assembly. However, teleoperating dual-arm demonstrations is expensive due to the high-dimensional action space, which poses challenges for conventional methods to handle general bimanual manipulation tasks. In contrast, single-arm policy has recently demonstrated impressive generalizability across a wide range of tasks because of scaled model parameters and training data, which can provide sharable manipulation knowledge for dual-arm systems. To this end, we propose a plug-and-play method named AnyBimanual, which transfers pretrained single-arm policy to multi-task bimanual manipulation policy with limited bimanual demonstrations. Specifically, we first introduce a skill manager to dynamically schedule the discovered skill primitives from pretrained single-arm policy for bimanual manipulation tasks, which combines skill primitives with embodiment-specific compensation. To mitigate the observation discrepancy between single-arm and dual-arm systems, we present a voxel editor to generate spatial soft masks for visual embedding of the workspace, which aims to align visual input of single-arm policy model for each arm with those during pretraining stage. Extensive results on 13 simulated and real-world tasks indicate the superiority of AnyBimanual with an improvement of 12.67% on average success rate compared with previous state-of-the-art methods.

2772AgentQuest: Benchmarking LLM and VLM Agents on Long-Horizon Interactive Tasks

[openreview] [pdf]

Abstract Large Language Models (LLMs) and Vision Language Models (VLMs) possess extensive knowledge and exhibit promising reasoning abilities, however, they still struggle to perform well in complex, dynamic environments. Real-world tasks require handling intricate interactions, advanced spatial reasoning, long-term planning, and continuous exploration of new strategies—areas in which we lack effective methodologies for comprehensively evaluating these capabilities. To address this gap, we introduce AgentQuest, a novel benchmark designed to assess the agentic capabilities of LLMs and VLMs through a diverse set of challenging games. Our benchmark incorporates a range of existing reinforcement learning environments with varying levels of difficulty, including tasks that are solvable by non-expert humans in seconds to extremely challenging ones that may take years to master (e.g., the NetHack Learning Environment). We devise fine-grained metrics to measure performance and conduct an extensive evaluation of several popular open-source and closed-source LLMs and VLMs. Our findings indicate that while current models achieve partial success in the easier games, they struggle significantly with more challenging tasks. Notably, we observe severe deficiencies in vision-based decision-making, as models perform worse when visual representations of the environments are provided. We release AgentQuest as an open and user-friendly benchmark to facilitate future research and development in the agentic community.

2773Counterfactual Causal Inference in Natural Language with Large Language Models

[openreview] [pdf]

Abstract Causal structure discovery methods are commonly applied to structured data where the causal variables are known and where statistical testing can be used to assess the causal relationships. By contrast, recovering a causal structure from unstructured natural language data such as news articles contains numerous challenges due to the absence of known variables or counterfactual data to estimate the causal links. Large Language Models (LLMs) have shown promising results in this direction but also exhibit limitations. This work investigates LLM’s abilities to build causal graphs from text documents and perform counterfactual causal inference. We propose an end-to-end causal structure discovery and causal inference method from natural language: we first use an LLM to extract the instantiated causal variables from text data and build a causal graph. We merge causal graphs from multiple data sources to represent the most exhaustive set of causes possible. We then conduct counterfactual inference on the estimated graph. The causal graph conditioning allows reduction of LLM biases and better represents the causal estimands. We use our method to show that the limitations in the counterfactual causal reasoning abilities come from prediction errors and propose directions to mitigate them. We demonstrate the applicability of our method on real-world news articles.

2774FLOPS: Forward Learning with OPtimal Sampling

[openreview] [pdf]

Abstract Given the limitations of backpropagation, perturbation-based gradient computation methods have recently gained focus for learning with only forward passes, also referred to as queries. Conventional forward learning consumes enormous queries on each data point for accurate gradient estimation through Monte Carlo sampling, which hinders the scalability of those algorithms. However, not all data points deserve equal queries for gradient estimation. In this paper, we study the problem of improving the forward learning efficiency from a novel perspective: how to reduce the gradient estimation variance with minimum cost? For this, we propose to allocate the optimal number of queries over each data in one batch during training to achieve a good balance between estimation accuracy and computational efficiency. Specifically, with a simplified proxy objective and a reparameterization technique, we derive a novel plug-and-play query allocator with minimal parameters. Theoretical results are carried out to verify its optimality. We conduct extensive experiments for fine-tuning Vision Transformers on various datasets and further deploy the allocator to two black-box applications: prompt tuning and multimodal alignment for foundation models. All findings demonstrate that our proposed allocator significantly enhances the scalability of forward-learning algorithms, paving the way for real-world applications.

[openreview] [pdf]

Abstract Lean is an advanced proof assistant designed to facilitate formal theorem proving by providing a variety of interactive feedback. In this paper, we explore methodologies to leverage proof assistant feedback to augment the capabilities of large language models in constructing formal proofs. First, we deploy online reinforcement learning using Lean verification outcomes as the reward signal to improve the proof completion policy. This straightforward approach shows great promise in enhancing the model’s alignment with the formal verification system. In addition, we propose RMaxTS, a variant of Monte-Carlo tree search that employs an intrinsic-reward-driven exploration strategy to generate diverse proof paths. The tree structure is organized to represent the transitions of intermediate tactic states, extracted from the compilation messages given by Lean’s tactic mode. The intrinsic reward is constructed to incentivize the discovery of novel tactic states, which helps to to mitigate the sparse-reward problem inherent in proof search. These techniques lead to a more efficient planning scheme for formal proof generation, achieving new state-of-the-art results on both miniF2F and ProofNet benchmarks.

2776Learn from Known Unknowns: A Unified Empirical Bayesian Framework for Improving Group Robustness

[openreview] [pdf]

Abstract The lack of group robustness has emerged as a critical concern in machine learning, as conventional methods like Empirical Risk Minimization (ERM) can achieve high overall accuracy while yielding low worst-group accuracy in minority groups. This issue often stems from spurious correlations—non-essential features that models exploit as shortcuts—which can compromise deep learning models in high-stakes applications. Previous works have found that simply retraining classifiers with reweighted datasets or rebalanced samples could significantly improve robustness. However, existing methods lack a unified framework, as they often exhibit inconsistent performance across datasets, and sometimes rely heavily on hyperparameter tuning, making them impractical for real-world datasets. In this work, we first argue that existing methods can be unified as one Empirical Bayesian framework, where a priori of group information is not specified. We then propose our method \textit{Learn from Known Unknowns} under this framework by quantifying the epistemic uncertainty of biased ERM models and introducing a selective reweighting technique for retraining. Our empirical results demonstrate that this approach improves group robustness across diverse datasets and reduces reliance on hyperparameter tuning, offering a more efficient and scalable solution to spurious correlations.

2777Model Cautiousness: Towards Safer Deployment in Critical Domains

[openreview] [pdf]

Abstract In this paper, we introduce the concept of model cautiousness, which stresses the importance of aligning a model’s confidence with its accuracy in in-distribution (ID) scenarios while adopting a more uncertain approach in out-of-distribution (OoD) contexts. Model cautiousness is framed as a spectrum between justified confidence and complete ignorance, induced by the inability to clearly define a model’s domain of expertise. We propose a rigorous post-hoc approach to obtain a cautious model that merges the confidence scores of the primary confidence model and a model discriminating between ID and OoD inputs. A metric to measure the cautiousness error of a confidence model is introduced. We further present a simple method for discriminating ID from OoD inputs and providing a meaningful confidence estimate that an input is OoD. Finally, we benchmark our approach across 12 question-answering and 37 vision datasets, demonstrating its effectiveness in enhancing model cautiousness compared to standard calibration procedures.

2778Markovian Transformers for Informative Language Modeling

[openreview] [pdf]

Abstract Chain-of-Thought (CoT) reasoning holds great promise for explaining the outputs of language models, but recent studies have highlighted significant challenges in its practical application for interpretability. We propose to address this issue via two key components: a technique to factor next-token prediction through intermediate CoT text, ensuring the CoT is causally load-bearing, and a reinforcement learning approach to train CoT to predict future tokens independently of other context. This results in “Markovian” language models, where CoT serves as a fixed-size state for future token prediction. Our approach optimizes for “informativeness” – the improvement in next-token predictions using a trained CoT compared to a baseline. We demonstrate our method’s effectiveness using Proximal Policy Optimization (PPO) on arithmetic problems and achieve an 11% performance boost on the GSM8K benchmark using Mistral 7B Inst V2. The increased sensitivity of model performance to CoT perturbations provides strong evidence of CoT reliance. This work advances the development of more transparent and interpretable language models, potentially enabling their extension to arbitrarily long contexts and enhancing AI reasoning capabilities across various domains.

2779ExPLoRA: Parameter-Efficient Extended Pre-Training to Adapt Vision Transformers under Domain Shifts

[openreview] [pdf]

Abstract Parameter-efficient fine-tuning (PEFT) techniques such as low-rank adaptation (LoRA) can effectively adapt large pre-trained foundation models to downstream tasks using only a small fraction (0.1%-10%) of the original trainable weights. An under-explored question of PEFT is in extending the pre-training phase without supervised labels; that is, can we adapt a pre-trained foundation model to a new domain via efficient self-supervised pre-training on this new domain? In this work, we introduce ExPLoRA, a highly effective technique to improve transfer learning of pre-trained vision transformers (ViTs) under domain shifts. Initializing a ViT with pre-trained weights on large, natural-image datasets such as from DinoV2 or MAE, ExPLoRA continues the unsupervised pre-training objective on a new domain, unfreezing 1-2 pre-trained ViT blocks and tuning all other layers with LoRA. We then fine-tune the resulting model only with LoRA on this new domain for supervised learning. Our experiments demonstrate state-of-the-art results on satellite imagery, even outperforming fully pre-training and fine-tuning ViTs. Using the DinoV2 training objective, we demonstrate up to 7.5% improvement in linear probing top-1 accuracy on downstream tasks while using <10% of the number of parameters that are used in prior fully-tuned state-of-the art approaches. Our ablation studies confirm the efficacy of our approach over other baselines, including PEFT and unfreezing more ViT blocks.

2780Hallucination in LVLMs: Fictitious Presupposition Questions, Benchmark, and Solution

[openreview] [pdf]

Abstract Large Vision-Language Models (LVLMs) have achieved impressive performance across various vision-language tasks. However, hallucinations, i.e., generating counterfactual responses, remain a significant challenge. Although recent models have mitigated hallucinations in tasks such as object existence and image description, they primarily address hallucinations in response generation while overlooking the task question itself. This paper highlights the vulnerability of LVLMs in solving fictitious presupposition questions (FPQs), where the models are prone to accept the presuppositions of non-existent objects and produce severe hallucinatory responses. To this end, we first introduce a novel benchmark, VFP-Bench, to evaluate LVLMs’ capability to discriminate fictitious presuppositions and generate factual responses. Moreover, we introduce Antidote, a universal, synthetic data-driven self-correction solution for alleviating hallucination in FPQs and conventional tasks. It leverages synthetic data to incorporate factual priors into questions/queries to achieve self-correction, decoupling hallucination alleviation into a preference optimization problem. Applied to the LLaVA series, it enhances performance on VFP-Bench by over 50%, POPE by 1.8–3.3%, and CHAIR & SHR by 30–50%, without relying on external supervision from stronger LVLMs or human feedback and introducing noticeable catastrophic forgetting issues.

2781Learning through experience:Episodic memory representation for cognitive agents

[openreview] [pdf]

Abstract As the demand for intelligent robots and cognitive agents rises, the ability to retain and utilize past experiences through episodic memory has become crucial, especially for social companion robots that rely on previous interactions for task execution. To address this, we introduce Episodic Memory for Cognitive Agents (EMCA), a novel framework that advances knowledge representation by integrating real-world interactions. EMCA enables agents to adapt to complex environments by learning from tasks, interacting with humans, and processing multimodal data—such as speech, vision, and non-verbal cues—without pre-training on specific scenarios. EMCA models episodic memory through a graph-based structure , allowing for incremental storage and retrieval of experiences. Each interaction or event enriches the memory graph, supporting continuous learning and adaptation without extensive retraining. This human-like memory formation optimizes the agent’s ability to retrieve relevant information for tasks like localization, planning, and reasoning based on prior experiences.Unlike conventional models relying on temporal markers or recurrent patterns, EMCA encodes data like human memory, allowing reasoning across diverse scenarios regardless of temporal patterns. The framework dynamically builds a memory graph with semantic and temporal connections based on the agent’s experiences, promoting flexible temporal reasoning. It also introduces mechanisms for clustering new memories and a dynamic retrieval policy that adjusts based on context or query type, ensuring robustness even in unpredictable scenarios. Empirical tests show EMCA adapts effectively to real-world data, offering reliability and flexibility in dynamic environments.

2782Unsupervised Zero-Shot Reinforcement Learning via Dual-Value Forward-Backward Representation

[openreview] [pdf]

Abstract Online unsupervised reinforcement learning (URL) can discover diverse skills via reward-free pre-training and exhibits impressive downstream task adaptation abilities through further fine-tuning. However, online URL methods face challenges in achieving zero-shot generalization, i.e., directly applying pre-trained policies to downstream tasks without additional planning or learning. In this paper, we propose a novel Dual-Value Forward-Backward representation (DVFB) framework with a contrastive entropy intrinsic reward to achieve both zero-shot generalization and fine-tuning adaptation in online URL. On the one hand, we demonstrate that poor exploration in forward-backward representations can lead to limited data diversity in online URL, impairing successor measures, and ultimately constraining generalization ability. To address this issue, the DVFB framework learns successor measures through a skill value function while promoting data diversity through an exploration value function, thus enabling zero-shot generalization. On the other hand, and somewhat surprisingly, by employing a straightforward dual-value fine-tuning scheme combined with a reward mapping technique, the pre-trained policy further enhances its performance through fine-tuning on downstream tasks, building on its zero-shot performance. Through extensive multi-task generalization experiments, DVFB demonstrates both superior zero-shot generalization (outperforming on all 12 tasks) and fine-tuning adaptation (leading on 10 out of 12 tasks) abilities, surpassing state-of-the-art URL methods.

2783A Provable Quantile Regression Adapter via Transfer Learning

[openreview] [pdf]

Abstract Adapter-tuning strategy is an efficient method in machine learning that introduces lightweight and sparse trainable parameters into a pretrained model without altering the original parameters (e.g., low-rank adaptation of large language models). Nevertheless, most existing adapter-tuning approaches are developed for risk-neutral task objectives and the study on the adaptation of risk-sensitive tasks is limited. In this paper, we propose a transfer learning-based quantile regression adapter to improve the estimation of quantile-related risks by leveraging existing pretrained models. We also establish a theoretical analysis to quantify the efficacy of our quantile regression adapter. Particularly, we introduce a transferability measure that characterizes the intrinsic similarity between the pretrained model and downstream task in order to explain when transferring knowledge can improve downstream learning. Under appropriate transferability and structural assumptions, we establish error bounds for the estimation and out-of-sample prediction quality by our quantile regression adapter. Compared to vanilla approaches without transfer learning, our method is provably more sample efficient. Extensive numerical simulations are conducted to demonstrate the superiority and robustness of our method empirically.

2784Collective Model Intelligence Requires Compatible Specialization

[openreview] [pdf]

Abstract In this work, we explore the limitations of combining models by averaging intermediate features, referred to as model merging\textit{model merging}, and propose a new direction for achieving collective model intelligence through what we call compatible specialization\textit{compatible specialization}. Current methods for model merging, such as parameter and feature averaging, struggle to effectively combine specialized models due to representational divergence during fine-tuning. As models specialize to their individual domains, their internal feature representations become increasingly incompatible, leading to poor performance when attempting to merge them for new tasks. We analyze this phenomenon using centered kernel alignment (CKA) and show that as models specialize, the similarity in their feature space structure diminishes, hindering their capacity for collective use. To address these challenges, we investigate routing-based merging strategies, which offer more flexible methods for combining specialized models by dynamically routing across different layers. This allows us to improve on existing methods by combining features from multiple layers rather than relying on fixed, layer-wise combinations. However, we find that these approaches still face limitations when layers within models are representationally incompatible. Our findings highlight the importance of designing new approaches for model merging that operate on well-defined input and output spaces, similar to how humans communicate through language rather than intermediate neural activations.

2785Activating More Advantageous Neurons Can Improve Adversarial Transferability

[openreview] [pdf]

Abstract Deep Neural Networks (DNNs) are vulnerable to unseen noise, lighting the need to identify the deficiencies of DNNs to mitigate this vulnerability. In the field of adversarial attacks, existing works investigate the deficiencies causing the vulnerability of DNNs, quantifying the vulnerability of DNNs and demonstrating the transferability of adversarial examples where adversarial examples crafted for one model can deceive another. Among the related works, adversarial transferability attracts much attention since transferable adversarial examples enable black-box attacks and raise concerns about DNNs. Although various novel adversarial attacks are presented to improve the adversarial transferability, the property of DNNs that leads to the improvements remains unidentified. This work delves into this issue and reveals that different benign input with different features activates mostly different neurons in a model, and the model may be viewed as an ensemble including different submodels capturing different features. Therefore, an adversarial attack can activate more neurons to generate the adversarial examples, thus probably making the examples applicable to diverse models to enhance the adversarial transferability. Also, data transformation can help exclude wrong answers to boost the adversarial example. The extensive experiments demonstrate the soundness and superiority of our work.

2786LEVERAGING LEARNING RATE GRADIENTS FOR AUTOMATIC LEARNING RATE SELECTION

[openreview] [pdf]

Abstract Selecting an optimal learning rate (LR) is crucial for training deep neural networks, significantly affecting both convergence speed and final model performance. Determining this optimal LR typically involves two key challenges: choosing an appropriate initial LR and selecting an LR scheduler for adjusting the LR during training. This paper focuses on the former challenge—selecting the initial LR. Traditionally, this task relies on manual tuning or heuristic methods, often involving extensive trial-and-error or computationally expensive search strategies like grid search or random search. We propose an algorithm, Automatic Learning Rate Selection (ALRS), to find the initial LR without the need for manual intervention. ALRS leverages the gradient of the LR itself — a less explored approach in the field. ALRS is a computationally lightweight pre-training process that automatically selects the initial LR by iterative refinements using the LR gradient, specifically analyzing its sign information, combined with suitable search algorithms. This approach efficiently converges to the optimal LR in a stable and robust manner across various optimizers and network architectures.We evaluate our technique on standard deep learning benchmarks, including MNIST with a CNN and CIFAR-10 and CIFAR-100 with ResNet-18, using both SGD and Adam optimizers. Our experiments demonstrate that the automatically determined LRs achieve performance comparable to manually tuned LRs and state-of-the-art results.

2787Going Beyond Feature Similarity: Effective Dataset distillation based on Class-aware Conditional Mutual Information

[openreview] [pdf]

Abstract Dataset distillation (DD) aims to minimize the time and memory consumption needed for training deep neural networks on large datasets, by creating a smaller synthetic dataset that has similar performance to that of the full real dataset. However, current dataset distillation methods often result in synthetic datasets that are excessively difficult for networks to learn from, due to the compression of a substantial amount of information from the original data through metrics measuring feature similarity, e,g., distribution matching (DM). In this work, we introduce conditional mutual information (CMI) to assess the class-aware complexity of a dataset and propose a novel method by minimizing CMI. Specifically, we minimize the distillation loss while constraining the class-aware complexity of the synthetic dataset by minimizing its empirical CMI from the feature space of pre-trained networks, simultaneously. Conducting on a thorough set of experiments, we show that our method can serve as a general regularization method to existing DD methods and improve the performance and training efficiency.

2788Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces

[openreview] [pdf]

Abstract In human cognition theory, human thinking is governed by two systems: the fast and intuitive System 1 and the slower but more deliberative System 2. Recent studies have shown that incorporating System 2 process into Transformers including large language models (LLMs), significantly enhances their reasoning capabilities. Nevertheless, models that purely resemble System 2 thinking require substantially higher computational costs and are much slower to respond. To address this challenge, we present \dualformer, a single Transformer model that seamlessly integrates both the fast and slow reasoning modes. \dualformer is obtained by training on data with randomized reasoning traces, where different parts of the traces are dropped during training. The dropping strategies are specifically tailored according to the trace structure, analogous to analyzing our thinking process and creating shortcuts with patterns. At inference time, our model can be configured to output only the solutions (\emph{fast mode}) or both the reasoning chain and the final solution (\emph{slow mode}), or automatically decide which mode to engage (\emph{auto mode}). In all cases, \dualformer outperforms the corresponding baseline models in both performance and computational efficiency: \textbf{(1)} in slow mode, \dualformer optimally solves unseen 30×3030 \times 30 maze navigation tasks 97.697.6% of the time, surpassing the \searchformer (trained on data with complete reasoning traces) baseline performance of 93.3%, while only using 45.545.5% fewer reasoning steps; \textbf{(2)} in fast mode, \dualformer completes those tasks with an 8080% optimal rate, significantly outperforming the Solution-Only model (trained on solution-only data), which has an optimal rate of only 30%; \textbf{(3)} when operating in auto mode, \dualformer achieves an optimal rate of 96.6% while utilizing 59.959.9% fewer reasoning steps compared to \searchformer. For math problems, our techniques have also achieved improved performance with LLM fine-tuning, showing its generalization beyond task-specific models.

2789Less is More: Masking Elements in Image Condition Features Avoids Content Leakages in Style Transfer Diffusion Models

[openreview] [pdf]

Abstract Given a style-reference image as the additional image condition, text-to-image diffusion models have demonstrated impressive capabilities in generating images that possess the content of text prompts while adopting the visual style of the reference image. However, current state-of-the-art methods often struggle to disentangle content and style from style-reference images, leading to issues such as content leakages. To address this issue, we propose a masking-based method that efficiently decouples content from style without the need of tuning any model parameters. By simply masking specific elements in the style reference’s image features, we uncover a critical yet under-explored principle: guiding with appropriately-selected fewer conditions (e.g., dropping several image feature elements) can efficiently avoid unwanted content flowing into the diffusion models, enhancing the style transfer performances of text-to-image diffusion models. In this paper, we validate this finding both theoretically and experimentally. Extensive experiments across various styles demonstrate the effectiveness of our masking-based method and support our theoretical results.

2790IDEATOR: Jailbreaking VLMs Using VLMs

[openreview] [pdf]

Abstract As large Vision-Language Models (VLMs) continue to gain prominence, ensuring their safety deployment in real-world applications has become a critical concern. Recently, significant research efforts have focused on evaluating the robustness of VLMs against jailbreak attacks. Due to challenges in obtaining multi-modal data, current studies often assess VLM robustness by generating adversarial or query-relevant images based on harmful text datasets. However, the jailbreak images generated this way exhibit certain limitations. Adversarial images require white-box access to the target VLM and are relatively easy to defend against, while query-relevant images must be linked to the target harmful content, limiting their diversity and effectiveness. In this paper, we propose a novel jailbreak method named IDEATOR, which autonomously generates malicious image-text pairs for black-box jailbreak attacks. IDEATOR is a VLM-based approach inspired by our conjecture that a VLM itself might be a powerful red team model for generating jailbreak prompts. Specifically, IDEATOR employs a VLM to generate jailbreak texts while leveraging a state-of-the-art diffusion model to create corresponding jailbreak images. Extensive experiments demonstrate the high effectiveness and transferability of IDEATOR. It successfully jailbreaks MiniGPT-4 with a 94% success rate and transfers seamlessly to LLVA and InstructBLIP, achieving high success rates of 82% and 88%, respectively. IDEATOR uncovers previously unrecognized vulnerabilities in VLMs, calling for advanced safety mechanisms.

2791Is Synthetic Data Ready for Improving Visual Grounding?

[openreview] [pdf]

Abstract This paper extensively investigates the effectiveness of synthetic training data to improve the capabilities of vision-and-language models for grounding textual descriptions to image regions. We explore various strategies to best generate image-text pairs and image-text-box triplets using a series of pretrained models under different settings and varying degrees of reliance on real data. Through comparative analyses with synthetic, real, and web-crawled data, we identify factors that contribute to performance differences, and propose SynGround, an effective pipeline for generating useful synthetic data for visual grounding. Our findings show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models and offers the potential for infinite data generation. Particularly, SynGround improves the pointing game accuracy of pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively, across the RefCOCO+ and the Flickr30k benchmarks.

2792Optimizing Attention with Mirror Descent: Generalized Max-Margin Token Selection

[openreview] [pdf]

Abstract Attention mechanisms have revolutionized numerous domains of artificial intelligence, including natural language processing and computer vision, by enabling models to selectively focus on relevant parts of the input data. Building on recent results characterizing the optimization dynamics of gradient descent (GD) and the structural properties of its preferred solutions in attention-based models, this paper explores the convergence properties and implicit bias of a family of mirror descent (MD) algorithms designed for softmax attention mechanisms, with the potential function chosen as the pp-th power of the p\ell_p-norm. Specifically, we show the directional convergence of these algorithms to a generalized hard-margin SVM with an p\ell_p-norm objective when applied to a classification problem using a one-layer softmax attention model. Our theoretical results demonstrate that these algorithms not only converge directionally to the generalized max-margin solutions but also do so at a rate comparable to that of traditional GD in simpler models, despite the highly nonlinear and nonconvex nature of the present problem. Additionally, we delve into the joint optimization dynamics of the key-query matrix and the decoder, establishing conditions under which this complex joint optimization converges to their respective hard-margin SVM solutions.

2793Generalizability of Neural Networks Minimizing Empirical Risk Based on Expressive Power

[openreview] [pdf]

Abstract The primary objective of learning methods is generalization. Classic generalization bounds, based on VC-dimension or Rademacher complexity, are uniformly applicable to all networks in the hypothesis space. On the other hand, algorithm-dependent generalization bounds, like stability bounds, address more practical scenarios and provide generalization conditions for neural networks trained using SGD. However, these bounds often rely on strict assumptions, such as the NTK hypothesis or convexity of the empirical loss, which are typically not met by neural networks. In order to establish generalizability under less stringent assumptions, this paper investigates generalizability of neural networks that minimize the empirical risk. A lower bound for population accuracy is established based on the expressiveness of these networks, which indicates that with adequately large training sample and network sizes, these networks can generalize effectively. Additionally, we provide a lower bound necessary for generalization, demonstrating that, for certain data distributions, the quantity of data required to ensure generalization exceeds the network size needed to represent that distribution. Finally, we provide theoretical insights into several phenomena in deep learning, including robust overfitting, importance of over-parameterization networks, and effects of loss functions.

2794Hierarchical Classification via Diffusion on Manifolds

[openreview] [pdf]

Abstract Hierarchical classification, the problem of classifying images according to a predefined hierarchical taxonomy, has practical significance owing to the principle of ``making better mistakes’', i.e., better to predict correct coarse labels than incorrect fine labels. Yet, it is insufficiently studied in literature, presumably because simply finetuning a pretrained deep neural network using the cross-entropy loss on leaf classes already leads to good performance w.r.t not only the popular top-1 accuracy but also hierarchical metrics. Despite the empirical effectiveness of finetuning pretrained models, we argue that hierarchical classification could be better addressed by explicitly regularizing finetuning w.r.t the predefined hierarchical taxonomy. Intuitively, with a pretrained model, data lies in hierarchical manifolds in the feature space. Hence, we propose a hierarchical multimodal contrastive finetuning method to leverage taxonomic hierarchy to finetune a pretrained model for better hierarchical classification. Moreover, the hierarchical manifolds motivate a graph diffusion-based method to adjust posteriors at hierarchical levels altogether in inference. This distinguishes our method from the existing ones, including top-down approaches (using coarse-class predictions to adjust fine-class predictions) and bottom-up approaches (processing fine-class predictions towards coarse-label predictions). We validate our method on two large-scale datasets, iNat18 and iNat21. Extensive experiments demonstrate that our method significantly outperforms prior arts w.r.t both top-1 accuracy and established hierarchical metrics, thanks to our new multi-modal hierarchical contrastive training and graph-diffusion-based inference.

2795Understanding Virtual Nodes: Oversquashing and Node Heterogeneity

[openreview] [pdf]

Abstract While message passing neural networks (MPNNs) have convincing success in a range of applications, they exhibit limitations such as the oversquashing problem and their inability to capture long-range interactions. Augmenting MPNNs with a virtual node (VN) removes the locality constraint of the layer aggregation and has been found to improve performance on a range of benchmarks. We provide a comprehensive theoretical analysis of the role of VNs and benefits thereof, through the lenses of oversquashing and sensitivity analysis. First, we characterize, precisely, how the improvement afforded by VNs on the mixing abilities of the network and hence in mitigating oversquashing, depends on the underlying topology. We then highlight that, unlike Graph-Transformers (GTs), classical instantiations of the VN are often constrained to assign uniform importance to different nodes. Consequently, we propose a variant of VN with the same computational complexity, which can have different sensitivity to nodes based on the graph structure. We show that this is an extremely effective and computationally efficient baseline for graph-level tasks.

2796Replicate and Quantize: A Plug-and-Play Strategy for Load Balancing in Sparse Mixture-of-Experts LLMs

[openreview] [pdf]

Abstract While the rapid increase in the number of model parameters poses significant benefits to the development of large language models (LLMs), computational costs are also raised. In order to tackle this difficulty, the sparse mixture-of-experts(SMoE) model was introduced to tackle LLM scaling by activating a subset of experts per input. Therefore, how to leverage the knowledge of multiple experts will be an important topic. Normally, in the most extreme scenario, employing a balanced expert allocation system will result in a time-saving of nn times compared to utilizing only a single expert. Thus, in this paper we (1) systematically analyzed the performance and functionality of each expert. (2) Introduced a metric to fill the blank of evaluating load balance for the sparse mixture-of-experts(SMoE) model, based on the observation. (3) Proposed a dynamic plug-and-play strategy that is both trainingless and near-lossless, effectively resolving the load balancing problem, in contrast to previous works that focused on training strategies.

2797Do LLMs Recognize Your Preferences? Evaluating Personalized Preference Following in LLMs

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly deployed as chatbots, yet their ability to personalize responses to user preferences remains limited. We introduce PrefEval, a benchmark for evaluating LLMs’ ability to infer, memorize and adhere to user preferences in long-context conversational setting. PrefEval comprises 3,000 manually curated user preference and query pairs spanning 20 topics. PrefEval contains user personalization or preference information in both explicit and implicit preference forms, and evaluates LLM performance using a generation and a classification task. With PrefEval, we have evaluated 10 open-sourced and proprietary LLMs in multi-session conversations with varying context lengths up to 100k tokens. We benchmark with various prompting, iterative feedback, and retrieval-augmented generation methods. Our benchmarking effort reveals that state-of-the-art LLMs face significant challenges in following users’ preference during conversations. In particular, in zero-shot settings, preference following accuracy falls below 10% at merely 10 turns (~3k tokens) across most evaluated models. Even with advanced prompting and retrieval methods, preference following still deteriorates in long-context conversations. We also find that multiple stated preferences within a conversation improve adherence and models are not affected by conflicting preferences. Furthermore, we show that fine-tuning on PrefEval significantly improves performance. We believe PrefEval serves as a valuable resource for measuring, understanding, and enhancing LLMs’ proactive preference following abilities, paving the way for personalized conversational agents.

2798Expanding Expressivity in Transformer Models with MöbiusAttention

[openreview] [pdf]

Abstract Attention mechanisms and Transformer architectures have revolutionized Natural Language Processing (NLP) by enabling exceptional modeling of long-range dependencies and capturing intricate linguistic patterns. However, their inherent reliance on linear operations in the form of matrix multiplications limits their ability to fully capture inter-token relationships on their own. We propose MöbiusAttention, a novel approach that integrates Möbius transformations within the attention mechanism of Transformer-based models. Möbius transformations are non-linear operations in spaces over complex numbers with the ability to map between various geometries. By incorporating these properties, MöbiusAttention empowers models to learn more intricate geometric relationships between tokens and capture a wider range of information through complex-valued weight vectors. We build and pre-train a BERT and a RoFormer version enhanced with MöbiusAttention, which we then finetune on the GLUE benchmark. We evaluate empirically our approach against the baseline BERT and RoFormer models on a range of downstream tasks. Our approach compares favorably against the baseline models, even with smaller number of parameters suggesting the enhanced expressivity of MöbiusAttention. This research paves the way for exploring the potential of Möbius transformations in the complex projective space to enhance the expressivity and performance of foundation models.

2799Decoupled SGDA for Games with Intermittent Strategy Communication

[openreview] [pdf]

Abstract We focus on reducing communication overhead in multiplayer games, where frequently exchanging strategies between players is not feasible and players have noisy or outdated strategies of the other players. We propose \textit{Decoupled SGDA}, an extension of Stochastic Gradient Descent Ascent (SGDA), where players perform independent updates using outdated strategies of opponents, with periodic strategy synchronization. For Strongly-Convex-Strongly-Concave (SCSC) games, we demonstrate that Decoupled SGDA achieves near-optimal communication complexity comparable to the best-known GDA rates. For \emph{weakly coupled} games where the interaction between players is lower relative to non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Our findings extend to multi-player games. To provide insights into the effect of communication frequency and convergence, we extensively study the convergence of Decoupled SGDA for quadratic minimax problems. Lastly, in settings where the noise over the players is imbalanced, Decoupled SGDA significantly outperforms federated minimax methods.

2800Generalized Probabilistic Attention Mechanism in Transformers

[openreview] [pdf]

Abstract The Transformer architecture has become widely adopted due to its demonstrated success, attributed to the attention mechanism at its core. Despite these successes, the attention mechanism of Transformers is associated with two well-known issues: rank-collapse and gradient vanishing. In this paper, we present a theoretical analysis that it is inherently difficult to address both issues simultaneously in the conventional attention mechanism. To handle these issues, we introduce a novel class of attention mechanism, referred to as generalized probabilistic attention mechanism (GPAM), and its dual-attention implementation within the Transformer architecture. Unlike conventional attention mechanisms, GPAM allows for negative attention scores while preserving a fixed total sum. We provide theoretical evidence that the proposed dual-attention GPAM (daGPAM) effectively mitigates both the rank-collapse and gradient vanishing issues which are difficult to resolve simultaneously with the conventional attention mechanisms. Furthermore, we empirically validate this theoretical evidence, demonstrating the superiority of daGPAM compared to other alternative attention mechanisms that were proposed to address the same issues. Additionally, we demonstrate the practical benefits of GPAM in natural language processing tasks, such as language modeling and neural machine translation.

2801Unifying Unsupervised Graph-Level Anomaly Detection and Out-of-Distribution Detection: A Benchmark

[openreview] [pdf]

Abstract To build safe and reliable graph machine learning systems, unsupervised graph-level anomaly detection (GLAD) and unsupervised graph-level out-of-distribution (OOD) detection (GLOD) have received significant attention in recent years. Though these two lines of research share the same objective, they have been studied independently in the community due to distinct evaluation setups, creating a gap that hinders the application and evaluation of methods from one to the other. To bridge the gap, in this work, we present a Unified Benchmark for unsupervised Graph-level OOD and anomaly Detection (UB-GOLD), a comprehensive evaluation framework that unifies GLAD and GLOD under the concept of generalized graph-level OOD detection. Our benchmark encompasses 35 datasets spanning four practical anomaly and OOD detection scenarios, facilitating the comparison of 18 representative GLAD/GLOD methods. We conduct multi-dimensional analyses to explore the effectiveness, generalizability, robustness, and efficiency of existing methods, shedding light on their strengths and limitations. Furthermore, we provide an open-source codebase of UB-GOLD to foster reproducible research and outline potential directions for future investigations based on our insights.

2802Can Transformers Perform PCA ?

[openreview] [pdf]

Abstract Transformers demonstrate significant advantage as the building block of Large Language Models. Recent efforts are devoted to understanding the learning capacities of transformers at a fundamental level. This work attempts to understand the intrinsic capacity of transformers in performing dimension reduction from complex data. Theoretically, our results rigorously show that transformers can perform Principle Component Analysis (PCA) similar to the Power Method, given a supervised pre-training phase. Moreover, we show the generalization error of transformers decays by n1/5n^{-1/5} in L2L_2. Empirically, our extensive experiments on the simulated and real world high dimensional datasets justify that a pre-trained transformer can successfully perform PCA by simultaneously estimating the first kk eigenvectors and eigenvalues. These findings demonstrate that transformers can efficiently extract low dimensional patterns from high dimensional data, shedding light on the potential benefits of using pre-trained LLM to perform inference on high dimensional data.

2803ContraSim: Contrastive Similarity Space Learning for Financial Market Predictions

[openreview] [pdf]

Abstract We introduce the Contrastive Similarity Space (ContraSim) paradigm that is able to form global semantic understanding between how daily financial headlines can affect market movement. ContraSim consists of two steps. 1) Weighted Headline Augmentation: We propose a method of augmenting financial headlines to create new headlines with known semantic distances to the original. 2) Weighted-Self Supervised Contrastive Learning (WSSCL): An extension of classical binary contrastive learning algorithms, WSSCL leverages the known distances between anchor and augmented prompts to generate finely grained embedding space that optimizes for similar news to be clumped together. We measure how well ContraSim is able to learn global financial information by parsing whether or not it inherently groups newslines of homogeneous market movement directions together, using a novel information density metric Info-kNN. We find that incorporating features from ContraSim into financial forecasting tasks has a 7% increase in classification accuracy. Additionally, we highlight that ContraSim can be used to find historic news-days that most resemble pertinent financial headlines of the day to help analysts to make better decisions for predicting market movement.

2804Contextually Guided Transformers via Low-Rank Adaptation

[openreview] [pdf]

Abstract Large Language Models (LLMs) based on Transformers excel at text processing, but their reliance on prompts for specialized behavior introduces computational overhead. We propose a modification to a Transformer architecture that eliminates the need for explicit prompts by learning to encode context into the model’s weights. Our Contextually Guided Transformer (CGT) model maintains a contextual summary at each sequence position, allowing it to update the weights on the fly based on the preceding context. This approach enables the model to self-specialize, effectively creating a tailored model for processing information following a given prefix. We demonstrate the effectiveness of our method on synthetic in-context learning tasks and language modeling benchmarks. Furthermore, we introduce techniques for enhancing the interpretability of the learned contextual representations, drawing connections to Variational Autoencoders and promoting smoother, more consistent context encoding. This work offers a novel direction for efficient and adaptable language modeling by integrating context directly into the model’s architecture.

2805Negative-Prompt-driven Alignment for Generative Language Model

[openreview] [pdf]

Abstract Large language models have achieved remarkable capabilities, but aligning their outputs with human values and preferences remains a significant challenge. Existing alignment methods primarily focus on positive examples while overlooking the importance of negative responses in guiding models away from undesirable behaviors. For instance, the widely-used alignment datasets reveals a scarcity of explicit negative examples that contradict human values, hindering its ability to discourage harmful or biased outputs during training. To address this limitation, we propose NEAT, i.e., NEgative-prompt-driven AlignmenT, to introduce negative prompts to generate undesirable responses alongside positive examples during the optimization process. NEAT explicitly penalizes the model for producing harmful outputs, guiding it not only toward desirable behaviors but also steering it away from generating undesirable, biased responses. This dual feedback mechanism enables better alignment with human preferences, crucial in contexts where avoiding harm is paramount. Starting from a pre-trained language model, NEAT performs online alignment by incorporating a ranking loss derived from an expanded preference dataset containing both positive and negative examples. Extensive experiments validate NEAT’s effectiveness in significantly enhancing language models’ alignment with human values and preferences.

2806HuRi : Humanoid Robots Adaptive Risk-ware Distributional Reinforcement Learning for Robust Control

[openreview] [pdf]

Abstract Due to the high complexity of bipedal locomotion, the locomotion control of humanoid robots requires precise adjustment of the balance system to adapt to the changing environment. This high dependence on balance makes the robot very sensitive to risky environments. Therefore, any slight change in the state of the environment may cause the robot to lose balance, thereby increasing the risk of falling or damage. In the past, few studies have explicitly incorporated risk factors into robot policy training, and have failed to adaptively adjust the risk perception level for different risky environmental states, which will affect the agent’s exploration during training and thus fail to select the optimal action in the risky environment. We propose an adaptive risk-aware control policy(HuRi) based on value distributional reinforcement learning. This algorithm does not require additional modules, but only uses the environmental input and the calculated probability distribution, using IQR to measure the intrinsic uncertainty of the environment and RND to evaluate the parameter uncertainty of the environmental state. Combining these two uncertainties, the risk perception level of the strategy is adjusted by adjusting the scalar risk parameter of the distortion function. With this algorithm, the agent can explore safely and efficiently in the risky environment, adaptively adjust the risk sensitivity level of the agent by controlling different distortion measures of the reward distribution, and select the optimal action in the dynamic risky environment. We conducted simulation and actual deployment on the Zerith robot to verify the robustness of HuRi.

2807Cognitive Insights and Stable Coalition Matching for Fostering Multi-Agent Cooperation

[openreview] [pdf]

Abstract Cognitive abilities, such as Theory of Mind (ToM), play a vital role in facilitating cooperation in human social interactions. However, Large Language Model (LLM) agents with higher ToM abilities do not necessarily exhibit better cooperative behavior compared to those with lower ToM abilities, highlighting the complexity of translating human cognitive processes to artificial intelligent agents. To address this challenge, we propose a novel matching coalition mechanism that leverages the strengths of agents with different ToM levels by explicitly considering belief alignment and specialized abilities when forming coalitions. Our proposed matching algorithm seeks to find stable coalitions that maximize the potential for cooperative behavior and ensure long-term viability. By incorporating cognitive insights into the design of multi-agent systems, our work demonstrates the potential of leveraging ToM to create more sophisticated and human-like coordination strategies that foster cooperation and improve overall system performance.

2808PLS-based approach for Fair Representation Learning

[openreview] [pdf]

Abstract We revisit the problem of fair representation learning by proposing Fair Partial Least Squares (PLS) components. PLS is widely used in statistics to efficiently reduce the dimension of the data by providing representation tailored for the prediction. We propose a novel method to incorporate fairness constraints in the construction of PLS components. This new algorithm provides a feasible way to construct such features both in the linear and the non linear case using kernel embeddings. The efficiency of our method is evaluated on different datasets, and we prove its superiority with respect to standard fair PCA method.

2809Designing a Conditional Prior Distribution for Flow-Based Generative Models

[openreview] [pdf]

Abstract Flow-based generative models have recently shown impressive performance for conditional generation tasks, such as text-to-image generation. However, current methods transform a general noise distribution to a specific mode of the target data distribution. As such, every point in the initial source distribution can be mapped to every point in the target distribution, resulting in a long average path. To this end, in this work, we tap into a non-utilized property of conditional flow-based models: the ability to design a non-trivial prior distribution. Given an input condition, such as a text prompt, we first map it to a point lying in data space, representing an “average” data point of the minimal average distance to all data points of the same conditional mode (e.g., class). We then utilize the flow matching formulation to map samples from a Gaussian centered around this point to the conditional target distribution. Experimentally, our method significantly improves training times and generation quality (FID, KID and CLIP alignment scores) compared to baselines, producing high quality samples using smaller number of sampling steps.

2810Supervised Batch Normalization

[openreview] [pdf]

Abstract Batch Normalization (BN), a widely-used technique in neural networks, enhances generalization and expedites training by normalizing each mini-batch to the same mean and variance. However, its effectiveness diminishes when confronted with diverse data distributions. To address this challenge, we propose Supervised Batch Normalization (SBN), a pioneering approach. We expand normalization beyond traditional single mean and variance parameters, enabling the identification of data modes prior to training. This ensures effective normalization for samples sharing common features. We define contexts as modes, categorizing data with similar characteristics. These contexts are explicitly defined, such as domains in domain adaptation or modalities in multimodal systems, or implicitly defined through clustering algorithms based on data similarity. We illustrate the superiority of our approach over BN and other commonly employed normalization techniques through various experiments on both single and multi-task datasets. Integrating SBN with Vision Transformer results in a remarkable 15.13% accuracy enhancement on CIFAR-100. Additionally, in domain adaptation scenarios, employing AdaMatch demonstrates an impressive 22.25% accuracy improvement on MNIST and SVHN compared to BN.

2811Generalist Policy for k-Server Problem on Graphs using Deep Reinforcement Learning with Action-Value Decomposition

[openreview] [pdf]

Abstract The online kk-server problem on graphs is a fundamental computational problem that can model a wide range of practical problems, such as dispatching ambulances to serve accidents or dispatching taxis to serve ride requests. While most prior work on the kk-server problem focused on online algorithms, reinforcement learning promises policies that require low computational effort during execution, which is critical in time-sensitive applications, such as ambulance dispatch. However, there exists no scalable reinforcement-learning approach for the kk-server problem. To address this gap, we introduce a scalable computational approach for learning generalist policies. Besides scalability, the advantage of generalist policies is transferability: a generalist policy can be applied to an entire class of graphs without the need for retraining, which is crucial for practical applications, e.g., in ambulance dispatch problems where road conditions or demand distributions may change over time. We achieve scalability and transferability by introducing a novel architecture that decomposes the action-value into a global and a local term, estimated from a shared graph-convolution backbone. We evaluate our approach on a variety of graph classes, comparing to well-established baselines, demonstrating the performance and transferability of our generalist policies.

2812Beyond Forecasting: Compositional Time Series Reasoning for End-to-End Task Execution

[openreview] [pdf]

Abstract In recent decades, there have been substantial advances in time series models and benchmarks across various individual tasks, such as time series forecasting, classification, and anomaly detection. Meanwhile, compositional reasoning in time series prevalent in real-world applications (e.g., decision-making and compositional question answering) is in great demand. Unlike simple tasks that primarily focus on predictive accuracy, compositional reasoning emphasizes the synthesis of diverse information from both time series data and various domain knowledge, making it distinct and extremely more challenging. In this paper, we introduce Compositional Time Series Reasoning, a new task of handling intricate multistep reasoning tasks from time series data. Specifically, this new task focuses on various question instances requiring structural and compositional reasoning abilities on time series data, such as decision-making and compositional question answering. As an initial attempt to tackle this novel task, we developed TS Reasoner, a program-aided approach that utilizes large language model (LLM) to decompose a complex task into steps of programs that leverage existing time series models and numerical subroutines. Unlike existing reasoning work which only calls off-the-shelf modules, TS Reasoner allows for the creation of custom modules and provides greater flexibility to incorporate domain knowledge as well as user-specified constraints. We demonstrate the effectiveness of our method through a comprehensive set of experiments. These promising results indicate potential opportunities in the new task of time series reasoning and highlight the need for further research.

2813Streaming Algorithms ForℓpFlows andℓpRegression

[openreview] [pdf]

Abstract We initiate the study of one-pass streaming algorithms for underdetermined p\ell_p linear regression problems of the form

minAx=bxp,,where ARn×d with nd,,\min_{\mathbf A\mathbf x = \mathbf b} \lVert\mathbf x\rVert_p ,, \qquad \text{where } \mathbf A \in \mathbb R^{n \times d} \text{ with } n \ll d ,,

which generalizes basis pursuit (p=1p = 1) and least squares solutions to underdetermined linear systems (p=2p = 2). We study the column-arrival streaming model, in which the columns of A\mathbf A are presented one by one in a stream. When A\mathbf A is the incidence matrix of a graph, this corresponds to an edge insertion graph stream, and the regression problem captures p\ell_p flows which includes transshipment (p=1p = 1), electrical flows (p=2p = 2), and max flow (p=p = \infty) on undirected graphs as special cases. Our goal is to design algorithms which use space much less than the entire stream, which has a length of dd.For the task of estimating the cost of the p\ell_p regression problem for p[2,]p\in[2,\infty], we show a streaming algorithm which constructs a sparse instance supported on O~(ε2n)\tilde O(\varepsilon^{-2}n) columns of A\mathbf A which approximates the cost up to a (1±ε)(1\pm\varepsilon) factor, which corresponds to O~(ε2n2)\tilde O(\varepsilon^{-2}n^2) bits of space in general and an O~(ε2n)\tilde O(\varepsilon^{-2}n) space semi-streaming algorithm for constructing p\ell_p flow sparsifiers on graphs. This extends to p(1,2)p\in(1, 2) with O~(ε2nq/2)\tilde O(\varepsilon^{2}n^{q/2}) columns, where qq is the H"older conjugate exponent of pp. For p=2p = 2, we show that Ω(n2)\Omega(n^2) bits of space are required in general even for outputting a constant factor solution. For p=1p = 1, we show that the cost cannot be estimated even to an o(n)o(\sqrt n) factor in poly(n)\mathrm{poly}(n) space.On the other hand, if we are interested in outputting a solution x\mathbf x, then we show that (1+ε)(1+\varepsilon)-approximations require Ω(d)\Omega(d) space for p>1p > 1, and in general, κ\kappa-approximations require Ω~(d/κ2q)\tilde\Omega(d/\kappa^{2q}) space for p>1p > 1. We complement these lower bounds with the first sublinear space upper bounds for this problem, showing that we can output a κ\kappa-approximation using space only poly(n)O~(d/κq)\mathrm{poly}(n) \cdot \tilde O(d/\kappa^q) for p>1p > 1, as well as a n\sqrt n-approximation using poly(n,logd)\mathrm{poly}(n, \log d) space for p=1p = 1.

2814Refine Knowledge of Large Language Models via Adaptive Contrastive Learning

[openreview] [pdf]

Abstract How to alleviate the hallucinations of Large Language Models (LLMs) has always been the fundamental goal pursued by the LLMs research community. Looking through numerous hallucination-related studies, a mainstream category of methods is to reduce hallucinations by optimizing the knowledge representation of LLMs to change their output. Considering that the core focus of these works is the knowledge acquired by models, and knowledge has long been a central theme in human societal progress, we believe that the process of models refining knowledge can greatly benefit from the way humans learn. In our work, by imitating the human learning process, we design an Adaptive Contrastive Learning strategy. Our method flexibly constructs different positive and negative samples for contrastive learning based on LLMs’ actual mastery of knowledge. This strategy helps LLMs consolidate the correct knowledge they already possess, deepen their understanding of the correct knowledge they have encountered but not fully grasped, forget the incorrect knowledge they previously learned, and honestly acknowledge the knowledge they lack. Extensive experiments and detailed analyses on widely used datasets demonstrate the effectiveness and competitiveness of our method.

2815A Geometric Approach to Personalized Recommendation with Set-Theoretic Constraints Using Box Embeddings

[openreview] [pdf]

Abstract Personalized item recommendation typically suffers from data sparsity, which is most often addressed by learning vector representations of users and items via low-rank matrix factorization. While this effectively densifies the matrix by assuming users and movies can be represented by linearly dependent latent features, it does not capture more complicated interactions. For example, vector representations struggle with set-theoretic relationships, such as negation and intersection, e.g. recommending a movie that is “comedy and action, but not romance”. In this work, we formulate the problem of personalized item recommendation as matrix completion where rows are set-theoretically dependent. To capture this set-theoretic dependence we represent each user and attribute by a hyperrectangle or box (i.e. a Cartesian product of intervals). Box embeddings can intuitively be understood as trainable Venn diagrams, and thus not only inherently represent similarity (via the Jaccard index), but also naturally and faithfully support arbitrary set-theoretic relationships. Queries involving set-theoretic constraints can be efficiently computed directly on the embedding space by performing geometric operations on the representations. We empirically demonstrate the superiority of box embeddings over vector-based neural methods on both simple and complex item recommendation queries by up to 30% overall.

2816Certified Defense on the Fairness of Graph Neural Networks

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have emerged as a prominent graph learning model in various graph-based tasks over the years. Nevertheless, due to the vulnerabilities of GNNs, it has been empirically proved that malicious attackers could easily corrupt the fairness level of their predictions by adding perturbations to the input graph data. In this paper, we take crucial steps to study a novel problem of certifiable defense on the fairness level of GNNs. Specifically, we propose a principled framework named ELEGANT and present a detailed theoretical certification analysis for the fairness of GNNs. ELEGANT takes any GNNs as its backbone, and the fairness level of such a backbone is theoretically impossible to be corrupted under certain perturbation budgets for attackers. Notably, ELEGANT does not have any assumption over the GNN structure or parameters, and does not require re-training the GNNs to realize certification. Hence it can serve as a plug-and-play framework for any optimized GNNs ready to be deployed. We verify the satisfactory effectiveness of ELEGANT in practice through extensive experiments on real-world datasets across different backbones of GNNs, where ELEGANT is also demonstrated to be beneficial for GNN debiasing.

2817Choose Before You Label: Efficient Node and Data Selection in Distributed Learning

[openreview] [pdf]

Abstract We consider one of the most relevant problems of distributed learning, i.e., the selection of the learning nodes to include in the training process as well as the selection of the samples from each of the learning nodes’ local datasets, so as to make learning sustainable. Traditional approaches rely on pursuing a balanced label distribution, which requires label statistics from all datasets, including those not selected for learning. This may be costly and may raise privacy concerns. To cope with this issue, we aim at selecting few and small datasets. To this end, we propose a new metric, called loneliness, which is defined on unlabelled training samples. First, through both a theoretical and an experimental analysis, we show that loneliness is strongly linked with learning performance (i.e., test accuracy). Then, we propose a new node- and data-selection procedure, called Goldilocks, that uses loneliness to make its decisions. Our performance evaluation, including three state-of-the-art datasets and both centralized and federated learning, demonstrates that Goldilocks outperforms approaches based upon a balanced label distribution by providing over 70% accuracy improvement, in spite of using information that is both less sensitive privacy-wise and less onerous to obtain.

2818Convex is back:\Solving Belief MDPs via Convexity-Informed Deep Reinforcement Learning

[openreview] [pdf]

Abstract We present a novel method for Deep Reinforcement Learning (DRL), incorporating the convex property of the value function over the belief space in Partially Observable Markov Decision Processes (POMDPs). We thus introduce hard- and soft-enforced convexity as two different approaches, and compare their performance against standard DRL on two well-known POMDP environments, namely the Tiger and FieldVisionRockSample problems. Our findings show that including the convexity feature can substantially increase the median and/or maximum performance of the agents, especially when testing on out-of-distribution domains.

2819Comparing and Contrasting Deep Learning Weather Prediction Backbones on Navier-Stokes and Atmospheric Dynamics

[openreview] [pdf]

Abstract Remarkable progress in the development of Deep Learning Weather Prediction (DLWP) models positions them to become competitive with traditional numerical weather prediction (NWP) models. Indeed, a wide number of DLWP architectures---based on various backbones, including U-Net, Transformer, Graph Neural Network (GNN), and Fourier Neural Operator (FNO)---have demonstrated their potential at forecasting atmospheric states. However, due to differences in training protocols, forecast horizons, and data choices, it remains unclear which (if any) of these methods and architectures are most suitable for weather forecasting and for future model development. Here, we step back and provide a detailed empirical analysis, under controlled conditions, comparing and contrasting the most prominent DLWP models, along with their backbones. We accomplish this by predicting synthetic two-dimensional incompressible Navier-Stokes and real-world global weather dynamics. In terms of accuracy, memory consumption, and runtime, our results illustrate various tradeoffs. For example, on synthetic data, we observe favorable performance of FNO; and on the real-world WeatherBench dataset, our results demonstrate the suitability of ConvLSTM and SwinTransformer for short-to-mid-ranged forecasts. For long-ranged weather rollouts of up to 365 days, we observe superior stability and physical soundness in architectures that formulate a spherical data representation, i.e., GraphCast and Spherical FNO. In addition, we observe that all of these model backbones ``saturate,‘’ i.e., none of them exhibit so-called neural scaling, which highlights an important direction for future work on these and related models. The code is available at \url{https://anonymous.4open.science/r/dlwp-benchmark-F88C}.

2820Integral Performance Approximation for Continuous-Time Reinforcement Learning Control

[openreview] [pdf]

Abstract We introduce integral performance approximation (IPA), a new continuous-time reinforcement learning (CT-RL) control method. It leverages an affine nonlinear dynamic model, which partially captures the dynamics of the physical environment, alongside state-action trajectory data to enable optimal control with great data efficiency and robust control performance. Utilizing Kleinman algorithm structures allows IPA to provide theoretical guarantees of learning convergence, solution optimality, and closed-loop stability. Furthermore, we demonstrate the effectiveness of IPA on three CT-RL environments including hypersonic vehicle (HSV) control, which has additional challenges caused by unstable and nonminimum phase dynamics. As a result, we demonstrate that the IPA method leads to new, SOTA control design and performance in CT-RL.

2821HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

[openreview] [pdf]

Abstract Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.

2822Diff-In: Data Influence Estimation with Differential Approximation

[openreview] [pdf]

Abstract In this paper, we introduce a new formulation to approximate a sample’s influence by accumulating the differences in influence between consecutive learning steps, which we term Diff-In. Specifically, we formulate the sample-wise influence as the cumulative sum of its changes/differences across successive training iterations. By employing second-order approximations, we approximate these difference terms with high accuracy while eliminating the need for model convexity required by existing methods. Despite being a second-order method, Diff-In maintains computational complexity comparable to that of first-order methods and remains scalable. This efficiency is achieved by computing the product of the Hessian and gradient, which can be efficiently approximated using finite differences of first-order gradients. We assess the approximation accuracy of Diff-In both theoretically and empirically. Our theoretical analysis demonstrates that Diff-In achieves significantly lower approximation error compared to existing influence estimators. Extensive experiments further confirm its superior performance across multiple benchmark datasets in three data-centric tasks: data cleaning, data deletion, and coreset selection. Notably, our experiments on data pruning for large-scale vision-language pre-training show that Diff-In can scale to millions of data points and outperforms strong baselines.

2823Drift2Matrix: Kernel-Induced Self Representation for Concept Drift Adaptation in Co-evolving Time Series

[openreview] [pdf]

Abstract In the realm of time series analysis, tackling the phenomenon of concept drift poses a significant challenge. Concept drift -- characterized by the evolving statistical properties of time series data, affects the reliability and accuracy of conventional analysis models. This is particularly evident in co-evolving scenarios where interactions among variables are crucial. This paper presents Drift2Matrix, a novel framework that leverages kernel-induced self-representation for adaptive responses to concept drift in time series. Drift2Matrix employs a kernel-based learning mechanism to generate a representation matrix, encapsulating the inherent dynamics of co-evolving time series. This matrix serves as a key tool for identification and adaptation to concept drift by observing its temporal variations. Furthermore, Drift2Matrix effectively identifies prevailing patterns and offers insights into emerging trends through pattern evolution analysis. Our empirical evaluation of Drift2Matrix across various datasets demonstrates its effectiveness in handling the complexities of concept drift. This approach introduces a novel perspective in the theoretical domain of co-evolving time series analysis, enhancing adaptability and accuracy in the face of dynamic data environments. Code is available at GitHub.

2824From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

[openreview] [pdf]

Abstract When leveraging language models for reasoning tasks, generating explicit chain-of-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining high performance. Our approach enables training a GPT-2 Small model to solve 20-by-20 multiplication with 99.5% accuracy while being 26 times faster than explicit CoT, whereas standard training cannot solve beyond 4-by-4 multiplication. Furthermore, our method proves effective on larger language models, such as Mistral 7B, achieving over 50% accuracy on GSM8K without producing any intermediate steps.

2825How to Train Long-Context Language Models (Effectively)

[openreview] [pdf]

Abstract We study the problem of adapting a language model (LM) to make effective use of long-context information. We first establish a reliable evaluation protocol to guide model development---instead of perplexity, we use a broad set of long-context tasks, and we evaluate models after supervised fine-tuning (SFT) with instruction data as this better reveals long-context abilities. Supported by our robust evaluations, we run thorough experiments to decide the data mix for continued pre-training, the instruction tuning dataset, and other design choices such as position extrapolation. We find that (1) code repositories and books are excellent sources of long data, but it is crucial to combine them with high-quality short data; (2) training with a sequence length beyond the evaluation length boosts long-context performance; (3) for SFT, using only short instruction datasets yields strong performance on long-context tasks. Our final model, ProLong-8B, which is initialized from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K, outperforming Llama-3.1-8B on the majority of long-context tasks despite having seen 5% as many tokens during long-context training. Additionally, ProLong can effectively process up to 512K tokens, one of the longest context windows of publicly available LMs.

2826Integration Flow Models

[openreview] [pdf]

Abstract Recently, ordinary differential equation (ODE) based generative models have emerged as a cutting-edge method for producing high-quality samples in many applications. Generally, these methods typically involve learning continuous transformation trajectories that map a simple initial distribution (i.e., Gaussian noise) to the target data distribution (i.e., images) by multiple steps of solving different ODE functions in inference to obtain high-quality results. However, the ODE-based methods either suffer the discretization error of numerical solvers of ODE, which restricts the quality of samples when only a few NFEs are used, or struggle with training instability. In this paper, we proposed Integration Flow, which learns the results of ODE-based trajectory paths directly without solving the ODE functions. Moreover, Integration Flow explicitly incorporates the target state x0\mathbf{x}_0 as the anchor state in guiding the reverse-time dynamics and we have theoretically proven this can contribute to both stability and accuracy. To the best of our knowledge, Integration Flow is the first model with the unified structure to estimate ODE-based generative models. Through theoretical analysis and empirical evaluations, we show that Integration Flows achieve improved performance when it is applied to existing ODE-based model, such as diffusion models, Rectified Flows, and PFGM++. Specifically, Integration Flow achieves one-step generation on CIFAR10 with FID of 2.63 for Variance Exploding (VE) diffusion model, 3.4 for Rectified Flow without relflow and 2.96 for PFGM++. By extending the sampling to 1000 steps, we further reduce FID score to 1.71 for VE, setting state-of-the-art performance.

2827Residual Kernel Policy Network: Enhancing Stability and Robustness in RKHS-Based Reinforcement Learning

[openreview] [pdf]

Abstract Achieving optimal performance in reinforcement learning requires robust policies supported by training processes that ensure both sample efficiency and stability. Modeling the policy in reproducing kernel Hilbert space (RKHS) enables efficient exploration of local optimal solutions. However, the stability of existing RKHS-based methods is hindered by significant variance in gradients, while the robustness of the learned policies is often compromised due to the sensitivity of hyperparameters. In this work, we conduct a comprehensive analysis of the significant instability in RKHS policies and reveal that the variance of the policy gradient increases substantially when a wide-bandwidth kernel is employed. To address these challenges, we propose a novel RKHS policy learning method integrated with representation learning to dynamically process observations in complex environments, enhancing the robustness of RKHS policies. Furthermore, inspired by the advantage functions, we introduce a residual layer that further stabilizes the training process by significantly reducing gradient variance in RKHS. Our novel algorithm, the Residual Kernel Policy Network (ResKPN), demonstrates state-of-the-art performance, achieving a 30% improvement in episodic rewards across complex environments.

2828YOLO-RD: Introducing Relevant and Compact Explicit Knowledge to YOLO by Retriever-Dictionary

[openreview] [pdf]

Abstract Identifying and localizing objects within images is a fundamental challenge, and numerous efforts have been made to enhance model accuracy by experimenting with diverse architectures and refining training strategies. Nevertheless, a prevalent limitation in existing models is overemphasizing the current input while ignoring the information from the entire dataset. We introduce an innovative RetrieverDictionary\textbf{R}etriever-\textbf{D}ictionary (RD) module to address this issue. This architecture enables YOLO-based models to efficiently retrieve features from a Dictionary that contains the insight of the dataset, which is built by the knowledge from Visual Models (VM), Large Language Models (LLM), or Visual Language Models (VLM). The flexible RD enables the model to incorporate such explicit knowledge that enhances the ability to benefit multiple tasks, specifically, segmentation, detection, and classification, from pixel to image level. The experiments show that using the RD significantly improves model performance, achieving more than a 3% increase in mean Average Precision for object detection with less than a 1% increase in model parameters. Beyond 1-stage object detection models, the RD module improves the effectiveness of 2-stage models and DETR-based architectures, such as Faster R-CNN and Deformable DETR.

2829Provable Faster Zeroth-order Method for Bilevel Optimization with Optimal Dependency on Error and Dimension

[openreview] [pdf]

Abstract In this paper, we study and analyze zeroth-order stochastic approximation algorithms for solving black-box bilevel optimization problems, where only the upper and lower function values can be obtained. \citep{Saeed2024} proposed the first full zeroth-order bilevel method that utilizes Gaussian smoothing to estimate the first- and second-order partial derivatives of functions with two independent blocks of variables. However, this method suffers from a high dimensional dependency of O((d1+d2)4)\mathcal{O}((d_{1}+d_{2})^{4}), where d1d_{1} and d2d_{2} are the dimensions of the outer and inner problems, respectively. They left an open question: can this dimension dependency be improved? To answer this question, we propose a single-loop accelerated zeroth-order bilevel algorithm, which achieves a dimension dependency of O(d1+d2)\mathcal{O}(d_{1}+d_{2}) by incorporating coordinate-wise smoothing gradient estimators (coord). We develop a new theoretical analysis for the proposed algorithm, which converges to a stationary point of Φ(x)\Phi(x) with a complexity of O((d1+d2)ϵ3)\mathcal{O}((d_{1}+d_{2})\epsilon^{-3}) in expectation settings and O((d1+d2)nϵ2)\mathcal{O}((d_{1}+d_{2})\sqrt{n}\epsilon^{-2}) in finite sum settings. These complexities are both optimal with respect to dimension and error ϵ\epsilon. We also provide experiment to validate the effectiveness of the proposed algorithm.

2830TopInG: Topologically Interpretable Graph Learning via Persistent Rationale Filtration

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown remarkable performance in various scientific domains, but their lack of interpretability limits their applicability in critical decision-making processes. Recently, intrinsic interpretable GNNs have been studied to provide insights into model predictions by identifying rationale substructures in graphs. However, existing methods face challenges when the underlying rationale subgraphs are complicated and variable. To address this challenge, we propose TopIng, a novel topological framework to interpretable GNNs that leverages persistent homology to identify persistent rationale subgraphs. Our method introduces a rationale filtration learning technique that models the generating procedure of rationale subgraphs, and enforces the persistence of topological gap between rationale subgraphs and complement random graphs by a novel self-adjusted topological constraint, topological discrepancy. We show that our topological discrepancy is a lower bound of a Wasserstein distance on graph distributions with Gromov-Hausdorff metric. We provide theoretical guarantees showing that our loss is uniquely optimized by the ground truth under certain conditions. Through extensive experiments on varaious synthetic and real datasets, we demonstrate that TopIng effectively addresses key challenges in interpretable GNNs including handling variiform rationale subgraphs, balancing performance with interpretability, and avoiding spurious correlations. Experimental results show that our approach improves state-of-the-art methods up to 20%+ on both predictive accuracy and interpretation quality.

2831On Choice of Loss Functions For Neural Control Barrier Certificates

[openreview] [pdf]

Abstract The design of controllers with correctness guarantees is a primary concern for safety-critical control systems. A Control Barrier Certificate (CBC) is a real-valued function over the state space of the system that provides an inductive proof of the existence of a safe controller. Recently, neural networks have been successfully deployed for data-driven learning of control barrier certificates. These approaches encode the conditions for the existence of a CBC using a rectified linear unit (ReLU) loss function. The resulting encoding, while sound, tends to be conservative, which results in slower training and limits scalability to large, complex systems. Can altering the loss function alleviate some of the problems associated with ReLU loss and lead to faster learning?This paper proposes a novel encoding with a Mean Squared Error (MSE) loss function, which allows for more scalable and efficient training, while addressing some of the theoretical limitations of previous methods. The proposed approach derives a validity condition based on Lipschitz continuity to formally characterize safety guarantees, eliminating the need for a post-hoc verification. The effectiveness of the proposed loss functions is demonstrated through six case studies curated from the existing state of the art. Our results provide a compelling argument for exploring alternative loss function choices as a novel approach to optimizing the design of control barrier certificates.

2832Relative Preference Optimization: Enhancing LLM Alignment through Contrasting Responses across Identical and Diverse Prompts

[openreview] [pdf]

Abstract In the field of large language models (LLMs), aligning models with the diverse preferences of users is a critical challenge. Direct Preference Optimization (DPO) has played a key role in this area. It works by using pairs of preferences derived from the same prompts, and it functions without needing an additional reward model. However, DPO does not fully reflect the complex nature of human learning, which often involves understanding contrasting responses to not only identical but also similar questions. To overcome this shortfall, we propose Relative Preference Optimization (RPO). RPO is designed to discern between more and less preferred responses derived from both identical and related prompts. It introduces a contrastive weighting mechanism, enabling the tuning of LLMs using a broader range of preference data, including both paired and unpaired sets. This approach expands the learning capabilities of the model, allowing it to leverage insights from a more varied set of prompts. Experiments in both paired and unpaired dataset settings, including tasks like dialogue, summarization, and general evaluation benchmarks, demonstrate RPO’s superior ability to align LLMs with user preferences and enhance adaptability during training.

2833Active partitioning: inverting the paradigm of active learning

[openreview] [pdf]

Abstract Datasets often incorporate various functional patterns related to different aspects or regimes, which are typically not equally present throughout the dataset. We propose a novel, general-purpose partitioning algorithm that utilizes competition between models to detect and separate these functional patterns. This competition is induced by multiple models iteratively submitting their predictions for the dataset, with the best prediction for each data point being rewarded with training on that data point. This reward mechanism amplifies each model’s strengths and encourages specialization in different patterns. The specializations can then be translated into a partitioning scheme. The amplification of each model’s strengths inverts the active learning paradigm: while active learning typically focuses the training of models on their weaknesses to minimize the number of required training data points, our concept reinforces the strengths of each model, thus specializing them. We validate our concept -- called active partitioning -- with various datasets with clearly distinct functional patterns, such as mechanical stress and strain data in a porous structure. The active partitioning algorithm produces valuable insights into the datasets’ structure, which can serve various further applications. As a demonstration of one exemplary usage, we set up modular models consisting of multiple expert models, each learning a single partition, and compare their performance on more than twenty popular regression problems with single models learning all partitions simultaneously. Our results show significant improvements, with up to 54% loss reduction, confirming our partitioning algorithm’s utility.

2834KrwEmd: Revising the Imperfect Recall Abstraction from Forgetting Everything

[openreview] [pdf]

Abstract Excessive abstraction is a serious issue in solving games with ordered signals—a subset of imperfect information games, caused by extreme implementations of imperfect recall, which discard all historical information and, as a result, negatively impact AI performance. This paper presents KrwEmd, the first practical algorithm designed to address this issue. We first introduce the k-recall winrate feature, which not only qualitatively distinguishes signal infosets by leveraging future and, more importantly, historical game information, but also quantitatively reflects their similarity. We then build on this by developing the KrwEmd algorithm, which cluster signal infosets using Earth Mover’s Distance to assess discrepancies between their features. Experimental results demonstrate that KrwEmd significantly enhances AI gameplay performance compared to existing algorithms.

2835Information Subtraction: Learning Representations for Conditional Entropy

[openreview] [pdf]

Abstract The representations of conditional entropy and conditional mutual information are significant in explaining the unique effects among variables. The previous works based on conditional contrastive sampling have successfully eliminated information about discrete sensitive variables, but have not yet addressed continuous cases. This paper introduces a framework of Information Subtraction capable of representing arbitrary information components between continuous variables. We implement a generative-based architecture that outputs such representations by simultaneously maximizing an information term and minimizing another. The results highlight the representations’ ability to provide semantic features of conditional entropy. By subtracting sensitive and domain-specific information, our framework effectively enhances fair learning and domain generalization.

2836Online Continual Graph Learning

[openreview] [pdf]

Abstract The aim of Continual Learning (CL) is to learn new tasks incrementally while avoiding catastrophic forgetting. Online Continual Learning (OCL) specifically focuses on learning efficiently from a continuous stream of data with shifting distribution. While recent studies explore Continual Learning on graphs exploiting Graph Neural Networks (GNNs), only few of them focus on a streaming setting. Many real-world graphs evolve over time and timely (online) predictions could be required. However, current approaches are not well aligned with the standard OCL literature, partly due to the lack of a clear definition of online continual learning on graphs. In this work, we propose a general formulation for online continual learning on graphs, emphasizing the efficiency of batch processing while accounting for graph topology, providing a grounded setting to analyze different methods. We present a set of benchmark datasets for online continual graph learning, together with the results of several methods in CL literature, adapted to our setting. Additionally, we address the challenge of GNN memory usage, as considering multiple hops of neighborhood aggregation can require access to the entire growing graph, resulting in prohibitive costs for the setting. We thus propose solutions to maintain bounded complexity for efficient online learning.

2837Rethinking Visual Counterfactual Explanations Through Region Constraint

[openreview] [pdf]

Abstract Visual counterfactual explanations (VCEs) have recently gained immense popularity as a tool for clarifying the decision-making process of image classifiers. This trend is largely motivated by what these explanations promise to deliver -- indicate semantically meaningful factors that change the classifier’s decision. However, we argue that current state-of-the-art approaches lack a crucial component -- the region constraint -- whose absence prevents from drawing explicit conclusions, and may even lead to faulty reasoning due to phenomenons like confirmation bias. To address the issue of previous methods, which modify images in a very entangled and widely dispersed manner, we propose region-constrained VCEs (RVCEs), which assume that only a predefined image region can be modified to influence the model’s prediction. To effectively sample from this subclass of VCEs, we propose Region-Constrained Counterfactual Schrödinger Bridge (RCSB), an adaptation of a tractable subclass of Schrödinger Bridges to the problem of conditional inpainting, where the conditioning signal originates from the classifier of interest. In addition to setting a new state-of-the-art by a large margin, we extend RCSB to allow for exact counterfactual reasoning, where the predefined region contains only the factor of interest, and incorporating the user to actively interact with the RVCE by predefining the regions manually.

2838Bridging General and Personalized Federated Learning through Selective Model Integration

[openreview] [pdf]

Abstract Personalized federated learning (PFL) achieves high performance by assuming clients only meet test data locally, which does not meet many generic federated learning (GFL) scenarios. In this work, we theoretically show that PMs can be used to enhance GFL with a new learning problem named Selective FL (SFL), which involves optimizing PFL and model selection. However, storing and selecting whole models requires impractical computation and communication costs. To practically solve SFL, inspired by model components that attempt to edit a sub-model for specific purposes, we design an efficient and effective framework named Hot-Pluggable Federated Learning (HPFL). Specifically, clients individually train personalized plug-in modules based on a shared backbone, and upload them with a plug-in marker on the server modular store. In inference stage, an accurate selection algorithm allows clients to identify and retrieve suitable plug-in modules from the modular store to enhance their generalization performance on the target data distribution. Furthermore, we provide differential privacy protection during the selection with theoretical guarantee. Our comprehensive experiments and ablation studies demonstrate that HPFL significantly outperforms state-of-the-art GFL and PFL algorithms. Additionally, we empirically show HPFL’s remarkable potential to resolve other practical FL problems such as continual federated learning and discuss its possible applications in one-shot FL, anarchic FL, and FL plug-in market. Our work is the first attempt towards improving GFL performance through a selecting mechanism with personalized plug-ins.

2839Late Chunking: Contextual Chunk Embeddings Using Long-Context Embedding Models

[openreview] [pdf]

Abstract Many use cases require retrieving smaller portions of text, and dense vector-based retrieval systems often perform better with shorter text segments, as the semantics are less likely to be “over-compressed” in the embeddings. Consequently, practitioners often split text documents into smaller chunks and encode them separately. However, chunk embeddings created in this way can lose contextual information from surrounding chunks, resulting in sub-optimal representations. In this paper, we introduce a novel method called "late chunking, which leverages long context embedding models to first embed all tokens of the long text, with chunking applied after the transformer model and just before mean pooling - hence the term “late” in its naming. The resulting chunk embeddings capture the full contextual information, leading to superior results across various retrieval tasks. The method is generic enough to be applied to a wide range of long-context embedding models and works without additional training. To further increase the effectiveness of late chunking, we propose a dedicated fine-tuning approach for embedding models.

2840Optimal Causal Representations and the Causal Information Bottleneck

[openreview] [pdf]

Abstract To effectively study complex causal systems, it is often useful to construct representations that simplify parts of the system by discarding irrelevant details while preserving key features. The Information Bottleneck (IB) method is a widely used approach in representation learning that compresses random variables while retaining information about a target variable. Traditional methods like IB are purely statistical and ignore underlying causal structures, making them ill-suited for causal tasks. We propose the Causal Information Bottleneck (CIB), a causal extension of the IB, which compresses a set of chosen variables while maintaining causal control over a target variable. This method produces representations which are causally interpretable, and which can be used when reasoning about interventions. We present experimental results demonstrating that the learned representations accurately capture causality as intended.

2841The GECo algorithm for Graph Neural Networks Explanation

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are powerful models that manage complex data sources and their interconnection links. One of GNNs’ main drawbacks is their lack of interpretability, which limits their applicability in sensitive cases. In this paper, we introduce a new methodology involving graph communities to address the interpretability of graph classification problems. The proposal, called GECo (Graph Explanation by COmmunities), exploits the idea that a community, i.e., a subset of graph nodes densely connected, should play a crucial role in graph classification. This assumption is reasonable considering the message-passing mechanism, the core of GNNs. GECo analyzes the contribution to the classification result of the community graphs, building a mask that highlights graph-relevant structures. It first uses the trained GNN one wants to explain to classify the entire graph. Then, it detects the different communities; for each community, a smaller subgraph, including the community nodes’ is created, and the trained GNN is run to see how likely the subgraph alone supports the predicted class. After evaluating all the subgraph communities, an average probability is calculated and set as a threshold. Finally, any subgraph community with a probability value higher than the threshold is assessed as necessary for the model’s decision. The collection of these key communities is the basis for the final explanation since they allow the highlighting of the most relevant parts of the graph leading to the classification. GECo has been tested on GNN employing Graph Convolutional Networks layers, using six artificial and four real-world graph datasets. The six synthetic datasets were generated by adding some artificial motifs (e.g., house, cycle, etc.) to Erdos-Renyi and Barabasi-Albert graphs. The real-world datasets contain molecule structures. Both categories of datasets are adopted in the experimental part of the state-of-the-art proposals for graph explainability. GECo has been compared with a random baseline explainer and four state-of-the-art approaches: PGExplainer, PGMExplainer, GNNExplainer, and SubgraphX. We chose these methods for their different strengths, specifically PGExplainer for its efficiency and generalization capability through a learned explanation model, PGMExplainer for its probabilistic approach based on causal graphs, GNNExplainer for its detailed subgraph and feature-level explanations, and SubgraphX for its theoretically grounded subgraph selection by Shapley values. These choices ensure a comprehensive evaluation of our approach against a wide range of robust techniques. We assessed GECo’s performance using four evaluation criteria that leverage predicted and ground-truth explanations and use user-controlled parameters, such as the probability distribution obtained by the GNN. The results obtained by GECo consistently outperform state-of-the-art techniques across multiple metrics for synthetic and most real-world datasets. In addition, GECo is significantly faster than its competitors in terms of computational efficiency, making it an ideal solution for large-scale data analysis and practical applications. These strengths solidify GECo’s role in generating accurate, efficient, and interpretable explanations in graph-based classification tasks.

2842ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models

[openreview] [pdf]

Abstract Recent advancements in Large Language Models (LLMs) have spurred numerous attempts to apply these technologies to embodied tasks, particularly focusing on high-level task planning and task decomposition. To further explore this area, we introduce a new embodied task planning benchmark, ET-Plan-Bench, which specifically targets embodied task planning using LLMs. It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities, and is designed to evaluate two critical dimensions of LLMs’ application in embodied task understanding: spatial (relation constraint, occlusion for target objects) and temporal & causal understanding of the sequence of actions in the environment. By using multi-source simulators as the backend simulator, it can provide immediate environment feedback to LLMs, which enables LLMs to interact dynamically with the environment and re-plan as necessary. We evaluated the state-of-the-art open source and closed source foundation models, including GPT-4, LLAMA and Mistral on our proposed benchmark. While they perform adequately well on simple navigation tasks, their performance can significantly deteriorate when faced with tasks that require a deeper understanding of spatial, temporal, and causal relationships. Thus, our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework that presents a significant challenge to the latest foundation models. We hope it can spark and drive further research in embodied task planning using foundation models.

2843Evaluating the Instruction-following Abilities of Language Models using Knowledge Tasks

[openreview] [pdf]

Abstract In this work, we focus our attention on developing a benchmark for instruction-following where it is easy to verify both task performance as well as instruction-following capabilities. We adapt existing knowledge benchmarks and augment them with instructions that are a) conditional on correctly answering the knowledge task or b) use the space of candidate options in multiple-choice knowledge-answering tasks. This allows us to study model characteristics, such as their change in performance on the knowledge tasks in the presence of answer-modifying instructions and distractor instructions. In contrast to existing benchmarks for instruction following, we not only measure instruction-following capabilities but also use LLM-free methods to study task performance. We study a series of openly available large language models of varying parameter sizes (1B-405B) and closed source models namely GPT-4o-mini, GPT-4o. We find that even large-scale instruction-tuned LLMs fail to follow simple instructions in zero-shot settings. We release our dataset, the benchmark, code, and results for future work.

2844Distributed Gradient Descent with Many Local Steps in Overparameterized Models

[openreview] [pdf]

Abstract In distributed training of machine learning models, gradient descent with local iterative steps is a very popular method, variants of which are commonly known as Local-SGD or the Federated Averaging (FedAvg). In this method, gradient steps based on local datasets are taken independently in distributed compute nodes to update the local models, which are then aggregated intermittently. Although the existing convergence analysis suggests that with heterogeneous data, FedAvg encounters quick performance degradation as the number of local steps increases, it is shown to work quite well in practice. In this work we try to explain this good performance from a viewpoint of implicit bias in Local Gradient Descent (Local-GD) with a large number of local steps. In overparameterized regime, the gradient descent at each compute node would lead the model to a specific direction locally. We characterize the dynamics of the aggregated global model and compare it to the centralized model trained with all of the data in one place. In particular, we analyze the implicit bias of gradient descent on linear models, for both regression and classification tasks. Our analysis shows that the aggregated global model converges exactly to the centralized model for regression tasks, and converges (in direction) to the same feasible set as centralized model for classification tasks. We further propose a Modified Local-GD with a refined aggregation and theoretically show it converges to the centralized model in direction for linear classification. We empirically verified our theoretical findings in linear models and also conducted experiments on distributed fine-tuning of pretrained neural networks to further apply our theory.

2845Agent-Oriented Planning in Multi-Agent Systems

[openreview] [pdf]

Abstract Through the collaboration of multiple agents possessing diverse expertise and tools, multi-agent systems achieve impressive progress in solving real-world problems. Given the user queries, the meta-agents, serving as the brain within these systems, are required to decompose the queries into multiple sub-tasks that can be allocated to suitable agents capable of solving them, so-called agent-oriented planning. In this study, we identify three critical design principles of agent-oriented planning, including solvability, completeness, and non-redundancy, to ensure that each sub-task is effectively resolved, leading to satisfactory responses to the original queries. These principles further inspire us to propose a novel framework for agent-oriented planning in multi-agent systems, leveraging a fast task decomposition and allocation process followed by an effective and efficient evaluation via a reward model. During the planning process, the meta-agent is also responsible for evaluating the performance of the expert agents, making timely adjustments to the sub-tasks and scheduling as necessary. Besides, we integrate a feedback loop into the proposed framework to further enhance the effectiveness and robustness of such a problem-solving process. Extensive experiments demonstrate the advancement of the proposed framework in solving real-world problems compared to both single-agent systems and existing planning strategies for multi-agent systems.

2846Action Sequence Planner: An Alternative For Offline Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning methods, which typically train agents that make decisions step by step, are known to suffer from instability due to bootstrapping and function approximation, especially when applied to tasks requiring long-horizon planning. To alleviate these issues, in this paper, we propose a novel policy gradient approach by planning an action sequence in a high-dimensional space.This design implicitly models temporal dependencies, excelling in long-horizon and horizon-critical tasks. Furthermore, we discover that replacing maximum likelihood with cross-entropy loss in policy gradient methods significantly stabilizes training gradients, leading to substantial performance improvements in long-horizon tasks. The proposed neural network-based solution features a simple architecture that not only facilitates ease of training and convergence but also demonstrates high efficiency and effective performance. Extensive experimental results reveal that our method exhibits strong performance across a variety of tasks.

2847MTMC: Generalized Category Discovery via Maximum Token Manifold Capacity

[openreview] [pdf]

Abstract Identifying previously unseen data is crucial for enhancing the robustness of deep learning models in the open world. Generalized category discovery (GCD) is a representative problem that requires clustering unlabeled data that includes known and novel categories. Current GCD methods mostly focus on minimizing intra-cluster variations, often at the cost of manifold capacity, thus limiting the richness of within-class representations. In this paper, we introduce a novel GCD approach that emphasizes maximizing the token manifold capacity (MTMC) within class tokens, thereby preserving the diversity and complexity of the data’s intrinsic structure. Specifically, MTMC’s efficacy is fundamentally rooted in its ability to leverage the nuclear norm of the singular values as a quantitative measure of the manifold capacity. MTMC enforces a richer and more informative representation within the manifolds of different patches constituting the same sample. MTMC ensures that, for each cluster, the representations of different patches of the same sample are compact and lie in a low-dimensional space, thereby enhancing discriminability. By doing so, the model could capture each class’s nuanced semantic details and prevent the loss of critical information during the clustering process. MTMC promotes a comprehensive, non-collapsed representation that improves inter-class separability without adding excessive complexity.

2848vTune: Verifiable Fine-Tuning for LLMs Through Backdooring

[openreview] [pdf]

Abstract As fine-tuning large language models (LLMs) becomes increasingly prevalent, users often rely on third-party services with limited visibility into their fine-tuning processes. This lack of transparency raises the question:how do consumers verify that fine-tuning services are performed correctly? For instance, a service provider could claim to fine-tune a model for each user, yet simply send all users back the same base model. To address this issue, we propose vTune, a simple method that uses a small number of \textit{backdoor} data points added to the training data to provide a statistical test for verifying that a provider fine-tuned a custom model on a particular user’s dataset. Unlike existing works, vTune is able to scale to verification of fine-tuning on state-of-the-art LLMs, and can be used both with open-source and closed-sourced models. We test our approach across several model families and sizes as well as across multiple instruction-tuning datasets, and find that the statistical test is satisfied with p-values on the order of 10e40\sim 10e^{-40}, with no negative impact on downstream task performance. Further, we explore several attacks that attempt to subvert vTune and demonstrate the method’s robustness to these attacks.

2849Harmonic Machine Learning Models are Robust

[openreview] [pdf]

Abstract We introduce Harmonic Robustness, a powerful and intuitive method to test the robustness of any machine-learning model either during training or in black-box real-time inference monitoring without ground-truth labels. It is based on functional deviation from the harmonic mean-value property, indicating instability and lack of explainability. We show implementation examples in low-dimensional trees and feedforward NNs, where the method reliably identifies overfitting, as well as in more complex high-dimensional models such as ResNet-50 and Vision Transformer where it efficiently measures adversarial vulnerability across image classes.

2850Dynamic Mixture-of-Experts for Incremental Graph Learning

[openreview] [pdf]

Abstract Graph incremental learning is a learning paradigm that aims to adapt models trained on previous data to continuously incremented data or tasks over time without the need for retraining on the full dataset. However, regular graph machine learning methods suffer from catastrophic forgetting when applied to incremental learning settings, where previously learned knowledge is overridden by new knowledge. Previous approaches have tried to address this by treating the previously trained model as an inseparable unit and using regularization, experience replay, and parameter isolation to maintain old behaviors while learning new knowledge. These approaches, however, do not account for the fact that not all previously acquired knowledge is equally beneficial for learning new tasks, and maintaining all previous knowledge and the latest knowledge in a single model is ineffective. Some prior patterns can be transferred to help learn new data, while others may deviate from the new data distribution and be detrimental. To address this, we propose a dynamic mixture-of-experts (DyMoE) approach for incremental learning. Specifically, a DyMoE GNN layer adds new expert networks specialized in modeling the incoming data blocks. We design a customized regularization loss that utilizes data sequence information so existing experts can maintain their ability to solve old tasks while helping the new expert learn the new data effectively. As the number of data blocks grows over time, the computational cost of the full mixture-of-experts (MoE) model increases. To address this, we introduce a sparse MoE approach, where only the top-kk most relevant experts make predictions, significantly reducing the computation time. Our model achieved 5.47% relative accuracy increase compared to the best baselines on class incremental learning with minimal computation increase, showing the model’s exceptional power.

2851Anomaly Detection by Estimating Gradients of the Tabular Data Distribution

[openreview] [pdf]

Abstract Detecting anomalies in tabular data from various domains has become increasingly important in deep learning research. Simultaneously, the development of generative models has advanced, offering powerful mechanisms for detecting anomalies by modeling normal data. In this paper, we propose a novel method for anomaly detection using a noise conditional score network (NSCN). NSCNs, which can learn the gradients of log probability density functions over many noise-perturbed data distributions, are known for their diverse sampling even in low-density regions of the training data. This effect can also be utilized, and thus, the NSCN can be used directly as an anomaly indicator with an anomaly score derived from a simplified loss function. This effect will be analyzed in detail. Our method is trained on normal behavior data, enabling it to differentiate between normal and anomalous behaviors in test scenarios. To evaluate our approach extensively, we created the world’s largest benchmark for anomaly detection in tabular data with 49 baseline methods consisting of the ADBench benchmark and several more datasets from the literature. Overall, our approach shows state-of-the-art performance across the benchmark.

2852Mixed Hierarchical Oracle and Multi-Agent Benchmark in Two-player Zero-sum Games

[openreview] [pdf]

Abstract Self-play methods have achieved remarkable success in two-player zero-sum games, attaining superhuman performance in many complex game domains. Parallelizing learners is a feasible approach to handling large-scale games. However, parallelizing learners often leads to suboptimal exploitation of computational resources, resulting in inefficiencies. In this study, we introduce the Mixed Hierarchical Oracle (MHO), designed to enhance computational efficiency and performance in large-scale two-player zero-sum games. MHO enables the parallelization of reinforcement learning tasks through a hierarchical pipeline that balances exploration and exploitation across oracle levels. It also avoids cold-start issues by using a “model soup” initialization strategy. Additionally, we present MiniStar, an open-source environment focused on small-scale combat scenarios, developed to facilitate research in self-play algorithms. Through extensive experiments on matrix games and the MiniStar environment, we demonstrate that MHO outperforms existing methods in terms of computational efficiency and performance.

2853AnyView: Few Shot Personalized View Transfer

[openreview] [pdf]

Abstract Fine-tuning generative models for concept driven personalization have witnessed tremendous growth ever since the arrival of methods like DreamBooth, Textual Inversion etc. Particularly, such techniques have been thoroughly explored for style-driven generation. Recently, diffusion models have also demonstrated impressive capabilities in view synthesis tasks, setting the foundation for exploring view-driven generation approaches. Motivated by these advancements, we investigate the capacity of a pretrained stable diffusion model to grasp ``what constitutes a view" without relying on explicit 3D priors. Specifically, we base our method on a personalized text to image model, Dreambooth, given its strong ability to adapt to specific novel objects with a few shots. Our research reveals two interesting findings. First, we observe that Dreambooth can learn the high level concept of a view, compared to arguably more complex strategies which involve fine-tuning diffusions on large amounts of multi-view data. Second, we establish that the concept of a view can be disentangled and transferred to a novel object irrespective of the original object’s identity from which the views are learnt. Motivated by this, we introduce a learning strategy, AnyView, which inherits a specific view through only one image sample of a single scene, and transfers the knowledge to a novel object, learnt from a few shots, using low rank adapters. Through extensive experiments we demonstrate that our method, albeit simple, is efficient in generating reliable view samples for in the wild images. Code and models will be released.

2854On Provable Length and Compositional Generalization

[openreview] [pdf]

Abstract Out-of-distribution generalization capabilities of sequence-to-sequence models can be studied from the lens of two crucial forms of generalization: length generalization -- the ability to generalize to longer sequences than ones seen during training, and compositional generalization: the ability to generalize to token combinations not seen during training. In this work, we provide first provable guarantees on length and compositional generalization for common sequence-to-sequence models -- deep sets, transformers, state space models, and recurrent neural nets -- trained to minimize the prediction error. Taking a first principles perspective, we study the realizable case, i.e., the labeling function is realizable on the architecture. We show that simple limited capacity versions of these different architectures achieve both length and compositional generalization. Across different architectures, we also find that a linear relationship between the learned representation and the representation in the labeling function is necessary for length and compositional generalization.

2855Inference of Evolving Mental States from Irregular Action Events to Understand Human Behaviors

[openreview] [pdf]

Abstract Inference of latent human mental processes, such as belief, intention, or desire, is crucial for developing AI with human-like intelligence, enabling more effective and timely collaboration. In this paper, we introduce a versatile encoder-decoder model designed to infer evolving mental processes based on irregularly observed action events and predict future occurrences. The primary challenges arise from two factors: both actions and mental processes are irregular events, and the observed action data is often limited. To address the irregularity of these events, we leverage a temporal point process model within the encoder-decoder framework, effectively capturing the dynamics of both action and mental events. Additionally, we implement a backtracking mechanism in the decoder to enhance the accuracy of predicting future actions and evolving mental states. To tackle the issue of limited data, our model incorporates logic rules as priors, enabling accurate inferences from just a few observed samples. These logic rules can be refined and updated as needed, providing flexibility to the model. Overall, our approach enhances the understanding of human behavior by predicting when actions will occur and how mental processes evolve. Experiments on both synthetic and real-world datasets demonstrate the strong performance of our model in inferring mental states and predicting future actions, contributing to the development of more human-centric AI systems.

2856Content-Style Learning from Unaligned Domains: Identifiability under Unknown Latent Dimensions

[openreview] [pdf]

Abstract Understanding identifiability of latent content and style variables from unaligned multi-domain data is essential for tasks such as domain translation and data generation. Existing works on content-style identification were often developed under somewhat stringent conditions, e.g., that all latent components are mutually independent and that the dimensions of the content and style variables are known. We introduce a new analytical framework via cross-domainlatent distribution matching(LDM), which establishes content-style identifiability under substantially more relaxed conditions. Specifically, we show that restrictive assumptions such as component-wise independence of the latent variables can be removed. Most notably, we prove that prior knowledge of the content and style dimensions is not necessary for ensuring identifiability, if sparsity constraints are properly imposed onto the learned latent representations. Bypassing the knowledge of the exact latent dimension has been a longstanding aspiration in unsupervised representation learning---our analysis is the first to underpin its theoretical and practical viability. On the implementation side, we recast the LDM formulation into a regularized multi-domain GAN loss with coupled latent variables. We show that the reformulation is equivalent to LDM under mild conditions---yet requiring considerably less computational resource. Experiments corroborate with our theoretical claims.

2857Modification-Considering Value Learning for Reward Hacking Mitigation in RL

[openreview] [pdf]

Abstract Reinforcement learning (RL) agents can exploit unintended strategies to achieve high rewards without fulfilling the desired objectives, a phenomenon known as reward hacking. In this work, we examine reward hacking through the lens of General Utility RL, which generalizes RL by considering utility functions over entire trajectories rather than state-based rewards. From this perspective, many instances of reward hacking can be seen as inconsistencies between current and updated utility functions, where the behavior optimized for an updated utility function is poorly evaluated by the original one. Our main contribution is Modification-Considering Value Learning (MC-VL), a novel algorithm designed to address this inconsistency during learning. Starting with a coarse yet value-aligned initial utility function, the MC-VL agent iteratively refines this function based on past observations while considering the potential consequences of updates. This approach enables the agent to anticipate and reject modifications that may lead to undesired behavior. To empirically validate our approach, we implement an MC-VL agent based on the Double Deep Q-Network (DDQN) and demonstrate its effectiveness in preventing reward hacking across various grid-world tasks, including benchmarks from the AI Safety Gridworlds suite.

2858Strengthening Federated Learning: Surrogate Data-Guided Aggregation for Robust Backdoor Defense

[openreview] [pdf]

Abstract Backdoor attacks in federated learning (FL) have garnered significant attention due to their destructive potential. Current advanced backdoor defense strategies typically involve calculating predefined metrics related to local models and modifying the server’s aggregation rule accordingly. However, these metrics may exhibit biases due to the inclusion of malicious models in the calculation, leading to defense failures. To address this issue, we propose a novel backdoor defense method in FL named Su\textit{Su}rrogate D\textit{D}ata-guided A\textit{A}ggregation (SuDA). SuDA independently evaluates local models using surrogate data, thereby mitigating the influence of malicious models. Specifically, it constructs a surrogate dataset composed of pure noise, which is shared between the server and clients. By leveraging this shared surrogate data, clients train their models using both the shared and local data, while the server reconstructs potential triggers for each local model to identify backdoors, facilitating the filtering of backdoored models before aggregation. To ensure the generalizability of local models across both local and surrogate data, SuDA aligns local data with surrogate data in the representation space, supported by theoretical analysis. Comprehensive experiments demonstrate the substantial superiority of SuDA over previous works.

2859SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF

[openreview] [pdf]

Abstract In Large Language Model (LLM) development, Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning models with human values and preferences. RLHF traditionally relies on the Kullback-Leibler (KL) divergence between the current policy and a frozen initial policy as a reference, which is added as a penalty in policy optimization algorithms like Proximal Policy Optimization (PPO). While this constraint prevents models from deviating too far from the initial checkpoint, it limits exploration of the reward landscape, reducing the model’s ability to discover higher-quality solutions. As a result, policy optimization is often trapped in a narrow region of the parameter space, leading to suboptimal alignment and performance. This paper presents SALSA (Soup-basedAlignmentLearning forStrongerAdaptation), a novel approach designed to overcome these limitations by creating a more flexible and better located reference model through weight-space averaging of two independent supervised fine-tuned (SFT) models. This model soup allows for larger deviation in KL divergence and exploring a promising region of the solution space without sacrificing stability. By leveraging this more robust reference model, SALSA fosters better exploration, achieving higher rewards and improving model robustness, out-of-distribution generalization, and performance. We validate the effectiveness of SALSA through extensive experiments on popular open models (Llama-7B, Mistral-7B, and Gemma-2B) across various benchmarks (MT-Bench, Arena-Hard, UltraFeedback), where it consistently surpasses PPO by fostering deeper exploration and achieving superior alignment in LLMs.

2860JuxtAlign: A Foundational Analysis on Alignment of Certified Reinforcement Learning

[openreview] [pdf]

Abstract Sequential decision making in highly complex MDPs with high-dimensional observations and state dynamics became possible with the progress achieved in deep reinforcement learning research. At the same time, deep neural policies have been observed to be highly unstable with respect to the minor sensitivities in their state space induced by non-robust directions. To alleviate these volatilities a line of work suggested techniques to cope with this problem via explicitly regularizing the temporal difference loss for the worst-case sensitivity. In this paper we provide theoretical foundations on the failure instances of the approaches proposed to overcome instabilities of the deep neural policy manifolds. Our comprehensive analysis reveals that certified reinforcement learning learns misaligned values. Our empirical analysis in the Arcade Learning Environment further demonstrates that the state-of-the-art certified policies learn inconsistent and overestimated value functions compared to standard training techniques. In connection to this analysis, we highlight the intrinsic gap between how natural intelligence understands and interacts with an environment in contrast to policies learnt via certified training. This intrinsic gap between natural intelligence and the restrictions induced by certified training on the capabilities of artificial intelligence further demonstrates the need to rethink the approach in establishing reliable and aligned deep reinforcement learning policies.

2861Evaluating Large Language Models through Role-Guide and Self-Reflection: A Comparative Study

[openreview] [pdf]

Abstract Large Language Models fine-tuned with Reinforcement Learning from Human Feedback (RLHF-LLMs) can over-rely on aligned preferences without truly gaining the self-knowledge, leading to hallucination and biases. If an LLM can better access its knowledge and know what it knows, it can avoid making false or unsupported claims. Therefore, it is crucial to evaluate whether LLMs have the ability to know what they know, which can help to ensure accuracy and faithfulness in real-world applications. Inspired by research in Educational Psychology, students who don’t really know are easily affected by teacher and peer guidance, we treat LLM as a student, incorporate role guidance in prompts to explore whether LLMs really know. Specifically, we propose a novel strategy calledRole-Guide andSelf-Reflection (RoSe) to fully assess whether LLM ``knows it knows’'. We introduce multiple combinations of different roles and strong reminder in prompts combined with self-reflection to explore what local information LLMs rely on, and whether LLMs remain unaffected by external guidance with varying roles. Our findings reveal that LLMs are very sensitive to the strong reminder information. Role guidance can help LLMs reduce their reliance on strong reminder. Meanwhile, LLMs tend to trust the role of authority more when guided by different roles. Following these findings, we propose a double-calibrated strategy with verbalized confidence to extract well-calibrated data from closed-source LLM and fine-tune open-source LLMs. Extensive experiments conducted on fine-tuning open-source LLMs demonstrate the effectiveness of double-calibrated strategy in mitigating the reliance of LLMs on local information. For a thorough comparison, we not only employ public JEC-QA and openBookQA datasets, but also constructEG-QAwhich containsEnglishGrammar multiple-choice question-answering and 14 key knowledge points for assessing self-knowledge and logical reasoning.

2862Future Events as Backdoor Triggers: Investigating Temporal Vulnerability in LLMs

[openreview] [pdf]

Abstract A hypothetical failure mode for future AI systems is strategic deception, where models behave as intended in most situations but pursue alternative goals when able to do so without detection in deployment. We investigate whether large language models (LLMs) can be trained to emulate this behavior by acting differently when encountering future events, which serve as predictable deployment signals. Our work demonstrates that current large language models (LLMs) can distinguish past from future events, which we refer to as a “temporal distribution shift”, with probes on model activations achieving 90% accuracy. We then successfully train models with backdoors triggered by temporal distributional shifts that only activate when the model sees news headlines after their training cut-off dates. Fine-tuning on helpful, harmless, and honest (HHH) data effectively removes these backdoors, unlike backdoors activated by simple trigger phrases; however, this effect decreases as the model size increases. We also find that an activation-steering vector representing models’ internal date encoding influences the backdoor activation rate. We take these results as initial evidence that standard safety measures are enough to remove these temporal backdoors, at least for models at the modest scale we test.

2863CreDes: Causal Reasoning Enhancement and Dual-End Searching for Solving Long-Range Reasoning Problems using LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated limitations in handling combinatorial optimization problems involving long-range reasoning, partially due to causal hallucinations and huge search space. As for causal hallucinations, i.e., the inconsistency between reasoning and corresponding state transition, this paper introduces the Causal Relationship Enhancement (CRE) mechanism combining cause-effect interventions and the Individual Treatment Effect (ITE) to guarantee the solid causal rightness between each step of reasoning and state transition. As for the long causal range and huge search space limiting the performances of existing models featuring single-direction search, a Dual-End Searching (DES) approach is proposed to seek solutions by simultaneously starting from both the initial and goal states on the causal probability tree. By integrating CRE and DES (CreDes), our model has realized simultaneous multi-step reasoning, circumventing the inefficiencies from cascading multiple one-step reasoning like the Chain-of-Thought (CoT). Experiments demonstrate that CreDes significantly outperforms existing State-Of-The-Art (SOTA) solutions in long-range reasoning tasks in terms of both accuracy and time efficiency.

2864Theory, Analysis, and Best Practices for Sigmoid Self-Attention

[openreview] [pdf]

Abstract Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and queries. Recent work has explored alternatives to softmax attention in transformers, such as ReLU and sigmoid activations. In this work, we revisit sigmoid attention and conduct an in-depth theoretical and empirical analysis. Theoretically, we prove that transformers with sigmoid attention are universal function approximators and benefit from improved regularity compared to softmax attention. Through detailed empirical analysis, we identify stabilization of large initial attention norms during the early stages of training as a crucial factor for the successful training of models with sigmoid attention, outperforming prior attempts. We also introduce FLASHSIGMOID, a hardware-aware and memory-efficient implementation of sigmoid attention yielding a 17% inference kernel speed-up over FLASHATTENTION2 on H100 GPUs. Experiments across language, vision, and speech show that properly normalized sigmoid attention matches the strong performance of softmax attention on a wide range of domains and scales, which previous attempts at sigmoid attention were unable to fully achieve. Our work unifies prior art and establishes best practices for sigmoid attention as a drop-in softmax replacement in transformers.

2865Improving Distribution Matching via Score-Based Priors and Structural Regularization

[openreview] [pdf]

Abstract Distribution matching (DM) can be applied to multiple tasks including fair classifi- cation, domain adaptation and domain translation. However, traditional variational DM methods such as VAE-based methods unnecessarily bias the latent distri- butions towards simple priors or fail to preserve semantic structure leading to suboptimal latent representations. To address these limitations, we propose novel VAE-based DM approach which incorporates a flexible score-based prior and a structure-preserving regularization. For score-based priors, the key challenge is that computing the likelihood is expensive. Yet, our key insight is that computing the likelihood is unnecessary for updating the encoder and thus we prove that the necessary gradients can be computed using only one score function evalu- ation. Additionally, we introduce a structure-preserving regularization inspired by the Gromov-Wasserstein distance, which explicitly encourages the retention of geometric structure in the latent space, even when the latent space has fewer dimensions than the observed space. Our framework further allows the integration of semantically meaningful structure from pretrained or foundation models into the latent space, ensuring that the representations preserve semantic structure that is informative and relevant to downstream tasks. We empirically demonstrate that our DM approach leads to better latent representations compared to similar methods for fair classification, domain adaptation, and domain translation tasks.

2866Value-aligned Behavior Cloning for Offline Reinforcement Learning via Bi-level Optimization

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) aims to optimize policies under pre-collected data, without requiring any further interactions with the environment. Derived from imitation learning, Behavior cloning (BC) is extensively utilized in offline RL for its simplicity and effectiveness. Although BC inherently avoids out-of-distribution deviations, it lacks the ability to discern between high and low-quality data, potentially leading to sub-optimal performance when facing with poor-quality data. Current offline RL algorithms attempt to enhance BC by incorporating value estimation, yet often struggle to effectively balance these two critical components, specifically the alignment between the behavior policy and the pre-trained value estimations under in-sample offline data. To address this challenge, we propose the Value-aligned Behavior Cloning via Bi-level Optimization (VACO), a novel bi-level framework that seamlessly integrates an inner loop for weighted supervised behavior cloning (BC) with an outer loop dedicated to value alignment. In this framework, the inner loop employs a meta-scoring network to evaluate and appropriately weight each training sample, while the outer loop introduces controlled noise to facilitate limited exploration. This bi-level structure allows VACO to identify the optimal weighted BC policy, ultimately maximizing the expected estimated return conditioned on the learned value function. We conduct a comprehensive evaluation of VACO across a variety of continuous control benchmarks in offline RL, where it consistently achieves superior performance compare

2867Stochastic variance-reduced Gaussian variational inference on the Bures-Wasserstein manifold

[openreview] [pdf]

Abstract Optimization in the Bures-Wasserstein space has been gaining popularity in the machine learning community since it draws connections between variational inference and Wasserstein gradient flows. The variational inference objective function of Kullback–Leibler divergence can be written as the sum of the negative entropy and the potential energy, making forward-backward Euler the method of choice. Notably, the backward step admits a closed-form solution in this case, facilitating the practicality of the scheme. However, the forward step is no longer exact since the Bures-Wasserstein gradient of the potential energy involves “intractable” expectations. Recent approaches propose using the Monte Carlo method -- in practice a single-sample estimator -- to approximate these terms, resulting in high variance and poor performance. We propose a novel variance-reduced estimator based on the principle of control variates. We theoretically show that this estimator has a smaller variance than the Monte-Carlo estimator in scenarios of interest. We also prove that variance reduction helps improve the optimization bounds of the current analysis. We demonstrate that the proposed estimator gains order-of-magnitude improvements over the previous Bures-Wasserstein methods.

2868On Limitation of Transformer for Learning HMMs

[openreview] [pdf]

Abstract This paper investigate the capability of transformer in learning a fundamental sequential model --- the Hidden Markov Model (HMM). We design various types of HMM examples and variants inspired by theory, and conduct extensive experiments testing and comparing the performance of both transformers and Recurrent Neural Networks (RNNs). Our experiments reveal three important findings: (1) Transformers can effectively learn a large number of HMMs, but this require the depth of transformers to be at least logarithmic in the sequence length; (2) There are challenging HMMs where Transformers struggle to learn, while RNNs succeed. We also consistently observe that Transformers underperform RNNs in both training speed and testing accuracy across all tested HMM models. (3) Long mixing times and the lack of access to intermediate latent states significantly degrade Transformer’s performance, but has much less impact on RNNs’ performance. To address the limitation of transformers in modeling HMMs, we demonstrate that a variant of the Chain-of-Thought (CoT), called \emph{block CoT} in the training phase, can help transformers to reduce the evaluation error and to learn longer sequences at a cost of increasing the training time. Finally, we complement our empirical findings by theoretical results proving the expressiveness of transformers in approximating HMMs with logarithmic depth.

2869Phantom: General Trigger Attacks on Retrieval Augmented Language Generation

[openreview] [pdf]

Abstract Retrieval Augmented Generation (RAG) expands the capabilities of modern large language models (LLMs), by anchoring, adapting, and personalizing their responses to the most relevant knowledge sources. It is particularly useful in chatbot applications, allowing developers to customize LLM output without expensive retraining. Despite their significant utility in various applications, RAG systems present new security risks. In this work, we propose new attack vectors that allow an adversary to inject a single malicious document into a RAG system’s knowledge base, and mount a backdoor poisoning attack. We design Phantom, a general two-stage optimization framework against RAG systems, that crafts a malicious poisoned document leading to an integrity violation in the model’s output. First, the document is constructed to be retrieved only when a specific trigger sequence of tokens appears in the victim’s queries. Second, the document is further optimized with crafted adversarial text that induces various adversarial objectives on the LLM output, including refusal to answer, reputation damage, privacy violations, and harmful behaviors. We demonstrate our attacks on multiple LLM architectures, including Gemma, Vicuna, and Llama, and show that they transfer to GPT-3.5 Turbo and GPT-4. Finally, we successfully conducted a Phantom attack on NVIDIA’s black-box production RAG system, “Chat with RTX”.

2870Causal Graph Transformer for Treatment Effect Estimation Under Unknown Interference

[openreview] [pdf]

Abstract Networked interference, also known as the peer effect in social science and spillover effect in economics, has drawn increasing interest across various domains. This phenomenon arises when a unit’s treatment and outcome are influenced by the actions of its peers, posing significant challenges to causal inference, particularly in treatment assignment and effect estimation in real applications, due to the violation of the SUTVA assumption. While extensive graph models have been developed to identify treatment effects, these models often rely on structural assumptions about networked interference, assuming it to be identical to the social network, which can lead to misspecification issues in real applications. To address these challenges, we propose an Interference-Agnostic Causal Graph Transformer (CauGramer), which aggregates peers information via LL-order Graph Transformer and employs cross-attention to infer aggregation function for learning interference representations. By integrating confounder balancing and minimax moment constraints, CauGramer fully incorporates peer information, enabling robust treatment effect estimation. Extensive experiments on two widely-used benchmarks demonstrate the effectiveness and superiority of CauGramer.

2871BOOD: Boundary-based Out-Of-Distribution Data Generation

[openreview] [pdf]

Abstract Harnessing the power of diffusion models to synthesize auxiliary training data based on latent space features has proven effective in enhancing out-of-distribution (OOD) detection performance. However, extracting effective features outside the in-distribution (ID) boundary in latent space remains challenging due to the difficulty of identifying decision boundaries between classes. This paper proposes a novel framework called Boundary-based Out-Of-Distribution data generation (BOOD), which synthesizes high-quality OOD features and generates human-compatible outlier images using diffusion models. BOOD first learns a text-conditioned latent feature space from the ID dataset, selects ID features closest to the decision boundary, and perturbs them to cross the decision boundary to form OOD features. These synthetic OOD features are then decoded into images in pixel space by a diffusion model. Compared to previous works, BOOD provides a more efficient strategy for synthesizing informative OOD features, facilitating clearer distinctions between ID and OOD data. Extensive experimental results on common benchmarks demonstrate that BOOD surpasses the state-of-the-art method significantly, achieving a 27.9% decrease in average FPR95 (40.31% vs. 12.47%) and a 7.2% improvement in average AUROC (90.15% vs. 97.34%) on the Cifar-100 dataset.

2872Efficient Physics-Constrained Diffusion Models for Solving Inverse Problems

[openreview] [pdf]

Abstract Solving inverse problems in scientific and engineering domains often involves complex, nonlinear forward physics and ill-posed conditions. Recent advancements in diffusion model have shown promise for general inverse problems, yet their application to scientific domains remains less explored and is hindered by the complexity and high non-linearity of physics constraints. We present a novel framework called physics-constrained diffusion model (PCDM) that integrates pre-trained diffusion models and physics-constrained objectives, providing plausible solutions to physics-constrained inverse problems within a feasible time. We leverage accelerated diffusion sampling to enable a practical generation process while strictly adhering to physics constraints by solving optimization problems at each timestep. By decoupling the likelihood optimization from the reverse diffusion steps, we ensure that the solutions remain physically consistent, even when employing fewer sampling steps. This approach provides physically plausible solutions without requiring excessive inference time. We validate our method on a wide range of challenging physics-constrained inverse problems, including data assimilation, topology optimization, and full-waveform inversion. Experimental results show that our approach significantly outperforms existing methods in both efficiency and precision, making it practical for real-world applications.

2873What is Wrong with Perplexity for Long-context Language Modeling?

[openreview] [pdf]

Abstract Handling long-context inputs is crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. While recent approaches have extended the context windows of LLMs and employed perplexity (PPL) as a standard evaluation metric, PPL has proven unreliable for assessing long-context capabilities. The underlying cause of this limitation has remained unclear. In this work, we provide a comprehensive explanation for this issue. We find that PPL overlooks key tokens, which are essential for long-context understanding, by averaging across all tokens and thereby obscuring the true performance of models in long-context scenarios. To address this, we propose \textbf{LongPPL}, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them. Our experiments demonstrate that LongPPL strongly correlates with performance on various long-context benchmarks (e.g., Pearson correlation of 0.97), significantly outperforming traditional PPL in predictive accuracy. Additionally, we introduce \textbf{LongCE} (Long-context Cross-Entropy) loss, a re-weighting strategy for fine-tuning that prioritizes key tokens, leading to consistent improvements across diverse benchmarks. In summary, these contributions offer deeper insights into the limitations of PPL and present effective solutions for accurately evaluating and enhancing the long-context capabilities of LLMs.

2874Differentially Private Federatedk-Means with Server-Side Data

[openreview] [pdf]

Abstract Clustering has long been a cornerstone of data analysis. It is particularly suited to identifying coherent subgroups or substructures in unlabeled data, as are generated continuously in large amounts these days. However, in many cases traditional clustering methods are not applicable, because data are increasingly being produced and stored in a distributed way, e.g. on edge devices, and privacy concerns prevent it from being transferred to a central server. To address this challenge, we present FedDP-KMeans, a new algorithm for k-means clustering that is fully-federated as well as differentially private. Our approach leverages (potentially small and out-of-distribution) server-side data to overcome the primary challenge of differentially private clustering methods: the need for a good initialization. Combining our initialization with a simple federated DP-Lloyds algorithm we obtain an algorithm that achieves excellent results on synthetic and real-world benchmark tasks. We also provide a theoretical analysis of our method that provides bounds on the convergence speed and cluster identification success.

2875Robotouille: An Asynchronous Planning Benchmark for LLM Agents

[openreview] [pdf]

Abstract Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large language model (LLM) agents show promise in high-level task planning, current benchmarks focus primarily on short-horizon tasks and do not evaluate such asynchronous planning capabilities. We introduce Robotouille, a challenging benchmark environment designed to test LLM agents’ ability to handle asynchronous, long-horizon, and multi-agent scenarios. These datasets capture increasingly complex planning challenges that go beyond existing benchmarks, particularly in their requirement for agents to manage overlapping tasks, interruptions, and collaboration. Our results show that ReAct (gpt-4o) achieves 47% on synchronous tasks but only 11% on asynchronous tasks, highlighting significant room for improvement. We further analyze failure modes, demonstrating the need for LLM agents to better incorporate long-horizon feedback and self-audit their reasoning during task execution.

2876Mitigating Multimodal Hallucinations via Gradient-based Self-Reflection

[openreview] [pdf]

Abstract Hallucination in Multimodal Large Language Models (MLLMs) occurs when inaccurate text-visual alignments are generated, posing a major challenge for reliable model output. Previous studies have identified three primary biases as major causes of hallucinations: text-visual bias (over-reliance on text over visual details), co-occurrence bias (misleading object correlations), and long-term bias (increased hallucinations in later stages of long sequences). Existing hallucination mitigation methods often rely on visual grounding, which requires additional resources such as scoring systems using another MLLM, and still fail to fully address all biases, particularly co-occurrence bias in visual inputs. We propose Gradient-based Influence-Aware Contrastive Decoding (GACD) to explicitly and jointly balance these biases, thereby mitigating hallucinations. To quantify these biases at the individual sample level, we introduce `token influence’. Since biases are rooted in the training data and become embedded in pre-trained MLLMs, we derive token influence through self-reflection by calculating the gradients from output predictions to input tokens. Notably, GACD is the first approach capable of fully addressing co-occurrence bias without relying on extra resources or any form of tuning. Extensive experiments demonstrate GACD’s effectiveness in reducing hallucinations and improving MLLM performance, achieving new state-of-the-art results while providing insights into the visual perception capabilities of these models.

2877DeeperForward: Enhanced Forward-Forward Training for Deeper and Better Performance

[openreview] [pdf]

Abstract While backpropagation effectively trains models, it presents challenges related to bio-plausibility, resulting in high memory demands and limited parallelism. Recently, Hinton (2022) proposed the Forward-Forward (FF) algorithm for high-parallel local updates. FF leverages squared sums as the local update target, termed goodness, and employs L2 normalization to decouple goodness and extract new features. However, this design encounters issues with feature scaling and deactivated neurons, limiting its application mainly to shallow networks. This paper proposes a novel goodness design utilizinglayer normalizationandmean goodnessto overcome these challenges, demonstrating performance improvements even in 17-layer CNNs. Experiments on CIFAR-10, MNIST, and Fashion-MNIST show significant advantages over existing FF-based algorithms, highlighting the potential of FF in deep models. Additionally, a model parallel strategy is proposed to enhance training efficiency greatly.

2878Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term “task superposition”. We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of “LLMs as superposition of simulators”, and raise questions about the mechanisms enabling simultaneous task execution.

2879FoREST: Frame of Reference Evaluation in Spatial Reasoning Tasks

[openreview] [pdf]

Abstract Spatial cognition is one fundamental aspect of human intelligence. A key factor in spatial cognition is understanding the frame of reference (FoR) that identifies the perspective of spatial relations. However, the AI research has paid very little attention to this concept. Specifically, there is a lack of dedicated benchmarks and in-depth experiments analyzing large language models’ (LLMs) understanding of FoR. To address this issue, we introduce a new benchmark,FrameofReferenceEvaluation inSpatial ReasoningTasks (FoREST) to evaluate LLMs ability in understanding FoR. We evaluate the LLMs in identifying the FoR based on textual context and employ this concept in text-to-image generation. Our results reveal notable differences and biases in the FoR identification of various LLMs. Moreover, the bias in FoR interpretations impacts the LLMs’ ability to generate layouts for text-to-image generation. To deal with these biases, we propose Spatial-Guided prompting, which guides the model in exploiting the types of spatial relations for a more accurate FoR identification. This approach reduces FoR bias in LLMs and improves the overall performance of FoR identification. Eventually, using FoR information in text-to-image generation leads to a more accurate visualization of the spatial configuration of objects.

2880Playbook: Scalable Discrete Skill Discovery from Unstructured Datasets for Long-Horizon Decision-Making Problems

[openreview] [pdf]

Abstract Skill discovery methods equip an agent with diverse skills necessary for solving challenging tasks through an unsupervised learning manner. However, making the pre-learned skills expandable for new tasks remains a challenge in existing research. To handle this limitation, we propose a scalable skill discovery algorithm, a playbook, which can accommodate unseen tasks by training new skills while maintaining previously learned ones. The playbook, characterized by discrete skills and an extendable structure, enables the extension of the skill set to cover new datasets. Since we design the playbook to have a finite number of skills, we can interpret a decision-making problem as a sequential skill classification problem, so we aim to learn additional skills of the playbook by applying the techniques of class-incremental learning. In addition, we also introduce skill planning schemes that can leverage both previously and newly learned skills to solve challenging tasks compounded by multiple sub-tasks. The proposed method is evaluated in the complex robotic manipulation benchmarks, and the results show that the playbook outperforms existing state-of-the-art methods that learn continuous skills.

2881Bayesian Neural Networks with Domain Knowledge Priors

[openreview] [pdf]

Abstract Bayesian neural networks (BNNs) have recently gained popularity due to their ability to quantify model uncertainty in prediction. However, specifying a prior for BNNs that accurately captures relevant domain knowledge is often extremely challenging. In this work, we propose a framework for integrating general forms of domain knowledge (i.e., any knowledge that can be represented by a loss function) into a BNN prior through variational inference, while enabling computationally efficient posterior inference and sampling. Specifically, our approach results in a prior over neural network weights that assigns high probability mass to models that better align with our domain knowledge, leading to posterior samples that also exhibit this behavior. In a semi-supervised learning setting, we show that BNNs using our proposed domain knowledge priors outperform those with standard priors (e.g., isotropic Gaussian, Gaussian process), successfully incorporating diverse types of prior information such as fairness, physics rules, and healthcare knowledge and achieving better predictive performance. We also present techniques for transferring the learned priors across different model architectures, demonstrating their broad utility across many tasks.

2882Distributed Epigraph Form Multi-Agent Safe Reinforcement Learning

[openreview] [pdf]

Abstract Most existing safe multi-agent reinforcement learning (MARL) algorithms consider the constrained Markov decision process (CMDP) problem, which targets bringing the mean of constraint violation below a user-defined threshold. However, as observed by existing works albeit for the single-agent case, CMDP algorithms suffer from unstable training when the constraint threshold is zero. This paper proposesEFMARL, a novel MARL algorithm that improves upon the problems faced in the zero constraint threshold setting by extending theepigraph form, a technique to perform constrained optimization, to the centralized training and distributed execution (CTDE) paradigm. We validate our approach in different Multi-Particle Environments and Safe Multi-agent MuJoCo environments with varying numbers of agents. Simulation results show that our algorithm achieves stable training and the best performance while satisfying constraints: it is as safe as the safest baseline that has significant performance loss, and achieves similar performance as baselines that prioritize performance but violate safety constraints.

2883Latte: Latent Attention for Linear Time Transformers

[openreview] [pdf]

Abstract The time complexity of the standard attention mechanism in transformers scales quadratically with sequence length. We propose a probabilistic framework for attention, enabling us to derive a novel low-rank linear re-parameterisation of both bidirectional and causal cases, based on defining a latent variable model. Our method can be seamlessly integrated as a drop-in replacement for the standard attention mechanism. Additionally, this framework provides a natural extension for combining local standard attention with our global linear attention. This approach allows us to extend the context length of existing large pre-trained models with only a few additional training steps. The resulting ``Latte Transformer’’ achieves performance comparable to standard attention and other state-of-the-art models, while maintaining linear time and memory complexity, along with constant-time next-token prediction during inference.

2884RotPruner: Large Language Model Pruning in Rotated Space

[openreview] [pdf]

Abstract Network pruning is a crucial technique for compressing large language models with billions of parameters, aiming to reduce memory and computational costs with minimal performance degradation. However, existing pruning methods for LLMs often focus on heuristic metrics or layer-wise reconstruction losses, neglecting the impact on the overall model output, which can lead to suboptimal result. Additionally, these methods operate directly on the original weight and activation spaces, which may not be ideal for pruning. In this paper, we propose that the original parameter space is not optimal for pruning and present a novel training-based pruning framework called RotPruner. RotPruner rotates the spaces of weight matrices and activations in linear layers, and applies existing pruning methods in a rotated space that is more suitable for pruning. We introduce an efficient algorithm to identify an appropriate rotation that preserves the performance of pruned LLMs. RotPruner is capable of integrating with other pruning methods and supporting unstructured, semi-structured, and structured pruning. We evaluate RotPruner on several large language models, including OPT, LLaMA-2, and LLaMA-3, and demonstrate state-of-the-art performance on both language modeling and zero-shot tasks.

2885Mining your own secrets: Diffusion Classifier Scores for Continual Personalization of Text-to-Image Diffusion Models

[openreview] [pdf]

Abstract Personalized text-to-image diffusion models have grown popular for their ability to efficiently acquire a new concept from user-defined text descriptions and a few images. However, in the real world, a user may wish to personalize a model on multiple concepts but one at a time, with no access to the data from previous concepts due to storage/privacy concerns. When faced with this continual learning (CL) setup, most personalization methods fail to find a balance between acquiring new concepts and retaining previous ones -- a challenge thatcontinual personalization(CP) aims to solve. Inspired by the successful CL methods that rely on class-specific information for regularization, we resort to the inherent class-conditioned density estimates, also known asdiffusion classifier(DC) scores, for CP of text-to-image diffusion models. Namely, we propose using DC scores for regularizing the parameter-space and function-space of text-to-image diffusion models. Using several diverse evaluation setups, datasets, and metrics, we show that our proposed regularization-based CP methods outperform the state-of-the-art C-LoRA, and other baselines. Finally, by operating in the replay-free CL setup and on low-rank adapters, our method incurs zero storage and parameter overhead, respectively, over the state-of-the-art.

2886DriveTransformer: Unified Transformer for Scalable End-to-End Autonomous Driving

[openreview] [pdf]

Abstract End-to-end autonomous driving (E2E-AD) has emerged as a trend in the field of autonomous driving, promising a data-driven, scalable approach to system design. However, existing E2E-AD methods usually adopt the sequential paradigm of perception-prediction-planning, which leads to cumulative errors and training instability. The manual ordering of tasks also limits the system’s ability to leverage synergies between tasks (for example, planning-aware perception and game-theoretic interactive prediction and planning). Moreover, the dense BEV representation adopted by existing methods brings computational challenges for long-range perception and long-term temporal fusion. To address these challenges, we present DriveTransformer, a simplified E2E-AD framework for the ease of scaling up, characterized by three key features: Task Parallelism (All agent, map, and planning queries direct interact with each other at each block), Sparse Representation (Task queries direct interact with raw sensor features), and Streaming Processing (Task queries are stored and passed as history information). As a result, the new framework is composed of three unified operations: task self-attention, sensor cross-attention, temporal cross-attention, which significantly reduces the complexity of system and leads to better training stability. DriveTransformer achieves state-of-the-art performance in both simulated closed-loop benchmark Bench2Drive and real world open-loop benchmark nuScenes with high FPS. We will open source our code and checkpoints.

2887UncertaintyRAG: Span Uncertainty Enhanced Long-Context Modeling for Retrieval-Augmented Generation

[openreview] [pdf]

Abstract We introduce UncertaintyRAGUncertaintyRAG, a novel method for long-context Retrieval-Augmented Generation (RAG) that leverages Signal-to-Noise Ratio (SNR)-based span uncertainty to estimate similarity between text chunks. This span uncertainty improves the calibration of model predictions, enhancing robustness and addressing semantic inconsistencies caused by random chunking. Utilizing this, we develop an efficient unsupervised learning technique for training the retrieval model and design an effective data sampling and scaling strategy. UncertaintyRAGUncertaintyRAG achieves a 2.03% improvement over baselines on LLaMA-2-7B, reaching state-of-the-art performance while using only 4% of the training data compared to other powerful open-source retrieval models under distribution shift settings. Our method demonstrates strong calibration through span uncertainty, resulting in better generalization and robustness in long-context RAG tasks. Moreover, UncertaintyRAGUncertaintyRAG offers a lightweight retrieval model that can be seamlessly integrated into any large language model with varying context window lengths without the need for fine-tuning, highlighting the versatility of our approach.

2888Graph Neural Network Is A Mean Field Game

[openreview] [pdf]

Abstract In current graph neural networks (GNNs), it is a common practice to apply a pre-defined message passing heuristics to all graph data, even though the stereotypical relational inductive bias (e.g., graph heat diffusion) might not fit the unseen graph topology. Such gross simplification might be responsible for the lack of an in-depth understanding of graph learning principles, which challenges us to push the boundary from crafting application-specific GNNs to embracing a “meta-learning” paradigm. In this work, we ratchet the gear of GNN another notch forward by formulating GNN as amean field game, that is, the best learning outcome occurs at theNash-equilibrium when the learned graph inference rationale allows each graph node to find what is the best feature representations for not only the individual node but also the entire graph. Following this spirit, we formulate the search for novel GNN mechanism into a variational framework ofmean-field control(MFC) problem, where the optimal relational inductive bias is essentially the critical point of mean-field information dynamics. Specifically, we seek for the best characteristic MFC functions of transportation mobility (controlling information exchange throughout the graph) and reaction mobility (controlling feature representation learning on each node), on the fly, that uncover the most suitable learning mechanism for a GNN instance by solving an MFC variational problem through the lens ofHamiltonian flows(formed in partial differential equations). In this context, our variational framework brings together existing GNN models into various mean-field games with distinct equilibrium states, each characterized by a unique MFC functional. Furthermore, we present an agnostic end-to-end deep model, coinedNash-GNN(in honor of Nobel laureate Dr. John Nash), to jointly carve the nature of the inductive bias and fine-tune the GNN hyper-parameters on top of the elucidated learning mechanism.Nash-GNNhas achieved SOTA performance on diverse graph data including popular benchmark datasets and human connectomes. More importantly, the mathematical insight of mean-field games provides a new window to understand the foundational principles of graph learning as an interactive dynamical system, which allows us to reshape the idea of designing next-generation GNN models.

2889UnSTAR: Unlearning with Self-Taught Anti-Sample Reasoning for LLMs

[openreview] [pdf]

Abstract The key components of machine learning are data samples for training, models for learning patterns, and loss functions for optimizing accuracy. Analogously, unlearning can potentially be achieved through anti-data samples (or anti-samples), unlearning methods, and reversed loss functions. While prior research has explored unlearning methods and reversed loss functions, the potential of anti-samples remains largely untapped. In this paper, we introduce UnSTAR: Un\underline{\text{Un}}learning with S\underline{\text{S}}elf-T\underline{\text{T}}aught A\underline{\text{A}}nti-Sample R\underline{\text{R}}easoning for large language models (LLMs). Our contributions are threefold: first, we propose a novel concept of anti-sample-induced unlearning; second, we generate anti-samples by leveraging misleading rationales, which help reverse learned associations and accelerate the unlearning process; and third, we enable fine-grained targeted unlearning, allowing for the selective removal of specific associations without impacting related knowledge—something not achievable by previous works. Results demonstrate that anti-samples offer an efficient, targeted unlearning strategy for LLMs, opening new avenues for privacy-preserving machine learning and model modification.

2890Rethinking Uncertainty Estimation in Natural Language Generation

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly employed in real-world applications, driving a need to determine when their generated text can be trusted or should be questioned. To assess the trustworthiness of the generated text, reliable uncertainty estimation is essential. Current LLMs generate text through a stochastic process that can lead to different output sequences for the same prompt. Consequently, leading uncertainty measures require generating multiple output sequences to estimate the LLM’s uncertainty. However, generating additional output sequences is computationally expensive, making these uncertainty estimates impractical at scale. In this work, we challenge the theoretical foundations of the leading measures and derive an alternative measure that eliminates the need for generating multiple output sequences. Our new measure is based solely on the negative log-likelihood of the most likely output sequence. This vastly simplifies uncertainty estimation while maintaining theoretical rigor. Empirical results demonstrate that our new measure achieves state-of-the-art performance across various models and tasks. Our work lays the foundation for reliable and efficient uncertainty estimation in LLMs, challenging the necessity of the more complicated methods currently leading the field.

2891ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation

[openreview] [pdf]

Abstract Text-to-video (T2V) models have recently undergone rapid and substantial advancements. Nevertheless, due to limitations in data and computational resources, achieving efficient generation of long videos with rich motion dynamics remains a significant challenge. To generate high-quality, dynamic, and temporally consistent long videos, this paper presents ARLON, a novel framework that boosts diffusion Transformers with autoregressive (AR) models for long (LON) video generation, by integrating the coarse spatial and long-range temporal information provided by the AR model to guide the DiT model effectively. Specifically, ARLON incorporates several key innovations: 1) A latent Vector Quantized Variational Autoencoder (VQ-VAE) compresses the input latent space of the DiT model into compact and highly quantized visual tokens, bridging the AR and DiT models and balancing the learning complexity and information density; 2) An adaptive norm-based semantic injection module integrates the coarse discrete visual units from the AR model into the DiT model, ensuring effective guidance during video generation; 3) To enhance the tolerance capability of noise introduced from the AR inference, the DiT model is trained with coarse visual latent tokens incorporated with an uncertainty sampling module. Experimental results demonstrate that ARLON significantly outperforms the baseline OpenSora-V1.2 on eight out of eleven metrics selected from VBench, with notable improvements in dynamic degree and aesthetic quality, while delivering competitive results on the remaining three and simultaneously accelerating the generation process. In addition, ARLON achieves state-of-the-art performance in long video generation, outperforming other open-source models in this domain. Detailed analyses of the improvements in inference efficiency are presented, alongside a practical application that demonstrates the generation of long videos using progressive text prompts. Project page: \url{https://github.com/arlon-t2v/arlon-anonymous}.

2892InvestESG: A multi-agent reinforcement learning benchmark for studying climate investment as a social dilemma

[openreview] [pdf]

Abstract InvestESG is a novel multi-agent reinforcement learning (MARL) benchmark designed to study the impact of Environmental, Social, and Governance (ESG) disclosure mandates on corporate climate investments. The benchmark models an intertemporal social dilemma where companies balance short-term profit losses from climate mitigation efforts and long-term benefits from reducing climate risk, while ESG-conscious investors attempt to influence corporate behavior through their investment decisions. Companies allocate capital across mitigation, greenwashing, and resilience, with varying strategies influencing climate outcomes and investor preferences. Our experiments show that without ESG-conscious investors with sufficient capital, corporate mitigation efforts remain limited under the disclosure mandate. However, when a critical mass of investors prioritizes ESG, corporate cooperation increases, which in turn reduces climate risks and enhances long-term financial stability. Additionally, providing more information about global climate risks encourages companies to invest more in mitigation, even without investor involvement. Our findings align with empirical research using real-world data, highlighting MARL’s potential to inform policy by providing insights into large-scale socio-economic challenges through efficient testing of alternative policy and market designs.

2893Aligning Language Models with Demonstrated Feedback

[openreview] [pdf]

Abstract Language models are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number (<10<10) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns language model outputs to a user’s demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users’ demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO’s ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants (N=16N=16). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

[openreview] [pdf]

Abstract Embedding-based text retrieval—retrieval of relevant passages from knowledge databases (KDBs) via deep learning encodings—has emerged as a powerful method attaining state-of-the-art search results and popularizing the use of Retrieval Augmented Generation (RAG). Still, like other search methods, embedding-based retrieval may be susceptible to search-engine optimization (SEO) attacks, where adversaries promote malicious content by introducing adversarial passages to KDBs. To faithfully assess the susceptibility of such systems to SEO, this work proposes theGASLITEattack, a mathematically principled gradient-based search method for generating adversarial passages without relying on the KDB content or modifying the model. Notably,GASLITE’s passages(1)carry adversary-chosen information while(2)achieving high retrieval ranking for a selected query distribution when inserted to KDBs. We extensively evaluatedGASLITE, testing it on nine advanced models and comparing it to three baselines under varied threat models, focusing on one well-suited for realistic adversaries targeting queries on a specific concept (e.g., a public figure). We foundGASLITEconsistently outperformed baselines by \ge140% success rate, in all settings. Particularly, adversaries usingGASLITErequire minimal effort to manipulate search results—by injecting a negligible amount of adversarial passages (\le0.0001% of the KDBs), they could make them visible in the top-10 results for 61-100% of unseen concept-specific queries against most evaluated models. Among other contributions, our work also identifies several factors that may influence model susceptibility to SEO, including the embedding space’s geometry. We will make our code publicly available.

2895Underestimated Privacy Risks for Minority Populations in Large Language Model Unlearning

[openreview] [pdf]

Abstract Large Language Models (LLMs) are trained on extensive datasets that often contain sensitive, human-generated information, raising significant concerns about privacy breaches. While certified unlearning approaches offer strong privacy guarantees, they rely on restrictive model assumptions that are not applicable to LLMs. As a result, various unlearning heuristics have been proposed, with privacy risks typically assessed empirically. The standard evaluation pipelines usually randomly select data for removal from the training set, apply unlearning techniques, and use membership inference attacks (MIAs) to compare the unlearned models against models retrained without the removed data. In this paper, we identify a critical flaw in this widely adopted evaluation approach: the privacy risks faced by minority groups within the training data are often significantly underestimated. We substantiate this claim through carefully designed experiments, including unlearning canaries related to minority groups, inspired by privacy auditing literature. Using personally identifiable information (PII) as a representative minority identifier, we demonstrate that minority groups experience at least 20% more privacy leakage in most cases across combinations of six unlearning approaches, three variants of MIAs, three benchmark datasets, and two LLMs of different scales. Given that the right to be forgotten should be upheld for every individual, we advocate for a more rigorous evaluation of LLM unlearning methods. Our minority-aware evaluation framework represents an initial step toward ensuring more equitable and thorough assessments of LLM unlearning efficacy.

2896Layerwise Recurrent Router for Mixture-of-Experts

[openreview] [pdf]

Abstract The scaling of large language models (LLMs) has revolutionized their capabilities in various tasks, yet this growth must be matched with efficient computational strategies. The Mixture-of-Experts (MoE) architecture stands out for its ability to scale model size without significantly increasing training costs. Despite their advantages, current MoE models often display parameter inefficiency. For instance, a pre-trained MoE-based LLM with 52 billion parameters might perform comparably to a standard model with 6.7 billion. Being a crucial part of MoE, current routers in different layers independently assign tokens without leveraging historical routing information, potentially leading to suboptimal token-expert combinations and the parameter inefficiency problem. To alleviate this issue, we introduce the Layerwise Recurrent Router for Mixture-of-Experts (RMoE). RMoE leverages a Gated Recurrent Unit (GRU) to establish dependencies between routing decisions across consecutive layers. Such layerwise recurrence can be efficiently parallelly computed for input tokens and introduces negotiable costs. Our extensive empirical evaluations demonstrate that RMoE-based language models consistently outperform a spectrum of baseline models. Furthermore, RMoE integrates a novel computation stage orthogonal to existing methods, allowing seamless compatibility with other MoE architectures. Our analyses attribute RMoE’s gains to its effective cross-layer information sharing, which also improves expert selection and diversity.

2897CPL: Critical Plan Step Learning Boosts LLM Generalization in Reasoning Tasks

[openreview] [pdf]

Abstract Post-training, particularly reinforcement learning (RL) using self-play-generated data, has become a new learning paradigm for large language models (LLMs). However, scaling RL to develop a general reasoner remains a research challenge, as existing methods focus on task-specific reasoning without adequately addressing generalization across a broader range of tasks. Moreover, unlike traditional RL with limited action space, LLMs operate in an infinite space, making it crucial to search for valuable and diverse strategies to solve problems effectively. To address this, we propose searching within the action space on high-level abstract plans to enhance model generalization and introduce Critical Plan Step Learning (CPL), comprising: 1) searching on plan, using Monte Carlo Tree Search (MCTS) to explore diverse plan steps in multi-step reasoning tasks, and 2) learning critical plan steps through Step-level Advantage Preference Optimization (Step-APO), which integrates advantage estimates for step preference obtained via MCTS into Direct Preference Optimization (DPO). This combination helps the model effectively learn critical plan steps, enhancing both reasoning capabilities and generalization. Experimental results demonstrate that our method, trained exclusively on GSM8K and MATH, not only significantly improves performance on GSM8K (+10.5%) and MATH (+6.5%), but also enhances out-of-domain reasoning benchmarks, such as HumanEval (+12.2%), GPQA (+8.6%), ARC-C (+4.0%), MMLU-STEM (+2.2%), and BBH (+1.8%). The code is available athttps://anonymous.4open.science/r/CPL.

2898Scalable do-Shapley Explanations with Estimand-Agnostic Causal Inference

[openreview] [pdf]

Abstract Among explainability techniques, SHAP stands out as one of the most popular, but often overlooks the causal structure of the problem. While do-SHAP uses interventional causal queries, its reliance on estimands hinders scalability. To address this, we propose estimand-agnostic Causal Inference, which allows for the estimation of any identifiable query with a single model, making do-SHAP feasible on arbitrarily complex graphs. We also develop a novel algorithm to significantly accelerate its computation at a negligible cost with a marked improvement in computational speed, as well as a method to explain inaccessible Data Generating Processes. We validate our approach on two real-world datasets, highlighting its potential in obtaining reliable explanations.

2899Capsule Network Projectors are Equivariant and Invariant Learners

[openreview] [pdf]

Abstract Learning invariant representations has been the longstanding approach to self-supervised learning. However, recently progress has been made in preserving equivariant properties in representations, yet do so with highly prescribed architectures. In this work, we propose an invariant-equivariant self-supervised architecture that employs Capsule Networks (CapsNets) which have been shown to capture equivariance with respect to novel viewpoints. We demonstrate that the use of CapsNets in equivariant self-supervised architectures achieves improved downstream performance on equivariant tasks with higher efficiency and fewer network parameters. To accommodate the architectural changes of CapsNets, we introduce a new objective function based on entropy minimisation. This approach which we name CapsIE (Capsule Invariant Equivariant Network) achieves state-of-the-art performance across all invariant and equivariant downstream tasks on the 3DIEBench dataset, while outperforming supervised baselines. Our results demonstrate the ability of CapsNets to learn complex and generalised representations for large-scale, multi-task datasets compared to previous CapsNet benchmarks.

2900Correlated Proxies: A New Definition and Improved Mitigation for Reward Hacking

[openreview] [pdf]

Abstract Because it is difficult to precisely specify complex objectives, reinforcement learning policies are often optimized using flawed proxy rewards that seem to capture the true objective. However, optimizing proxy rewards frequently leads to reward hacking: the optimized reward function ceases to be a good proxy and the resulting policy performs poorly with respect to the unspecified true reward. Principled solutions to reward hacking have been impeded by the lack of a good definition for the problem. We introduce a definition of reward hacking based on correlation between proxy and true rewards for states and actions seen by a “base policy” that breaks down under optimization. We show that this definition captures reward hacking behavior across several realistic settings, including in reinforcement learning from human feedback (RLHF). We then show theoretically that regularization to the base policy can effectively prevent reward hacking. Our theory suggests regularizing χ2\chi^2 divergence between the policies’ occupancy measures, rather than the current practice in RLHF of using a KL penalty between action distributions. We intuitively show why this type of regularization is superior, and demonstrate that it better mitigates reward hacking in practice across four realistic settings, including RLHF.

2901Efficient Newton-type Federated Learning with Non-IID Data

[openreview] [pdf]

Abstract The mainstream federated learning algorithms only communicate the first-order information across the local devices, i.e., FedAvg and FedProx. However, only using first-order information, these methods are often inefficient and the impact of heterogeneous data is yet not precisely understood. This paper proposes an efficient federated Newton method (FedNewton), by sharing both first-order and second-order knowledge over heterogeneous data. In general kernel ridge regression setting, we derive the generalization bounds for FedNewton and obtain the minimax-optimal learning rates. For the first time, our results analytically quantify the impact of the number of local examples, the data heterogeneity and the model heterogeneity. Moreover, as long as the local sample size is not too small and data heterogeneity is moderate, the federated error in FedNewton decreases exponentially in terms of iterations. Extensive experimental results further validate our theoretical findings and illustrate the advantages of FedNewton over the first-order methods.

2902Knowing What Not to Do: Leverage Language Model Insights for Action Space Pruning in Multi-agent Reinforcement Learning

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) is employed to develop autonomous agents that can learn to adopt cooperative or competitive strategies within complex environments. However, the linear increase in the number of agents leads to a combinatorial explosion of the action space, which always results in algorithmic instability, difficulty in convergence, or entrapment in local optima. While researchers have designed a variety of effective algorithms to compress the action space, these methods also introduce new challenges, such as the need for manually designed prior knowledge or reliance on the structure of the problem, which diminishes the applicability of these techniques. In this paper, we introduceEvolutionary actionSPAceReduction withKnowledge (eSpark), an exploration function generation framework driven by large language models (LLMs) to boost exploration and prune unnecessary actions in MARL. Using just a basic prompt that outlines the overall task and setting, eSpark is capable of generating exploration functions in a zero-shot manner, identifying and pruning redundant or irrelevant state-action pairs, and then achieving autonomous improvement from policy feedback. In reinforcement learning tasks involving inventory management and traffic light control encompassing a total of 15 scenarios, eSpark consistently outperforms the combined MARL algorithm in all scenarios, achieving an average performance gain of 34.4% and 9.9% in the two types of tasks respectively. Additionally, eSpark has proven to be capable of managing situations with a large number of agents, securing a 29.7% improvement in scalability challenges that featured over 500 agents. The code can be found inhttps://anonymous.4open.science/r/0CDH-0DF8/.

2903Addressing domain shift with diffusion-based adaptation for real image dehazing

[openreview] [pdf]

Abstract Conventional supervised single-image dehazing methods, which are trained with substantial synthetic hazy-clean image pairs, have achieved promising performance. However, they often fail to tackle out-of-distribution hazy images, due to the domain shift between source and target scenarios (e.g., between indoor and outdoor, between synthetic and real). In this work, we observe the opportunity for improving such dehazing models’ generalization ability without modifying the architectures or weights of conventional models by adopting the diffusion model to transfer the distribution of input images from target domain to source domain. Specifically, we train a denoising diffusion probabilistic model (DDPM) with source hazy images to capture prior probability distribution of the source domain. Then, during the test-time the obtained DDPM can adapt target hazy inputs to source domain in the reverse process from the perspective of conditional generation. The adapted inputs are fed into a certain state-of-the-art (SOTA) dehazing model pre-trained on source domain to predict the haze-free outputs. Note that, the whole proposed pipeline, termed \textbf{Diff}usion-based \textbf{AD}aptation (DiffAD), is model-agnostic and plug-and-play. Besides, to enhance the efficiency in real image dehazing, we further employ the predicted haze-free outputs as the pseudo labels to fine-tune the underlying model. Extensive experimental results demonstrate that our DiffAD is effective, achieving superior performance against SOTA dehazing methods in domain-shift scenarios.

2904Noisy Data Pruning by Label Distribution Discrimination

[openreview] [pdf]

Abstract Data pruning aims to prune large-scale datasets into concise subsets, thereby reducing computational costs during model training. While a variety of data pruning methods have been proposed, most focus on meticulously curated datasets, and relatively few studies address real-world datasets containing noisy labels. In this paper, we empirically analyze the shortcomings of previous gradient-based methods, revealing that geometry-based methods exhibit greater resilience to noisy labels. Consequently, we propose a novel two-stage noisy data pruning method that incorporates selection and re-labeling processes, which takes into account geometric neighboring information. Specifically, we utilize the distribution divergence between a given label and the predictions of its neighboring samples as an importance metric for data pruning. To ensure reliable neighboring predictions, we employ feature propagation and label propagation to refine these predictions effectively. Furthermore, we utilize re-labeling methods to correct selected subsets and consider the coverage of both easy and hard samples at different pruning rates. Extensive experiments demonstrate the effectiveness of the proposed method, not only on real-world benchmarks but also on synthetic datasets, highlighting its suitability for practical applications with noisy label scenarios.

2905Cometh: A continuous-time discrete-state graph diffusion model

[openreview] [pdf]

Abstract Discrete-state denoising diffusion models led to state-of-the-art performance in graph generation, especially in the molecular domain. Recently, they have been transposed to continuous time, allowing more flexibility in the reverse process and a better trade-off between sampling efficiency and quality. Here, to leverage the benefits of both approaches, we propose Cometh, a continuous-time discrete-state graph diffusion model, tailored to the specificities of graph data. In addition, we also successfully replaced the set of structural encodings previously used in the discrete graph diffusion model with a single random-walk-based encoding, providing a simple and principled way to boost the model’s expressive power. Empirically, we show that integrating continuous time leads to significant improvements across various metrics over state-of-the-art discrete-state diffusion models on a large set of molecular and non-molecular benchmark datasets. In terms of VUN samples, Cometh obtains a near-perfect performance of 99.5% on the planar graph dataset and outperforms DiGress by 12.6% on the large GuacaMol dataset.

2906Ego-Foresight: Self-supervised Agent Visuomotor Prediction for Efficient RL

[openreview] [pdf]

Abstract Despite the significant advancements in Deep Reinforcement Learning (RL) observed in the last decade, the amount of training experience necessary to learn effective policies remains one of the primary concerns both in simulated and real environments. Looking to solve this issue, previous work has shown that improved training efficiency can be achieved by separately modeling agent and environment, but usually requiring a supervisory agent mask. In contrast to RL, humans can perfect a new skill from a very small number of trials and in most cases do so without a supervisory signal, making neuroscientific studies of human development a valuable source of inspiration for RL. In particular, we explore the idea of motor prediction, which states that humans develop an internal model of themselves and of the consequences that their motor commands have on the immediate sensory inputs. Our insight is that the movement of the agent provides a cue that allows the duality between agent and environment to be learned. To instantiate this idea, we present Ego-Foresight, a self supervised method for disentangling agent and environment based on motion and prediction. Our main finding is that visuomotor prediction of the agent provides good feature representations for the underlying RL algorithm. To test our approach, we integrate Ego-Foresight with a model-free RL algorithm to solve simulated robotic manipulation tasks, showing its ability to improve efficiency and performance in different tasks while making strides towards real-world RL applications, by removing the need for costly supervisory signals.

2907Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

[openreview] [pdf]

Abstract We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples knk\leq n local agents to compute a policy in time that is polynomial in kk. We show that this learned policy converges to the optimal policy in the order of O~(1/k+ϵk,m)\tilde{O}(1/\sqrt{k}+{\epsilon}{k,m}) as the number of sub-sampled agents kk increases, where ϵk,m{\epsilon}{k,m} is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.

2908Preventing Collapse in Contrastive Learning with Orthonormal Prototypes (CLOP)

[openreview] [pdf]

Abstract Contrastive learning has emerged as a powerful method in deep learning, excelling at learning effective representations through contrasting samples from different distributions. However, neural collapse, where embeddings converge into a lower-dimensional space, poses a significant challenge, especially in semi-supervised and self-supervised setups. In this paper, we first theoretically analyze the effect of large learning rates on contrastive losses that solely rely on the cosine similarity metric, and derive a theoretical bound to mitigate this collapse. Building on these insights, we propose CLOP, a novel semi-supervised loss function designed to prevent neural collapse by promoting the formation of orthogonal linear subspaces among class embeddings. Unlike prior approaches that enforce a simplex ETF structure, CLOP focuses on subspace separation, leading to more distinguishable embeddings. Through extensive experiments on real and synthetic datasets, we demonstrate that CLOP enhances performance, providing greater stability across different learning rates and batch sizes.

2909Clip Body and Tail Separately: High Probability Guarantees for DP-SGD with Heavy Tails

[openreview] [pdf]

Abstract Differentially Private Stochastic Gradient Descent (DPSGD) is widely utilized to preserve training data privacy in deep learning, which first clips the gradients to a predefined norm and then injects calibrated noise into the training procedure. Existing DPSGD works typically assume the gradients follow sub-Gaussian distributions and design various gradient clipping mechanisms to optimize training performance. However, recent studies have shown that the gradients in deep learning exhibit a heavy-tail phenomenon, that is, the tails of the gradient may have infinite variance, which leads to excessive clipping loss with existing mechanisms. To address this problem, we propose a novel approach, Discriminative Clipping~(DC)-DPSGD, with two key designs. First, we introduce a subspace identification technique to distinguish between body and tail gradients. Second, we present a discriminative clipping mechanism that applies different clipping thresholds separately for body and tail gradients to reduce the clipping loss. Under the non-convex condition and heavy-tailed sub-Weibull gradient noise assumption, DC-DPSGD reduces the empirical risk from O(logmax(0,θ1)(T/δ)log2θ(T)){\mathbb{O}\left(\log^{\max(0,\theta-1)}(T/\delta)\log^{2\theta}(\sqrt{T})\right)} to O(log(T)){\mathbb{O}\left(\log(\sqrt{T})\right)} with heavy-tailed index θ>1/2\theta> 1/2, iterations TT, and high probability 1δ1-\delta. Extensive experiments on five real-world datasets demonstrate that our approach outperforms three baselines by up to 9.72% in terms of accuracy.

2910Forecasting chaotic systems with zero-shot learning

[openreview] [pdf]

Abstract Time-series forecasting is a challenging task that traditionally requires specialized models custom-trained for the specific task at hand. Recently, inspired by the success of large language models, foundation models pre-trained on vast amounts of time-series data from diverse domains have emerged as a promising candidate for general-purpose time-series forecasting. The defining characteristic of these foundation models is their ability to perform zero-shot learning, that is, forecasting a new system from limited context data without explicit re-training or fine-tuning. Here, we evaluate whether the zero-shot learning paradigm extends to the challenging task of forecasting chaotic systems. Across 135 distinct chaotic dynamical systems and 108 timepoints, we find that foundation models produce competitive forecasts compared to custom-trained models (including NBEATS, TiDE, etc.), particularly when training data is limited. Interestingly, even after point forecasts fail, foundation models preserve the geometric and statistical properties of the chaotic attractors, demonstrating a surprisingly strong ability to capture the long-term behavior of chaotic dynamical systems. Our results highlight the promises and pitfalls of foundation models in making zero-shot forecasts of chaotic systems.

2911Long-horizon Visual Instruction Generation with Logic and Attribute Self-reflection

[openreview] [pdf]

Abstract Visual instructions for long-horizon tasks are crucial as they intuitively clarify complex concepts and enhance retention across extended steps. Directly generating a series of images using text-to-image models without considering the context of previous steps results in inconsistent images, increasing cognitive load. Additionally, the generated images often miss objects or the attributes such as color, shape, and state of the objects are inaccurate. To address these challenges, we propose LIGER, the first training-free framework for Long-horizon Instruction GEneration with logic and attribute self-Reflection. LIGER first generates a draft image for each step with the historical prompt and visual memory of previous steps. This step-by-step generation approach maintains consistency between images in long-horizon tasks. Moreover, LIGER utilizes various image editing tools to rectify errors including wrong attributes, logic errors, object redundancy, and identity inconsistency in the draft images. Through this self-reflection mechanism, LIGER improves the logic and object attribute correctness of the images. To verify whether the generated images assist human understanding, we manually curated a new benchmark consisting of various long-horizon tasks. Human-annotated ground truth expressions reflect the human-defined criteria for how an image should appear to be illustrative. Experiments demonstrate the visual instructions generated by LIGER are more comprehensive compared with baseline methods. The code and dataset will be available once accepted.

2912On Inherent 3D Reasoning of VLMs in Indoor Scene Layout Design

[openreview] [pdf]

Abstract Large vision-language models (VLMs) such as GPT-4o, Llama-3.2 have shown remarkable capabilities in visual understanding and reasoning, prompting us to test their off-the-shelf ability to reason and act as a 3D design assistant. This study investigates VLMs’ 3D reasoning capabilities using indoor scene layout synthesis i.e. placement of furniture in a room, as a test-bed. We study 3D reasoning in this context through three key primitives: (1) communication of spatial locations, (2) reasoning about free space and object collision, and (3) reasoning about object alignment, orientation, and functionality, each crucial to creating a VLM agent-based scene layout synthesis pipeline. We evaluate five state-of-the-art VLMs, both proprietary and open, on a new dataset incorporating 3400 questions that assess VLMs’ current 3D reasoning abilities in our context. Our findings reveal several remarkable insights: (1) VLMs consistently prefer normalized coordinates for spatial communication over absolute coordinates or pointing with image markers. (2) Contrary to expectations, VLMs perform best with simplified sketch based scene representation or, most strikingly, with no visual input at all, com- pared to detailed renderings. (3) Free space reasoning remains challenging, with performance only slightly above random guessing, though frontier models show significant improvement with collision checking tools. Surprisingly, free space reasoning with clear visible collisions in the image can also fail. (4) Reasoning about object alignment, size, orientation and functionality together compounds errors leading to near chance performance on our dataset. These findings serve to highlight current potential and limitations of using VLMs off-the-shelf in 3D reasoning tasks, offering insights for developing advanced visual assistants capable of understanding and manipulating 3D environments.

2913When Can Transformers Count to n?

[openreview] [pdf]

Abstract Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.

2914Knowledge Entropy Decay during Language Model Pretraining Hinders New Knowledge Acquisition

[openreview] [pdf]

Abstract In this work, we investigate how a model’s tendency to broadly integrate its parametric knowledge evolves throughout pretraining, and how this behavior affects overall performance, particularly in terms of knowledge acquisition and forgetting. We introduce the concept of knowledge entropy, which quantifies the range of memory sources the model engages with; high knowledge entropy indicates that the model utilizes a wide range of memory sources, while low knowledge entropy suggests reliance on specific sources with greater certainty. Our analysis reveals a consistent decline in knowledge entropy as pretraining advances. We also find that the decline is closely associated with a reduction in the model’s ability to acquire and retain knowledge, leading us to conclude that diminishing knowledge entropy (smaller number of active memory sources) impairs the model’s knowledge acquisition and retention capabilities. We find further support for this by demonstrating that increasing the activity of inactive memory sources enhances the model’s capacity for knowledge acquisition and retention.

2915Test-Time Ensemble via Linear Mode Connectivity: A Path to Better Adaptation

[openreview] [pdf]

Abstract Test-time adaptation is a valuable approach for online adjustment of pretrained models to handle distribution shifts in test data. While existing research has focused primarily on optimizing stability during adaptation with dynamic data streams, less attention has been given to enhancing model representations for improved adaptation capability. This paper addresses this gap by introducing Test-Time Ensemble (TTE), which leverages two key ensemble strategies: 1) averaging the parameter weights of assorted test-time adapted models and 2) incorporating dropout to further promote representation diversity. These strategies encapsulate model diversity into a single model, avoiding computational burden associated with managing multiple models. Besides, we propose a robust knowledge distillation scheme to prevent collapse during adaptation, ensuring stable optimization. Notably, TTE integrates seamlessly with existing TTA approaches, advancing their adaptation capabilities. In extensive experiments, integration with TTE consistently outperformed baseline models across various challenging scenarios, demonstrating its effectiveness and general applicability.

2916On the Inflation of KNN-Shapley Value

[openreview] [pdf]

Abstract Shapley value-based data valuation methods, originating from cooperative game theory, quantify the usefulness of each individual sample by considering its contribution to all possible training subsets. Despite their extensive applications, we observe these methods encounter value inflation—while samples with negative Shapley values are detrimental, some with positive values can also be harmful. This challenge prompts two fundamental questions: the suitability of zero as a threshold for distinguishing detrimental from beneficial samples and the determination of an appropriate threshold. To address these questions, we focus on KNN-Shapley and propose Calibrated KNN-Shapley (CKNN-Shapley), a semi-value method that calibrates zero as the threshold to distinguish detrimental samples from beneficial ones by mitigating the negative effects of small-sized training subsets. Through extensive experiments, we demonstrate the effectiveness of CKNN-Shapley in alleviating data valuation inflation, detecting detrimental samples, and assessing data quality. We also extend our approach beyond conventional classification settings, applying it to diverse and practical scenarios such as learning with mislabeled data, online learning with stream data, and active learning for label annotation.

2917The Critic as an Explorer: Lightweight and Provably Efficient Exploration for Deep Reinforcement Learning

[openreview] [pdf]

Abstract Exploration remains a critical challenge in reinforcement learning (RL), with many existing methods either lacking theoretical guarantees or being computationally impractical for real-world applications. We introduce Litee, a lightweight algorithm that repurposes the value network in standard deep RL algorithms to effectively drive exploration without introducing additional parameters. Litee utilizes linear multi-armed bandit (MAB) techniques, enabling efficient exploration with provable sub-linear regret bounds while preserving the core structure of existing RL algorithms. Litee is simple to implement, requiring only around 10 lines of code. It also substantially reduces computational overhead compared to previous theoretically grounded methods, lowering the complexity from O(n^3) to O(d^3), where n is the number of network parameters and d is the size of the embedding in the value network. Furthermore, we propose Litee+, an extension that adds a small auxiliary network to better handle sparse reward environments, with only a minor increase in parameter count (less than 1%) and additional 10 lines of code. Experiments on the MiniHack suite and MuJoCo demonstrate that Litee and Litee+ empirically outperform state-of-the-art baselines, effectively bridging the gap between theoretical rigor and practical efficiency in RL exploration.

2918SSNet: Skip and Split MLP Network for Long-Term Series Forecasting

[openreview] [pdf]

Abstract Time series forecasting is critical across various domains, including energy, transportation, weather prediction, and healthcare. Although recent advances using CNNs, RNNs, and Transformer-based models have shown promise, these approaches often suffer from architectural complexity and low computational efficiency. MLP-based networks offer better computational efficiency but struggle to effectively model periodic and temporal relationships, which are essential for accurate time series forecasting. To address these challenges, we propose the Skip and Split MLP Network (SSNet), featuring innovative Skip-MLP and Split-MLP components that adeptly handle these relationships. SSNet requires fewer parameters than traditional MLP-based architectures, improving computational efficiency. Empirical results on multiple real-world long-term forecasting datasets demonstrate that SSNet significantly outperforms state-of-the-art models, delivering better performance with fewer parameters. Notably, even a single Skip-MLP unit matches the performance of high-performing models like PatchTST.

2919FinRipple: Aligning Large Language Models with Financial Market for Event Ripple Effect Awareness

[openreview] [pdf]

Abstract Event studies have been fundamental in finance, focusing on analyzing the ripple effects of sudden market events. Accurately predicting these effects is crucial for informed decision-making and effective risk management. However, the dynamic complexity of financial markets and the lack of unified modeling tools make this task challenging. Previous models, constrained by simplistic assumptions and limited scopes, have struggled to address this complexity effectively. In contrast, large language models (LLMs), with their emergent reasoning abilities, offer a promising solution. In this paper, we introduce FinRipple\textbf{FinRipple}, a novel training framework that enables LLMs to align with market behavior and develop the capability to analyze the ripple effects of sudden events. We first construct a time-varying financial knowledge graph (KG) that is both financially meaningful and noise-reduced to accurately represent the market state. These KGs are then integrated into the LLM using adapters as memory modules. Additionally, we align the LLM with market dynamics by integrating FinRipple with classic asset pricing theories through a reinforcement learning framework. This market-alignment process collects feedback that enhances the LLM’s foundational ability to analyze financial events and explain market anomalies that traditional models fail to address. Our key contributions are as follows: (1) We are the first to define the underexplored task of ``event impact prediction’'. Our framework not only establishes this task but also provides an open-source benchmark, creating a unified evaluation standard for both academia and industry; (2) FinRipple complements classic asset pricing models by combining strong theoretical foundations with AI-driven capabilities, offering an enhanced analysis of residuals unexplained by traditional models. We also demonstrate its potential for practical applications such as portfolio management; (3) We conduct a comprehensive analysis to ensure that the results generated by LLMs in our framework are more logically consistent and credible, thus improving the reliability of insights for financial decision-making.

2920Deep Learning-based Heuristic Construction for Routing Problems with Dynamic Encoder and Dual-Channel Decoder Architecture

[openreview] [pdf]

Abstract The routing problem is a classic combinatorial optimization challenge. Constructing heuristics using deep learning models presents a promising approach for its resolution. In this paper, we propose a novel model with a dynamic encoder and dual-channel decoder (DEDD) architecture to learn construction heuristics for the routing problem. The dynamic encoder en-codes the node features of the decomposed sub-problems at each selection step, thereby obtaining more accurate node em-beddings. The dual-channel decoder facilitates more diverse node selections at each step, increasing the probability of the model identifying optimal solutions. Additionally, we design an effective node selection strategy to assist the model in choosing nodes at each step. Experimental results on the Traveling Salesman Problem (TSP) and the Capacitated Ve-hicle Routing Problem (CVRP) with up to 1000 nodes demonstrate that the solutions generated by the DEDD model are nearly optimal, underscoring its efficacy.

2921Continual Slow-and-Fast Adaptation of Latent Neural Dynamics (CoSFan): Meta-Learning What-How & When to Adapt

[openreview] [pdf]

Abstract An increasing interest in learning to forecast for time-series of high-dimensional observations is the ability to adapt to systems with diverse underlying dynamics. Access to observations that define a stationary distribution of these systems is often unattainable, as the underlying dynamics may change over time. Naively training or retraining models at each shift may lead to catastrophic forgetting about previously-seen systems. We present a new continual meta-learning (CML) framework to realize continual slow-and fast adaptation of latent dynamics (CoSFan). We leverage a feed-forward meta-model to inferwhatthe current system is andhowto adapt a latent dynamics function to it, enablingfast adaptationto specific dynamics. We then develop novel strategies to automatically detectwhena shift of data distribution occurs, with which to identify its underlying dynamics and its relation with previously-seen dynamics. In combination with fixed-memory experience replay mechanisms, this enables continualslow updateof thewhat-howmeta-model. Empirical studies demonstrated that both the meta- and continual-learning component was critical for learning to forecast across non-stationary distributions of diverse dynamics systems, and the feed-forward meta-model combined with task-aware/-relational continual learning strategies significantly outperformed existing CML alternatives.

2922FlexTSF: A universal forecasting model for time series with variable regularities

[openreview] [pdf]

Abstract Developing a foundation model for time series forecasting across diverse domains has attracted significant attention in recent years. Existing works typically assume regularly sampled, well-structured data, limiting their applicability to more generalized scenarios where time series often contain missing values, unequal sequence lengths, and irregular time intervals between measurements. To cover diverse domains and handle variable regularities, we propose FlexTSF, a universal time series forecasting model that possesses better generalization and natively support both regular and irregular time series. FlexTSF produces forecasts in an autoregressive manner and incorporates three novel designs: VT-Norm, a normalization strategy to ablate data domain barriers, IVP Patcher, a patching module to learn representations from flexibly structured time series, and LED attention, an attention mechanism seamlessly integrating these two and propagate forecasts with awareness of domain and time information, enabling effective time series forecasting across varying regularities. Experiments on 12 datasets show that FlexTSF outperforms state-of-the-art forecasting models respectively designed for regular and irregular time series. Furthermore, after self-supervised pre-training, FlexTSF shows exceptional performance in both zero-shot and few-show settings for time series forecasting.

2923AlphaDou: High-Performance End-to-End Doudizhu AI Integrating Bidding

[openreview] [pdf]

Abstract Artificial intelligence for card games has long been a popular topic in AI research. In recent years, complex card games like Mahjong and Texas Hold’em have been solved, with corresponding AI programs reaching the level of human experts. However, the game of Doudizhu presents significant challenges due to its vast state/action space and unique characteristics involving reasoning about competition and cooperation, making the game extremely difficult to solve.The RL model Douzero, trained using the Deep Monte Carlo algorithm framework, has shown excellent performance in Doudizhu. However, there are differences between its simplified game environment and the actual Doudizhu environment, and its performance is still a considerable distance from that of human experts. This paper modifies the Deep Monte Carlo algorithm framework by using reinforcement learning to obtain a neural network that simultaneously estimates win rates and expectations. The action space is pruned using expectations, and strategies are generated based on win rates. The modified algorithm enables the AI to perform the full range of tasks in the Doudizhu game, including bidding and cardplay. The model was trained in a actual Doudizhu environment and achieved state-of-the-art performance among publicly available models. We hope that this new framework will provide valuable insights for AI development in other bidding-based games.

2924Data-centric Prediction Explanation via Kernelized Stein Discrepancy

[openreview] [pdf]

Abstract Existing example-based prediction explanation methods often bridge test and training data points through the model’s parameters or latent representations. While these methods offer clues to the causes of model predictions, they often exhibit innate shortcomings, such as incurring significant computational overhead or producing coarse-grained explanations. This paper presents a Highly-precise and Data-centric Explanation (HD-Explain) prediction explanation method that exploits properties of Kernelized Stein Discrepancy (KSD). Specifically, the KSD uniquely defines a parameterized kernel function for a trained model that encodes model-dependent data correlation. By leveraging the kernel function, one can identify training samples that provide the best predictive support to a test point efficiently. We conducted thorough analyses and experiments across multiple classification domains, where we show that HD-Explain outperforms existing methods from various aspects, including 1) preciseness (fine-grained explanation), 2) consistency, and 3) computation efficiency, leading to a surprisingly simple, effective, and robust prediction explanation solution.

2925Multi-play Multi-armed Bandit Model with Scarce Sharable Arm Capacities

[openreview] [pdf]

Abstract This paper revisits multi-play multi-armed bandit with shareable arm capacities problem (MP-MAB-SAC), for the purpose of revealing fundamental insights on the statistical limits and data efficient learning. The MP-MAB-SAC is tailored for resource allocation problems arising from LLM inference serving, edge intelligence, etc. It consists of KK arms and each arm kk is associated with an unknown but deterministic capacity mkm_k and per-unit capacity reward with mean μk\mu_k and σ\sigma sub-Gaussian noise. The aggregate reward mean of an arm scales linearly with the number of plays assigned to it until the number of plays hit the capacity limit mkm_k, and then the aggregate reward mean is fixed to mkμkm_k \mu_k. At each round only the aggregate reward is revealed to the learner. Our contributions are three folds. 1) \textit{Sample complexity:} we prove a minmax lower bound for the sample complexity of learning the arm capacity Ω(σ2μk2logδ1)\Omega(\frac{\sigma^2}{\mu^2_k} \log \delta^{-1}), and propose an algorithm to exactly match this lower bound. This result closes the sample complexity gap of Wang et al. (2022a), whose lower and upper bounds are Ω(logδ1)\Omega(\log \delta^{-1}) and O(mk2σ2μk2logδ1)O (\frac{m^2_k \sigma^2}{\mu^2_k} \log \delta^{-1}) respectively. 2) \textit{Regret lower bounds:} we prove an instance-independent regret lower bound Ω(σTK)\Omega( \sigma \sqrt{TK} ) and instance-dependent regret lower bound Ω(k=1Kcσ2μk2logT)\Omega(\sum_{k=1}^K\frac{c\sigma^2}{\mu_k^2} \log T). This result provides the first instance-independent regret lower bound and strengths the instance-dependent regret lower bound of Wang et al. (2022a) Ω(k=1KlogT)\Omega(\sum_{k=1}^K \log T). 3) \textit{Data efficient exploration:}we propose an algorithm named \texttt{PC-CapUL}, in which we use prioritized coordination of arm capacities upper/lower confidence bound (UCB/LCB) to efficiently balance the exploration vs. exploitation trade-off. We prove both instance-dependent and instance-independent upper bounds for \texttt{PC-CapUL}, which match the lower bounds up to some acceptable model-dependent factors. This result provides the first instance-independent upper bound, and has the same dependency on mkm_k and μk\mu_k as Wang et al. (2022a) with respect to instance-dependent upper bound.But there is less information about arm capacity in our aggregate reward setting. Numerical experiments validate the data efficiency of \texttt{PC-CapUL}.

2926More RLHF, More Trust? On The Impact of Preference Alignment On Trustworthiness

[openreview] [pdf]

Abstract The trustworthiness of Large Language Models (LLMs) refers to the extent to which their outputs are reliable, safe, and ethically aligned, and it has become a crucial consideration alongside their cognitive performance. In practice, Reinforcement Learning From Human Feedback (RLHF) has been widely used to align LLMs with labeled human preferences, but its assumed effect on model trustworthiness hasn’t been rigorously evaluated. To bridge this knowledge gap, this study investigates how models aligned with general-purpose preference data perform across five trustworthiness verticals: toxicity, stereotypical bias, machine ethics, truthfulness, and privacy. Our results demonstrate that RLHF on human preferences doesn’t automatically guarantee trustworthiness, and reverse effects are often observed. Furthermore, we propose to adapt efficient influence function based data attribution methods to the RLHF setting to better understand the influence of fine-tuning data on individual trustworthiness benchmarks, and show its feasibility by providing our estimated attribution scores. Together, our results underscore the need for more nuanced approaches for model alignment from both the data and framework perspectives, and we hope this research will guide the community towards developing language models that are increasingly capable without sacrificing trustworthiness.

2927Safe Reinforcement Learning in Black-Box Environments via Adaptive Shielding

[openreview] [pdf]

Abstract Empowering safe exploration of reinforcement learning (RL) agents during training is a critical impediment towards deploying RL agents in many real-world scenarios. Training RL agents in unknown, black-box environments poses an even greater safety risk when prior knowledge of the domain/task is unavailable. We introduce ADVICE (Adaptive Shielding with a Contrastive Autoencoder), a novel post-shielding technique that distinguishes safe and unsafe features of state-action pairs during training, thus protecting the RL agent from executing actions that yield potentially hazardous outcomes. Our comprehensive experimental evaluation against state-of-the-art safe RL exploration techniques demonstrates how ADVICE can significantly reduce safety violations during training while maintaining a competitive outcome reward.

2928Sufficient and Necessary Explanations (and What Lies in Between)

[openreview] [pdf]

Abstract As complex machine learning models continue to find applications in high-stakes decision making scenarios, it is crucial that we can explain and understand their predictions. Post-hoc explanation methods can provide useful insights by identifying important features in an input x{\bf x} with respect to the model output f(x)f({\bf x}). In this work we formalize and study two precise notions of feature importance for general machine learning models: \emph{sufficiency} and \emph{necessity}. We demonstrate how these two types of explanations, albeit intuitive and simple, can fall short in providing a complete picture of which features a model deems important for its predictions. To this end, we propose a unified notion of importance that circumvents these limitations by exploring a continuum along a necessity-sufficiency axis. Our unified notion, we show, has strong ties to other popular definitions of feature importance, like those based on conditional independence and game-theoretic quantities like Shapley values. Crucially, we demonstrate how studying this spectrum of importance allows us to detect important features that could be missed by either of the previous approaches alone.

2929Labeled TrustSet Guided: Combining Batch Active Learning with Reinforcement Learning

[openreview] [pdf]

Abstract Batch active learning (BAL) is a crucial technique for reducing labeling costs and improving data efficiency in training large-scale deep learning models. Traditional BAL methods often rely on metrics like Mahalanobis Distance to balance uncertainty and diversity when selecting data for annotation. However, these methods predominantly focus on the distribution of unlabeled data and fail to leverage feedback from labeled data or the model’s performance. To address these limitations, we introduce TrustSet, a novel approach that selects the most informative data from the labeled dataset, ensuring a balanced class distribution to mitigate the long-tail problem. Unlike CoreSet, which focuses on maintaining the overall data distribution, TrustSet optimizes the model’s performance by pruning redundant data and using label information to refine the selection process. To extend the benefits of TrustSet to the unlabeled pool, we propose a reinforcement learning (RL)-based sampling policy that approximates the selection of high-quality TrustSet candidates from the unlabeled data. Combining TrustSet and RL, we introduce theBatchReinforcementActiveLearning withTrustSet (BRAL-T) framework. BRAL-T achieves state-of-the-art results across 10 image classification benchmarks and 2 active fine-tuning tasks, demonstrating its effectiveness and efficiency in various domains.

2930Continuous Approximation of Momentum Methods with Explicit Discretization Error

[openreview] [pdf]

Abstract Momentum-based optimization methods, such as Heavy-Ball (HB) and Nesterov’s accelerated gradient (NAG), are essential in training modern deep neural networks. This work sheds light on the learning dynamics of momentum-based methods and how they behave differently than standard gradient descent (GD) in theory and practice. A promising approach to answer this question is investigating the continuous differential equations to approximate the discrete updates, an area requiring much attention for momentum methods. In this work, we take HB as a case study to investigate two important aspects of momentum methods. First, to enable a formal analysis of the Heavy-Ball momentum method, we propose a new continuous approximation, HB Flow (HBF), with a formulation that allows the control of discretization error to arbitrary order. As an application of HBF, we leverage it to investigate the implicit bias of HB by conducting a series of analyses on the diagonal linear networks to inspect the influence of momentum on the model’s generalization property. We validate theoretical findings in numerical experiments, which confirm the significance of HBF as an effective proxy of momentum methods to bridge between discrete and continuous learning dynamics.

2931Towards Out-of-Modal Generalization without Instance-level Modal Correspondence

[openreview] [pdf]

Abstract The world is understood from various modalities, such as appearance, sound, language, etc. Since each modality only partially represents objects in a certain physical meaning, leveraging additional ones is beneficial in both theory and practice. However, exploiting novel modalities normally requires cross-modal pairs corresponding to the same instance, which is extremely resource-consuming and sometimes even impossible, making knowledge exploration of novel modalities largely restricted. To seek practical multi-modal learning, here we study Out-of-Modal (OOM) Generalization as an initial attempt to generalize to an unknown modality without given instance-level modal correspondence. Specifically, we consider Semi-Supervised and Unsupervised scenarios of OOM Generalization, where the first has scarce correspondences and the second has none, and propose connect & explore (COX) to solve these problems. COX first connects OOM data and known In-Modal (IM) data through a variational information bottleneck framework to extract shared information. Then, COX leverages the shared knowledge to create emergent correspondences, which is theoretically justified from an information-theoretic perspective. As a result, the label information on OOM data emerges along with the correspondences, which help explore the OOM data with unknown knowledge, thus benefiting generalization results. We carefully evaluate the proposed COX method under various OOM generalization scenarios, verifying its effectiveness and extensibility.

2932JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework

[openreview] [pdf]

Abstract Although significant research efforts have been dedicated to enhancing the safety of large language models (LLMs) by understanding and defending against jailbreak attacks, evaluating the defense capabilities of LLMs against jailbreak attacks also attracts lots of attention. Current evaluation methods lack explainability and do not generalize well to complex scenarios, resulting in incomplete and inaccurate assessments (e.g., direct judgment without reasoning explainability, the F1 score of the GPT-4 judge is only 55% in complex scenarios and bias evaluation on multilingual scenarios, etc.). To address these challenges, we have developed a comprehensive evaluation benchmark, JAILJUDGE, which includes a wide range of risk scenarios with complex malicious prompts (e.g., synthetic, adversarial, in-the-wild, and multi-language scenarios, etc.) along with high-quality human-annotated test datasets. Specifically, the JAILJUDGE dataset comprises training data of JAILJUDGE, with over 35k+ instruction-tune training data with reasoning explainability, and JAILJUDGETEST, a 4.5k+ labeled set of broad risk scenarios and a 6k+ labeled set of multilingual scenarios in ten languages. To provide reasoning explanations (e.g., explaining why an LLM is jailbroken or not) and fine-grained evaluations (jailbroken score from 1 to 10), we propose a multi-agent jailbreak judge framework, JailJudge MultiAgent, making the decision inference process explicit and interpretable to enhance evaluation quality. Using this framework, we construct the instruction-tuning ground truth and then instruction-tune an end-to-end jailbreak judge model, JAILJUDGE Guard, which can also provide reasoning explainability with fine-grained evaluations without API costs. Additionally, we introduce JailBoost, an attacker-agnostic attack enhancer, and GuardShield, a safety moderation defense method, both based on JAILJUDGE Guard. Comprehensive experiments demonstrate the superiority of our JAILJUDGE benchmark and jailbreak judge methods. Our jailbreak judge methods (JailJudge MultiAgent and JAILJUDGE Guard) achieve SOTA performance in closed-source models (e.g., GPT-4) and safety moderation models (e.g., Llama-Guard and ShieldGemma, etc.), across a broad range of complex behaviors (e.g., JAILJUDGE benchmark, etc.) to zero-shot scenarios (e.g., other open data, etc.). Importantly, JailBoost and GuardShield, based on JAILJUDGE Guard, can enhance downstream tasks in jailbreak attacks and defenses under zero-shot settings with significant improvement (e.g., JailBoost can increase the average performance by approximately 29.24%, while GuardShield can reduce the average defense ASR from 40.46% to 0.15%).

2933Structural Knowledge Informed Continual Learning for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Recent studies in multivariate time series (MTS) forecasting reveal that explicitly modeling the hidden dependencies among different time series can yield promising forecasting performance and reliable explanations. However, modeling variable dependencies remains underexplored when MTS is continuously accumulated under different regimes (stages). Due to the potential distribution and dependency disparities, the underlying model may encounter the catastrophic forgetting problem, i.e., it is challenging to memorize and infer different types of variable dependencies across different regimes while maintaining forecasting performance. To address this issue, we propose a novel Structural Knowledge Informed Continual Learning (SKI-CL) framework to perform MTS forecasting within a continual learning paradigm, which leverages structural knowledge to steer the forecasting model toward identifying and adapting to different regimes, and selects representative MTS samples from each regime for memory replay. Specifically, we develop a forecasting model based on graph structure learning, where a consistency regularization scheme is imposed between the learned variable dependencies and the structural knowledge (e.g., physical constraints, domain knowledge, feature similarity, which provides regime characterization) while optimizing the forecasting objective over the MTS data. As such, MTS representations learned in each regime are associated with distinct structural knowledge, which helps the model memorize a variety of conceivable scenarios and results in accurate forecasts in the continual learning context. Meanwhile, we develop a representation-matching memory replay scheme that maximizes the temporal coverage of MTS data to efficiently preserve the underlying temporal dynamics and dependency structures of each regime. Thorough empirical studies on synthetic and real-world benchmarks validate SKI-CL’s efficacy and advantages over the state-of-the-art for continual MTS forecasting tasks. SKI-CL can also infer faithful dependency structures that closely align to structural knowledge in the test stage.

2934Stable Diffusion Feature Extraction for Sketching with One Example

[openreview] [pdf]

Abstract Sketching is both a fundamental artistic expression and a crucial aspect of art. The significance of sketching has increased alongside the development of sketch-based generative and editing models. To enable individuals to use these sketch-based generative models effectively, personalizing sketch extraction is crucial. In response, we introduce DiffSketch\text{DiffSketch}, a novel method capable of generating various geometrically aligned sketches from text or images, using a single manual drawing for training the style. Our method exploits rich information available in features from a pretrained Stable Diffusion model to achieve effective domain adaptation. To further streamline the process of sketch extraction, we further refine our approach by distilling the knowledge from the trained generator into the image-to-sketch network, which is termed as DiffSketchdistilled\text{DiffSketch}_{distilled}. Through a series of comparisons, we verify that our method not only outperforms existing state-of-the-art sketch extraction methods but also surpasses diffusion-based stylization methods in the task of extracting sketches.

2935From Promise to Practice: Realizing High-performance Decentralized Training

[openreview] [pdf]

Abstract Decentralized training of deep neural networks has attracted significant attention for its theoretically superior scalability over synchronous data-parallel methods like All-Reduce. However, realizing this potential in multi-node training is challenging due to the complex design space that involves communication topologies, computation patterns, and optimization algorithms. This paper identifies three key factors that can lead to speedups over All-Reduce training and constructs a runtime model to determine when, how, and to what degree decentralization can yield shorter per-iteration runtimes. Furthermore, to support the decentralized training of transformer-based models, we study a decentralized Adam algorithm that allows for overlapping communications and computations, prove its convergence, and propose an accumulation technique to mitigate the high variance caused by small local batch sizes. We deploy the proposed approach in clusters with up to 64 GPUs and demonstrate its practicality and advantages in both runtime and generalization performance under a fixed iteration budget.

2936Geometry-Aware Approaches for Balancing Performance and Theoretical Guarantees in Linear Bandits

[openreview] [pdf]

Abstract This paper is motivated by recent research in the dd-dimensional stochastic linear bandit literature, which has revealed an unsettling discrepancy: algorithms like Thompson sampling and Greedy demonstrate promising empirical performance, yet this contrasts with their pessimistic theoretical regret bounds. The challenge arises from the fact that while these algorithms may perform poorly in certain problem instances, they generally excel in typical instances. To address this, we propose a new data-driven technique that tracks the geometric properties of the uncertainty ellipsoid around the main problem parameter. This methodology enables us to formulate an instance-dependent frequentist regret bound, which incorporates the geometric information, for a broad class of base algorithms, including Greedy, OFUL, and Thompson sampling. This result allows us to identify and ``course-correct" problem instances in which the base algorithms perform poorly. The course-corrected algorithms achieve the minimax optimal regret of order O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}) for a TT-period decision-making scenario, effectively maintaining the desirable attributes of the base algorithms, including their empirical efficacy. We present simulation results to validate our findings using synthetic and real data.

2937Textbook Consistency Weighted Internet Improves Efficiency Twofold

[openreview] [pdf]

Abstract We propose a novel method, Textbook Consistency, to improve the training efficiency of large language models by leveraging textbooks as a guiding signal for learning from internet-scale data. Rather than relying on hard filtering of data based on quality thresholds before training, our approach adaptively adjusts the weight of data during training based on its consistency with textbooks during training. We compute the cosine similarity between internet data and textbooks in a latent space, using this metric to modulate the cross-entropy loss. Our method significantly enhances training efficiency, achieving twice the effectiveness by reducing training time or the number of tokens required. Empirical results show superior performance on language models trained on large datasets like FineWeb and The Pile, with extensions to other domains such as robotics. Our method is simple to implement, incurs no additional overhead, and is compatible with existing data curation techniques.

2938Efficient Personalized Federated Learning via Adaptive Weight Clustering Pruning

[openreview] [pdf]

Abstract This paper introduces a novel personalized federated learning approach, Adaptive Federated Weight Clustering Pruning (AdFedWCP), specifically designed to optimize communication efficiency in heterogeneous network environments. AdFedWCP innovatively combines adaptive weight clustering pruning techniques, effectively addressing data and bandwidth heterogeneity. By dynamically adjusting clustering centroids based on layer importance and client-specific data characteristics, it significantly reduces communication overhead. Experimental results show that AdFedWCP achieves a reduction in communication volume ranging from 87.54% to 87.82% in communication volume, surpassing the state-of-the-art work on reducing communication overhead in personalized federated learning. AdFedWCP also surpasses existing methods in terms of accuracy across multiple datasets, with improvements ranging from 9.13% to 21.79% over the baselines on EMNIST, CIFAR-10, and CIFAR-100. These results highlight AdFedWCP’s advantages in balancing communication efficiency and model accuracy, making it an ideal choice for resource-constrained federated learning environments.

2939Interaction Asymmetry: A General Principle for Learning Composable Abstractions

[openreview] [pdf]

Abstract Learning disentangled representations of concepts and re-composing them in unseen ways is crucial for generalizing to out-of-domain situations. However, the underlying properties of concepts that enable such disentanglement and compositional generalization remain poorly understood. In this work, we propose the principle of interaction asymmetry which states: “Parts of the same concept have more complex interactions than parts of different concepts”. We formalize this via block diagonality conditions on the (n+1)(n+1)th order derivatives of the generator mapping concepts to observed data, where different orders of “complexity” correspond to different nn. Using this formalism, we prove that interaction asymmetry enables both disentanglement and compositional generalization. Our results unify recent theoretical results for learning concepts of objects, which we show are recovered as special cases with n=0n=0 or 1. We provide results for up to n=2n=2, thus extending these prior works to more flexible generator functions, and conjecture that the same proof strategies generalize to larger nn. Practically, our theory suggests that, to disentangle concepts, an autoencoder should penalize its latent capacity and the interactions between concepts during decoding. We propose an implementation of these criteria using a flexible Transformer-based VAE, with a novel regularizer on the attention weights of the decoder. On synthetic image datasets consisting of objects, we provide evidence that this model can achieve comparable object disentanglement to existing models that use more explicit object-centric priors.

2940Learning to Explore and Exploit with GNNs for Unsupervised Combinatorial Optimization

[openreview] [pdf]

Abstract Combinatorial optimization (CO) problems are pervasive across various domains, but their NP-hard nature often necessitates problem-specific heuristic algorithms. Recent advancements in deep learning have led to the development of learning-based heuristics, yet these approaches often struggle with limited search capabilities. We introduce Explore-and-Exploit GNN (X2X^2GNN, pronounced x-squared GNN), a novel unsupervised neural framework that combines exploration and exploitation for combinatorial search optimization: i) Exploration - X2X^2GNN generates multiple solutions simultaneously, promoting diversity in the search space; (ii) Exploitation - X2X^2GNN employs neural stochastic iterative refinement, where sampled partial solutions guide the search toward promising regions and help escape local optima. X2X^2GNN employs neural stochastic iterative refinement to exploit partial existing solutions, guiding the search toward promising regions and helping escape local optima. By balancing exploration and exploitation X2X^2GNN achieves superior performance and generalization on several graph CO problems including Max Cut, Max Independent Set, and Max Clique. Notably, for large Max Clique problems, X2X^2GNN consistently generates solutions within 1.2% of optimality, while other state-of-the-art learning-based approaches struggle to reach within 22% of optimal. Moreover, X2X^2GNN consistently generates better solutions than Gurobi on large graphs for all three problems under reasonable time budgets. Furthermore, X2X^2GNN exhibits exceptional generalization capabilities. For the Maximum Independent Set problem, X2X^2GNN outperforms state-of-the-art methods even when trained on smaller or out-of-distribution graphs compared to the test set.

2941From Counseling Transcript to Mind Map: Leveraging LLMs for Effective Summarization in Mental Health Counseling

[openreview] [pdf]

Abstract The increasing number of patients with mental health illness has heightened the cognitive load on therapists, making it challenging for them to provide personalized care that each patient requires. Summarizing counseling sessions can aid mental health practitioners in recalling key details. However, most existing research on summarization focuses primarily on text-based summaries which often require significant cognitive effort to read and interpret. Visual-based summary such as mind maps is proven to help enhance cognitive understanding by giving a quick overview of topics and content. Nevertheless, due to the complex nature of counseling which involves substantial qualitative data, generating visual-based summaries using traditional AI models can be challenging. With the recent advancements in Large Language Models (LLMs), these models have demonstrated the capability to perform tasks based on instructions and generate outputs in various formats. In this study, we develop a web-based summarization tool that serves as a pipeline in performing summarization of counseling transcripts into visual-based mind map summaries using LLMs. We conducted a human evaluation to validate the effectiveness of the generated visual-based summary based on criteria of accuracy, completeness, conciseness and coherence. Our findings show that our web-based summarization tool can effectively extract key points from counseling transcripts and present them in visual-based mind maps, demonstrating its potential in enhancing insights for therapists, ultimately simplifying the process of documenting counseling sessions.

2942Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling

[openreview] [pdf]

Abstract Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, while others observe decreased performance. In this paper, we first dissect how action chunking impacts the divergence between a learner and a demonstrator. We find that action chunking allows the learner to better capture the temporal dependencies in demonstrations (e.g., latent strategies) but at the cost of reduced reactivity in stochastic environments (e.g., action noise, object motions). To address this tradeoff, we propose Bidirectional Decoding (BID), a test-time inference algorithm that bridges action chunking with closed-loop operations. BID samples multiple predictions at each time step and searches for the optimal one based on two criteria: (i) backward coherence, which favors samples aligned with previous decisions, (ii) forward contrast, which favors samples close to outputs of a stronger policy and distant from those of a weaker policy. By coupling decisions within and across action chunks, BID promotes strong temporal consistency over multiple steps while maintaining high reactivity to unexpected state changes. Experimental results show that BID boosts the performance of two state-of-the-art robot policies across seven simulation benchmarks and two real-world tasks.

2943Sum-of-Squares Programming for Ma-Trudinger-Wang Regularity of Optimal Transport Maps

[openreview] [pdf]

Abstract For a given ground cost, approximating the Monge optimal transport map that pushes forward a given probability measure onto another has become a staple in several modern machine learning algorithms. The fourth-order Ma-Trudinger-Wang (MTW) tensor associated with this ground cost function provides a notion of curvature in optimal transport. The non-negativity of this tensor plays a crucial role for establishing continuity for the Monge optimal transport map. It is, however, generally difficult to analytically verify this condition for any given ground cost. To expand the class of cost functions for which MTW non-negativity can be verified, we propose a provably correct computational approach which provides certificates of non-negativity for the MTW tensor using Sum-of-Squares (SOS) programming. We further show that our SOS technique can also be used to compute an inner approximation of the region where MTW non-negativity holds. We apply our proposed SOS programming method to several practical ground cost functions to approximate the regions of regularity of their corresponding optimal transport maps.

2944Improved Diffusion-based Generative Model with Better Adversarial Robustness

[openreview] [pdf]

Abstract Diffusion Probabilistic Models (DPMs) have achieved considerable success in generation. However, its training and sampling processes are confronted with the problem of distribution mismatch. During the denoising process, the input data distributions of the model are different during the training and inference stages, which makes the model potentially generate inaccurate data. To obviate this, we conduct an analysis of the training objective of DPM, and theoretically prove that the mismatch can be mitigated by Distributionally Robust Optimization (DRO), which is equivalent to conducting robustness-driven Adversarial Training (AT) on the DPM. Furthermore, for the recently proposed consistency model (CM), which distills the inference process of the DPM, we prove that its training objective similarly faces the mismatch issue. Fortunately, such a problem is also mitigated by AT. Thereafter, we propose to conduct efficient AT on both DPM and CM. Finally, a series of empirical studies verify the effectiveness of AT in diffusion-based models.

2945TimeBridge: Non-Stationarity Matters for Long-term Time Series Forecasting

[openreview] [pdf]

Abstract Non-stationarity poses significant challenges for multivariate time series forecasting due to the inherent short-term fluctuations and long-term trends that can lead to spurious regressions or obscure essential long-term relationships. Most existing methods either eliminate or retain non-stationarity without adequately addressing its distinct impacts on short-term and long-term modeling. Eliminating non-stationarity is essential for avoiding spurious regressions and capturing local dependencies in short-term modeling, while preserving it is crucial for revealing long-term cointegration across variates. In this paper, we propose TimeBridge, a novel framework designed to bridge the gap between non-stationarity and dependency modeling in long-term time series forecasting. By segmenting input series into smaller patches, TimeBridge applies Integrated Attention to mitigate short-term non-stationarity and capture stable dependencies within each variate, while Cointegrated Attention preserves non-stationarity to model long-term cointegration across variates. Extensive experiments show that TimeBridge consistently achieves state-of-the-art performance in both short-term and long-term forecasting. Additionally, TimeBridge demonstrates exceptional performance in financial forecasting on the CSI 500 and S&P 500 indices, further validating its robustness and effectiveness. The code is available in the supplementary material.

2946Scaling Long Context Training Data by Long-Distance Referrals

[openreview] [pdf]

Abstract Training large language models for long context understanding faces the challenge of data shortage. Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality. In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scale such data. We found that long distance referrals, which occur in natural long documents, are crucial for long-context training. However, simply concatenating short documents does not reliably generate these relations. We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal. To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships. Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation. By packing web pages through their hyper-link connection, we can create longer, high-quality documents. Our experiments demonstrate that LongPackis highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents. Furthermore, the constructed documents have a ‘near-natural’ quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods.

2947Adversarial Score Identity Distillation: Rapidly Surpassing the Teacher in One Step

[openreview] [pdf]

Abstract Score identity Distillation (SiD) is a data-free method that has achieved state-of-the-art image generation performance by leveraging only a pretrained diffusion model, without the need for any training data. In this paper, we introduce SiDA (SiD with Adversarial Loss), which not only enhances generation quality but also improves distillation efficiency by incorporating real images and adversarial loss. SiDA utilizes the encoder from the generator’s score network as a discriminator, enhancing its ability to differentiate between real images and those generated by SiD. The adversarial loss is batch-normalized within each GPU and then combined with the original SiD loss, effectively integrating the average “fakeness” per GPU batch into the pixel-based SiD loss. This allows SiDA to distill a single-step generator either from scratch or by fine-tuning an existing one. SiDA converges significantly faster than its predecessor when trained from scratch and quickly surpasses the original model’s performance after an initial warmup period when fine-tuning from a SiD checkpoint. This method has established new benchmarks for low FID scores when distilling EDM diffusion models pretrained on CIFAR-10 (32x32) and ImageNet (64x64), achieving FID scores of1.499on CIFAR-10 unconditional,1.396on CIFAR-10 conditional, and1.110on ImageNet 64x64.

2948MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

[openreview] [pdf]

Abstract Reinforcement learning from human feedback (RLHF) has demonstrated effectiveness in aligning large language models (LLMs) with human preferences. However, token-level RLHF suffers from the credit assignment problem over long sequences, where delayed rewards make it challenging for the model to discern which actions contributed to successful outcomes. This hinders learning efficiency and slows convergence. In this paper, we propose MA-RLHF, a simple yet effective RLHF framework that incorporates macro actions-- sequences of tokens or higher-level language constructs--into the learning process. By operating at this higher level of abstraction, our approach reduces the temporal distance between actions and rewards, facilitating faster and more accurate credit assignment. This results in more stable policy gradient estimates and enhances learning efficiency within each episode, all without increasing computational complexity during training or inference. We validate our approach through extensive experiments across various model sizes and tasks, including text summarization, dialogue generation, question answering, and program synthesis. Our method achieves substantial performance improvements over standard RLHF, with performance gains of up to 30% in text summarization and code generation, 18% in dialogue, and 8% in question answering tasks. Notably, our approach reaches parity with vanilla RLHF 1.7x to 2x faster in terms of training time and continues to outperform it with further training. We will release our code, data, and models to inspire future research.

2949Beyond Simple Sum of Delayed Rewards: Non-Markovian Reward Modeling for Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement Learning (RL) empowers agents to acquire various skills by learning from reward signals. Unfortunately, designing high-quality instance-level rewards often demands significant effort. An emerging alternative, RL with delayed reward, focuses on learning from rewards presented periodically, which can be obtained from human evaluators assessing the agent’s performance over sequences of behaviors. However, traditional methods in this domain assume the existence of underlying Markovian rewards and that the observed delayed reward is simply the sum of instance-level rewards, both of which often do not align well with real-world scenarios. In this paper, we introduce the problem of RL from Composite Delayed Reward (RLCoDe), which generalizes traditional RL from delayed rewards by eliminating the strong assumption. We suggest that the delayed reward may arise from a more complex structure reflecting the overall contribution of the sequence. To address this problem, we present a framework for modeling composite delayed rewards, using a weighted sum of non-Markovian components to capture the different contributions of individual steps. Building on this framework, we propose Composite Delayed Reward Transformer (CoDeTr), which incorporates a specialized in-sequence attention mechanism to effectively model these contributions. We conduct experiments on challenging locomotion tasks where the agent receives delayed rewards computed from composite functions of observable step rewards. The experimental results indicate that CoDeTr consistently outperforms baseline methods across evaluated metrics. Additionally, we demonstrate that it effectively identifies the most significant time steps within the sequence and accurately predicts rewards that closely reflect the environment feedback.

2950Look Twice Before You Answer: Memory-Space Visual Retracing for Hallucination Mitigation in Multimodal Large Language Models

[openreview] [pdf]

Abstract Despite their impressive capabilities, Multimodal Large Language Models (MLLMs) are susceptible to hallucinations, especially assertively fabricating content not present in the visual inputs. To address the aforementioned challenge, we follow a common cognitive process - \textit{when one’s initial memory of critical on-sight details fades, replenishing visual memory is essential to seek a factual and accurate answer.} Therefore, we introduce Mem\textbf{Mem}ory-space V\textbf{V}isual R\textbf{R}etracing (MemVR\textbf{MemVR}), a novel hallucination mitigation paradigm that without the need for external knowledge retrieval or additional fine-tuning. In particular, we treat visual prompts as supplementary evidence to be reinjected into MLLMs via Feed Forward Network (FFN) as “key-value memory”, when the model is uncertain or even amnesic about question-relevant visual memories. Comprehensive experimental evaluations demonstrate that \modelname significantly mitigates hallucination issues across various MLLMs and excels in general benchmarks without incurring added time overhead, thus emphasizing its potential for widespread applicability.

2951Random Graph Asymptotics for Treatment Effect Estimation in Two-Sided Markets

[openreview] [pdf]

Abstract In two-sided markets, the accurate estimation of treatment effects is crucial yet challenging due to the inherent interference between market participants, which violates the Stable Unit Treatment Value Assumption (SUTVA). This paper introduces a novel framework that leverages random graph asymptotics to model and estimate treatment effects under network interference in two-sided markets. By extending the application of exposure graph models and proposing a new estimation process, we derive estimators with robust asymptotic properties, suitable for large-scale market scenarios. Our theoretical findings are supported by extensive numerical simulations, demonstrating the effectiveness and practical applicability of our approach in estimating direct and indirect causal effects within these complex market structures.

2952Transformers Use Causal World Models in Maze-Solving Tasks

[openreview] [pdf]

Abstract Recent studies in interpretability have explored the inner workings of transformer models trained on tasks across various domains, often discovering that these networks naturally develop surprisingly structured representations. When such representations comprehensively reflect the task domain’s structure, they are commonly referred to as “World Models” (WMs). In this work, we discover such WMs in transformers trained on maze tasks. By analyzing the causal role of these WMs, we hope to contribute to the development of more interpretable and controllable AI systems. In particular, by employing Sparse Autoencoders (SAEs) and analysing attention patterns, we examine the construction of WMs and demonstrate consistency between the circuit analysis and the SAE feature-based analysis. We intervene upon the isolated features to confirm their causal role and, in doing so, find asymmetries between certain types of interventions. Surprisingly, we find that models are able to reason with respect to more active features than they would ever have observed during training, even if attempting to specify these in the input token sequence would lead the model to fail. We also observe that varying positional encodings can alter how WMs are encoded in a model’s residual stream.

2953Towards Robust Alignment of Language Models: Distributionally Robustifying Direct Preference Optimization

[openreview] [pdf]

Abstract This study addresses the challenge of noise in training datasets for Direct Preference Optimization (DPO), a method for aligning Large Language Models (LLMs) with human preferences. We categorize noise into pointwise noise, which includes low-quality data points, and pairwise noise, which encompasses erroneous data pair associations that affect preference rankings. Utilizing Distributionally Robust Optimization (DRO), we enhance DPO’s resilience to these types of noise. Our theoretical insights reveal that DPO inherently embeds DRO principles, conferring robustness to pointwise noise, with the regularization coefficient β\beta playing a critical role in its noise resistance. Extending this framework, we introduce Distributionally Robustifying DPO (Dr. DPO), which integrates pairwise robustness by optimizing against worst-case pairwise scenarios. The novel hyperparameter β\beta' in Dr. DPO allows for fine-tuned control over data pair reliability, providing a strategic balance between exploration and exploitation in noisy training environments. Empirical evaluations demonstrate that Dr. DPO substantially improves the quality of generated text and response accuracy in preference datasets, showcasing enhanced performance in both noisy and noise-free settings.

2954Learning to Customize Text-to-Image Diffusion In Diverse Context

[openreview] [pdf]

Abstract Most text-to-image customization techniques fine-tune models on a small set of \emph{personal concept} images captured in minimal contexts. This often results in the model becoming overfitted to these training images and unable to generalize to new contexts in future text prompts. Existing customization methods are built on the success of effectively representing personal concepts as textual embeddings. Thus, in this work, we resort to diversifying the context of these personal concepts \emph{solely} within the textual space by simply creating a contextually rich set of text prompts, together with a widely used self-supervised learning objective. Surprisingly, this straightforward and cost-effective method significantly improves semantic alignment in the textual space, and this effect further extends to the image space, resulting in higher prompt fidelity for generated images. Additionally, our approach does not require any architectural modifications, making it highly compatible with existing text-to-image customization methods. We demonstrate the broad applicability of our approach by combining it with four different baseline methods, achieving notable CLIP score improvements.

2955In-context learning and Occam’s razor

[openreview] [pdf]

Abstract The goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees for generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best—a principle called Occam’s razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam’s razor and in-context learning—an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved.

2956Adapting to both finite-sample and asymptotic regimes

[openreview] [pdf]

Abstract This paper introduces an empirical risk minimization based approach with concomitant scaling, which eliminates the need for tuning a robustification parameter in the presence of heavy-tailed data. This method leverages a new loss function that concurrently optimizes both the mean and robustification parameters. Through this dual-parameter optimization, the robustification parameter automatically adjusts to the unknown data variance, rendering the method self-tuning. Our approach surpasses previous models in both computational and asymptotic efficiency. Notably, it avoids the reliance on cross-validation or Lepski’s method for tuning the robustification parameter, and the variance of our estimator attains the Cram’{e}r-Rao lower bound, demonstrating optimal efficiency. In essence, our approach demonstrates optimal performance across both finite-sample and large-sample scenarios, a feature we describe as \textit{algorithmic adaptivity to both asymptotic and finite-sample regimes}. Numerical studies lend strong support to our methodology.

2957Eidetic Learning: an Efficient and Provable Solution to Catastrophic Forgetting

[openreview] [pdf]

Abstract Catastrophic forgetting -- the phenomenon of a neural network learning a task and losing the ability to perform it after being trained on some other task -- is a long-standing problem for neural networks \citep{mccloskey1989catastrophic}. We introduce Eidetic Learning and prove that it guarantees networks do not forget. When training an EideticNet, accuracy on previous tasks is preserved because the neurons important for them are fixed and, most importantly, the hidden states that those neurons operate on are guaranteed to be unchanged by \textit{any} subsequent tasks for \textit{any} input sample. EideticNets are easy to implement, their complexity in time and space is linear in the number of parameters, and their guarantees hold for normalization layers during pre-training and fine-tuning. We show empirically with a variety of network architectures and sets of tasks that EideticNets are immune to forgetting. While the practical benefits of EideticNets are substantial, we believe they can be of benefit to practitioners and theorists alike. They have the potential to open new directions of exploration for lifelong and continual learning. We will release the code repository containing the EideticNet PyTorch framework upon publication.

2958Tight Clusters Make Specialized Experts

[openreview] [pdf]

Abstract Sparse Mixture-of-Experts (MoE) architectures have emerged as a promising approach to decoupling model capacity from computational cost. At the core of the MoE model is the router, which learns the underlying clustering structure of the input distribution in order to send input tokens to appropriate experts. However, latent clusters may be unidentifiable in high dimension, which causes slow convergence, susceptibility to data contamination, and overall degraded representations as the router is unable to perform appropriate token-expert matching. We examine the router through the lens of clustering optimization and derive optimal feature weights that maximally identify the latent clusters. We use these weights to compute the token-expert routing assignments in an adaptively transformed space that promotes well-separated clusters, which helps identify the best-matched expert for each token. In particular, for each expert cluster, we compute a set of weights that scales features according to whether that expert clusters tightly along that feature. We term this novel router the Adaptive Clustering (AC) router. Our AC router enables the MoE model to obtain three connected benefits: 1) faster convergence, 2) better robustness to data corruption, and 3) overall performance improvement, as experts are specialized in semantically distinct regions of the input space. We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of MoE backbones for large-scale language modeling and object recognition tasks in both clean and corrupted settings.

2959Approximating Full Conformal Prediction for Neural Network Regression with Gauss-Newton Influence

[openreview] [pdf]

Abstract Uncertainty quantification is an important prerequisite for the deployment of deep learning models in safety-critical areas. Yet, this hinges on the uncertainty estimates being useful to the extent the predictive prediction intervals are well-calibrated and sharp. In the absence of inherent uncertainty estimates (e.g. pretrained models), popular approaches that operate post-hoc include Laplace’s method and split conformal prediction (split-CP). However, Laplace’s method can be miscalibrated when the model is misspecified and split-CP requires sample splitting, and thus comes at the expense of statistical efficiency. In this work, we construct prediction intervals for neural network regressors post-hoc without held-out data. This is achieved by approximating the full conformal prediction method (full-CP). Whilst full-CP nominally requires retraining the model for every test point and candidate label, we propose to train just once and locally perturb model parameters using Gauss-Newton influence to approximate the effect of retraining. Coupled with linearization of the network, we express the absolute residual nonconformity score as a piecewise linear function of the candidate label allowing for an efficient procedure that avoids the exhaustive search over the output space. On standard regression benchmarks, we show the resulting prediction intervals are locally-adaptive and often tighter than those of split-CP.

2960Local Steps Speed Up Local GD for Heterogeneous Distributed Logistic Regression

[openreview] [pdf]

Abstract We analyze two variants of Local Gradient Descent applied to distributed logistic regression with heterogeneous, separable data and show convergence at the rate O(1/KR)O(1/KR) for KK local steps and sufficiently large RR communication rounds. In contrast, all existing convergence guarantees for Local GD applied to any problem are at least Ω(1/R)\Omega(1/R), meaning they fail to show the benefit of local updates. The key to our improved guarantee is showing progress on the logistic regression objective when using a large stepsize η1/K\eta \gg 1/K, whereas prior analysis depends on η1/K\eta \leq 1/K.

2961Small-to-Large Generalization: Training Data Influences Models Consistently Across Scale

[openreview] [pdf]

Abstract Choice of training data distribution greatly affects model behavior. Yet, in large-scale settings, precisely characterizinghowchanges in training data influence predictions is often difficult due to model training costs. Current practice is to instead extrapolate from scaled down, inexpensive-to-train proxy models. However, changes in data do not influence smaller and larger models identically. Therefore, understanding how choice of data affects large-scale models raises the question: how does training data influence model behavior across compute scale? We find that the answer is nuanced. Small- and large-scale language model predictions generallydohighly correlate across choice of training data---often, even when small-model predictions are at the level of random guessing. However, therealsoexist downstream datasets where these predictions correlate much less. Equipped with these findings, we characterize how proxy scale affects performance in two downstream proxy model applications: data attribution and dataset selection.

2962AnyPrefer: An Automatic Framework for Preference Data Synthesis

[openreview] [pdf]

Abstract High-quality preference data is essential for aligning foundation models with human values through preference learning. However, manual annotation of such data is often time-consuming and costly. Recent methods adopt a self-rewarding approach, where the target model generates and annotates its own preference data, but this can lead to inaccuracies due to the reward model sharing weights with the target model, amplifying inherent biases. To address these issues, we propose Anyprefer, a framework designed to synthesize high-quality preference data for the target model. Anyprefer frames the data synthesis process as a cooperative two-player Markov Game, where the target model and a judge model collaborate. Here, a series of external tools are introduced to assist the judge model in accurately rewarding the target model’s responses, mitigating biases in the process. We also introduce a feedback mechanism to optimize prompts for both models, enhancing collaboration and improving data quality. The synthesized data is compiled into a new preference dataset, Anyprefer-V1, consisting of 58K high-quality preference pairs. Extensive experiments show that Anyprefer significantly improves model alignment across four applications, covering 21 datasets, achieving average improvements of 18.55 in five natural language generation datasets, 3.66 in nine vision-language understanding datasets, 30.05 in three medical image analysis datasets, and 14.50 in four visuo-motor control tasks.

2963TIEM: Enhancing Explanation of Video Prediction via Temporal Dynamics-Focused Dual Perturbation

[openreview] [pdf]

Abstract Explaining video data predictions is challenging due to the complex spatio-temporal information in videos. In particular, the existing perturbation-based methods for video interpretation often fail to consider different temporal contexts, making them ineffective for dynamic videos where the important regions change rapidly or appear ephemerally across frames. To address this, we propose a novel video interpretation method, time importance score-aware extremal perturbation masks (TIEM), that enhances explainability by focusing on temporal dynamics in videos. TIEM exploits a dual perturbation process: first, it evaluates temporal importance across frames via temporal perturbation and then generates spatio-temporal extremal perturbation masks using the temporal importance explicitly. Our experimental results demonstrate that TIEM resolves the key challenges of the existing methods, providing more precise explanations across the time domain in synthetic white-box models and black-box models for real-world videos.

2964Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities

[openreview] [pdf]

Abstract As AI systems, particularly Large Language Models (LLMs), rapidly advance towards surpassing human cognitive capabilities, ensuring their alignment with human values and safety standards emerges as a formidable challenge. This study addresses a crucial aspect of superalignment by investigating the decision-making capabilities and adversarial vulnerabilities of LLMs, focusing on GPT-3.5, GPT-4 and Gemini-1.5, within structured experimental settings that mimic complex human interactions. We applied an adversarial framework to two decision-making tasks—the two-armed bandit task and the Multi-Round Trust Task (MRTT)—to test the vulnerabilities of LLMs under adversarial conditions. In the bandit task, the adversary aimed to induce the LLM’s preference for the predefined target action with the constraint that each action must be assigned an equal number of rewards. For the MRTT, we trained two types of adversaries: one aimed at maximizing its own earnings (MAX) and the other focused on maximizing fairness (FAIR). GPT-4 and Gemini-1.5 showed a bias toward exploitation in the bandit task, prioritizing early-established strategies, which made them predictable and vulnerable to manipulation. GPT-3.5, while more exploratory in the bandit task, demonstrated more risk-seeking behavior in the MRTT, leading to increased vulnerability in interacting with the MAX adversary. Notably, Gemini-1.5 excelled in the MRTT, adapting effectively to adversaries and outperforming both GPT-3.5 and GPT-4 by balancing risk and cooperation with its adversaries. By presenting a specific set of tasks that characterizes decision-making vulnerabilities in LLM-based agents, we provide a concrete methodology for evaluating their readiness for real-world deployment. The adversarial framework proved a powerful tool for stress-testing LLMs, revealing the importance of ensuring that AI models are both robust against adversarial manipulation and responsive to fairness cues in complex, dynamic environments.

2965On the Power of Learning-Augmented Search Trees

[openreview] [pdf]

Abstract We study learning-augmented binary search trees (BSTs) via Treaps with carefully designed priorities. The result is a simple search tree in which the depth of each item xx is determined by its predicted weight wxw_x. Specifically, each item xx is assigned a composite priority of loglog(1/wx)+U(0,1)-\lfloor\log\log(1/w_x)\rfloor + U(0, 1) where U(0,1)U(0, 1) is the uniform random variable. By choosing wxw_x as the relative frequency of xx, the resulting search trees achieve static optimality. This approach generalizes the recent learning-augmented BSTs [Lin-Luo-Woodruff ICML`22], which only work for Zipfian distributions, by extending them to arbitrary input distributions. Furthermore, we demonstrate that our method can be generalized to a B-Tree data structure using the B-Treap approach [Golovin ICALP’09]. Our search trees are also capable of leveraging localities in the access sequence through online self-reorganization, thereby achieving the working-set property. Additionally, they are robust to prediction errors and support dynamic operations, such as insertions, deletions, and prediction updates. We complement our analysis with an empirical study, demonstrating that our method outperforms prior work and classic data structures.

2966A Differentiable Rank-Based Objective for Better Feature Learning

[openreview] [pdf]

Abstract In this paper, we leverage existing statistical methods to better understand feature learning from data. We tackle this by modifying the model-free variable selection method, Feature Ordering by Conditional Independence (FOCI), which is introduced in Azadkia & Chatterjee (2021). While FOCI is based on a non-parametric coefficient of conditional dependence, we introduce its parametric, differentiable approximation. With this approximate coefficient of correlation, we present a new algorithm called difFOCI, which is applicable to a wider range of machine learning problems thanks to its differentiable nature and learnable parameters. We present difFOCI in three contexts: (1) as a variable selection method with baseline comparisons to FOCI, (2) as a trainable model parametrized with a neural network, and (3) as a generic, widely applicable neural network regularizer, one that improves feature learning with better management of spurious correlations. We evaluate difFOCI on increasingly complex problems ranging from basic variable selection in toy examples to saliency map comparisons in convolutional networks. We then show how difFOCI can be incorporated in the context of fairness to facilitate classifications without relying on sensitive data.

2967CodeUpdateArena: Benchmarking Knowledge Editing on API Updates

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly being used to synthesize and reason about source code. The libraries and API functions they invoke are continuously evolving, with functionality being added or changing. Yet, no prior work has studied how an LLM’s knowledge about code API functions can be updated. To fill this gap, we presentCodeUpdateArena, a benchmark for knowledge editing in the code domain. An instance in our benchmark consists of a synthetic API function update paired with a program synthesis example that uses the updated functionality; our goal is to update an LLM to be able to solve this program synthesis example without providing documentation of the update at inference time. Compared to knowledge editing for facts, success here is more challenging: a code LLM must reason about the semantics of the modified function rather than just reproduce its syntax. Our dataset is constructed by first prompting GPT-4 to generate atomic and executable function updates. Then, for each update, we generate program synthesis examples whose code solutions are prone to use the update. Our benchmark covers updates of various types to 54 functions from seven diverse Python packages, with a total of 670 program synthesis examples. Our experiments show that fine-tuning open-source code LLMs (i.e., DeepSeek, CodeLlama) on documentation of a new update does not allow them to incorporate changes for problem-solving. However, prepending the same information does help, establishing that the information is present, and careful fine-tuning on examples demonstrating the update shows improvement, paving the way for better knowledge editing techniques for code.

2968The Optimization Landscape of SGD Across the Feature Learning Strength

[openreview] [pdf]

Abstract We consider neural networks (NNs) where the final layer is down-scaled by a fixed hyperparameter γ\gamma. Recent work has identified γ\gamma as controlling the strength of feature learning. As γ\gamma increases, network evolution changes from “lazy” kernel dynamics to “rich” feature-learning dynamics, with a host of associated benefits including improved performance on common tasks. In this work, we conduct a thorough empirical investigation of the effect of scaling γ\gamma across a variety of models and datasets in the online training setting. We first examine the interaction of γ\gamma with the learning rate η\eta, identifying several scaling regimes in the γ\gamma-η\eta plane which we explain theoretically using a simple model. We find that the optimal learning rate η\eta^* scales non-trivially with γ\gamma. In particular, ηγ2\eta^* \propto \gamma^2 when γ1\gamma \ll 1 and ηγ2/L\eta^* \propto \gamma^{2/L} when γ1\gamma \gg 1 for a feed-forward network of depth LL. Using this optimal learning rate scaling, we proceed with an empirical study of the under-explored ``ultra-rich’’ γ1\gamma \gg 1 regime. We find that networks in this regime display characteristic loss curves, starting with a long plateau followed by a drop-off, sometimes followed by one or more additional staircase steps. We find networks of different large γ\gamma values optimize along similar trajectories up to a reparameterization of time. We further find that optimal online performance is often found at large γ\gamma and could be missed if this hyperparameter is not tuned. Our findings indicate that analytical study of the large-γ\gamma limit may yield useful insights into the dynamics of representation learning in performant models.

2969Capturing and Mitigating Gradient Aggregation Errors for Fault-Tolerant Distributed Training

[openreview] [pdf]

Abstract Capturing and recovering from hardware failures is important in fault-tolerant distributed training to guarantee system efficiency. However, some hardware-related silent data corruption errors during gradient aggregation like bit corruptions or communication noise, are difficult to capture and address, leading to slow or failed convergence. To understand and mitigate these errors, we first mathematically formulate and generalize them as gradient inconsistency. Then, we theoretically analyze how it leads to model divergence accumulated during training and the failed convergence. Based on the analytical study, we design PAFT, a fault-tolerant distributed training system with dynamic and asynchronous parameter synchronization. PAFT includes two parts: (1) PAFT-Sync, which mitigates model divergence by periodically synchronizing parameters, and (2) PAFT-Dyn, which minimizes synchronization overhead through dynamic training overlap and synchronization frequency scheduling based on profiled error degrees. Together, they ensure efficient model convergence at scale. The fault-tolerant synchronization in PAFT is optimized to support commonly used optimizers, e.g., Stochastic Gradient Descent (SGD), SGD momentum, and Adam. We implement PAFT on PyTorch Distributed and train ResNet, GPT-2, and LLaMA-2 on 4\sim 32 GPUs. Experimental results show that PAFT efficiently defends against gradient aggregation error degrees while maintaining training performance.

2970Provably Learning Concepts by Comparison

[openreview] [pdf]

Abstract We are born with the ability to learn concepts by comparing diverse observations. This helps us to understand the new world in a compositional manner and facilitates extrapolation, as objects naturally consist of multiple concepts. In this work, we argue that the cognitive mechanism of comparison, fundamental to human learning, is also vital for machines to recover true concepts underlying the data. This offers correctness guarantees for the field of concept learning, which, despite its impressive empirical successes, still lacks general theoretical support. Specifically, we aim to develop a theoretical framework for the identifiability of concepts with multiple classes of observations. We show that with sufficient diversity across classes, hidden concepts can be identified without assuming specific concept types, functional relations, or parametric generative models. Interestingly, even when conditions are not globally satisfied, we can still provide alternative guarantees for as many concepts as possible based on local comparisons, thereby extending the applicability of our theory to more flexible scenarios. Moreover, the hidden structure between classes and concepts can also be identified nonparametrically. We validate our theoretical results in both synthetic and real-world settings.

2971P-SPIKESSM: HARNESSING PROBABILISTIC SPIKING STATE SPACE MODELS FOR LONG-RANGE DEPENDENCY TASKS

[openreview] [pdf]

Abstract Spiking neural networks (SNNs) are posited as a computationally efficient and biologically plausible alternative to conventional neural architectures, with their core computational framework primarily using the leaky integrate-and-fire (LIF) neuron model. However, the limited hidden state representation of LIF neurons, characterized by a scalar membrane potential, and sequential spike generation process, poses challenges for effectively developing scalable spiking models to address long-range dependencies in sequence learning tasks. In this study, we develop a scalable probabilistic spiking learning framework for long-range dependency tasks leveraging the fundamentals of state space models. Unlike LIF neurons that rely on the determinitic Heaviside function for a sequential process of spike generation, we introduce a SpikeSampler layer that samples spikes stochastically based on an SSM-based neuronal model while allowing parallel computations. To address non-differentiability of the spiking operation and enable effective training, we also propose a surrogate function tailored for the stochastic nature of the SpikeSampler layer. To enhance inter-neuron communication, we introduce the SpikeMixer block, which integrates spikes from neuron populations in each layer. This is followed by a ClampFuse layer, incorporating a residual connection to capture complex dependencies, enabling scalability of the model. Our models attain state-of-the-art performance among SNN models across diverse long-range dependency tasks, encompassing the Long Range Arena benchmark, permuted sequential MNIST, and the Speech Command dataset and demonstrate sparse spiking pattern highlighting its computational efficiency.

2972Maximum Total Correlation Reinforcement Learning

[openreview] [pdf]

Abstract Simplicity is a powerful inductive bias. In reinforcement learning, regularization is used for simpler policies, data augmentation for simpler representations, and sparse reward functions for simpler objectives, all that, with the underlying motivation to increase generalizability and robustness by focusing on the essentials. Supplementary to these techniques, we investigate how to promote simple behavior throughout the duration of the episode. To that end, we introduce a modification of the reinforcement learning problem, that additionally maximizes the total correlation within the induced trajectories. We propose a practical algorithm that optimizes all models, including policy and state representation, based on a lower bound approximation. In simulated robot locomotion environments, our method naturally generates policies that induce periodic and compressible trajectories, and that exhibit superior robustness to noise and changes in dynamics compared to baseline methods, while also improving performance in the original tasks.

2973Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation

[openreview] [pdf]

Abstract Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in science and engineering. They allow us to quickly understand the main properties of the data, identify outliers and processing errors, and inform the next steps of data analysis. As such, LDEs have to be faithful to the original high-dimensional data, i.e., they should represent the relationships that are encoded in the data, both at a local as well as global scale. The current generation of LDE approaches focus on reconstructing local distances between pair of samples correctly, often outperforming traditional approaches aiming at all distances. For these approaches, global relationships are, however, usually strongly distorted, often argued to be an inherent trade-off between local and global structure learning for embeddings. We suggest a new perspective on LDE learning, reconstructing angles between data points. We show that our approach, MERCAT, yields good reconstruction across a diverse set of experiments and metrics, and preserve structures well across all scales. Compared to existing work, our approach also has a simple formulation, facilitating future theoretical analysis and algorithmic improvements.

2974Causal Discovery via Bayesian Optimization

[openreview] [pdf]

Abstract Existing score-based methods for directed acyclic graph (DAG) learning from observational data struggle to recover the causal graph accurately and sample-efficiently. To overcome this, in this study, we propose DrBO (DAG recovery via Bayesian Optimization)—a novel DAG learning framework leveraging Bayesian optimization (BO) to find high-scoring DAGs. We show that, by sophisticatedly choosing the promising DAGs to explore, we can find higher-scoring ones much more efficiently. To address the scalability issues of conventional BO in DAG learning, we replace Gaussian Processes commonly employed in BO with dropout neural networks, trained in a continual manner, which allows for (i) flexibly modeling the DAG scores without overfitting, (ii) incorporation of uncertainty into the estimated scores, and (iii) scaling with the number of evaluations. As a result, DrBO is computationally efficient and can find the accurate DAG in fewer trials and less time than existing state-of-the-art methods. This is demonstrated through an extensive set of empirical evaluations on many challenging settings with both synthetic and real data.

2975On Evaluating the Durability of Safeguards for Open-Weight LLMs

[openreview] [pdf]

Abstract Many stakeholders---from model developers to policymakers---seek to minimize the risks of large language models (LLMs). Key to this goal is whether technical safeguards can impede the misuse of LLMs, even when models are customizable via fine-tuning or when model weights are openly available. Several recent studies have proposed methods to produce durable LLM safeguards for open-weight LLMs that can withstand adversarial modifications of the model’s weights via fine-tuning. This holds the promise of raising adversaries’ costs even under strong threat models where adversaries can directly fine-tune parameters. However, we caution against over-reliance on such methods in their current state. Through several case studies, we demonstrate that even the evaluation of these defenses is exceedingly difficult and can easily mislead audiences into thinking that safeguards are more durable than they really are. We draw lessons from the failure modes that we identify and suggest that future research carefully cabin claims to more constrained, well-defined, and rigorously examined threat models, which can provide useful and candid assessments to stakeholders.

2976Enforcing Latent Euclidean Geometry in VAEs for Statistical Manifold Interpolation

[openreview] [pdf]

Abstract Latent linear interpolations are a powerful tool for navigating the representation space of deep generative models. This aspect is particularly relevant in applied settings, where meaningful latent traversals can be learnt to represent the evolution of a system’s trajectory and mapped back to the often complex and high-dimensional data space. However, when data lies on a manifold with complex geometry, linear interpolations of the representation space do not directly correspond to geodesic paths along the manifold unless enforced. An example of such a setting is scRNA-seq, where high-dimensional and discrete cellular data is assumed to lie on a negative binomial statistical manifold modelled by the decoder of a variational autoencoder. We introduce FlatVI, a novel training framework enforcing Euclidean geometry in the latent space of discrete-likelihood variational autoencoders modelling count data. In our regularisation setting, straight lines in the latent domain correspond to geodesic interpolations in the decoded space, improving the combination of our model with methods assuming Euclidean latent geometry. Results on simulated data empirically support our claims, while experiments on temporally resolved biological datasets show improvements in the reconstruction of cellular trajectories and the learning of biologically meaningful velocity fields.

2977Estimating the Probabilities of Rare Outputs in Language Models

[openreview] [pdf]

Abstract We consider the problem oflow probability estimation: given a machine learning model and a formally-specified input distribution, how can we estimate the probability of a binary property of the model’s output, even when that probability is too small to estimate by random sampling? This problem is motivated by the need to improve worst-case performance, which distribution shift can make much more likely. We study low probability estimation in the context of argmax sampling from small transformer language models. We compare two types of methods: importance sampling, which involves searching for inputs giving rise to the rare output, and activation extrapolation, which involves extrapolating a probability distribution fit to the model’s logits. We find that importance sampling outperforms activation extrapolation, but both outperform naive sampling. Finally, we explain how minimizing the probability estimate of an undesirable behavior generalizes adversarial training, and argue that new methods for low probability estimation are needed to provide stronger guarantees about worst-case performance.

2978Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models

[openreview] [pdf]

Abstract One core capability of large language models~(LLMs) is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to verify the correctness of the instruction responses, and unit test samples to cross-validate the code’s correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the advanced open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Using two widely-used and three challenging general instruction-following benchmarks, we demonstrate that AutoIF significantly improves LLM performance across a wide range of natural instruction constraints. Notably, AutoIF is the first to surpass 90% accuracy in IFEval’s loose instruction accuracy, without compromising general, math and coding capabilities. Further analysis of quality, scaling, combination, and data efficiency highlights AutoIF’s strong generalization and alignment potential.

2979A Reoptimization Framework for Mixed Integer Linear Programming with Dynamic Parameters

[openreview] [pdf]

Abstract Many real-world applications, such as logistics, routing, scheduling, and production planning, involve dynamic systems that require continuous updates to solutions for new Mixed Integer Linear Programming (MILP) problems. These new instances may differ in parameters like objective functions, constraints, and variable bounds. While reoptimization techniques have been explored for Linear Programming (LP) and specific MILP problems, their effectiveness in general MILP is limited. In this work, we propose a two-stage reoptimization framework for efficiently identifying high-quality feasible solutions. Specifically, we first utilize the historical solving process information to predict the high confidence solving space for modified MILPs to contain high-quality solutions. Based on the prediction results, we fix a part of variables to apply the prediction intervals and use the Thompson Sampling algorithm to determine the set of variables to fix and optimize the predicted probability with the updates of solutions from the solver. Extensive experiments across nine reoptimization datasets show that our VP-OR outperforms the state-of-the-art methods, achieving highly accurate solutions under strict time limits and demonstrating faster convergence with smaller primal gaps.

2980Neural Context Flows for Meta-Learning of Dynamical Systems

[openreview] [pdf]

Abstract Neural Ordinary Differential Equations (NODEs) often struggle to adapt to new dynamic behaviors caused by parameter changes in the underlying system, even when these dynamics are similar to previously observed behaviors. This problem becomes more challenging when the changing parameters are unobserved, meaning their value or influence cannot be directly measured when collecting data. To address this issue, we introduce Neural Context Flow (NCF), a robust and interpretable Meta-Learning framework that includes uncertainty estimation. NCF uses kk-th order Taylor expansion to enable contextual self-modulation, allowing context vectors to influence dynamics from other domains while also modulating themselves. After establishing convergence guarantees, we empirically test NCF and compare it to related adaptation methods. Our results show that NCF achieves state-of-the-art Out-of-Distribution performance on 5 out of 6 linear and non-linear benchmark problems. Through extensive experiments, we explore the flexible model architecture of NCF and the encoded representations within the learned context vectors. Our findings highlight the potential implications of NCF for foundational models in the physical sciences, offering a promising approach to improving the adaptability and generalization of NODEs in various scientific applications. Our code is openly available at \url{AnonymousGithubRepo}.

2981BANGS: Game-theoretic Node Selection for Graph Self-Training

[openreview] [pdf]

Abstract Graph self-training is a semi-supervised learning method that iteratively selects a set of unlabeled data to retrain the underlying graph neural network (GNN) model and improve its prediction performance. While selecting highly confident nodes has proven effective for self-training, this pseudo-labeling strategy ignores the combinatorial dependencies between nodes and suffers from a local view of the distribution. To overcome these issues, we propose BANGS, a novel framework that unifies the labeling strategy with conditional mutual information as the objective of node selection. Our approach---grounded in game theory---selects nodes in a combinatorial fashion and provides theoretical guarantees for robustness under noisy objective. More specifically, unlike traditional methods that rank and select nodes independently, BANGS considers nodes as a collective set in the self-training process. Our method demonstrates superior performance and robustness across various datasets, base models, and hyperparameter settings, outperforming existing techniques. The codebase is available onhttps://anonymous.4open.science/r/BANGS-3EA4.

2982KidSat: satellite imagery to map childhood poverty

[openreview] [pdf]

Abstract Satellite imagery has emerged as an important tool to analyze demographic, health, and development indicators. While various deep learning models have been built for these tasks, each is specific to a particular problem, with few standard benchmarks available. We propose a new dataset pairing satellite imagery and high-quality survey data on child poverty to benchmark satellite feature representations. Our dataset consists of 33,608 images, each 10 km × 10 km, from 16 countries in Eastern and Southern Africa in the time period 1997-2022. As defined by UNICEF, multidimensional child poverty comprises six fundamental factors—housing, sanitation, water, nutrition, education, and health (UNICEF, 2021)—which can be calculated from geocoded, face-to-face Demographic and Health Surveys (DHS) Program data. Using our dataset we benchmark multiple feature representations for encoding satellite imagery, from low-level satellite imagery models such as MOSAIKS (Rolf et al., 2021), to deep learning foundation models, which include both generic vision models such as DINOv2 (Oquab et al., 2023) and specific satellite imagery models such as SatMAE (Cong et al., 2022). As part of the benchmark, we test spatial as well as temporal generalization, by testing on unseen locations, and on data beyond the training years. We provide open source code to reproduce and extend our entire pipeline: building the satellite imagery dataset, obtaining ground truth data from DHS, and comparing the various models considered in our work.

2983Noise Balance and Stationary Distribution of Stochastic Gradient Descent

[openreview] [pdf]

Abstract How the stochastic gradient descent (SGD) navigates the loss landscape of a neural network remains poorly understood. This work shows that the minibatch noise of SGD regularizes the solution towards a noise-balanced solution whenever the loss function contains a rescaling symmetry. We prove that when the rescaling symmetry exists, the SGD dynamics is limited to only a low-dimensional subspace and prefers a special set of solutions in an infinitely large degenerate manifold, which offers a partial explanation of the effectiveness of SGD in training neural networks. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width, which is the first analytical expression of the stationary distribution of SGD in a high-dimensional non-quadratic potential. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, loss of ergodicity, memory effects, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, highlighting a fundamental difference between deep and shallow models. Lastly, we discuss the implication of the proposed theory for the practical problem of variational Bayesian inference.

2984Adaptive Caching for Faster Video Generation with Diffusion Transformers

[openreview] [pdf]

Abstract Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs)--- despite making significant headway in this context--- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free\textit{training-free} method to accelerate video DiTs, termed Adaptive Caching (AdaCache\textit{AdaCache}), which is motivated by the fact that “not all videos are created equal”\textit{``not all videos are created equal''}: meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg\textit{MoReg}) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines. Our code will be made publicly-available.

2985Graffe: Graph Representation Learning Enabled via Diffusion Probabilistic Models

[openreview] [pdf]

Abstract Diffusion probabilistic models (DPMs), widely recognized for their potential to generate high-quality samples, tend to go unnoticed in representation learning. While recent progress has highlighted their potential for capturing visual semantics, adapting DPMs to graph representation learning remains in its infancy. In this paper, we introduceGraffe, a self-supervised diffusion model proposed for graph representation learning. It features a graph encoder that distills a source graph into a compact representation, which, in turn, serves as the condition to guide the denoising process of the diffusion decoder. To evaluate the effectiveness of our model, we first explore the theoretical foundations of applying diffusion models to representation learning, proving that the denoising objective implicitly maximizes the conditional mutual information between data and its representation. Specifically, we prove that the negative logarithm of denoising score matching loss is a tractable lower bound for the conditional mutual information. Empirically, Graffe delivers competitive results under the linear probing setting on node and graph classification, achieving state-of-the-art performance on 9 of the 11 real-world datasets. These findings indicate that powerful generative models, especially diffusion models, serve as an effective tool for graph representation learning.

2986LICORICE: Label-Efficient Concept-Based Interpretable Reinforcement Learning

[openreview] [pdf]

Abstract Recent advances in reinforcement learning (RL) have predominantly leveraged neural network-based policies for decision-making, yet these models often lack interpretability, posing challenges for stakeholder comprehension and trust. Concept bottleneck models offer an interpretable alternative by integrating human-understandable concepts into neural networks. However, a significant limitation in prior work is the assumption that annotations for these concepts are readily available during training, necessitating continuous real-time concept annotation. This reliance either places a significant burden on human annotators or incurs substantial costs in API queries and inference time when employing automated labeling methods, such as vision-language models (VLMs). To overcome this limitation, we introduce a novel training scheme that enables RL algorithms to efficiently learn a concept-based policy by only querying annotators to label a small set of data. Our algorithm, LICORICE, involves three main contributions: interleaving concept learning and RL training, using a concept ensembles to actively select informative data points for labeling, and decorrelating the concept data with a simple strategy. We show how LICORICE reduces human labeling efforts to 500 or fewer concept labels in three environments and 5000 in another complex environment at minimal or no cost to performance. We also explore the use of VLMs as automated concept annotators, finding them effective in some cases but challenging in others. This work significantly reduces the annotation burden for interpretable RL, making it more practical for real-world applications where transparency is crucial.

2987A Theory of Initialisation’s Impact on Specialisation

[openreview] [pdf]

Abstract Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network’s tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks, and, finally, assessing the implications on the commonly employed elastic weight consolidation regularisation technique.

2988Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated significant potential in clinical decision support. Yet LLMs still suffer from hallucinations and lack fine-grained contextual medical knowledge, limiting their high-stake healthcare applications such as clinical diagnosis. Traditional retrieval-augmented generation (RAG) methods attempt to address these limitations but frequently retrieve sparse or irrelevant information, undermining prediction accuracy. We introduce KARE, a novel framework that integrates knowledge graph (KG) community-level retrieval with LLM reasoning to enhance healthcare predictions. KARE constructs a comprehensive multi-source KG by integrating biomedical databases, clinical literature, and LLM-generated insights, and organizes it using hierarchical graph community detection and summarization for precise and contextually relevant information retrieval. Our key innovations include: (1) a dense medical knowledge structuring approach enabling accurate retrieval of relevant information; (2) a dynamic knowledge retrieval mechanism that enriches patient contexts with focused, multi-faceted medical insights; and (3) a reasoning-enhanced prediction framework that leverages these enriched contexts to produce both accurate and interpretable clinical predictions. Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions. In addition to its impressive prediction accuracy, our framework leverages the reasoning capabilities of LLMs, enhancing the trustworthiness of clinical predictions.

2989How Do We Select Right LLM for Each Query?

[openreview] [pdf]

Abstract As Large Language Models (LLMs) continue to expand in both variety and cost, selecting the most appropriate model for each query is becoming increasingly crucial. Many existing works treat this as an offline problem, necessitating a data-gathering phase to compile a set of query-answer-reward triplets beforehand. They often struggle to determine the adequate number of triplets needed and are prone to overfitting if the data volume is insufficient. To address these limitations, we propose a new solution, the Multi-Armed Router (MAR), which applies multi-armed bandit theory—a perspective previously unexplored in this domain. Unlike previous works that base decision-making solely on regression techniques using static datasets (i.e., constructed triplets), our method treats this as an online multi-LLM recommendation problem, which better mirrors real-world applications. Moreover, rather than the vanilla multi-armed bandit, our framework employs contextual bandit algorithms to navigate the trade-offs between exploring new models and exploiting proven models, while considering the dependency between the input query and the answer’s reward. Due to the lack of an off-the-shelf dataset in this area, we construct WildArena, a dataset of 4,029 real-world user queries. For each query, there are seven open-ended responses derived from seven leading LLMs, respectively, with an evaluation score for each answer by using the LLM-as-a-Judge framework. We hope that the introduction of the new perspective and the dataset will facilitate the research in per-query LLM routing.

2990Multi-Head RAG: Solving Multi-Aspect Problems with LLMs

[openreview] [pdf]

Abstract Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving motivation is that different attention heads can learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets that we release online, and real-world use cases to demonstrate MRAG’s effectiveness, showing improvements of up to 20% in relevance over standard RAG baselines. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarking tools like RAGAS as well as different classes of data stores.

2991Conditional LoRA Parameter Generation

[openreview] [pdf]

Abstract Generative models have achieved remarkable success in image, video, and text domains. Inspired by this, researchers have explored utilizing generative models to generate neural network parameters. However, these efforts have been limited by the parameter size and the practicality of generating high-performance parameters. In this paper, we propose COND P-DIFF, a novel approach that demonstrates the feasibility of controllable high-performance parameter generation, particularly for LoRA (Low-Rank Adaptation) weights, during the fine-tuning process. Specifically, we employ an autoencoder to extract efficient latent representations for parameters. We then train a conditional latent diffusion model to synthesize high-performing model parameters from random noise based on specific task conditions. Experimental results in both computer vision and natural language processing domains consistently demonstrate that COND P-DIFF can generate high-performance parameters conditioned on the given task. Moreover, we observe that the parameter distribution generated by COND P-DIFF exhibits differences compared to the distribution obtained through normal optimization methods, indicating a certain level of generalization capability. Our work paves the way for further exploration of condition-driven parameter generation, offering a promising direction for task-specific adaptation of neural networks.

2992Fictitious Synthetic Data Can Improve LLM Factuality via Prerequisite Learning

[openreview] [pdf]

Abstract Recent studies have identified one aggravating factor of LLM hallucinations as the knowledge inconsistency between pre-training and fine-tuning, where unfamiliar fine-tuning data mislead the LLM to fabricate plausible but wrong outputs. In this paper, we propose a novel fine-tuning strategy called Prereq-Tune to address this knowledge inconsistency and reduce hallucinations. Fundamentally, Prereq-Tune disentangles the learning of skills and knowledge, so the model learns only the task skills without being impacted by the knowledge inconsistency. To achieve this, Prereq-Tune introduces an additional prerequisite learning stage to learn the necessary knowledge for SFT, allowing subsequent SFT to focus only on task skills. Prereq-Tune can also be combined with fictitious synthetic data to enhance the grounding of LLM outputs to their internal knowledge. Experiments show that Prereq-Tune outperforms existing baselines in improving LLM’s factuality across short QA and long-form generation tasks. It also opens new possibilities for knowledge-controlled generation in LLMs.

2993Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile

[openreview] [pdf]

Abstract Despite the promise of synthesizing high-fidelity videos, Diffusion Transformers (DiTs) with 3D full attention suffer from expensive inference due to the complexity of attention computation and numerous sampling steps. For example, the popular Open-Sora-Plan model consumes more than 9 minutes for generating a single video of 29 frames. This paper addresses the inefficiency issue from two aspects: 1) Prune the 3D full attention based on the redundancy within video data; We identify a prevalent tile-style repetitive pattern in the 3D attention maps for video data, and advocate a new family of sparse 3D attention that holds a linear complexity w.r.t. the number of video frames. 2) Shorten the sampling process based on multi-step consistency distillation; We split the entire sampling trajectory into several segments and perform consistency distillation within each one to activate few-step generation capacities. We further devise a three-stage training pipeline to conjoin the low-complexity attention and few-step generation capacities. Notably, with 0.1% pretraining data, we turn the Open-Sora-Plan-1.2 model into an efficient one that is 7.4x −7.8x faster for 29 and 93 frames 720p video generation with less than 1% performance loss in VBench. In addition, we demonstrate that our approach is amenable to distributed inference, achieving an additional 3.91x speedup when running on 4 GPUs with sequence parallelism.

2994Teaching LLMs How To Learn with Contextual Fine-Tuning

[openreview] [pdf]

Abstract Prompting Large Language Models (LLMs), or providing context on the expected model of operation, is an effective way to steer the outputs of such models to satisfy human desiderata after they have been trained. But in rapidly evolving domains, there is often need to fine-tune LLMs to improve either the kind of knowledge in their memory or their abilities to perform open ended reasoning in new domains. When human’s learn new concepts, we often do so by linking the new material that we are studying to concepts we have already learned before. To that end, we ask, “can prompting help us teach LLMs how to learn”. In this work, we study a novel generalization of instruction tuning, called contextual fine-tuning, to fine-tune LLMs. Our method leverages instructional prompts designed to mimic human cognitive strategies in learning and problem-solving to guide the learning process during training, aiming to improve the model’s interpretation and understanding of domain-specific knowledge. We empirically demonstrate that this simple yet effective modification improves the ability of LLMs to be fine-tuned rapidly on new datasets both within the medical and financial domains.

2995REFINE: Inversion-Free Backdoor Defense via Model Reprogramming

[openreview] [pdf]

Abstract Backdoor attacks on deep neural networks (DNNs) have emerged as a significant security threat, allowing adversaries to implant hidden malicious behaviors during the model training phase. Pre-processing-based defense, which is one of the most important defense paradigms, typically focuses on input transformations or backdoor trigger inversion (BTI) to deactivate or eliminate embedded backdoor triggers during the inference process. However, these methods suffer from inherent limitations: transformation-based defenses often struggle to balance the intensity of transformations with preserving the model’s accuracy, while BTI-based defenses require accurate reconstruction of the trigger patterns, which is rarely achievable without prior knowledge. In this paper, we propose REFINE, an inversion-free backdoor defense method based on model reprogramming. REFINE consists of two key components: (1) an input transformation module that disrupts both benign and backdoor patterns, generating new benign features; and (2) an output remapping module that redefines the model’s output domain to guide the input transformations effectively. By further integrating supervised contrastive loss, REFINE enhances the defense capabilities while maintaining model utility. Extensive experiments on various benchmark datasets demonstrate the effectiveness of our REFINE and its resistance to potential adaptive attacks.

2996Fat-to-Thin Policy Optimization: Offline Reinforcement Learning with Sparse Policies

[openreview] [pdf]

Abstract Sparse continuous policies are distributions that can choose some actions at random yet keep strictly zero probability for the other actions, which are radically different from the Gaussian. They have important real-world implications, e.g. in modeling safety-critical tasks like medicine. The combination of offline reinforcement learning and sparse policies provides a novel paradigm that enables learning completely from logged datasets a safety-aware sparse policy. However, sparse policies can cause difficulty with the existing offline algorithms which require evaluating actions that fall outside of the current support. In this paper, we propose the first offline policy optimization algorithm that tackles this challenge: Fat-to-Thin Policy Optimization (FtTPO). Specifically, we maintain a fat (heavy-tailed) proposal policy that effectively learns from the dataset and injects knowledge to a thin (sparse) policy, which is responsible for interacting with the environment. We instantiate FtTPO with the general qq-Gaussian family that encompasses both heavy-tailed and sparse policies and verify that it performs favorably in a safety-critical treatment simulation and the standard MuJoCo suite.

2997Reward-RAG: Enhancing RAG with Reward Driven Supervision

[openreview] [pdf]

Abstract In this paper, we introduce Reward-RAG, a novel approach designed to enhance the Retrieval-Augmented Generation (RAG) model through Reward-Driven Supervision. Unlike previous RAG methodologies, which focus on training language models (LMs) to utilize external knowledge retrieved from external sources, our method adapts retrieval information to specific domains by employing CriticGPT to train a dedicated reward model. This reward model generates synthesized datasets for fine-tuning the RAG encoder, aligning its outputs more closely with human preferences. The versatility of our approach allows it to be effectively applied across various domains through domain-specific fine-tuning. We evaluate Reward-RAG on publicly available benchmarks from multiple domains, comparing it to state-of-the-art methods. Our experimental results demonstrate significant improvements in performance, highlighting the effectiveness of Reward-RAG in improving the relevance and quality of generated responses. These findings underscore the potential of integrating reward models with RAG to achieve superior outcomes in natural language generation tasks.

2998Discrete Latent Plans via Semantic Skill Abstractions

[openreview] [pdf]

Abstract Skill learning from language instructions is a critical challenge in developing intelligent agents that can generalize across diverse tasks and follow complex human instructions. Hierarchical methods address this by decomposing the learning problem into multiple levels, where the high-level and low-level policies are mediated through a latent plan space. Effective modeling and learning of this latent plan space are key to enabling robust and interpretable skill learning. In this paper, we introduce LADS, a hierarchical approach that learns language-conditioned discrete latent plans through semantic skill abstractions. Our method decouples the learning of the latent plan space from the language-conditioned high-level policy to improve training stability. First, we incorporate a trajectory encoder to learn a discrete latent space with the low-level policy, regularized by language instructions. Next, we model the high-level policy as a categorical distribution over these discrete latent plans to capture the multi-modality of the dataset. Through experiments in simulated control environments, we demonstrate that LADS outperforms state-of-the-art methods in both skill learning and compositional generalization.

2999Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

[openreview] [pdf]

Abstract Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, the data generated by the policy model is more effective than that produced by humans or GPT-4, due to the former’s in-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro.

3000TimeKAN: KAN-based Frequency Decomposition Learning Architecture for Long-term Time Series Forecasting

[openreview] [pdf]

Abstract Real-world time series often have multiple frequency components that are intertwined with each other, making accurate time series forecasting challenging. Decomposing the mixed frequency components into multiple single frequency components is a natural choice. However, the information density of patterns varies across different frequencies, and employing a uniform modeling approach for different frequency components can lead to inaccurate characterization. To address this challenges, inspired by the flexibility of the recent Kolmogorov-Arnold Network (KAN), we propose a KAN-based Frequency Decomposition Learning architecture (TimeKAN) to address the complex forecasting challenges caused by multiple frequency mixtures. Specifically, TimeKAN mainly consists of three components: Cascaded Frequency Decomposition (CFD) blocks, Multi-order KAN Representation Learning (M-KAN) blocks and Frequency Mixing blocks. CFD blocks adopt a bottom-up cascading approach to obtain series representations for each frequency band. Benefiting from the high flexibility of KAN, we design a novel M-KAN block to learn and represent specific temporal patterns within each frequency band. Finally, Frequency Mixing blocks is used to recombine the frequency bands into the original format. Extensive experimental results across multiple real-world time series datasets demonstrate that TimeKAN achieves state-of-the-art performance as an extremely lightweight architecture.

3001Simple, Good, Fast: Self-Supervised World Models Free of Baggage

[openreview] [pdf]

Abstract What are the essential components of world models? How far do we get with world models that are not employing RNNs, transformers, discrete representations, and image reconstructions? This paper introduces SGF, a Simple, Good, and Fast world model that uses self-supervised representation learning, captures short-time dependencies through frame and action stacking, and enhances robustness against model errors through data augmentation. We extensively discuss SGF’s connections to established world models, evaluate the building blocks in ablation studies, and demonstrate good performance through quantitative comparisons on the Atari 100k benchmark. The source code will be made available.

3002Right on Time: Revising Time Series Models by Constraining their Explanations

[openreview] [pdf]

Abstract The reliability of deep time series models is often compromised by their tendency to rely on confounding factors, which may lead to incorrect outputs. Our newly recorded, naturally confounded dataset named P2S from a real mechanical production line emphasizes this. To avoid “Clever-Hans” moments in time series, i.e., to mitigate confounders, we introduce the method Right on Time (RioT). RioT enables, for the first time interactions with model explanations across both the time and frequency domain. Feedback on explanations in both domains is then used to constrain the model, steering it away from the annotated confounding factors. The dual-domain interaction strategy is crucial for effectively addressing confounders in time series datasets. We empirically demonstrate that RioT can effectively guide models away from the wrong reasons in P2S as well as popular time series classification and forecasting datasets.

3003Initialization Matters: Unraveling the Impact of Pre-Training on Federated Learning

[openreview] [pdf]

Abstract Initializing with pre-trained models when learning on downstream tasks is now standard practice in machine learning. Several recent works explore the benefits of pre-trained initialization in a federated learning (FL) setting, where the downstream training is performed at the edge clients with heterogeneous data distribution. These works show that starting from a pre-trained model can substantially reduce the adverse impact of data heterogeneity on the test performance of a model trained in a federated setting, with no changes to the standard FedAvg training algorithm. In this work, we provide a deeper theoretical understanding of this phenomenon. To do so, we study the class of two-layer convolutional neural networks (CNNs) and provide bounds on the training error convergence and test error of such a network trained with FedAvg. We introduce the notion of aligned and misaligned filters at initialization and show that the data heterogeneity only affects learning on misaligned filters. Starting with a pre-trained model typically results in fewer misaligned filters at initialization, thus producing a lower test error even when the model is trained in a federated setting with data heterogeneity. Experiments in synthetic settings and practical FL training on CNNs verify our theoretical findings.

3004Self-Supervised Pseudodata Filtering for Improved Replay with Sub-Optimal Generators

[openreview] [pdf]

Abstract Continual learning of a sequence of tasks without forgetting previously acquired knowledge is one of the main challenges faced by modern deep neural networks. In the class-incremental scenario (aka open-set learning), one of the most difficult continual learning problems, new classes are presented to a classifier over time. The model needs to be able to learn and recognize these new classes while also retaining its knowledge of previously witnessed ones. A common approach is to make it revisit the old classes or their features in some form, either by analysing stored exemplars or by using artificially generated samples. The latter approach, Generative Replay, usually relies on a separate generator trained alongside the main classifier. Since the generator also needs to learn continually, it is usually retrained on every task, using its own generated samples as training data representing older classes. This can lead to error propagation and accumulating features unimportant or confusing for the classifier, reducing the overall performance for larger numbers of tasks. We propose a simple filtering mechanism for mitigating this issue – whenever pseudodata is generated for a new task, the classifier can reject samples it is not able to classify with sufficient confidence, thus preventing both models from retraining on poor-quality data. We tested the filter on several datasets, including real-life images, using various combinations of models, as the method can be applied regardless of the network architectures. We show that filtering improves the classifier’s accuracy and provide statistical analysis of the results.

3005Looking Backward: Streaming Video-to-Video Translation with Feature Banks

[openreview] [pdf]

Abstract This paper introduces StreamV2V, a diffusion model that achieves real-time streaming video-to-video (V2V) translation with user prompts. Unlike prior V2V methods using batches to process limited frames, we opt to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past. This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values, and directly fuses similar past features into the output. The feature bank is continually updated by merging stored and new features, making it compact yet informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning. It can run 20 FPS on one A100 GPU, being 15×\times, 46×\times, 108×\times, and 158×\times faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V’s exceptional ability to maintain temporal consistency.

3006Impact of Prompt on Latent Representations in LLMs

[openreview] [pdf]

Abstract The effectiveness of zero-shot learning frameworks, particularly in Large Language Models (LLMs), has lately shown tremendous improvement. Nonetheless, zero-shot performance critically depends on the prompt quality. Scientific literature has been prolific in proposing methods to select, create, and evaluate prompts from a language or performance perspective, changing their phrasing or creating them following heuristics rules. While these approaches are intuitive, they are insufficient in unveiling the internal mechanisms of Large Language Models. In this work, we propose exploring the impact of prompts on the latent representations of auto-regressive transformer models considering a zero-shot setting. We focus on the geometrical properties of prompts’ inner representation at different stages of the model. Experiments conducted give insights into how prompt characteristics influence the structure and distribution of vector representations in generative models. We focus on binary classification tasks on which prompting methods have shown robust performance and show that prompt formulation has indeed an influence on latent representation. However, their impact is dependent on the model family. Using clustering methods, we show that even though prompts are similar in natural language, surprisingly, their representations can differ. This is highly model-dependent, demonstrating the need for more precise analysis.

3007Investigating Pattern Neurons in Urban Time Series Forecasting

[openreview] [pdf]

Abstract Urban time series forecasting is crucial for smart city development and is key to sustainable urban management. Although urban time series models (UTSMs) are effective in general forecasting, they often overlook low-frequency events, such as emergencies and holidays, leading to degraded performance in practical applications. In this paper, we first investigate how UTSMs handle these infrequent patterns from a neural perspective. Based on our findings, we propose P\textbf{P}attern N\textbf{N}euron guided Train\textbf{Train}ing (PN-Train\texttt{PN-Train}), a novel training method that features (i) a perturbation-based detector\textit{perturbation-based detector} to identify neurons responsible for low-frequency patterns in UTSMs, and (ii) a fine-tuning mechanism\textit{fine-tuning mechanism} that enhances these neurons without compromising representation learning on high-frequency patterns. Empirical results demonstrate that PN-Train\texttt{PN-Train} considerably improves forecasting accuracy for low-frequency events while maintaining high performance for high-frequency events.

3008The AdEMAMix Optimizer: Better, Faster, Older

[openreview] [pdf]

Abstract Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts for gradients being local linear approximations which lose their relevance as the iterate moves along the loss landscape. This work questions the use of a single EMA to accumulate past gradients and empirically demonstrates how this choice can be sub-optimal: a single EMA cannot simultaneously give a high weight to the immediate past, and a non-negligible weight to older gradients. Building on this observation, we propose AdEMAMix, a simple modification of the Adam optimizer with a mixture of two EMAs to better take advantage of past gradients. Our experiments on language modeling and image classification show---quite surprisingly---that gradients can stay relevant for tens of thousands of steps. They help to converge faster, and often to lower minima: e.g., a 1.3B parameter AdEMAMix LLM trained on 101B tokens performs comparably to an AdamW model trained on 197B tokens (+95+95%). Moreover, our method significantly slows-down model forgetting during training. Our work motivates further exploration of different types of functions to leverage past gradients, beyond EMAs.

3009Can LLMs Enhance Performance Prediction for Deep Learning Models?

[openreview] [pdf]

Abstract Accurate performance prediction of Deep Learning (DL) models is essential for efficient resource allocation and optimizations in various stages of the DL system stack. While existing approaches can achieve high prediction accuracy, they lack ability to quickly adapt to new hardware environments or emerging workloads. This paper leverages both Graph Neural Networks (GNNs) and Large Language Models (LLMs) to enhance the accuracy and adaptability of DL performance prediction. Our intuition is that GNNs are adept at capturing the structural information of DL models, naturally represented as graphs, while LLMs provide generalization and the ability to quickly adapt to various tasks thanks to extensive pre-training data. We empirically demonstrate that using GNN-derived graph embeddings as inputs to an LLM outperforms traditional representations, including high-level text summary and lossless semi-structured text (e.g., JSON), for this task. Furthermore, we propose a structured pre-training strategy to enable model adaptation to new hardware environments, significantly reducing the need for extensive retraining. Our experiments validate the effectiveness of this approach, showing an 8.8 percentage-point improvement in accuracy over a state-of-the-art GNN baseline. Notably, when adapted to new hardware with few samples, our method achieves a remarkable 30--70 percentage-point increase in accuracy compared to the GNN baseline.

3010SSLA: A Generalized Attribution Method for Interpreting Self-Supervised Learning without Downstream Task Dependency

[openreview] [pdf]

Abstract Self-Supervised Learning (SSL) is a crucial component of unsupervised tasks, enabling the learning of general feature representations without the need for labeled categories. However, our understanding of SSL tasks remains limited, and it is still unclear how SSL models extract key features from raw data. Existing interpretability methods are heavily reliant on downstream tasks, requiring information from these tasks to explain SSL models. This reliance blurs the line between interpreting the SSL model itself and the downstream task model. Moreover, these methods often require additional samples beyond the target of interpretation, introducing extra information that complicates the interpretability process. In this paper, we propose three fundamental prerequisites for the interpretability of SSL tasks and design the Self-Supervised Learning Attribution (SSLA) algorithm that adheres to these prerequisites. SSLA redefines the interpretability objective by introducing a feature similarity measure, reducing the impact of randomness inherent in SSL algorithms, and achieving more stable interpretability results. Additionally, SSLA abstracts the interpretability process, making it independent of specific neural network architectures. To the best of our knowledge, SSLA is the first SSL interpretability method that does not rely on downstream tasks. We also redesign a more reasonable evaluation framework and establish baselines for comparative assessment. The source code for our implementation is publicly available athttps://anonymous.4open.science/r/SSLA-EF85.

3011Neural Manifold Regularization: Aligning 2D Latent Dynamics with Stereotyped, Natural, and Attempted Movements

[openreview] [pdf]

Abstract Mapping neural activity to behavior is a fundamental goal in both neuroscience and brain-machine interfaces. Traditionally, at least three-dimensional (3D) latent dynamics have been required to represent two-dimensional (2D) movement trajectories. In this work, we introduce Neural Manifold Regularization (NMR), a method that embeds neural dynamics into a 2D latent space and regularizes the manifold based on the distances and densities of continuous movement labels. NMR pulls together positive pairs of neural embeddings (corresponding to closer labels) and pushes apart negative pairs (representing more distant labels). Additionally, NMR applies greater force to infrequent labels to prevent them from collapsing into dominant labels. We benchmarked NMR against other dimensionality reduction techniques using neural activity from four signal modalities: single units, multiunit threshold crossings, unsorted events, and local field potentials. These latent dynamics were mapped to three types of movements: stereotyped center-out reaching and natural random target reaching in monkeys, as well as attempted handwriting in a paralyzed patient. NMR consistently outperforms other methods by over 50% across four signal modalities and three movement types, evaluated over 68 sessions. Our code is uploaded.

3012Training Large Language Models for Retrieval-Augmented Question Answering through Backtracking Correction

[openreview] [pdf]

Abstract Despite recent progress in Retrieval-Augmented Generation (RAG) achieved by large language models (LLMs), retrievers often recall uncorrelated documents, regarded as “noise” during subsequent text generation. To address this, some methods train LLMs to distinguish between relevant and irrelevant documents using labeled data, enabling them to select the most likely relevant ones as context. However, they remain sensitive to noise, as LLMs can easily make mistakes when the selected document is noisy. Some approaches increase the number of referenced documents and train LLMs to perform stepwise reasoning when presented with multiple documents. Unfortunately, these methods rely on extensive and diverse annotations to ensure generalization, which is both challenging and costly. In this paper, we proposeBacktracking Correctionto address these limitations. Specifically, we reformulate stepwise RAG into a multi-step decision-making process. Starting from the final step, we optimize the model through error sampling and self-correction, and then backtrack to the previous state iteratively. In this way, the model’s learning scheme follows an easy-to-hard progression: as the target state moves forward, the context space decreases while the decision space increases. Experimental results demonstrate thatBacktracking Correctionenhances LLMs’ ability to make complex multi-step assessments, improving the robustness of RAG in dealing with noisy documents.

3013Temporal Logic-Based Multi-Vehicle Backdoor Attacks against Offline RL Agents in End-to-end Autonomous Driving

[openreview] [pdf]

Abstract End-to-end autonomous driving (AD) systems integrate complex decision-making processes. Assessing the safety of these systems against potential security threats, including backdoor attacks, is a stepping stone for real-world deployment. However, traditional methods focus on static triggers, which do not adequately reflect the dynamic nature of these systems and could be impractical to deploy in the real world. To address these limitations, we propose a novel backdoor attack against the end-to-end AD systems that leverage multi-vehicles’ trajectories as triggers. We employ different behavior models and their configurations to generate the trigger trajectories, which are then quantitatively evaluated using temporal logic specifications. This evaluation guides the subsequent perturbations to the behavior model configurations. Through an iterative process of regeneration and re-evaluation, we can refine and generate realistic and plausible trigger trajectories that involve multiple vehicles’ complex interactions. Furthermore, we develop a negative training strategy by incorporating patch trajectories that share similarities with the triggers but are designated not to activate the backdoor. We thus enhance the stealthiness of the attack, refining the system’s responses to trigger scenarios. Through extensive empirical studies using offline reinforcement learning (RL) driving agents with various trigger patterns and target action designs, we demonstrate the flexibility and effectiveness of our proposed attack, showing the under-exploration of existing end-to-end AD systems’ vulnerabilities to such multi-vehicle-based backdoor attacks. We also evaluate the attack against existing defenses and validate different design choices of our attack via a comprehensive ablation study.

3014Choices are More Important than Efforts: LLM Enables Efficient Multi-Agent Exploration

[openreview] [pdf]

Abstract With expansive state-action spaces, efficient multi-agent exploration remains a longstanding challenge in reinforcement learning. Although pursuing novelty, diversity, or uncertainty attracts increasing attention, redundant efforts brought by exploration without proper guidance choices poses a practical issue for the community. This paper introduces a systematic approach, termed LEMAE, choosing to channel informative task-relevant guidance from a knowledgeable Large Language Model (LLM) for Efficient Multi-Agent Exploration. Specifically, we ground linguistic knowledge from LLM into symbolic key states, that are critical for task fulfillment, in a discriminative manner at low LLM inference costs. To unleash the power of key states, we design Subspace-based Hindsight Intrinsic Reward (SHIR) to guide agents toward key states by increasing reward density. Additionally, we build the Key State Memory Tree (KSMT) to track transitions between key states in a specific task for organized exploration. Benefiting from diminishing redundant explorations, LEMAE outperforms existing SOTA approaches on the challenging benchmarks (e.g., SMAC and MPE) by a large margin, achieving a 10x acceleration in certain scenarios. Our code is available athttps://anonymous.4open.science/r/LEMAE.

3015Conflict-Averse Gradient Aggregation for Constrained Multi-Objective Reinforcement Learning

[openreview] [pdf]

Abstract In real-world applications, a reinforcement learning (RL) agent should consider multiple objectives and adhere to safety guidelines. To address these considerations, we propose a constrained multi-objective RL algorithm named constrained multi-objective gradient aggregator (CoMOGA). In the field of multi-objective optimization, managing conflicts between the gradients of the multiple objectives is crucial to prevent policies from converging to local optima. It is also essential to efficiently handle safety constraints for stable training and constraint satisfaction. We address these challenges straightforwardly by treating the maximization of multiple objectives as a constrained optimization problem (COP), where the constraints are defined to improve the original objectives. Existing safety constraints are then integrated into the COP, and the policy is updated by solving the COP, which ensures the avoidance of gradient conflicts. Despite its simplicity, CoMOGA guarantees convergence to global optima in a tabular setting. Through various experiments, we have confirmed that preventing gradient conflicts is critical, and the proposed method achieves constraint satisfaction across all tasks.

3016Variational Bayesian Pseudo-Coreset

[openreview] [pdf]

Abstract The success of deep learning requires large datasets and extensive training, which can create significant computational challenges. To address these challenges, pseudo-coresets, small learnable datasets that mimic the entire data, have been proposed. Bayesian Neural Networks, which offer predictive uncertainty and probabilistic interpretation for deep neural networks, also face issues with large-scale datasets due to their high-dimensional parameter space. Prior works on Bayesian Pseudo-Coresets (BPC) attempt to reduce the computational load for computing weight posterior distribution by a small number of pseudo-coresets but suffer from memory inefficiency during BPC training and sub-optimal results. To overcome these limitations, we propose Variational Bayesian Pseudo-Coreset (VBPC), a novel approach that utilizes variational inference to efficiently approximate the posterior distribution, reducing memory usage and computational costs while improving performance across benchmark datasets.

3017Voronoi Tessellation-based Confidence Decision Boundary Visualization to Enhance Understanding of Active Learning

[openreview] [pdf]

Abstract The current visualizations used in active learning are quite basic, making it difficult for researchers to effectively observe and analyze the practical performance of different sampling strategies. To address this issue, we introduce a more informative visual evaluation approach observation metric, the confidence decision boundary, which is generated through Voronoi tessellation and evaluated using ridge confidence, a newly proposed measure. This approach enhances the information content in boundary regions where data distribution is sparse. Based on the confidence decision boundary, we conducted a series of visualizations to evaluate various active learning query strategies. These visualizations are able to capture nuanced variations regarding how models based on different strategies perform sampling, the characteristics of points selected by various methods, and the impact of newly sampled points on the model. This enables a much deeper understanding of the underlying mechanisms of existing query strategies.

3018Iterative Label Refinement Matters More than Preference Optimization under Weak Supervision

[openreview] [pdf]

Abstract Language model (LM) post-training relies on two stages of human supervision: task demonstrations for supervised finetuning (SFT), followed by preference comparisons for reinforcement learning from human feedback (RLHF) via algorithms like proximal preference optimization (PPO) or direct preference optimization (DPO). As LMs become more capable, the tasks they are given become harder to supervise. Will post-training remain effective under unreliable supervision? To test this, we simulate unreliable demonstrations and comparison feedback using small LMs and time-constrained humans. We find that in the presence of unreliable supervision, SFT still retains some effectiveness, but DPO fails to improve the model beyond SFT. To address this, we propose iterative label refinement (ILR) as a replacement for RLHF with unreliable supervision. ILR directly improves the SFT data by using comparison feedback to decide whether human demonstrations should be replaced by model-generated alternatives, then retrains the model via SFT on the updated data. SFT+ILR outperforms SFT+DPO on several tasks with LM-simulated unreliable supervision (math, coding, safe instruction-following), with results further verified by human experiments on instruction-following. Our findings suggest that as LMs take on complex tasks where human supervision is unreliable, RLHF may no longer be the best use of human comparison feedback; instead, it is better to direct feedback towards improving the training data rather than continually training the model.

3019On the Optimization Landscape of Low Rank Adaptation Methods for Large Language Models

[openreview] [pdf]

Abstract Training Large Language Models (LLMs) poses significant memory challenges, making low-rank adaptation methods an attractive solution. Previously, Low-Rank Adaptation (LoRA) addressed this by adding a trainable low-rank matrix to the frozen pre-trained weights in each layer, reducing the number of trainable parameters and optimizer states. GaLore, which compresses the gradient matrix instead of the weight matrix, has demonstrated superior performance to LoRA with faster convergence and reduced memory consumption. Despite their empirical success, the performance of these methods has not been fully understood or explained theoretically. In this paper, we analyze the optimization landscapes of LoRA, GaLore, and full-rank methods, revealing that GaLore benefits from fewer spurious local minima and a larger region that satisfies the \pl, a variant of Polyak-Łojasiewicz (PL) condition, leading to faster convergence. Our analysis leads to a novel method, GaRare, which further improves GaLore by using gradient random projection to reduce computational overhead. Practically, GaRare achieves strong performance in both pre-training and fine-tuning tasks, offering a more efficient approach to large-scale model adaptation.

3020TIMBA: Time series Imputation with Bi-directional Mamba Blocks and Diffusion models

[openreview] [pdf]

Abstract The problem of imputing multivariate time series spans a wide range of fields, from clinical healthcare to multi-sensor systems. Initially, Recurrent Neural Networks (RNNs) were employed for this task; however, their error accumulation issues led to the adoption of Transformers, leveraging attention mechanisms to mitigate these problems. Concurrently, the promising results of diffusion models in capturing original distributions have positioned them at the forefront of current research, often in conjunction with Transformers. In this paper, we propose replacing time-oriented Transformers with State-Space Models (SSM), which are better suited for temporal data modeling. Specifically, we utilize the latest SSM variant, S6, which incorporates attention-like mechanisms. By embedding S6 within Mamba blocks, we develop a model that integrates SSM, Graph Neural Networks, and node-oriented Transformers to achieve enhanced spatiotemporal representations. Implementing these architectural modifications, previously unexplored in this field, we present Time series Imputation with Bi-directional mamba blocks and diffusion models (TIMBA). TIMBA achieves superior performance in almost all benchmark scenarios and performs comparably in others across a diverse range of missing value situations and three real-world datasets. We also evaluate how the performance of our model varies with different amounts of missing values and analyse its performance on downstream tasks. In addition, we provide the original code to replicate the results.

3021LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality

[openreview] [pdf]

Abstract Understanding human locomotion is crucial for AI agents such as robots, particularly in complex indoor home environments. Modeling human trajectories in these spaces requires insight into how individuals maneuver around physical obstacles and manage social navigation dynamics. These dynamics include subtle behaviors influenced by proxemics - the social use of space, such as stepping aside to allow others to pass or choosing longer routes to avoid collisions. Previous research has developed datasets of human motion in indoor scenes, but these are often limited in scale and lack the nuanced social navigation dynamics common in home environments. To address this, we present LocoVR, a dataset of 7000+ two-person trajectories captured in virtual reality from over 130 different indoor home environments. LocoVR provides full body pose data and precise spatial information, along with rich examples of socially-motivated movement behaviors. For example, the dataset captures instances of individuals navigating around each other in narrow spaces, adjusting paths to respect personal boundaries in living areas, and coordinating movements in high-traffic zones like entryways and kitchens. Our evaluation shows that LocoVR significantly enhances model performance in three practical indoor tasks utilizing human trajectories, and demonstrates predicting socially-aware navigation patterns in home environments.

3022Interpreting and Steering LLM Representations with Mutual Information-based Explanations on Sparse Autoencoders

[openreview] [pdf]

Abstract Large language models (LLMs) excel at addressing general human queries, yet they can falter or produce unexpected responses in specific scenarios. Gaining insight into the internal states of LLMs is key to understanding their successes and failures, as well as to refining their capabilities. Recent efforts have applied sparse autoencoders to learn a feature basis for explaining LLM hidden spaces. However, current post-hoc explanation methods can not effectively describe the semantic meaning of the learned features, and it is difficult to steer LLM behaviors by manipulating these features. Our analysis reveals that existing explanation methods suffer from the frequency bias issue, i.e., they tend to focus on trivial linguistic patterns rather than semantics. To overcome this, we propose explaining the learned features from a fixed vocabulary set to mitigate the frequency bias, and designing a novel explanation objective based on the mutual information theory to better express the meaning of the features. We further suggest two strategies to steer LLM representations by modifying sparse feature activations in response to user queries during runtime. Empirical results demonstrate that our method generates more discourse-level explanations than the baselines, and can effectively steer LLM behaviors to defend against jailbreak attacks in the wild. These findings highlight the value of explanations for steering LLM representations in downstream applications.

3023FAIRMINDSIM: ALIGNMENT OF BEHAVIOR, EMO- TION, AND BELIEF IN HUMANS AND LLM AGENTS AMID ETHICAL DILEMMAS

[openreview] [pdf]

Abstract AI alignment is a pivotal issue concerning AI control and safety. It should consider not only value-neutral human preferences but also moral and ethical considerations. In this study, we introduced FairMindSim, which simulates the moral dilemma through a series of unfair scenarios. We used LLM agents to simulate human behavior, ensuring alignment across various stages. To explore the various socioeconomic motivations, which we refer to as beliefs, that drive both humans and LLM agents as bystanders to intervene in unjust situations involving others, and how these beliefs interact to influence individual behavior, we incorporated knowledge from relevant sociological fields and proposed the Belief-Reward Alignment Behavior Evolution Model (BREM) based on the recursive reward model (RRM). Our findings indicate that, behaviorally, GPT-4o exhibits a stronger sense of social justice, while humans display a richer range of emotions. Additionally, we discussed the potential impact of emotions on behavior. This study provides a theoretical foundation for applications in aligning LLMs with altruistic values.

3024Learning Large Skillsets in Stochastic Settings with Empowerment

[openreview] [pdf]

Abstract General purpose agents need to be able to execute large skillsets in stochastic settings. Given that the mutual information between skills and states measures the number of distinct skills in a skillset, a compelling objective for learning a diverse skillset is to find the skillset with the largest mutual information between skills and states. The problem is that the two main unsupervised approaches for maximizing this mutual information objective, Empowerment-based skill learning and Unsupervised Goal-Conditioned Reinforcement Learning, only maximize loose lower bounds on the mutual information, which can impede diverse skillset learning. We propose a new empowerment objective, Skillset Empowerment, that maximizes a tighter bound on the mutual information between skills and states. For any proposed skillset, the tighter bound on mutual information is formed by replacing the posterior distribution of the proposed skillset with a variational distribution that is conditioned on the proposed skillset and trained to match the posterior of the proposed skillset. Maximizing our mutual information lower bound objective is a bandit problem in which actions are skillsets and the rewards are our mutual information objective, and we optimize this bandit problem with a new actor-critic architecture. We show empirically that our approach is able to learn large abstract skillsets in stochastic domains, including ones with high-dimensional observations, in contrast to existing approaches.

3025Language-conditioned Multi-Style Policies with Reinforcement Learning

[openreview] [pdf]

Abstract Recent studies have explored the application of large language models (LLMs) in language-conditioned reinforcement learning (LC-RL). These studies typically involve training RL agents to follow straightforward human instructions in domains such as object manipulation, navigation, or text-based environments. To extend these capabilities for following high-level and abstract language instructions with diverse style policies in complex environments, we propose a novel method called LCMSP, which can generate language-conditioned multi-style policies. LCMSP first trains a multi-style RL policy capable of achieving different meta-behaviors, which can be controlled by corresponding style parameters. Subsequently, LCMSP leverages the reasoning capabilities and common knowledge of LLMs to align language instructions with style parameters, thereby realizing language-controlled multi-style policies. Experiments conducted in various environments and with different types of instructions demonstrate that the proposed LCMSP is capable of understanding high-level abstract instructions and executing corresponding behavioral styles in complex environments.

3026Efficient Inference for Large Language Model-based Generative Recommendation

[openreview] [pdf]

Abstract Large Language Model (LLM)-based generative recommendation has achieved notable success, yet its practical deployment is costly particularly due to excessive inference latency caused by autoregressive decoding. For lossless LLM decoding acceleration, Speculative Decoding (SD) has emerged as a promising solution. However, applying SD to generative recommendation presents unique challenges due to the requirement of generating top-K items (i.e., K distinct token sequences) as a recommendation list by beam search. This leads to more stringent verification in SD, where all the top-K sequences from the target LLM must be successfully drafted by the draft model at each decoding step. To alleviate this, we consider 1) boosting top-K sequence alignment between the draft model and the target LLM, and 2) relaxing the verification strategy to reduce trivial LLM calls. To this end, we propose an alignment framework named AtSpeed, which presents the AtSpeed-S optimization objective for top-K alignment under the strict top-K verification. Moreover, we introduce a relaxed sampling verification strategy that allows high-probability non-top-K drafted sequences to be accepted, significantly reducing LLM calls. Correspondingly, we propose AtSpeed-R for top-K alignment under this relaxed sampling verification. Empirical results on two real-world datasets demonstrate that AtSpeed significantly accelerates LLM-based generative recommendation, e.g., near 2x speedup under strict top-K verification and up to 2.5 speedup under relaxed sampling verification. The codes and datasets are available at~\url{https://anonymous.4open.science/r/AtSpeed/}.

3027Reconstruction-Guided Policy: Enhancing Decision-Making through Agent-Wise State Consistency

[openreview] [pdf]

Abstract An important challenge in multi-agent reinforcement learning is partial observability, where agents cannot access the global state of the environment during execution and can only receive observations within their field of view. To address this issue, previous works typically use the dimensional-wise state, which is obtained by applying MLP or dimensional-based attention on the global state, for decision-making during training and relying on a reconstructed dimensional-wise state during execution. However, dimensional-wise states tend to divert agent attention to specific features, neglecting potential dependencies between agents, making it difficult to make optimal decisions. Moreover, the inconsistency between the states used in training and execution further increases additional errors. To resolve these issues, we propose a method called Reconstruction-Guided Policy (RGP) to reconstruct the agent-wise state, which represents the information of inter-agent relationships, as input for decision-making during both training and execution. This not only preserves the potential dependencies between agents but also ensures consistency between the states used in training and execution. We conducted extensive experiments on both discrete and continuous action environments to evaluate RGP, and the results demonstrates its superior effectiveness. Our code is public inhttps://anonymous.4open.science/r/RGP-9F79

3028A Healthy Food Recommender System Using Collaborative Filtering and Transformers

[openreview] [pdf]

Abstract Unhealthy eating habits are a major contributing factor to public health problems such as the globally rising obesity rate. One way to help solve this problem is by creating systems that can suggest better food choices in order to improve the way people eat. A critical challenge with these systems is making sure they offer 1) suggestions that match what users like, while also 2) recommending healthy foods. In this paper, we introduce a novel food recommender system that provides healthy food recommendations similar to what the user has previously eaten. We used collaborative filtering to generate recommendations and re-ranked the recommendations using a novel health score and a BERT embedding similarity score. We evaluated our system on human subjects by conducting A/B testing on several methods deployed in a web application.

3029Generative World Explorer

[openreview] [pdf]

Abstract Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. However, humans can imagine unseen parts of the world through a mental exploration and revise their beliefs with imagined observations. Such updated beliefs can allow them to make more informed decisions at the current step, without having to physically explore the world first. To achieve this human-like ability, we introduce theGenerative World Explorer (Genex), a video generation model that allows an agent to mentally explore a large-scale 3D world (e.g., urban scenes) and acquire imagined observations to update its belief. This updated belief will then help the agent to make a more informed decision at the current step. To train Genex, we create a synthetic urban scene dataset, Genex-DB. Our experimental results demonstrate that (1) Genex can generate high-quality and consistent observations during long-horizon mental exploration of large 3D scenes and (2) the beliefs updated with the generated observations can inform an existing decision-making model (e.g., an LLM agent) to make better plans.

3030Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

[openreview] [pdf]

Abstract No absctract

3031Wide Neural Networks Trained with Weight Decay Provably Exhibit Neural Collapse

[openreview] [pdf]

Abstract Deep neural networks (DNNs) at convergence consistently represent the training data in the last layer via a highly symmetric geometric structure referred to as neural collapse. This empirical evidence has spurred a line of theoretical research aimed at proving the emergence of neural collapse, mostly focusing on the unconstrained features model. Here, the features of the penultimate layer are free variables, which makes the model data-agnostic and, hence, puts into question its ability to capture DNN training. Our work addresses the issue, moving away from unconstrained features and studying DNNs that end with at least two linear layers. We first prove generic guarantees on neural collapse that assume (i) low training error and balancedness of the linear layers (for within-class variability collapse), and (ii) bounded conditioning of the features before the linear part (for orthogonality of class-means, as well as their alignment with weight matrices). We then show that such assumptions hold for gradient descent training with weight decay: (i) for networks with a wide first layer, we prove low training error and balancedness, and (ii) for solutions that are either nearly optimal or stable under large learning rates, we additionally prove the bounded conditioning. Taken together, our results are the first to show neural collapse in the end-to-end training of DNNs.

3032RecurFormer: Not All Transformer Heads Need Self-Attention

[openreview] [pdf]

Abstract Transformer-based large language models (LLMs) excel in modeling complex language patterns but face significant computational costs during inference, especially with long inputs due to the attention mechanism’s memory overhead. We observe that certain attention heads exhibit a distribution where the attention weights concentrate on tokens near the query token, termed as recency aware, which focuses on local and short-range dependencies. Leveraging this insight, we propose RecurFormer, a novel architecture that replaces these attention heads with linear recurrent neural networks (RNNs), specifically the Mamba architecture. This replacement reduces the cache size without evicting tokens, thus maintaining generation quality. RecurFormer retains the ability to model long-range dependencies through the remaining attention heads and allows for reusing pre-trained Transformer-based LLMs weights with continual training. Experiments demonstrate that RecurFormer matches the original model’s performance while significantly enhancing inference efficiency. Our approach provides a practical solution to the computational challenges of Transformer-based LLMs inference, making it highly attractive for tasks involving long inputs.

3033Understanding Chain-of-Thought in LLMs Through Information Theory

[openreview] [pdf]

Abstract Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain’ at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.

3034Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

[openreview] [pdf]

Abstract As the quality of large language models has improved, there has been increased interest in using them to model non-linguistic tokens. For example, the Decision Transformer recasts agentic decision making as a sequence modeling problem, using a decoder-only LLM to model the distribution over the discrete action space for an Atari agent. However, when adapting LLMs to non-linguistic domains, it remains unclear if softmax over discrete bins captures the continuous structure of the tokens and the potentially complex distributions needed for high quality token generation. We introduce a neural network layer, constructed using Fourier series, which we can easily substitute for any linear layer if we want the outputs to have a more continuous structure. We perform extensive analysis on synthetic datasets, as well as on large-scale decision making and time series forecasting tasks. We also provide theoretical evidence that this layer can better learn signal from data while ignoring high-frequency noise. All of our results support the effectiveness of our proposed Fourier head in scenarios where the underlying data distribution has a natural continuous structure. For example, the Fourier head improves a Decision Transformer agent’s returns by 46% on the Atari Seaquest game, and increases a state-of-the-art times series foundation model’s forecasting performance by 3.5% across 20 benchmarks unseen during training.

3035Elucidating the Design Space of Text-to-Audio Models

[openreview] [pdf]

Abstract Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we setup a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA’s improved ability to generate creative audio following complex and imaginative captions — a task that is more challenging than current benchmarks.

3036BOFormer: Learning to Solve Multi-Objective Bayesian Optimization via Non-Markovian RL

[openreview] [pdf]

Abstract Bayesian optimization (BO) offers an efficient pipeline for optimizing black-box functions with the help of a Gaussian process prior and an acquisition function (AF). Recently, in the context of single-objective BO, learning-based AFs witnessed promising empirical results given its favorable non-myopic nature. Despite this, the direct extension of these approaches to multi-objective Bayesian optimization (MOBO) suffer from the hypervolume identifiability issue, which results from the non-Markovian nature of MOBO problems. To tackle this, inspired by the non-Markovian RL literature and the success of Transformers in language modeling, we present a generalized deep Q-learning framework and propose BOFormer, which substantiates this framework for MOBO via sequence modeling. Through extensive evaluation, we demonstrate that BOFormer constantly achieves better performance than the benchmark rule-based and learning-based algorithms in various synthetic MOBO and real-world multi-objective hyperparameter optimization problems.

3037Learning to Discretize Denoising Diffusion ODEs

[openreview] [pdf]

Abstract Diffusion Probabilistic Models (DPMs) are generative models showing competitive performance in various domains, including image synthesis and 3D point cloud generation. Sampling from pre-trained DPMs involves multiple neural function evaluations (NFE) to transform Gaussian noise samples into images, resulting in higher computational costs compared to single-step generative models such as GANs or VAEs. Therefore, reducing the number of NFEs while preserving generation quality is crucial. To address this, we propose LD3, a lightweight framework designed to learn the optimal time discretization for sampling. LD3 can be combined with various samplers and consistently improves generation quality without having to retrain resource-intensive neural networks. We demonstrate analytically and empirically that LD3 improves sampling efficiency much less computational overhead. We evaluate our method with extensive experiments on 7 pre-trained models, covering unconditional and conditional sampling in both pixel-space and latent-space DPMs. We achieve FIDs of 2.38 (10 NFE), and 2.27 (10 NFE) on unconditional CIFAR10 and AFHQv2 in 5-10 minutes of training. LD3 offers an efficient approach to sampling from pre-trained diffusion models.

3038Revisiting the Relation Between Robustness and Universality

[openreview] [pdf]

Abstract Themodified universality hypothesisproposed by Jones et al. (2022) suggests that adversarially robust models trained for a given task are highly similar. We revisit the hypothesis and test its generality. We find that predictive behavior does not converge with increasing robustness and thus is not universal. Further, with additional similarity measures, we uncover differences in the representations that were invisible with the measures used in prior work. While robust models tend to be more similar than standard models, robust models remain distinct in important aspects. Moreover, the importance of similarity measures when comparing representations is highlighted as the absolute level of similarity---and thus the assessment of universality---is heavily dependent on the measure used.

3039Towards Effective Evaluations and Comparison for LLM Unlearning Methods

[openreview] [pdf]

Abstract The imperative to eliminate undesirable data memorization underscores the significance of machine unlearning for large language models (LLMs). Recent research has introduced a series of promising unlearning methods, notably boosting the practical significance of the field. Nevertheless, adopting a proper evaluation framework to reflect the true unlearning efficacy is also essential yet has not received adequate attention. This paper seeks to improve the evaluation of LLM unlearning by addressing two key challenges---a) the robustness of evaluation metrics and b) the trade-offs between competing goals. The first challenge stems from findings that current metrics are susceptible to various red teaming scenarios. It indicates that they may not reflect the true extent of knowledge retained by LLMs but rather tend to mirror superficial model behaviors, thus prone to attacks. We address this issue by devising and assessing a series of candidate metrics, selecting the most robust ones under various types of attacks. The second challenge arises from the conflicting goals of eliminating unwanted knowledge while retaining those of others. This trade-off between unlearning and retention often fails to conform the Pareto frontier, rendering it subtle to compare the efficacy between methods that excel only in either unlearning or retention. We handle this issue by proposing a calibration method that can restore the original performance on non-targeted data after unlearning, thereby allowing us to focus exclusively on assessing the strength of unlearning. Our evaluation framework notably enhances the effectiveness when assessing and comparing various LLM unlearning methods, further allowing us to benchmark existing works, identify their proper hyper-parameters, and explore new tricks to enhance their practical efficacy.

3040Discovering High-Quality Chess Puzzles Through One Billion Plays with Offline Reinforcement Learning

[openreview] [pdf]

Abstract Learning and skill mastery requires extensive and deliberate practice. In many learning settings, producing high-quality pedagogical materials can require a high level of domain expertise and be very time-consuming. Pedagogical materials often need to train students to engage in different thinking patterns. In some domains, such as chess, puzzles are used to help students practice their skills in calculating the next moves and recognizing known patterns on a board. Giving students a practice set of puzzles to help them learn different modes of thinking is challenging because the teacher needs to carefully balance between different motifs and how many look-ahead steps a student needs to perform. Popular online platforms like Chess.com and Lichess offer players millions of puzzles. Unlike chess tactics puzzles procured by human experts, where chess beginners can learn valuable insights, these puzzles are automatically generated and often regarded as having low pedagogical values. These platforms also rely on a heuristic to recommend puzzles to users for practice. Using the user history data over an entire year, a total of 1.6 billion puzzle-solving histories, we learn the pedagogical value of a puzzle and how to automatically choose a set of puzzles to better support chess learners in a completely unstructured way using insights from offline reinforcement learning. We validate the quality of the puzzles discovered by our model by collecting annotation ratings from titled chess players. The success of our pipeline shows promise for a future where we can understand the pedagogical values of practice items in other domains like math or coding problems.

3041Corrective Retrieval Augmented Generation

[openreview] [pdf]

Abstract Large language models (LLMs) inevitably exhibit hallucinations since the accuracy of generated texts cannot be secured solely by the parametric knowledge they encapsulate. Although retrieval-augmented generation (RAG) is a practicable complement to LLMs, it relies heavily on the relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong. To this end, we propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation. Specifically, a lightweight retrieval evaluator is designed to assess the overall quality of retrieved documents for a query, returning a confidence degree based on which different knowledge retrieval actions can be triggered. Since retrieval from static and limited corpora can only return sub-optimal documents, large-scale web searches are utilized as an extension for augmenting the retrieval results. Besides, a decompose-then-recompose algorithm is designed for retrieved documents to selectively focus on key information and filter out irrelevant information in them. CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches. Experiments on four datasets covering short- and long-form generation tasks show that CRAG can significantly improve the performance of RAG-based approaches.

3042Efficient Source-Free Time-Series Adaptation via Parameter Subspace Disentanglement

[openreview] [pdf]

Abstract In this paper, we propose a framework for efficient Source-Free Domain Adaptation (SFDA) in the context of time-series, focusing on enhancing both parameter efficiency and data-sample utilization. Our approach introduces an improved paradigm for source-model preparation and target-side adaptation, aiming to enhance training efficiency during target adaptation. Specifically, we reparameterize the source model’s weights in a Tucker-style decomposed manner, factorizing the model into a compact form during the source model preparation phase. During target-side adaptation, only a subset of these decomposed factors is fine-tuned, leading to significant improvements in training efficiency. We demonstrate using PAC Bayesian analysis that this selective fine-tuning strategy implicitly regularizes the adaptation process by constraining the model’s learning capacity. Furthermore, this re-parameterization reduces the overall model size and enhances inference efficiency, making the approach particularly well suited for resource-constrained devices. Additionally, we demonstrate that our framework is compatible with various SFDA methods and achieves significant computational efficiency, reducing the number of fine-tuned parameters and inference overhead in terms of MACs by over 90% while maintaining model performance.

3043Enabling Weak LLMs to Judge Response Reliability via Meta Ranking

[openreview] [pdf]

Abstract Despite the strong performance of large language models (LLMs) across a wide range of tasks, they still have reliability issues. Previous studies indicate that strong LLMs like GPT-4-turbo excel in evaluating the reliability of responses from LLMs, but face efficiency and local deployment issues. Thus, to enable weak LLMs to effectively assess the reliability of LLM responses, we propose a novel cross-query-comparison-based method called Meta Ranking\textit{Meta Ranking} (MR). Unlike previous few-shot methods that solely based on in-context learning capabilities in LLMs, MR assesses reliability by pairwise ranking the target query-response pair with multiple reference query-response pairs. We found that MR is highly effective in error detection for LLM responses, that MR with weaker LLMs, which have lower task performance, results in higher judgement precision against baselines with the same or even stronger models. Moreover, the method requires as few as five reference samples and significantly improving efficiency. We further demonstrate that MR can enhance strong LLMs’ performance in two practical applications: model cascading and instruction tuning. In model cascading, we combine open- and closed-source LLMs to achieve performance comparable to GPT-4-turbo with lower costs. In instruction tuning, we use MR for iterative training data filtering, significantly reducing data processing time and enabling LLaMA-7B and Phi-2 to surpass 13B models with fewer training tokens. These results underscore the high potential of MR in both efficiency and effectiveness.

3044Improving Complex Reasoning with Dynamic Prompt Corruption: A Soft Prompt Optimization Approach

[openreview] [pdf]

Abstract Prompt Tuning (PT) has emerged as a promising Parameter-Efficient Fine-Tuning (PEFT) approach by appending trainable continuous prompt vectors to the input, maintaining competitive performance with significantly fewer trainable parameters. While PT has shown effectiveness in enhancing task performance, particularly for classification tasks, its application to complex reasoning tasks has been largely overlooked. Our investigation reveals that PT provides limited improvement and may even degrade performance in reasoning tasks. This phenomenon suggests that soft prompts can positively impact certain instances while negatively affecting others, particularly during the latter stages of reasoning. To address these challenges, we propose a novel method called Dynamic Prompt Corruption (DPC), which seeks to optimize the use of soft prompts in reasoning tasks. DPC dynamically adjusts the influence of soft prompts based on their impact on the reasoning process. Specifically, it involves two key components: Dynamic Trigger and Dynamic Corruption. Dynamic Trigger measures the influence of soft prompts, determining whether their impact is beneficial or detrimental. Dynamic Corruption mitigates the negative effects of soft prompts by selectively masking key tokens that interfere with the reasoning process. We validate our approach through extensive experiments on various large language models (LLMs) and reasoning tasks, including GSM8K, MATH, and AQuA. The results demonstrate that Dynamic Prompt Corruption consistently improves the performance of LLMs, achieving 4%-8% accuracy gains compared to standard prompt tuning. These findings highlight the effectiveness of our approach and its potential to enhance complex reasoning in LLMs.

3045VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models

[openreview] [pdf]

Abstract Vector retrieval algorithms are essential for semantic queries within the rapidly evolving landscape of Large Language Models (LLMs). The ability to retrieve vectors that satisfy both similarity and diversity criteria substantially enhances the performance of LLMs. Although Maximal Marginal Relevance (MMR) is widely employed in retrieval scenarios requiring relevance and diversity, variations in the parameter ( \lambda ) lead to fluctuations that complicate the optimization trajectory in vector spaces. This obscures the direction of improvement and highlights the lack of a robust theoretical analysis regarding similarity and diversity constraints in retrieval processes. To address these challenges, this paper introduces a novel approach that characterizes both constraints through the relationship between the sum vector and the query vector. The proximity of these vectors ensures the similarity constraint, while requiring individual vectors within the sum vector to diverge in their alignment with the query vector satisfies the diversity constraint. We first formulate a new combinatorial optimization problem, selecting ( k ) vectors from a candidate set such that their sum vector maximally aligns with the query vector, and demonstrate that this problem is \textbf{NP-complete}. This result underscores the inherent difficulty of simultaneously achieving similarity and diversity in vector retrieval, thereby providing a theoretical foundation for future research. Subsequently, we present the heuristic algorithm \underline{\textbf{V}}ectors \underline{\textbf{R}}etrieval with \underline{\textbf{S}}imilarity and \underline{\textbf{D}}iversity, \textbf{VRSD}, which features a clear optimization objective and eliminates the need for preset parameters. VRSD also achieves a modest reduction in time complexity compared to MMR. Empirical validation confirms that VRSD significantly outperforms MMR across various datasets, while also demonstrating that the sum vector effectively captures both diversity and similarity simultaneously. The data and code are available athttps://anonymous.4open.science/r/VRSD-CF9D.

3046Weighted Fair Regression under Selection Bias

[openreview] [pdf]

Abstract Selection bias is a prevalent challenge in real-world data analysis, often stemming from biased historical censoring policies. While there is a growing body of literature on fairness in mitigating accuracy disparities, few studies have considered the potential impact of selection bias in training data. Depending on the selection mechanism, significant differences can arise between the population distribution and the training data distribution. Therefore, the training fairness metric can be heavily biased, leading to unfair learning. To address this issue under the fair regression problem, we propose weighting adjustments in the fairness constraint, which results in a novel fair regression estimator. Despite non-convexity, we derive an efficient algorithm to obtain a globally optimal solution. This work pioneers the integration of weighting adjustments into the fair regression problem, introducing a novel methodology to constrain accuracy disparities under arbitrary thresholds.

3047The Relevancy Metric: Understanding the Impact of Training Data

[openreview] [pdf]

Abstract Deep learning models are central to many critical decision-making processes, making it imperative to gain deeper insights into their behavior to improve performance, transparency, interpretability, and fairness. A key challenge is understanding how training data shapes model predictions on unseen test data. In this paper, we introduce a novel metric, Relevancy\textbf{\textit{Relevancy}}, which quantifies the impact of individual training samples on inference predictions. Our proposed metric is calculated by observing the learning dynamics of the model during training, and it is computationally efficient and applicable across a wide range of tasks. We demonstrate that it is between 80×80\times and 100,000×100,000\times more efficient than existing metrics for capturing the train-test relationship. Using relevancy\textit{relevancy}, we enable the identification of coresets — compact datasets that represent the essence of the training distribution. Quantitative evaluations show that coresets selected using our metric outperform state-of-the-art methods by up to 5.2% on CIFAR-100. Additionally, we qualitatively demonstrate how relevancy\textit{relevancy} can be extended to assess various training data properties, such as identifying mislabeled samples in widely used datasets like ImageNet, CIFAR-100, and Fashion-MNIST. These examples illustrate just a few of the many potential uses of relevancy\textit{relevancy}, highlighting its versatility in promoting more interpretable, efficient, and fair deep learning systems across diverse tasks.

3048BiC-Occ: Bi-directional Circulated 3D Occupancy Prediction for Autonomous Driving

[openreview] [pdf]

Abstract Vision-based 3D occupancy prediction is the cornerstone in autonomous driving systems to provide comprehensive scene perception for subsequent decisions, which requires assessing voxelized 3D scenes with multi-view 2D images. Existing methods mainly adopt unidirectional pipelines projecting image features to BEV representations for following supervision, whose performances are limited by the sparsity and ambiguity of voxel labels. To address this issue, we propose a Bi-directional Circulated 3D Occupancy Prediction (BiC-Occ) framework for more accurate voxel predictions and supervisions. Specifically, we design a Bi-directional View Transformer module that approximates invertible transition matrices of the view transformation process, promoting the self-consistency between 2D image features and 3D BEV representations. Furthermore, we propose a Circulated Interpolation Predictor module that exploits local geometric structures to align multi-scale BEV representations, correcting local ambiguity with consistent occupancy predictions across different resolutions. With the synergy of these two modules, the self-consistency within different perception views and occupancy resolutions compensates for the sparsity and ambiguity of voxel labels, leading to more accurate 3D occupancy predictions. Extensive experiments and analyses demonstrate the effectiveness of our BiC-Occ framework.

3049Learning Chaos In A Linear Way

[openreview] [pdf]

Abstract Learning long-term behaviors in chaotic dynamical systems, such as turbulent flows and climate modelling, is challenging due to their inherent instability and unpredictability. These systems exhibit positive Lyapunov exponents, which significantly hinder accurate long-term forecasting. As a result, understanding long-term statistical behavior is far more valuable than focusing on short-term accuracy. While autoregressive deep sequence models have been applied to capture long-term behavior, they often lead to exponentially increasing errors in learned dynamics. To address this, we shift the focus from simple prediction errors to preserving an invariant measure in dissipative chaotic systems. These systems have attractors, where trajectories settle, and the invariant measure is the probability distribution on attractors that remains unchanged under dynamics. Existing methods generate long trajectories of dissipative chaotic systems by aligning invariant measures, but it is not always possible to obtain invariant measures for arbitrary datasets. We propose the Poincaré Flow Neural Network (PFNN), a novel operator learning framework designed to capture behaviors of chaotic systems without any explicit knowledge of the invariant measure. PFNN employs an auto-encoder to map the chaotic system to a finite-dimensional feature space, effectively linearizing the chaotic evolution. It then learns the linear evolution operators to match the physical dynamics by addressing two critical properties in dissipative chaotic systems: (1) contraction, the system’s convergence toward its attractors, and (2) measure invariance, trajectories on the attractors following a probability distribution invariant to the dynamics. Our experiments on a variety of chaotic systems, including Lorenz 96, Kuramoto-Sivashinsky equation and Navier–Stokes equation, demonstrate that PFNN has more accurate predictions and physical statistics compared to competitive baselines including the Fourier Neural Operator and the Markov Neural Operator.

3050WildBench: Benchmarking LLMs with Challenging Tasks from Real Users in the Wild

[openreview] [pdf]

Abstract We introduce WildBench, an automated evaluation framework designed to benchmark large language models (LLMs) using challenging, real-world user queries. WildBench consists of 1,024 tasks carefully selected from over one million human-chatbot conversation logs. For automated evaluation with WildBench, we have developed two metrics, WB-Reward and WB-Score, which are computable using advanced LLMs such as GPT-4-turbo. WildBench evaluation uses task-specific checklists to evaluate model outputs systematically and provides structured explanations that justify the scores and comparisons, resulting in more reliable and interpretable automatic judgments. WB-Reward employs fine-grained pairwise comparisons between model responses, generating five potential outcomes: much better, slightly better, slightly worse, much worse, or a tie. Unlike previous evaluations that employed a single baseline model, we selected three baseline models at varying performance levels to ensure a comprehensive pairwise evaluation. Additionally, we propose a simple method to mitigate length bias, by converting outcomes of “slightly better/worse” to “tie” if the winner response exceeds the loser one by more than K characters. WB-Score evaluates the quality of model outputs individually, making it a fast and cost-efficient evaluation metric. WildBench results demonstrate a strong correlation with the human-voted Elo ratings from Chatbot Arena on hard tasks. Specifically, WB-Reward achieves a Pearson correlation of 0.98 with top-ranking models. Additionally, WB-Score reaches 0.95, surpassing both ArenaHard’s 0.91 and AlpacaEval2.0’s 0.89 for length-controlled win rates, as well as the 0.87 for regular win rates.

3051DPM: Dual Preferences-based Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Preference-based Reinforcement Learning (PbRL), which optimizes reward functions using preference feedback, is a promising approach for environments where handcrafted reward modeling is challenging. Especially in sparse-reward environments, feedback-based reward modeling achieves notable performance gains by transforming sparse feedback signals into dense ones. However, most PbRL research has primarily focused on single-agent environments, with limited attention to multi-agent environments. In this paper, we propose Dual Preferences-based Multi-Agent Reinforcement Learning (DPM), which extends PbRL to multi-agent tasks by introducingdualpreferences comparing not only whole trajectories but also individual agent contributions during transitions. Furthermore, DPM replaces human preferences with those generated by LLMs to train the reward functions. Experimental results in the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments demonstrate significant performance improvements over baselines, indicating the efficacy of DPM in optimizing individual reward functions and enhancing performances in sparse reward settings.

3052Solving Multiplayer Partially Observable Stochastic Games by Divergence-Regularized Discounted Aggregation

[openreview] [pdf]

Abstract This paper presents Divergence-Regularized Discounted Aggregation (DRDA), a multi-round learning system for solving partially observable stochastic games (POSGs), which unify normal-form games (NFGs), extensive-form games (EFGs), and Markov games (MGs). In each single round, DRDA can be viewed as a discounted variant of Follow the Regularized Leader (FTRL) under a general value function for POSGs concerning imperfect information and an infinite horizon. While previous studies on this FTRL variant have demonstrated its last-iterate convergence towards quantal response equilibrium (QRE) in NFGs, this paper extends the theoretical results to POSGs by defining a generalized Nash distribution (GND), which extends the QRE concept of Nash distribution in NFGs through divergence regularization. The linear last-iterate convergence of single-round DRDA to its rest point is proved under a general assumption of hypomonotonicity. When the rest point is unique, it induces the unique GND, which has a bounded deviation with respect to Nash equilibrium (NE). Under multiple learning rounds, DRDA keeps replacing the base policy for divergence regularization with the policy at the rest point in the previous round. It is further proved that the limit point of multi-round DRDA must be an exact NE rather than a QRE under the unique rest point assumption. In experiments, the last iterates of multi-round DRDA converge to NE at a near-exponential rate in NFGs, outperforming existing baselines including moving-magnet magnetic mirror descent (MMD) in multiplayer EFGs. In an infinite-horizon MG, DRDA significantly outperforms the applicable algorithms based on best-response computations.

3053D2P2-SGD: Dynamically Differentially Private Projected Stochastic Gradient Descent

[openreview] [pdf]

Abstract Stochastic optimization is a key enabler in modern machine learning, producing effective models for various tasks. However, several researchers have shown that model parameters and gradient information are susceptible to privacy leakage. Although, Differentially Private SGD (DPSGD) addresses privacy concerns, its static noise mechanism impacts the error bounds for model performance. Additionally, with the exponential increase in model parameters, efficient learning of these models using stochastic optimizers has become more challenging. To address these concerns, we introduce the Dynamically Differentially Private Projected Stochastic Gradient Descent (D2P2-SGD) optimizer. In D2P2-SGD, we combine two important ideas: (i) dynamic differential privacy (DDP) with automatic gradient clipping and (ii) random projection with SGD, allowing dynamic adjustment of the tradeoff between utility and privacy of the model. It demonstrates provably tighter error bounds compared to DPSGD across different behavior (i.e. convex and non-convex) of the objective function. The theoretical analysis further suggests that DDP leads to better utility at the cost of privacy, while random projection enables more efficient model learning. Extensive experiments across diverse datasets show that D2P2-SGD significantly enhances accuracy while maintaining privacy. Our code is available here.

3054Distribution-free Data Uncertainty for Neural Network Regression

[openreview] [pdf]

Abstract Quantifying uncertainty is an essential part of predictive modeling, especially in the context of high-stakes decision-making. While classification output includes data uncertainty by design in the form of class probabilities, the regression task generally aims only to predict the expected value of the target variable. Probabilistic extensions often assume parametric distributions around the expected value, optimizing the likelihood over the resulting explicit densities. However, using parametric distributions can limit practical applicability, making it difficult for models to capture skewed, multi-modal, or otherwise complex distributions. In this paper, we propose optimizing a novel nondeterministic neural network regression architecture for loss functions derived from a sample-based approximation of the continuous ranked probability score (CRPS), enabling a truly distribution-free approach by learning to sample from the target’s aleatoric distribution, rather than predicting explicit densities. Our approach allows the model to learn well-calibrated, arbitrary uni- and multivariate output distributions. We evaluate the method on a variety of synthetic and real-world tasks, including uni- and multivariate problems, function inverse approximation, and standard regression uncertainty benchmarks. Finally, we make all experiment code publicly available.

3055Replacing Implicit Regression with Classification in Policy Gradient Reinforcement Learning

[openreview] [pdf]

Abstract Stochastic policy gradient methods are a fundamental class of reinforcement learning algorithms. When using these algorithms for continuous control it is common to parameterize the policy using a Gaussian distribution. In this paper, we show that the policy gradient with Gaussian policies can be viewed as the gradient of a weighted least-squares objective function. That is, policy gradient algorithms are implicitly implementing a form of regression. A number of recent works have shown that reformulating regression problems as classification problems can improve learning. Inspired by these works, we investigate whether replacing this implicit regression with classification can improve the data efficiency and stability of policy learning. Toward this end, we introduce a novel policy gradient surrogate objective for softmax policies over a discretized action space. This surrogate objective uses a form of cross-entropy loss as a replacement for the implicit least-squares loss found in the surrogate loss for Gaussian policies. We extend prior theoretical analysis of this loss to our policy gradient surrogate objective and then provide experiments showing that this novel loss improves the data efficiency of stochastic policy gradient learning.

3056Simulate Before Act: Model-Based Planning for Web Agents

[openreview] [pdf]

Abstract Language agents have shown promising performance in automating web-based tasks, but the complexity and vast search spaces of real-world websites challenge reactive agents in identifying optimal solutions. While tree search agents offer enhanced exploration by interacting with actual websites, they often incur high costs, potential risks, and are challenging to implement for real-world websites. This paper explores a novel paradigm leveraging large language models’ (LLMs) internal world models for planning in complex environments, presenting a middle ground between reactive agents and tree search agents. Results on two representative benchmarks, VisualWebArena and Mind2Web-live, demonstrate that our approach largely closes the gap between reactive agents and tree search agents, while maintaining efficiency and safety advantages. Notably, tree search can be considered as approaching an upper bound for our method, as it explores actual websites rather than simulations. This work opens new avenues for research into more effective and secure strategies for autonomous agents in complex, dynamic environments. It represents a step forward in improving upon reactive agents while approaching the performance of tree search methods, without incurring their implementation challenges and costs.

3057Estimating Committor Functions via Deep Adaptive Sampling on Rare Transition Paths

[openreview] [pdf]

Abstract The committor functions are central to investigating rare but important events in molecular simulations. It is known that computing the committor function suffers from the curse of dimensionality. Recently, using neural networks to estimate the committor function has gained attention due to its potential for high-dimensional problems. Training neural networks to approximate the committor function needs to sample transition data from straightforward simulations of rare events, which is very inefficient. The scarcity of transition data makes it challenging to approximate the committor function. To address this problem, we propose an efficient framework to generate data points in the transition state region that helps train neural networks to approximate the committor function. We design a Deep Adaptive Sampling method for TRansition paths (DASTR), where deep generative models are employed to generate samples to capture the information of transitions effectively. In particular, we treat a non-negative function in terms of the integrand in the loss functional as an unnormalized probability density function and approximate it with the deep generative model. The new samples from the deep generative model are located in the region of the transition and fewer samples are located in the other region, which provides effective samples for approximating the committor function and significantly improves the accuracy. We demonstrate the effectiveness of the proposed method with both simulations and realistic examples.

3058CAN - CONTINUOUSLY ADAPTING NETWORKS

[openreview] [pdf]

Abstract Catastrophic forgetting is a fundamental challenge in neural networks that prevents continuous learning, which is one of the properties essential for achieving true general artificial intelligence. When trained sequentially on multiple tasks, conventional neural networks overwrite previously learned knowledge, hindering their ability to retain and apply past experiences. However, people and other animals can learn new things continuously without forgetting them. To overcome this problem, we devised an architecture that preserves significant task-specific connections by combining selective neuron freezing with Hebbian learning principles. Hebbian learning enables the network to adaptively strengthen synaptic connections depending on parameter activation. It is inspired by the synaptic plasticity seen in brains. By preserving the most important neurons using selective neuron freezing, new tasks can be trained without changing them. Experiments conducted on standard datasets show that our model significantly reduces the risk of catastrophic forgetting, allowing the network to learn continually.

3059Enhanced Semantic Alignment in Transformer Tracking via Position Learning and Force-Directed Attention

[openreview] [pdf]

Abstract In the field of visual object tracking, one-stream pipelines have become the mainstream framework due to its efficient integration of feature extraction and relationship modeling. However, existing methods still face the issue of semantic misalignment: firstly, the interaction of positional encoding between the two branches leads to a misalignment between feature semantics and position encoding; secondly, traditional attention mechanisms fail to distinguish between semantic attraction and repulsion among features, resulting in semantic misalignment when the model processes these features. To address these issues, we propose an Enhanced Semantic Alignment Transformer Tracker (ESAT) with position encode learning and force-directed attention mechanism. By leveraging positional encoding loss, ESAT separately learns the absolute positional encodings of the target and search branches, distinguishing the locations of various tokens and their positive or negative relationships, thereby enhancing the semantic consistency between position and features. Additionally, it incorporates a repulsion-attraction mechanism applied to the self-attention module, simulating dynamic interactions between nodes to improve feature discrimination. Extensive experiments on multiple public tracking datasets show that our method outperforms many pipelines and achieves superior performance on five challenging benchmarks.

3060ReGenesis: LLMs can Grow into Reasoning Generalists via Self-Improvement

[openreview] [pdf]

Abstract Post-training Large Language Models (LLMs) with explicit reasoning trajectories can enhance their reasoning abilities. However, acquiring such high-quality trajectory data typically demands meticulous supervision from humans or superior models, which can be either expensive or license-constrained. In this paper, we explore how far an LLM can improve its reasoning by self-synthesizing reasoning paths as training data without any additional supervision. Existing self-synthesizing methods, such as STaR, suffer from poor generalization to out-of-domain (OOD) reasoning tasks. We hypothesize it is due to that their self-synthesized reasoning paths are too task-specific, lacking general task-agnostic reasoning guidance. To address this, we proposeReasoning Generalist via Self-Improvement (ReGenesis), a method toself-synthesize reasoning paths as post-training data by progressing from abstract to concrete. More specifically, ReGenesis self-synthesizes reasoning paths by converting general reasoning guidelines into task-specific ones, generating reasoning structures, and subsequently transforming these structures into reasoning paths, without the need for human-designed task-specific examples used in existing methods. We show that ReGenesis achieves superior performance on all in-domain and OOD settings tested compared to existing methods. For six OOD tasks specifically, while previous methods exhibited an average performance decrease of approximately 4.6% after post training, ReGenesis delivers around 6.1% performance improvement. We also conduct an in-depth analysis of our framework and show ReGenesis is effective across various language models and design choices.

3061How Much is Unseen Depends Chiefly on Information About the Seen

[openreview] [pdf]

Abstract We find thatin expectationthe missing mass, i.e., the proportion of data points in an unknown population---that belong to classes thatdo notappear in the training data---is entirely determined by the number fkf_k of classes thatdoappear in the training data the same number of times and an exponentially decaying error. While this is the first precise characterization of the expected missing mass in terms of the sample, the induced estimator suffers from an impractically high variance. However, our theory suggests a large search space of nearly unbiased estimators that can be searched effectively and efficiently. Hence, we cast distribution-free estimation as an optimization problem to find a distribution-specific estimator with a minimized mean-squared error (MSE), given only the sample. In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good-Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators’ MSE is roughly 80% of the Good-Turing estimator’s.

3062Efficient Action-Constrained Reinforcement Learning via Acceptance-Rejection Method and Augmented MDPs

[openreview] [pdf]

Abstract Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisfaction but at the cost of either high computational burden incurred by the quadratic programs (QP) or increased architectural complexity due to the use of sophisticated generative models. In this paper, we propose a generic and computationally efficient framework that can adapt a standard unconstrained RL method to ACRL through two modifications: (i) To enforce the action constraints, we leverage the classic acceptance-rejection method, where we treat the unconstrained policy as the proposal distribution and derive a modified policy with feasible actions. (ii) To improve the acceptance rate of the proposal distribution, we construct an augmented two-objective Markov decision process (MDP), which include additional self-loop state transitions and a penalty signal for the rejected actions. This augmented MDP incentives the learned policy to stay close to the feasible action sets. Through extensive experiments in both robot control and resource allocation domains, we demonstrate that the proposed framework enjoys faster training progress, better constraint satisfaction, and a lower action inference time simultaneously than the state-of-the-art ACRL methods.

3063Stochastic Steepest Descent with Acceleration forℓp-Smooth Non-Convex Optimization

[openreview] [pdf]

Abstract In this work, we analyze stochastic p\ell_p steepest descent for non-convex problems. Specifically, for p>2p > 2, we establish ϵ\epsilon-approximate stationarity (in expectation) with respect to the dual norm pp\Vert\cdot\Vert_{p^*}^{p^*} at a rate of O(ϵ4)O(\epsilon^{-4}), thereby generalizing the previous guarantees for signSGD (p=p=\infty). In addition, inspired by techniques for the convex setting, we present a new accelerated p\ell_p descent method, called Stacey, based on interpolated primal-dual iterate sequences that are designed for non-Euclidean smooth optimization settings. We compare our algorithm against popular methods such as SGD, Adam, AdamW, and Lion on image classification and pretraining language modeling tasks, and our results demonstrate the potential for both faster convergence and achieving higher accuracy. We further evaluate our algorithm for different values of pp across various models and datasets, highlighting the importance and efficiency of non-Euclidean methods as compared to standard Euclidean-based approaches.

3064Exploring Knowledge Boundaries in Large Language Models for Retrieval Judgment

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly recognized for their practical applications. However, these models often encounter challenges in dynamically changing knowledge, as well as in managing unknown static knowledge. Retrieval-Augmented Generation (RAG) tackles this challenge and has shown a significant impact on LLMs. Actually, we find that the impact of RAG on the question answering capabilities of LLMs can be categorized into three groups: beneficial, neutral, and harmful. By minimizing retrieval requests that yield neutral or harmful results, we can effectively reduce both time and computational costs, while also improving the overall performance of LLMs. This insight motivates us to differentiate between types of questions using certain metrics as indicators, to decrease the retrieval ratio without compromising performance. In our work, we propose a method that is able to identify different types of questions from this view by training a Knowledge Boundary Model (KBM). Experiments conducted on 11 English and Chinese datasets illustrate that the KBM effectively delineates the knowledge boundary, significantly decreasing the proportion of retrievals required for optimal end-to-end performance. Specifically, we evaluate the effectiveness of KBM in three complex scenarios: dynamic knowledge, long-tail static knowledge, and multi-hop problems, as well as its functionality as an external LLM plug-in.

3065Binary Hypothesis Testing for Softmax Models and Leverage Score Models

[openreview] [pdf]

Abstract Softmax distributions are widely used in machine learning, including Large Language Models (LLMs) where the attention unit uses softmax distributions. We abstract the attention unit as the softmax model, where given a vector input, the model produces an output drawn from the softmax distribution (which depends on the vector input). We consider the fundamental problem of binary hypothesis testing in the setting of softmax models. That is, given an unknown softmax model, which is known to be one of the two given softmax models, how many queries are needed to determine which one is the truth? We show that the sample complexity is asymptotically O(ϵ2)O(\epsilon^{-2}) where ϵ\epsilon is a certain distance between the parameters of the models.Furthermore, we draw analogy between the softmax model and the leverage score model, an important tool for algorithm design in linear algebra and graph theory. The leverage score model, on a high level, is a model which, given vector input, produces an output drawn from a distribution dependent on the input. We obtain similar results for the binary hypothesis testing problem for leverage score models.

3066Let Large Language Models Find the Data to Train Themselves

[openreview] [pdf]

Abstract The current iterative development process for large language models (LLMs) is heavily data-centric, relying on human researchers and engineers to manually analyze model performance and determine what data to acquire for further training. However, this human-supervised approach is costly and may fail to identify optimal training signals. Its scalability is further limited as models become increasingly capable and may eventually exceed human intelligence. To address these issues, we propose an automated framework that enables models to autonomously discover and strategically acquire the most valuable training data to enhance their performance. It establishes a self-improving framework where models can invoke APIs to crawl and/or generate tailored datasets from various resources and environments, and retrain themselves. The data selection decisions are shaped by reinforcement feedback signals that reward performance gains while penalizing computational overhead. This formulation incentivizes models to develop self-knowledge about their strengths and areas for improvement in order to efficiently select training data. Empirical results demonstrate that LLMs operating within our framework are able to autonomously and strategically acquire valuable training data to enhance their performance across a variety of skills in 1,000 diverse in-house test tasks and three public benchmarks.

3067Reward Adaptation Via Q-Manipulation

[openreview] [pdf]

Abstract In this paper, we propose a new solution to reward adaptation (RA), the problem where the learning agent adapts to a target reward function based on one or multiple existing behaviors learned a priori under the same domain dynamics but different reward functions. RA has many applications, such as adapting an autonomous driving agent that can already operate either fast or safe to operating both fast and safe. Learning the target behavior from scratch is possible but often inefficient given the available source behaviors. Our work represents a new approach to RA via the manipulation of Q-functions. Assuming that the target reward function is a known function of the source reward functions, our approach to RA computes bounds of the Q function. We introduce an iterative process to tighten the bounds, similar to value iteration. This enables action pruning in the target domain before learning even starts. We refer to such a method as Q-Manipulation (Q-M). We formally prove that our pruning strategy does not affect the optimality of the returned policy while empirically show that it improves the sample complexity. Comparison with baselines is performed in a variety of synthetic and simulation domains to demonstrate its effectiveness and generalizability.

3068SOO-Bench: Benchmarks for Evaluating the Stability of Offline Black-Box Optimization

[openreview] [pdf]

Abstract Black-box optimization aims to find the optima through building a model close to the black-box objective function based on function value evaluation. However, in many real-world tasks, such as design of molecular formulas and mechanical structures, it is perilous, costly, or even infeasible to evaluate the objective function value of an actively sampled solution. In this situation, optimization can only be conducted via utilizing offline historical data, which yields offline black-box optimization. Different from the traditional goal that is to pursue the optimal solution, this paper at first discloses that the goal of offline optimization is to stably surpass the offline dataset during optimization procedure. Although benchmarks called Design-Bench already exist in this emerging field, it can hardly evaluate the stability of offline optimization, and mainly provides real-world offline tasks and the corresponding offline datasets. To this end, this paper proposes benchmarks named SOO-Bench (i.e., Stable Offline Optimization Benchmarks) for offline black-box optimization algorithms, so as to evaluate the stability of surpassing the offline dataset under different data distributions. Along with SOO-Bench, we also propose a stability indicator to measure the degree of stability. Specifically, SOO-Bench includes various real-world offline optimization tasks and offline datasets under different data distributions, involving the fields of satellites, materials science, structural mechanics and automobile manufacturing. Empirically, baseline and state-of-the-art algorithms are tested and analyzed on SOO-Bench. Hopefully, SOO-Bench is expected to serve as a catalyst for rapid developments of more novel and stable offline optimization methods. The code is available athttps://anonymous.4open.science/r/SOO-Bench-9025.

3069How vulnerable is my learned policy? Adversarial attacks on modern behavioral cloning policies

[openreview] [pdf]

Abstract Learning from Demonstration (LfD) algorithms have shown promising results in robotic manipulation tasks, but their vulnerability to adversarial attacks remains underexplored. This paper presents a comprehensive study of adversarial attacks on both classic and recently proposed algorithms, including Behavior Cloning (BC), LSTM-GMM, Implicit Behavior Cloning (IBC), Diffusion Policy (DP), and VQ-Behavior Transformer (VQ-BET). We study the vulnerability of these methods to untargeted, targeted and universal adversarial perturbations. While explicit policies, such as BC, LSTM-GMM and VQ-BET can be attacked in the same manner as standard computer vision models, we find that attacks for implicit and denoising policy models are nuaced and require developing novel attack methods. Our experiments on several simulated robotic manipulation tasks reveal that most of the current methods are highly vulnerable to adversarial perturbations. We also investigate the transferability of attacks across algorithms and architectures, providing insights into the generalizability of adversarial perturbations in LfD. We find that, the success rate of the transfer attacks is highly dependent on the task, raising necessity for more fine-grained metrics that capture intricate details of adversarial weakness of the state distribution. In summary, our findings highlight the vulnerabilities of modern BC algorithms, paving way for future work in addressing such limitations.

3070IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

[openreview] [pdf]

Abstract Fine-tuning large-scale text-to-image diffusion models for various downstream tasks has yielded impressive results. However, the heavy computational burdens of tuning large models prevent personal customization. Recent advances have attempted to employ parameter-efficient fine-tuning (PEFT) techniques to adapt the floating-point (FP) or quantized pre-trained weights. Nonetheless, the adaptation parameters in existing works are still restricted to FP arithmetic, hindering hardware-friendly acceleration. In this work, we propose IntLoRA, to further push the efficiency limits by using integer type (INT) low-rank parameters to adapt the quantized diffusion models. By working in the integer arithmetic, our IntLoRA offers three key advantages: (i) for fine-tuning, the pre-trained weights are quantized, reducing memory usage; (ii) for storage, both pre-trained and low-rank weights are in INT which consumes less disk space; (iii) for inference, IntLoRA weights can be naturally merged into quantized pre-trained weights through efficient integer multiplication or bit-shifting, eliminating additional post-training quantization. Extensive experiments demonstrate that IntLoRA can achieve performance on par with or even superior to the vanilla LoRA, accompanied by significant efficiency improvements.

3071Root Cause Analysis of Anomalies in Multivariate Time Series through Granger Causal Discovery

[openreview] [pdf]

Abstract Identifying the root causes of anomalies in multivariate time series is challenging due to the complex dependencies among the series. In this paper, we propose a comprehensive approach called AERCA that inherently integrates Granger causal discovery with root cause analysis. By defining anomalies as interventions on the exogenous variables of time series, AERCA not only learns the Granger causality among time series but also explicitly models the distributions of exogenous variables under normal conditions. AERCA then identifies the root causes of anomalies by highlighting exogenous variables that significantly deviate from their normal states. Experiments on multiple synthetic and real-world datasets demonstrate that AERCA can accurately capture the causal relationships among time series and effectively identify the root causes of anomalies.

3072Neural Dueling Bandits: Principled Preference-Based Optimization with Non-Linear Reward Function

[openreview] [pdf]

Abstract Contextual dueling bandit is used to model the bandit problems, where a learner’s goal is to find the best arm for a given context using observed noisy preference feedback over the selected arms for the past contexts. However, existing algorithms assume the reward function is linear, which can be complex and non-linear in many real-life applications like online recommendations or ranking web search results. To overcome this challenge, we use a neural network to estimate the reward function using preference feedback for the previously selected arms. We propose upper confidence bound- and Thompson sampling-based algorithms with sub-linear regret guarantees that efficiently select arms in each round. We also extend our theoretical results to contextual bandit problems with binary feedback, which is in itself a non-trivial contribution. Experimental results on the problem instances derived from synthetic datasets corroborate our theoretical results.

3073Salvage: Shapley-distribution Approximation Learning Via Attribution Guided Exploration for Explainable Image Classification

[openreview] [pdf]

Abstract The integration of deep learning into critical vision application areas has given rise to a necessity for techniques that can explain the rationale behind predictions. In this paper, we address this need by introducing Salvage, a novel removal-based explainability method for image classification. Our approach involves training an explainer model that learns the prediction distribution of the classifier on masked images. We first introduce the concept of Shapley-distributions, which offers a more accurate approximation of classification probability distributions than existing methods. Furthermore, we address the issue of unbalanced important and unimportant features. In such settings, naive uniform sampling of feature subsets often results in a highly unbalanced ratio of samples with high and low prediction likelihoods, which can hinder effective learning. To mitigate this, we propose an informed sampling strategy that leverages approximated feature importance scores, thereby reducing imbalance and facilitating the estimation of underrepresented features. After incorporating these two principles into our method, we conducted an extensive analysis on the ImageNette, MURA, and Pet datasets. The results show that Salvage outperforms various baseline explainability methods, including attention-, gradient-, and removal-based approaches, both qualitatively and quantitatively. Furthermore, we demonstrate that our explainer model can serve as a fully explainable classifier without a major decrease in classification performance, paving the way for fully explainable image classification.

3074FusionMaestro: Harmonizing Early Fusion, Late Fusion, and LLM Reasoning for Multi-Granular Table-Text Retrieval

[openreview] [pdf]

Abstract Table-text retrieval aims to retrieve relevant tables and text to support open-domain question answering. Existing studies use either early or late fusion, but face limitations. Early fusion pre-aligns a table row with its associated passages, forming ``stars," which often include irrelevant contexts and miss query-dependent relationships. Late fusion retrieves individual nodes, dynamically aligning them, but it risks missing relevant contexts. Both approaches also struggle with advanced reasoning tasks, such as column-wise aggregation and multi-hop reasoning. To address these issues, we propose FusionMaestro, which combines the strengths of both approaches. First, the edge-based bipartite subgraph retrieval identifies finer-grained edges between table segments and passages, effectively avoiding the inclusion of irrelevant contexts. Then, the query-relevant node expansion identifies the most promising nodes, dynamically retrieving relevant edges to grow the bipartite subgraph, minimizing the risk of missing important contexts. Lastly, the star-based LLM refinement performs logical inference at the star subgraph rather than the bipartite subgraph, supporting advanced reasoning tasks. Experimental results show that FusionMaestro outperforms state-of-the-art models with a significant improvement up to 42.6% and 39.9% in recall and nDCG, respectively, on the OTT-QA benchmark.

3075Training Robust Ensembles Requires Rethinking Lipschitz Continuity

[openreview] [pdf]

Abstract Transferability of adversarial examples is a well-known property that endangers all classification models, even those that are only accessible through black-box queries. Prior work has shown that an ensemble of models is more resilient to transferability: the probability that an adversarial example is effective against most models of the ensemble is low. Thus, most ongoing research focuses on improving ensemble diversity. Another line of prior work has shown that Lipschitz continuity of the models can make models more robust since it limits how a model’s output changes with small input perturbations. {\em In this paper, we study the effect of Lipschitz continuity on transferability rates.} We show that although a lower Lipschitz constant increases the robustness of a single model, it is not as beneficial in training robust ensembles as it increases the transferability rate of adversarial examples across models in the ensemble. Therefore, we introduce LOTOS, a new training paradigm for ensembles, which counteracts this adverse effect. It does so by promoting orthogonality among the top-kk sub-spaces of the transformations of the corresponding affine layers of any pair of models in the ensemble. We theoretically show that kk does not need to be large for convolutional layers, which makes the computational overhead negligible. Through various experiments, we show LOTOS increases the robust accuracy of ensembles of ResNet-18 models by 6 percentage points (p.p) against black-box attacks on CIFAR-10. It is also capable of combining with the robustness of prior state-of-the-art methods for training robust ensembles to enhance their robust accuracy by 10.7 p.p.

3076KARPA: A Training-free Method of Adapting Knowledge Graph as References for Large Language Model’s Reasoning Path Aggregation

[openreview] [pdf]

Abstract Large language models (LLMs) demonstrate exceptional performance across a variety of tasks, yet they are often affected by hallucinations and the timeliness of knowledge. Leveraging knowledge graphs (KGs) as external knowledge sources has emerged as a viable solution, but existing methods for LLM-based knowledge graph question answering (KGQA) are often limited by step-by-step decision-making on KGs, restricting the global planning and reasoning capabilities of LLMs, or they require fine-tuning or pre-training on specific KGs. To address these challenges, we propose Knowledge graph Assisted Reasoning Path Aggregation (KARPA), a novel framework that harnesses the global planning abilities of LLMs for efficient and accurate KG reasoning on KGs. KARPA operates through a three-step process: pre-planning, retrieving, and reasoning. First, KARPA uses the LLM’s global planning ability to pre-plan logically coherent relation paths based on the provided question and relevant relations within the KG. Next, in the retrieving phase, relation paths with high semantic similarity to the pre-planned paths are extracted as candidate paths using a semantic embedding model. Finally, these candidate paths are provided to the LLM for comprehensive reasoning. Unlike existing LLM-based KGQA methods, KARPA fully leverages the global planning and reasoning capabilities of LLMs without requiring stepwise traversal or additional training, and it is compatible with various LLM architectures. Extensive experimental results show that KARPA achieves state-of-the-art performance in KGQA tasks, delivering both high efficiency and accuracy.

3077PLUM: Improving Code LMs Using On-Policy Preference Learning Powered by Automatic Test Cases

[openreview] [pdf]

Abstract Preference learning provides a promising solution to address the limitations of supervised fine-tuning (SFT) for code language models, where the model is not explicitly trained to differentiate between correct and incorrect code. Recent findings demonstrate that on-policy data is the key to successful preference learning, where the preference data is collected using the same policy LM being trained. Inspired by this, we propose PLUM, an on-policy P\textbf{P}reference L\textbf{L}earning framework Au\textbf{u}gmented with test cases for code LM\textbf{M}s. The framework operates in three key stages: (1) automatic generation of test cases from natural language instructions, (2) creation of a preference data by evaluating candidate code solutions sampled from the policy, which can then be used to (3) train the policy LM. PLUM levitates the need to train reward models, allowing for large scale on-policy and online preference data collation.PLUM is evaluated on both standard benchmarks (HumanEval, MBPP) and more challenging ones (LiveCodeBench), delivering substantial improvements over original SFT’ed models and other execution-feedback-driven approaches. We show PLUM benefits are consistent across various widely-used code LMs even they have been well-trained with SFT. For example, PLUM increases pass rates by up to 4.8% on average on standard benchmarks and 11.8% on LiveCodeBench, demonstrating its effectiveness and generalizability. We also demonstrate the benefits of on-policy and online preference learning

3078LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization

[openreview] [pdf]

Abstract Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the updates depending on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6% accuracy gain on Super-Natural Instructions and 3.5% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).

3079HELPFUL-ONLY LARGE LANGUAGE MODEL

[openreview] [pdf]

Abstract To know your enemy, you must become your enemy. Sun Tzu stated in The Art of War\textit{The Art of War}. Often, it is crucial to synthesize data containing harmful content using large language models (LLMs) in order to train harmless LLMs. Methods by which synthesized data can be utilized include using it as training data to provide negative signals to the model, as automatic red-teaming data to identify vulnerabilities of the model and more. However, aligned LLMs struggle to generate harmful responses. In this paper, we propose the refusal-free\textit{refusal-free} training method to reach a Helpful-Only LLM\textbf{Helpful-Only LLM} that maintains the helpfulness of the state-of-the-art (SOTA) LLMs while allowing harmful response generation. The refusal-free\textit{refusal-free} training method filters the instances that refuse an user’s request from the datasets. We demonstrate that the refusal-free\textit{refusal-free} training dramatically decreases the rate at which the LLM generates refusal responses (refusal rate) by 60.12% without sacrificing its helpfulness. Also, we are aware of the possibility that the progress in this direction could lead to irreversible consequences. A powerful model that does not reject harmful requests and executes them all could be exploited for illicit purposes such as the creation of indiscriminate weapons or hacking. However, once again, we believe it is important to be the one to break an LLM and study how an LLM can be broken in advance, including understanding the boundaries a Helpful-Only LLM\textbf{Helpful-Only LLM} can reach and identifying its inherent tendencies. We emphasize that this study is wholly for academic purpose and is aimed at paving the way toward a harmless LLM. This study calls for the researchers to acknowledge the potential failures of LLMs and take steps to prevent the breakdowns. Content Warning:\textbf{Content Warning:} This paper contains examples that may be offensive in nature, and reader discretion is recommended.

3080Preference-Driven Spatial-Temporal Counting Process Models

[openreview] [pdf]

Abstract Traditional spatial-temporal models often overlook the complex decision-making processes and social factors that shape spatial-temporal event data generated by humans. This paper introduces a novel framework that integrates choice theory with social intelligence to model and analyze counting processes, such as crime occurrences or bike-sharing activity, where the observed discrete events result from individual decisions influenced by social dynamics. Our approach aims to uncover latent human preference patterns, represented by utility functions, to capture the diverse decision-making factors within a population that result in the observed event counts. These latent factors help explain how choices—such as where and when to commit a crime—are shaped by personal preferences, environmental conditions, and social influences. By modeling the aggregate outcomes of these individual choices, we can better understand and predict patterns in counting processes. The proposed model adopts a preference-driven approach to counting data, providing interpretable insights at a detailed level. It also enables in-depth analysis of how external interventions, like law enforcement actions or policy changes, influence individual decisions and how these effects spread through the system. Empirical evaluation of crime and bike-sharing datasets demonstrates our model’s ability to offer clear insights and achieve high predictive accuracy.

3081Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count

[openreview] [pdf]

Abstract Transformer-based language models often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. While arithmetic tasks are commonly used to study length generalization, certain tasks are considered notoriously difficult, e.g., multi-operand addition (which requires generalization over both the number of operands and their lengths) and multiplication (which requires generalization over both operand lengths). In this paper, we achieve approximately 2--3×\times length generalization on both tasks, which is the first such achievement in arithmetic Transformers. To this end, we design task-specific scratchpads enabling the model to focus on a fixed number of tokens per each next-token prediction step, and then apply multi-level versions of Position Coupling (Cho et al., 2024; McLeish et al., 2024) to offer Transformers information about the right position to attend to. On the theory side, we prove that a 1-layer Transformer using our method can solve multi-operand addition, up to operand length and operand count that are exponential in embedding dimension.

3082Conditionally Adaptive Graph Attention Networks for Credit Card Fraud Detection

[openreview] [pdf]

Abstract Fraudulent transactions have been on the rise, leading to significant financial losses annually. In credit card fraud detection (CCFD), various predictive models aim to mitigate these losses by assessing transaction risk. While GNN-based methods have been employed to capture spatio-temporal transaction features, they often suffer from oversmoothing as graph layers increase, causing fraudulent and legitimate transactions to become indistinguishable. Existing semi-supervised methods that mask some labels have not fully resolved this issue. To address this, we propose the Multi-head Attention Conditional Variational Autoencoder (Ma-CVAE), which leverages weight distributions from imbalanced datasets and the Gumbel softmax distribution to construct more diverse reconstructed features, reducing feature homogenization. Then, We utilize Temporal Graph Attention Networks (TGAT) with a Multi-Attention mechanism to model risk propagation among transactions. Finally, classification probabilities are mapped to risk scores via a Multi-Layer Perceptron (MLP). Our approach achieves state-of-the-art performance, improving AUC scores by 1.45%, 3.05%, and 0.83% on three semi-supervised datasets: FFSD, YelpChi, and Amazon, respectively.

3083TC-MoE: Augmenting Mixture of Experts with Ternary Expert Choice

[openreview] [pdf]

Abstract The Mixture of Experts (MoE) architecture has emerged as a promising solution for reducing computational overhead by selectively activating subsets of model parameters. The effectiveness of MoE models is primarily dependent on their routing mechanisms, with the widely adopted Top-K routing scheme used to activate experts. However, the Top-K scheme has notable limitations, including unnecessary activations and underutilization of existing experts. In this work, rather than modifying the routing mechanism as in previous studies, we propose Ternary Choice MoE (TC-MoE), a novel approach that expands the expert space by multiplying each expert with the ternary set {-1, 0, 1}. This expansion allows for more efficient and effective expert activations without incurring significant computational cost. Additionally, given the unique characteristics of the expanded expert space, we introduce a new load balancing loss and reward loss to ensure workload balance and achieve a flexible trade-off between effectiveness and efficiency. Extensive experiments demonstrate that TC-MoE achieves an average improvement of more than 0.95% over the traditional approaches, while reducing the average number of activated experts by 9%. These results confirm that TC-MoE effectively address the inefficiencies of classical routing schemes, offering a more efficient and scalable solution for MoE-based large language models.

3084PromptSFL: Improving Visual Prompt Tuning For Split Federated Learning

[openreview] [pdf]

Abstract Conflict arises due to the disparity between the substantial resource demands of pre-trained models and the limited available resources of federated learning (FL) participants. Split learning presents a viable approach for adapting pre-trained models to FL, involving the allocation of a small portion of the pre-trained model to clients while deploying the remaining part on a server. Moreover, the application of Visual Prompt Tuning (VPT) to pre-trained models has shown state-of-the-art performances in parameter-efficient fine-tuning methods. However, VPT exhibits unsatisfactory performance in split federated learning (SFL) compared to its performance in centralized learning. In this paper, we first identify that VPT falls short of expectations in SFL due to the insufficient generalization capability of clients. To address this issue, we propose PromptSFL, which aligns the feature spaces of prompts between clients and the server to adapt VPT for SFL. PromptSFL transmits the final prompts in clients, termed skip prompts, to the first prompts in the server, enabling clients to extract more common features from the server. Additionally, we introduce a linear layer to map the prompts from clients to the feature space in the server during this skipping process, preventing the prompts of clients from overfitting to local datasets. Moreover, to enhance the convergence speed of SFL, PromptSFL employs an adaptive learning rate for clients. Extensive experiments demonstrate the effectiveness and efficiency of PromptSFL.

3085Fast Adversarial Training against Sparse Attacks Requires Loss Smoothing

[openreview] [pdf]

Abstract This paper studies fast adversarial training against sparse adversarial perturbations. We highlight the challenges faced when employing 1-step attacks on l0l_0 bounded perturbations for fast adversarial training, including degraded performance and the occurrence of catastrophic overfitting (CO). We highlight that CO in l0l_0 adversarial training is caused by sub-optimal perturbation locations of 1-step attack, which is distinct from other cases. Theoretical and empirical analyses reveal that the loss landscape of l0l_0 adversarial training is more craggy compared to its ll_\infty, l2l_2 and l1l_1 counterparts. Moreover, we corroborate that the craggy loss landscape can aggravate CO. To address these issues, we propose Fast-LS-l0l_0 that incorporates soft label and the trade-off loss function to smooth the adversarial loss landscape. Extensive experiments demonstrate our method can overcome the challenge of catastrophic overfitting, achieves state-of-the-art performance and narrows down the performance gap between 1-step and multi-step adversarial training against sparse attacks.

3086Highly Efficient Self-Adaptive Reward Shaping for Reinforcement Learning

[openreview] [pdf]

Abstract Reward shaping is a technique in reinforcement learning that addresses the sparse-reward problem by providing more frequent and informative rewards. We introduce a self-adaptive and highly efficient reward shaping mechanism that incorporates success rates derived from historical experiences as shaped rewards. The success rates are sampled from Beta distributions, which dynamically evolve from uncertain to reliable values as data accumulates. Initially, the shaped rewards exhibit more randomness to encourage exploration, while over time, the increasing certainty enhances exploitation, naturally balancing exploration and exploitation. Our approach employs Kernel Density Estimation (KDE) combined with Random Fourier Features (RFF) to derive the Beta distributions, providing a computationally efficient, non-parametric, and learning-free solution for high-dimensional continuous state spaces. Our method is validated on various tasks with extremely sparse rewards, demonstrating notable improvements in sample efficiency and convergence stability over relevant baselines.

3087Bayesian Regularization of Latent Representation

[openreview] [pdf]

Abstract The effectiveness of statistical and machine learning methods often depends on how well data features are characterized. Developing informative and interpretable latent representations with controlled complexity is essential for visualizing data structure and for facilitating efficient model building through dimensionality reduction. Latent variable models, such as Gaussian Process Latent Variable Models (GP-LVM), have become popular for learning complex, nonlinear representations as an alternative to Principal Component Analysis (PCA). In this paper, we propose a novel class of latent variable models based on the recently introduced Q-exponential process (QEP), which generalizes GP-LVM with a tunable complexity parameter, q>0q>0. Our approach, the \emph{Q-exponential Process Latent Variable Model (QEP-LVM)}, subsumes GP-LVM as a special case when q=2q=2, offering greater flexibility in managing representation complexity while enhancing interpretability. To ensure scalability, we incorporate sparse variational inference within a Bayesian training framework. We establish connections between QEP-LVM and probabilistic PCA, demonstrating its superior performance through experiments on datasets such as the Swiss roll, oil flow, and handwritten digits.

3088Granularity Matters in Long-Tail Learning

[openreview] [pdf]

Abstract Balancing training on long-tail data distributions remains a long-standing challenge in deep learning. While methods such as re-weighting and re-sampling help alleviate the imbalance issue, limited sample diversity continues to hinder models from learning robust and generalizable feature representations, particularly for tail classes. In contrast to existing methods, we offer a novel perspective on long-tail learning, inspired by an observation: datasets with finer granularity tend to be less affected by data imbalance. In this paper, we investigate this phenomenon through both quantitative and qualitative studies, showing that increased granularity enhances the generalization of learned features in tail categories. Motivated by these findings, we propose a method to increase dataset granularity through category extrapolation. Specifically, we introduce open-set auxiliary classes that are visually similar to existing ones, aiming to enhance representation learning for both head and tail classes. This forms the core contribution and insight of our approach. To automate the curation of auxiliary data, we leverage large language models (LLMs) as knowledge bases to search for auxiliary categories and retrieve relevant images through web crawling. To prevent the overwhelming presence of auxiliary classes from disrupting training, we introduce a neighbor-silencing loss that encourages the model to focus on class discrimination within the target dataset. During inference, the classifier weights for auxiliary categories are masked out, leaving only the target class weights for use. Extensive experiments and ablation studies on three standard long-tail benchmarks demonstrate the effectiveness of our approach, notably outperforming strong baseline methods that use the same amount of data. The code will be made publicly available.

3089Sample-Imagined Generator: Efficient Virtual Sample Generation Method for Off-policy Reinforcement Learning with Sparse Rewards

[openreview] [pdf]

Abstract Off-policy reinforcement learning (RL) requires extensive real interaction with environment to gain experience for policy learning, presenting a challenge of low sample efficiency, especially in the condition of sparse rewards. To address this, we propose a Sample-Imagined Generator (SIG) which automatically trains a sample generator during environment interaction and could adaptively generate valuable imagined samples for policy learning. Through SIG, the policy greatly reduced the interaction with the environment during training and achieved comparable or even higher performance with those trained only through real interactions. SIG could be combined with any off-policy RL algorithm. Experiment in 5 continuous control tasks demonstrate that by substituting imagined samples for real ones to supplement the experience pool, SIG accomplishes tasks with significantly less interaction with the environment, notably improving sample efficiency across 10 off-policy reinforcement learning algorithms.

3090Promptriever: Instruction-Trained Retrievers Can Be Prompted Like Language Models

[openreview] [pdf]

Abstract Instruction-tuned language models (LM) are able to respond to imperative commands, providing a more natural user interface compared to their base counterparts. In this work, we present Promptriever, the first retrieval model able to be prompted like an LM. To train Promptriever, we curate and release a new instance-level instruction training set from MS MARCO, spanning nearly 500k instances. Promptriever not only achieves strong performance on standard retrieval tasks, but also follows instructions. We observe: (1) large gains (reaching SoTA) on following detailed relevance instructions (+14.3 p-MRR / +3.1 nDCG on FollowIR), (2) significantly increased robustness to lexical choices/phrasing in the query+instruction (+12.9 Robustness@10 on InstructIR), and (3) the ability to perform hyper-parameter search via prompting to reliably improve retrieval performance (+1.4 average increase on BEIR). Promptriever demonstrates that retrieval models can be controlled with prompts on a per-query basis, setting the stage for future work aligning LM prompting techniques with information retrieval.

3091Covariances for Free: Exploiting Mean Distributions for Federated Learning with Pre-trained Models

[openreview] [pdf]

Abstract Using pre-trained models has been found to reduce the effect of data heterogeneity and speed up federated learning algorithms. Recent works have investigated the use of first-order statistics and second-order statistics to aggregate local client data distributions at the server and achieve very high performance without any training. In this work we propose a training-free method based on an unbiased estimator of class covariance matrices. Our method, which only uses first-order statistics in the form of class means communicated by clients to the server, incurs only a fraction of the communication costs required by methods based on communicating second-order statistics. We show how these estimated class covariances can be used to initialize a linear classifier, thus exploiting the covariances without actually sharing them. When compared to state-of-the-art methods which also share only class means, our approach improves performance in the range of 4-26% with exactly the same communication cost. Moreover, our method achieves performance competitive or superior to sharing second-order statistics with dramatically less communication overhead. Finally, using our method to initialize classifiers and then performing federated fine-tuning yields better and faster convergence.

3092Unisolver: PDE-Conditional Transformers Are Universal PDE Solvers

[openreview] [pdf]

Abstract Deep models have recently emerged as a promising tool to solve partial differential equations (PDEs), known as neural PDE solvers. While neural solvers trained from either simulation data or physics-informed loss can solve PDEs reasonably well, they are mainly restricted to a few instances of PDEs, e.g. a certain equation with a limited set of coefficients. This limits the generalization of neural solvers to diverse PDEs, impeding them from being practical surrogate models for numerical solvers. In this paper, we present the Universal PDE Solver (Unisolver) capable of solving a wide scope of PDEs by training a novel Transformer model on diverse data and conditioned on diverse PDEs. Instead of purely scaling up data and parameters, Unisolver stems from the theoretical analysis of the PDE-solving process. Our key finding is that a PDE solution is fundamentally under the control of a series of PDE components, e.g. equation symbols, coefficients, and boundary conditions. Inspired by the mathematical structure of PDEs, we define a complete set of PDE components and flexibly embed them as domain-wise (e.g. equation symbols) and point-wise (e.g. boundaries) conditions for Transformer PDE solvers. Integrating physical insights with recent Transformer advances, Unisolver achieves consistent state-of-the-art results on three challenging large-scale benchmarks, showing impressive performance gains and favorable PDE generalizability.

3093Characterizing linear convergence in optimization: Polyak-Łojasiewicz inequality and weak-quasi-strong-convexity

[openreview] [pdf]

Abstract We give a complete characterization of optimization problems that can be solved by gradient descent with a linear convergence rate. We show that the well-known Polyak-Łojasiewicz inequality is necessary and sufficient for linear convergence with respect to function values to the minimum, while a property that we call “weak-quasi-strong-convexity”, or WQSC, is necessary and sufficient for linear convergence with respect to distances of the iterates to an optimum.

3094Learning-Augmented Search Data Structures

[openreview] [pdf]

Abstract We study the integration of machine learning advice to improve upon traditional data structure designed for efficient search queries. Although there has been recent effort in improving the performance of binary search trees using machine learning advice, e.g., Lin et. al. (ICML 2022), the resulting constructions nevertheless suffer from inherent weaknesses of binary search trees, such as complexity of maintaining balance across multiple updates and the inability to handle partially-ordered or high-dimensional datasets. For these reasons, we focus on skip lists and KD trees in this work. Given access to a possibly erroneous oracle that outputs estimated fractional frequencies for search queries on a set of items, we construct skip lists and KD trees that provably provides the optimal expected search time, within nearly a factor of two. In fact, our learning-augmented skip lists and KD trees are still optimal up to a constant factor, even if the oracle is only accurate within a constant factor. We show that if the search queries follow the ubiquitous Zipfian distribution, then the expected search time for an item by our data structures is only a constant, independent of the total number nn of items, i.e., O(1)\mathcal{O}(1), whereas a traditional skip list or KD tree will have an expected search time of O(logn)\mathcal{O}(\log n). We also demonstrate robustness by showing that our data structures achieves an expected search time that is within a constant factor of an oblivious skip list/KD tree construction even when the predictions are arbitrarily incorrect. Finally, we empirically show that our learning-augmented search data structures outperforms their corresponding traditional analogs on both synthetic and real-world datasets.

3095Unveiling Context-Aware Criteria in Self-Assessing LLMs

[openreview] [pdf]

Abstract The use of large language models (LLMs) as evaluators has garnered significant attention due to their potential to rival human-level evaluations in long-form re- sponse assessments. However, current LLM evaluators rely heavily on static, human-defined criteria, limiting their ability to generalize across diverse gener- ative tasks and incorporate context-specific knowledge. In this paper, we pro- pose a novel Self-Assessing LLM framework that integrates Context-Aware Cri- teria (SALC) with dynamic knowledge tailored to each evaluation instance. This instance-level knowledge enhances the LLM evaluator’s performance by provid- ing relevant, context-aware insights that pinpoint the important criteria specific to the current instance. Additionally, the proposed framework adapts seamlessly to various tasks without relying on predefined human criteria, offering a more flex- ible evaluation approach. Empirical evaluations demonstrate that our approach significantly outperforms existing baseline evaluation frameworks, yielding im- provements ranging from 5% across a wide variety of datasets. Furthermore, by leveraging knowledge distillation techniques, we fine-tuned smaller language models for criteria generation and evaluation, achieving comparable or superior performance to larger models with much lower cost. Our method also exhibits a 5% improvement on the Alpaca leaderboard when employed for preference data generation in Direct Preference Optimization (DPO), underscoring its efficacy as a robust and scalable evaluation framework.

3096An Examination on the Effectiveness of Divide-and-Conquer Prompting in Large Language Models

[openreview] [pdf]

Abstract Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, simple instructional prompts suffer from inaccurate responses. Existing works show that more complicated prompting strategies, such as Chain-of-Thoughts and Least-to-Most, can unlock LLM’s powerful capacity in diverse areas. Recent researches reveal that simple divide-and-conquer prompting strategy, i.e. simply dividing the input sequence to multiple sub-inputs, can substantially improve LLM’s performance in some specific tasks such as misinformation detection. In this paper, we aim at understanding the utility of divide-and-conquer prompting strategy, i.e. on which kind of tasks this strategy gets advantages. Specifically, we provide a theoretic analysis to divide-and-conquer prompting strategy and help us identify the specific tasks where DaC prompting can bring performance boost with theoretic guarantee. We then present two cases (\textbf{large integer arithmetic and fact verification}) where experimental results aligns with our theoretic analysis.

[openreview] [pdf]

Abstract Pareto front profiling in multi-objective optimization (MOO), i.e. finding a diverse set of Pareto optimal solutions, is challenging, especially with expensive objectives that require training a neural network. Typically, in MOO for neural architecture search (NAS), we aim to balance performance and hardware metrics across devices. Prior NAS approaches simplify this task by incorporating hardware constraints into the objective function, but profiling the Pareto front necessitates a computationally expensive search for each constraint. In this work, we propose a novel NAS algorithm that encodes user preferences to trade-off performance and hardware metrics, yielding representative and diverse architectures across multiple devices in just a single search run. To this end, we parameterize the joint architectural distribution across devices and multiple objectives via a hypernetwork that can be conditioned on hardware features and preference vectors, enabling zero-shot transferability to new devices. Extensive experiments involving up to 19 hardware devices and 3 different objectives demonstrate the effectiveness and scalability of our method. Finally, we show that, without any additional costs, our method outperforms existing MOO NAS methods across a broad range of qualitatively different search spaces and datasets, including MobileNetV3 on ImageNet-1k, an encoder-decoder transformer space for machine translation and a decoder-only space for language modelling.

3098Predictive Uncertainty Quantification for Bird’s Eye View Segmentation: A Benchmark and Novel Loss Function

[openreview] [pdf]

Abstract The fusion of raw sensor data to create a Bird’s Eye View (BEV) representation is critical for autonomous vehicle planning and control. Despite the growing interest in using deep learning models for BEV semantic segmentation, anticipating segmentation errors and enhancing the explainability of these models remain under-explored. This paper introduces a comprehensive benchmark for predictive uncertainty quantification in BEV segmentation, evaluating multiple approaches across three popular datasets and two representative backbones. Our study focuses on the effectiveness of the quantified uncertainty in detecting misclassified and out-of-distribution (OOD) pixels, while also improving model calibration. Through empirical analysis, we uncover challenges in existing uncertainty quantification methods and demonstrate the potential of evidential deep learning techniques, which capture both aleatoric and epistemic uncertainty. To address these challenges, we propose a novel loss function, Uncertainty-Focal-Cross-Entropy (UFCE), specifically designed for highly imbalanced data, along with a simple uncertainty-scaling regularization term that improve both uncertainty quantification and model calibration for BEV segmentation.

3099Reinforcement Learning with Segment Feedback

[openreview] [pdf]

Abstract Classic reinforcement learning (RL) assumes that an agent can observe a reward for each state-action pair. However, in practical applications, it is often difficult and costly to collect a reward for each state-action pair. While there have been several works considering RL with trajectory feedback, it is unclear if trajectory feedback is inefficient for learning when trajectories are long. In this work, we propose a model named RL with segment feedback, which offers a general paradigm filling the gap between per-state-action feedback and trajectory feedback seemlessly. In this model, we consider an episodic Markov decision process (MDP), where each episode is equally divided into mm segments, and the agent observes reward feedback only at the end of each segment. Under this model, we study two popular feedback settings: binary feedback and sum feedback, where the agent observes a binary outcome and a reward sum according to the underlying reward function, respectively. To investigate the impacts of the number of segments mm on learning performance, we design efficient algorithms and establish regret upper and lower bounds for both feedback settings. Our theoretical and empirical results show that: under binary feedback, increasing the number of segments mm decreases the regret at an exponential rate; in contrast, surprisingly under sum feedback, increasing mm does not reduce the regret significantly.

3100SOAP: Improving and Stabilizing Shampoo using Adam

[openreview] [pdf]

Abstract There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo’s drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor --- a memory-efficient approximation of Adam --- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo’s preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm:ShampoOwithAdam in thePreconditioner’s eigenbasis (SOAP). With regards to improving Shampoo’s computational efficiency, the most straightforward approach would be to simply compute Shampoo’s eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available athttps://anonymous.4open.science/status/SOAP-F93B.

3101MixLinear: Extreme Low Resource Multivariate Time Series Forecasting with0.1kParameters

[openreview] [pdf]

Abstract In recent years, there has been a growing interest in Long-term Time Series Forecasting (LTSF), which involves predicting long-term future values by analyzing a large amount of historical time-series data to identify patterns and trends. There exist significant challenges in LTSF due to its complex temporal dependencies and high computational demands. Although the Transformer-based models offer high forecasting accuracy, they are often too compute-intensive to be deployed on devices with hardware constraints. On the other hand, the linear models aim to reduce the computational overhead by employing either decomposition methods in the time domain or compact representations in the frequency domain. In this paper, we propose MixLinear, an ultra-lightweight multivariate time series forecasting model specifically designed for resource-constrained environments. MixLinear effectively captures both temporal and frequency domain features by modeling intra-segment and inter-segment variations in the time domain and extracting frequency variations from a low-dimensional latent space in the frequency domain. By reducing the parameter scale of a downsampled nn-length input/output one-layer linear model from O(n2)O(n^2) to O(n)O(n), MixLinear achieves efficient computation without sacrificing accuracy. Extensive evaluations across four benchmark datasets demonstrate that MixLinear attains forecasting performance comparable to, or surpassing, state-of-the-art models with significantly fewer parameters (0.1K0.1K), which makes it well-suited for deployment on devices with limited computational capacity.

3102Does Vector Quantization Fail in Spatio-Temporal Forecasting? Exploring a Differentiable Sparse Soft-Vector Quantization Approach

[openreview] [pdf]

Abstract Spatio-temporal forecasting is crucial in various fields and requires a careful balance between identifying subtle patterns and filtering out noise. Vector quantization (VQ) appears well-suited for this purpose, as it quantizes input vectors into a set of codebook vectors or patterns. Although vector quantization (VQ) has shown promise in various computer vision tasks, it surprisingly falls short in enhancing the accuracy of spatio-temporal forecasting. We attribute this to two main issues: inaccurate optimization due to non-differentiability and limited representation power in hard VQ. To tackle these challenges, we introduce Differentiable Sparse Soft-Vector Quantization (SVQ), the first VQ method to enhance spatio-temporal forecasting. SVQ balances detail preservation with noise reduction, offering full differentiability and a solid foundation in sparse regression. Our approach employs a two-layer MLP and an extensive codebook to streamline the sparse regression process, significantly cutting computational costs while simplifying training and improving performance. Empirical studies on five spatio-temporal benchmark datasets show SVQ achieves state-of-the-art results, including a 7.9% improvement on the WeatherBench-S temperature dataset and an average MAE reduction of 9.4% in video prediction benchmarks (Human3.6M, KTH, and KittiCaltech), along with a 17.3% enhancement in image quality (LPIPS). Code is publicly available athttps://anonymous.4open.science/r/SVQ-Forecasting

3103BIG5-CHAT: Shaping LLM Personalities Through Training on Human-Grounded Data

[openreview] [pdf]

Abstract In this work, we tackle the challenge of embedding realistic human personality traits into LLMs. Previous approaches have primarily focused on prompt-based methods that describe the behavior associated with the desired personality traits, suffering from realism and validity issues. To address these limitations, we introduce BIG5-CHAT, a large-scale dataset containing 100,000 dialogues designed to ground models in how humans express their personality in text. Leveraging this dataset, we explore Supervised Fine-Tuning and Direct Preference Optimization as training-based methods to align LLMs more naturally with human personality patterns. Our methods outperform prompting on personality assessments such as BFI and IPIP-NEO, with trait correlations more closely matching human data. Furthermore, our experiments reveal that models trained to exhibit higher conscientiousness, higher agreeableness, lower extraversion, and lower neuroticism display better performance on reasoning tasks, aligning with psychological findings on how these traits impact human cognitive performance. To our knowledge, this work is the first comprehensive study to demonstrate how training-based methods can shape LLM personalities through learning from real human behaviors.

3104SPACETGN: Augmented Mini-Batch Negative Sampling for Continuous-Time Dynamic Graph Learning

[openreview] [pdf]

Abstract Continuous-Time Dynamic Graph (CTDG) learning has significantly advanced link prediction performance by leveraging random negative sampling and incorporating adaptive temporal information. Recent studies aim to improve performance by introducing random sampling to obtain hard negative samples, whose quality is limited by randomness, capturing few categories of negative samples, and leading to false positive (FP) and false negative (FN) problems. Here we present SPACETGN, a CTDG learning framework, with a augmented hard negative sampling mini-batches (AMNS) strategy and two new feature extraction strategies that derive space-temporal locality subgraph and historical occurrence information to emphasize the graph’s temporal discriminative properties. The AMNS strategy sample mini-batches comprised of instances that are hard-to-distinguish (i.e., hard and true negatives with respect to each other) based on the target distribution, thereby effectively augmenting the discriminative features and the diversity of historical and inductive samples. Furthermore, to mitigate the challenges posed by false positives (FP) and false negatives (FN), our architecture SPACETGN employs a conceptually straightforward approach that investigates temporal subgraphs and historical interactions between source and destination nodes. This enables the model to leverage complex and historically accurate interactions among predicted entities. Our extensive evaluation of dynamic link prediction on seven state-of-the-practice datasets reveals that SPACETGN achieves state-of-the-art performance in most datasets, demonstrating its effectiveness in ameliorating model bias.

3105BroadWay: Boost Your Text-to-Video Generation Model in a Training-free Way

[openreview] [pdf]

Abstract The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present BroadWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, BroadWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that BroadWay significantly improves the quality of text-to-video generation with negligible additional cost.

3106GUD: Generation with Unified Diffusion

[openreview] [pdf]

Abstract Diffusion generative models transform noise into data by inverting a process that progressively adds noise to data samples. Inspired by concepts from the renormalization group in physics, which analyzes systems across different scales, we revisit diffusion models by exploring three key design aspects: 1) the choice of representation in which the diffusion process operates (e.g. pixel-, PCA-, Fourier-, or wavelet-basis), 2) the prior distribution that data is transformed into during diffusion (e.g. Gaussian with covariance Σ\Sigma), and 3) the scheduling of noise levels applied separately to different parts of the data, captured by a component-wise noise schedule. Incorporating the flexibility in these choices, we develop a unified framework for diffusion generative models with greatly enhanced design freedom. In particular, we introduce soft-conditioning models that smoothly interpolate between standard diffusion models and autoregressive models (in any basis), conceptually bridging these two approaches. Our framework opens up a wide design space which may lead to more efficient training and data generation, and paves the way to novel architectures integrating different generative approaches and generation tasks.

3107INS-MMBench: A Comprehensive Benchmark for Evaluating LVLMs’ Performance in Insurance

[openreview] [pdf]

Abstract Large Vision-Language Models (LVLMs) have demonstrated outstanding performance in various general multimodal applications such as image recognition and visual reasoning, and have also shown promising potential in specialized domains. However, the application potential of LVLMs in the insurance domain—characterized by rich application scenarios and abundant multimodal data—has not been effectively explored. There is no systematic review of multimodal tasks in the insurance domain, nor a benchmark specifically designed to evaluate the capabilities of LVLMs in insurance. This gap hinders the development of LVLMs within the insurance domain. In this paper, we systematically review and distill multimodal tasks for four representative types of insurance: auto insurance, property insurance, health insurance, and agricultural insurance. We propose INS-MMBench, the first comprehensive LVLMs benchmark tailored for the insurance domain. INS-MMBench comprises a total of 2.2K thoroughly designed multiple-choice questions, covering 12 meta-tasks and 22 fundamental tasks. Furthermore, we evaluate multiple representative LVLMs, including closed-source models such as GPT-4o and open-source models like BLIP-2. This evaluation not only validates the effectiveness of our benchmark but also provides an in-depth performance analysis of current LVLMs on various multimodal tasks in the insurance domain. We hope that INS-MMBench will facilitate the further application of LVLMs in the insurance domain and inspire interdisciplinary development.

3108Task Vectors are Cross-Modal

[openreview] [pdf]

Abstract We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications.

31093CIL: Causality-Inspired Contrastive Conditional Imitation Learning for Autonomous Driving

[openreview] [pdf]

Abstract Imitation learning (IL) aims to recover an expert’s strategy by performing supervised learning on the demonstration datasets. Incorporating IL in safety-crucial tasks like autonomous driving is promising as it requires less interaction with the actual environment than reinforcement learning approaches. However, the robustness of IL methods is often questioned, as phenomena like causal confusion occur frequently and hinder it from practical use. In this paper, we conduct causal reasoning to investigate the crucial requirements for the ideal imitation generalization performance. With insights derived from modeled causalities, we propose causality-inspired contrastive conditional imitation learning (3CIL), a conditional imitation learning method equipped with contrastive learning and action residual prediction tasks, regularizing the imitator in causal and anti-causal directions. To mitigate the divergence with experts in unfamiliar scenarios, 3CIL introduces a sample-weighting term that transforms the prediction error into an emphasis on critical samples. Extensive experiments in the CARLA simulator show the proposed method significantly improves the driving capabilities of models.

3110Optimal Transport for Reducing Bias in Causal Inference without Data Splitting

[openreview] [pdf]

Abstract Causal inference seeks to estimate the causal effect given a treatment such as a kind of medicine or the dosage of a medication. To address the issue of confounding bias caused by the non-randomized treatment assignment on samples, most existing methods reduce the covariate shift between subpopulations receiving different values of treatment. However, these methods split training samples into smaller groups, which cuts down the number of samples in each group, while precise distribution estimation and alignment highly rely on a sufficient number of training data. In this paper, we propose a distribution alignment paradigm that involves all the training samples without data splitting, which can be naturally applied in the settings of binary and continuous treatments. To this end, we characterize the distribution shift by considering different probability measures of the same set including all the training samples, and reduce the shift between the marginal covariate distribution and the conditional covariate distribution given a treatment value. By doing this, data reduction caused by splitting is avoided, and the outcome prediction model trained on samples receiving one treatment value can be generalized to the entire population. In specific, we exploit the optimal transport theory built on probability measures to analyze the confounding bias and the outcome estimation error, which motivates us to propose a balanced representation learning method for causal inference of binary and continuous treatments. The experimental results on both binary and continuous treatment settings demonstrate the effectiveness of the proposed method.

3111Personality Alignment of Large Language Models

[openreview] [pdf]

Abstract Current methods for aligning large language models (LLMs) typically aim to reflect general human values and behaviors, but they often fail to capture the unique characteristics and preferences of individual users. To address this gap, we introduce the concept of Personality Alignment. This approach tailors LLMs’ responses and decisions to match the specific preferences of individual users or closely related groups. Inspired by psychometrics, we created the Personality Alignment with Personality Inventories (PAPI) dataset, which includes data from 300,000 real subjects, each providing behavioral preferences based on the Big Five Personality Factors. This dataset allows us to quantitatively evaluate the extent to which LLMs can align with each subject’s behavioral patterns. Recognizing the challenges of personality alignments—such as limited personal data, diverse preferences, and scalability requirements—we developed an activation intervention optimization method. This method enhances LLMs’ ability to efficiently align with individual behavioral preferences using minimal data and computational resources. Remarkably, our method, PAS, achieves superior performance while requiring only 1/5 of the optimization time compared to DPO, offering practical value for personality alignment. Our work paves the way for future AI systems to make decisions and reason in truly personality ways, enhancing the relevance and meaning of AI interactions for each user and advancing human-centered artificial intelligence.

3112A simulation-heuristics dual-process model for intuitive physics

[openreview] [pdf]

Abstract The role of mental simulation in human behavior for various physical tasks is widely acknowledged, attributed to the generality of Intuitive Physics Engine (IPE). However, it remains unclear whether mental simulation is consistently employed across scenarios of different simulation costs and where its boundary is. Moreover, cognitive strategies beyond these boundaries have not been thoroughly investigated. Here, we adopted a pouring-marble task containing various conditions to study IPE’s limits and strategies beyond. A human study revealed two distinct error patterns in predicting the pouring angle, differentiated by the simulation time using a boundary. This suggests a possible switching of the underlying reasoning strategies. Our initial experiment on IPE showed that its correlation with human judgments diminished in scenarios requiring extended time of simulation. This observation prompted the exploration of an alternative mechanism based on heuristics for intuitive physics. We uncovered that a linear heuristic model, relying exclusively on empirical data, replicated human prediction more accurately when the simulation time exceeded a certain boundary. Motivated by these observations, we propose a new framework, Simulation-Heuristics Model (SHM), which conceptualizes intuitive physics as a dual process: IPE is predominant only in short-time simulation, whereas a heuristics-based approach is applied as IPE’s simulation time extends beyond the simulation boundary. The SHM model aligns more precisely with human behavior across various scenarios and demonstrates superior generalization capabilities under different conditions. Crucially, SHM integrates computational methods previously viewed as separate into a unified model, quantitatively studying their switching mechanism.

3113GPU-Accelerated Counterfactual Regret Minimization

[openreview] [pdf]

Abstract Counterfactual regret minimization is a family of algorithms of no-regret learning dynamics capable of solving large-scale imperfect information games. We propose implementing this algorithm as a series of dense and sparse matrix and vector operations, thereby making it highly parallelizable for a graphical processing unit, at a cost of higher memory usage. Our experiments show that our implementation performs up to about 352.5 times faster than OpenSpiel’s Python implementation and up to about 22.2 times faster than OpenSpiel’s C++ implementation and the speedup becomes more pronounced as the size of the game being solved grows.

3114DYSTIL: Dynamic Strategy Induction with Large Language Models for Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning from expert demonstrations has long remained a challenging research problem, and existing methods resorting to behavioral cloning plus further RL training often suffer from poor generalization, low sample efficiency, and poor model interpretability. Inspired by the strong reasoning abilities of large language models (LLMs), we propose a novel strategy-based neuro-symbolic reinforcement learning framework integrated with LLMs called DYnamic STrategy Induction with Llms for reinforcement learning (DYSTIL) to overcome these limitations. DYSTIL dynamically queries a strategy-generating LLM to induce textual strategies based on advantage estimations and expert demonstrations, and gradually internalizes induced strategies into the RL agent through policy optimization to improve its performance through boosting policy generalization and enhancing sample efficiency. It also provides a direct textual channel to observe and interpret the evolution of the policy’s underlying strategies during training. We test DYSTIL over challenging RL environments from Minigrid and BabyAI, and empirically demonstrate that DYSTIL significantly outperforms state-of-the-art baseline methods by 17.75% success rate on average while also enjoying higher sample efficiency during the learning process.

3115DECISION-FOCUSED UNCERTAINTY QUANTIFICATION

[openreview] [pdf]

Abstract There is increasing interest in ``decision-focused" machine learning methods which train models to account for how their predictions are used in downstream optimization problems. Doing so can often improve performance on subsequent decision problems. However, current methods for uncertainty quantification do not incorporate any information at all about downstream decisions. We develop a framework based on conformal prediction to produce prediction sets that account for a downstream decision loss function, making them more appropriate to inform high-stakes decision-making. Our approach harnesses the strengths of conformal methods—modularity, model-agnosticism, and statistical coverage guarantees—while incorporating downstream decisions and user-specified utility functions. We prove that our methods retain standard coverage guarantees. Empirical evaluation across a range of datasets and utility metrics demonstrates that our methods achieve significantly lower decision loss compared to standard conformal methods. Additionally, we present a real-world use case in healthcare diagnosis, where our method effectively incorporates the hierarchical structure of dermatological diseases. It successfully generates sets with coherent diagnostic meaning, aiding the triage process during dermatology diagnosis and illustrating how our method can ground high-stakes decision-making on external domain knowledge.

3116Diffusion Preference Alignment via Relative Text-Image Contrast

[openreview] [pdf]

Abstract Aligning Large Language Models (LLMs) to human preferences has become a prominent area of research within language modeling. However, the application of preference learning to image generation in Text-to-Image (T2I) models remains relatively unexplored. One approach, Diffusion-DPO, initially experimented with pairwise preference learning in diffusion models for individual text prompts. We propose Diff-contrast, a novel method designed to align diffusion-based T2I models with human preferences. This method utilizes both prompt-image pairs with identical prompts and those that are semantically related across different modalities. Additionally, we introduced a new evaluation task, style alignment, to address the issues of high cost, low reproducibility, and poor interpretability associated with current evaluations of human preference alignment. Our results show that Diff-contrast surpasses existing techniques, e.g. Diffusion-DPO, in tuning Stable Diffusion versions 1.5 and XL-1.0 across both automated evaluations of human preference and style alignment.

3117A Formal Framework for Understanding Length Generalization in Transformers

[openreview] [pdf]

Abstract A major challenge for transformers is generalizing to sequences longer than those observed during training. While previous works have empirically shown that transformers can either succeed or fail at length generalization depending on the task, theoretical understanding of this phenomenon remains limited. In this work, we introduce a rigorous theoretical framework to analyze length generalization in causal transformers with learnable absolute positional encodings. In particular, we characterize those functions that are identifiable in the limit from sufficiently long inputs with absolute positional encodings under an idealized inference scheme using a norm-based regularizer. This enables us to prove the possibility of length generalization for a rich family of problems. We experimentally validate the theory as a predictor of success and failure of length generalization across a range of algorithmic and formal language tasks. Our theory not only explains a broad set of empirical observations but also opens the way to provably predicting length generalization capabilities in transformers.

3118Bayesian scaling laws for in-context learning

[openreview] [pdf]

Abstract In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. Prior work has established strong correlations between the number of in-context examples provided and the accuracy of the model’s predictions. In this paper, we seek to explain this correlation by showing that ICL approximates a Bayesian learner. This perspective gives rise to a family of novel Bayesian scaling laws for ICL. In experiments with GPT-2 models of different sizes, our scaling laws exceed or match existing scaling laws in accuracy while also offering interpretable terms for task priors, learning efficiency, and per-example probabilities. To illustrate the analytic power that such interpretable scaling laws provide, we report on controlled synthetic dataset experiments designed to inform real-world studies of safety alignment. In our experimental protocol, we use SFT to suppress an unwanted existing model capability and then use ICL to try to bring that capability back (many-shot jailbreaking). We then experiment on real-world instruction-tuned LLMs using capabilities benchmarks as well as a new many-shot jailbreaking dataset. In all cases, Bayesian scaling laws accurately predict the conditions under which ICL will cause the suppressed behavior to reemerge, which sheds light on the ineffectiveness of post-training at increasing LLM safety.

3119Variance Reduced Distributed Non-Convex Optimization Using Matrix Stepsizes

[openreview] [pdf]

Abstract Matrix-stepsized gradient descent algorithms have been shown to have superior performance in non-convex optimization problems compared to their scalar counterparts. The det-CGD algorithm, as introduced by Li et al. (2023), leverages matrix stepsizes to perform compressed gradient descent for non-convex objectives and matrix smooth problems in a federated manner. The authors establish the algorithm’s convergence to a neighborhood of a weighted stationarity point under a convex condition for the symmetric and positive-definite matrix stepsize. In this paper, we propose two variance-reduced versions of the det-CGD algorithm, incorporating MARINA and DASHA methods. Notably, we establish theoretically and empirically, that det-MARINA and det-DASHA outperform MARINA, DASHA and the distributed det-CGD algorithms in terms of iteration and communication complexities.

3120AMAP: Automatic Multi-head Attention Pruning by similarity-based pruning indicator

[openreview] [pdf]

Abstract Despite the strong performance of Transformers, quadratic computation complexity of self-attention presents challenges in applying them to vision tasks. Linear attention reduces this complexity from quadratic to linear, offering a strong computation-performance trade-off. To further optimize this, automatic pruning is an effective method to find a structure that maximizes performance within a target resource through training without any heuristic approaches. However, directly applying it to multi-head attention is not straightforward due to channel mismatch. In this paper, we propose an automatic pruning method to deal with this problem. Different from existing methods that rely solely on training without any prior knowledge, we integrate channel similarity-based weights into the pruning indicator to preserve the more informative channels within each head. Then, we adjust the pruning indicator to enforce that channels are removed evenly across all heads, thereby avoiding any channel mismatch. We incorporate a reweight module to mitigate information loss due to channel removal and introduce an effective pruning indicator initialization for linear attention, based on the attention differences between the original structure and each channel. By applying our pruning method to the FLattenTransformer on ImageNet-1K, which incorporates original and linear attention mechanisms, we achieve a 30% reduction of FLOPs in a near lossless manner. It also has 1.96% of accuracy gain over the DeiT-B model while reducing FLOPs by 37%, and 1.05% accuracy increase over the Swin-B model with a 10% reduction in FLOPs as well. The proposed method outperforms previous state-of-the-art efficient models and the recent pruning methods.

3121Mechanism design with multi-armed bandit

[openreview] [pdf]

Abstract A popular approach of automated mechanism design is to formulate a linear program (LP) whose solution gives a mechanism with desired properties. We analytically derive a class of optimal solutions for such an LP that gives mechanisms achieving standard properties of efficiency, incentive compatibility, strong budget balance (SBB), and individual rationality (IR), where SBB and IR are satisfied in expectation. Notably, our solutions are represented by an exponentially smaller number of essential variables than the original variables of LP. Our solutions, however, involve a term whose exact evaluation requires solving a certain optimization problem exponentially many times as the number of players grows. We thus evaluate this term by modeling it as the problem of estimating the mean reward of the best arm in multi-armed bandit (MAB), propose a Probably and Approximately Correct estimator, and prove its asymptotic optimality by establishing a lower bound on its sample complexity. This MAB approach reduces the number of times the optimization problem is solved from exponential to linear. Numerical experiments show that the proposed approach finds mechanisms that are guaranteed to achieve desired properties with high probability for environments with up to 128 players, which substantially improves upon the prior work.

3122DRL: Decomposed Representation Learning for Tabular Anomaly Detection

[openreview] [pdf]

Abstract Anomaly detection, indicating to identify the anomalies that significantly deviate from the majority normal instances of data, has been an important role in machine learning and related applications. Despite the significant success achieved in anomaly detection on image and text data, the accurate Tabular Anomaly Detection (TAD) has still been hindered due to the lack of clear prior semantic information in the tabular data. Most state-of-the-art TAD studies are along the line of reconstruction, which first reconstruct training data and then use reconstruction errors to decide anomalies; however, reconstruction on training data can still hardly distinguish anomalies due to the data entanglement in their representations. To address this problem, in this paper, we propose a novel approach Decomposed Representation Learning (DRL), to re-map data into a tailor-designed constrained space, in order to capture the underlying shared patterns of normal samples and differ anomalous patterns for TAD. Specifically, we enforce the representation of each normal sample in the latent space to be decomposed into a weighted linear combination of randomly generated orthogonal basis vectors, where these basis vectors are both data-free and training-free. Furthermore, we enhance the discriminative capability between normal and anomalous patterns in the latent space by introducing a novel constraint that amplifies the discrepancy between these two categories, supported by theoretical analysis. Finally, extensive experiments on 40 tabular datasets and 15 competing tabular anomaly detection algorithms show that our method achieves state-of-the-art performance.

3123Visual Transformation Telling

[openreview] [pdf]

Abstract Humans can naturally reason from superficial state differences (e.g. ground wetness) to transformations descriptions (e.g. raining) according to their life experience. In this paper, we propose a new visual reasoning task to test this transformation reasoning ability in real-world scenarios, calledVsualTransformationTelling (VTT). Given a series of states (i.e., images), VTT requires to describe the transformation occurring between every two adjacent states. Different from existing visual reasoning tasks that focus on surface state reasoning, the advantage of VTT is that it captures the underlying causes, e.g. actions or events, behind the differences among states. We collect a novel dataset which comprise 13,547 samples to support the study of transformation reasoning. Each sample involves several key state images along with their transformation descriptions. Our dataset spans diverse real-world activities, providing a rich resource for training and evaluation with automated, human, and LLM assessments. To construct an initial benchmark for VTT, we test models including traditional visual storytelling (CST, GLACNet) or dense video captioning methods (Densecap) and advanced multimodal large language models (LLaVA v1.5-7B, Qwen-VL-chat, Gemini-1.5, GPT-4o, and GPT-4), as well as their upgraded versions based on our learning on human reasoning. Experimental results reveal that even state-of-the-art models still have a significant gap with human performance in VTT, highlighting substantial areas for improvement.

3124Multi-Agent Decision S4: Leveraging State Space Models for Offline Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Sequence-based supervised learning with transformers has been successfully applied to tackle offline reinforcement learning in single-agent settings. However, extending these algorithms to offline multi-agent reinforcement learning (offline MARL) settings still remains to be a challenge. Existing transformer-based approaches for offline MARL either train agents independently, without fully considering them as a multi-agent system or depend on a centralized transformer model, which suffers from scalability issues. Additionally, transformers have inherent constraints, particularly in regard to managing long-term dependencies and computational efficiency. In light of the recent success of Structured State Space Sequence (S4) models in sequence modeling, which offer greater parameter efficiency, faster inference times, and enhanced ability to handle longer context lengths, we propose leveraging S4-based models in our work. We perform offline training utilizing the efficient convolutional view of S4 followed by on-policy fine-tuning utilizing its recurrent dynamics. To foster scalable cooperation between agents, the multi-agent decision making problem is sequentially expanded where the agents take actions one after another at each time step. This design enables agents to exhibit better cooperative behavior by basing their actions on the decisions and learned behaviors of preceding agents with minimal communication. During offline training, this dependency facilitates gradient flow from one agent to its predecessors, leading to more stable learning and improved overall team performance. Based on experiments performed on challenging MARL benchmarks of Multi-Robot Warehouse (RWARE) and StarCraft Multi-Agent Challenge (SMAC), we demonstrate that the developed algorithm significantly outperforms the state-of-the-art offline RL-based and transformer-based MARL baselines across most tasks.

3125ChartMoE: Mixture of Diversely Aligned Expert Connector for Chart Understanding

[openreview] [pdf]

Abstract Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, current MLLMs still struggle to provide faithful data and reliable analysis only based on charts. To address it, we propose ChartMoE, which employs the Mixture of Expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train several linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with nearly 1 million chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts diversely and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.

3126Understanding Learning with Sliced-Wasserstein Requires Re-thinking Informative Slices

[openreview] [pdf]

Abstract The practical applications of Wasserstein distances (WDs) are constrained by their sample and computational complexities. Sliced-Wasserstein distances (SWDs) provide a workaround by projecting distributions onto one-dimensional subspaces, leveraging the more efficient, closed-form WDs for one-dimensional distributions. However, in high dimensions, most random projections become uninformative due to the concentration of measure phenomenon. Although several SWD variants have been proposed to focus on \textit{informative} slices, they often introduce additional complexity, numerical instability, and compromise desirable theoretical (metric) properties of SWD. Amidst the growing literature that focuses on directly modifying the slicing distribution, which often face challenges, we revisit the classic Sliced-Wasserstein and propose instead to rescale the 1D Wasserstein to make all slices equally informative. Importantly, we show that with an appropriate notion of \textit{slice informativeness}, rescaling for all individual slices simplifies to \textbf{a single global scaling factor} on the SWD. This, in turn, translates to the standard learning rate search for gradient-based learning in common ML workflows. We perform extensive experiments across various machine learning tasks showing that the classic SWD, when properly configured, can often match or surpass the performance of more complex variants. We then answer the following question: Is Sliced-Wasserstein all you need for common learning tasks?

3127Think or Remember? Detecting and Directing LLMs Towards Memorization or Generalization

[openreview] [pdf]

Abstract In this paper, we study fundamental mechanisms of memorization and generalization in Large Language Models (LLMs), drawing inspiration from the functional specialization observed in the human brain. Our study aims to (a) determine whether LLMs exhibit spatial differentiation of neurons for memorization and generalization, (b) predict these behaviors using internal representations, and (c) control them through inference-time interventions. To achieve this, we design specialized datasets to distinguish between memorization and generalization, build up classifiers to predict these behaviors from model hidden states and develop interventions to influence the model in real time. Our experiments reveal that LLMs exhibit neuron-wise differentiation for memorization and generalization, and the proposed intervention mechanism successfully steers the model’s behavior as intended. These findings significantly advance the understanding of LLM behavior and demonstrate the potential for enhancing the reliability and controllability of LLMs.

3128Divide, Reweight, and Conquer: A Logit Arithmetic Approach for In-Context Learning

[openreview] [pdf]

Abstract In-Context Learning (ICL) emerges as a key feature for Large Language Models (LLMs), allowing them to adapt to new tasks by leverageing task-specific examples without updating model parameters. However, ICL faces challenges with increasing numbers of examples due to performance degradation and quadratic computational costs. In this paper, we propose Logit Arithmetic Reweighting Approach (LARA), a novel framework that enhances ICL by using logit-based ensembling of multiple demonstrations. Our approach divides long input demonstrations into parallelizable shorter inputs to significantly reduce memory requirements, and then effectively aggregate the information by reweighting logits of each group via a non-gradient optimization approach. We further introduce Bi- nary LARA (B-LARA), a variant that constrains weights to binary values to simplify the search space and reduces memory usage by filtering out less informative demonstration groups. Experiments on BBH and MMLU demonstrate that LARA and B-LARA outperform all baseline methods in both accuracy and memory efficiency. We also conduct extensive analysis to show that LARA generalizes well to scenarios of varying numbers of examples from limited to many-shot demonstrations. Our codes can be found inhttps://anonymous.4open.science/r/LARA-F55B.

3129Machine Unlearning via Simulated Oracle Matching

[openreview] [pdf]

Abstract Machine unlearning---efficiently removing the effect of a small “forget set” of training data on a pre-trained machine learning model---has recently attracted significant research interest. Despite this interest, however, recent work shows that existing machine unlearning techniques do not hold up to thorough evaluation in non-convex settings. In this work, we introduce a new machine unlearning technique that exhibits strong empirical performance even in such challenging settings. Our starting point is the perspective that the goal of unlearning is to produce a model whose outputs arestatistically indistinguishablefrom those of a model re-trained on all but the forget set. This perspective naturally suggests a reduction from the unlearning problem to that of *data attribution, where the goal is to predict the effect of changing the training set on a model’s outputs. Thus motivated, we propose the following meta-algorithm, which we call Datamodel Matching (DMM): given a trained model, we (a) use data attribution topredictthe output of the model if it were re-trained on all but the forget set points; then (b)fine-tunethe pre-trained model to match these predicted outputs. In a simple convex setting, we show how this approach provably outperforms a variety of iterative unlearning algorithms. Empirically, we use a combination of existing evaluations and a new metric based on the KL-divergence to show that even in non-convex settings, DMM achieves strong unlearning performance relative to existing algorithms. An added benefit of DMM is that it is a meta-algorithm, in the sense that future advances in data attribution translate directly into better unlearning algorithms, pointing to a clear direction for future progress in unlearning.

3130Test-time Adaptation for Cross-modal Retrieval with Query Shift

[openreview] [pdf]

Abstract The success of most existing cross-modal retrieval methods heavily relies on the assumption that the given queries follow the same distribution of the source domain. However, such an assumption is easily violated in real-world scenarios due to the complexity and diversity of queries, thus leading to the query shift problem. Specifically, query shift refers to the online query stream originating from the domain that follows a different distribution with the source one. In this paper, we observe that query shift would not only diminish the uniformity (namely, within-modality scatter) of the query modality but also amplify the gap between query and gallery modalities. Based on the observations, we propose a novel method dubbed Test-time adaptation for Cross-modal Retrieval (TCR). In brief, TCR employs a novel module to refine the query predictions (namely, retrieval results of the query) and a joint objective to prevent query shift from disturbing the common space, thus achieving online adaptation for the cross-modal retrieval models with query shift. Expensive experiments demonstrate the effectiveness of the proposed TCR against query shift. The code will be released upon acceptance.

3131Potential Outcome Imputation for CATE Estimation

[openreview] [pdf]

Abstract One of the most significant challenges in Conditional Average Treatment Effect (CATE) estimation is the statistical discrepancy between distinct treatment groups. To address this, we propose a model-agnostic data augmentation method for CATE estimation. We first derive regret bounds for general data augmentation methods, indicating that reduced group discrepancy and low imputation error enhance CATE estimation. Inspired by this, we introduce a contrastive learning approach that reliably imputes missing potential outcomes for a selected subset of individuals based on a similarity measure. These reliable imputations augment the original dataset, reducing the discrepancy between treatment groups while inducing minimal imputation error. The augmented dataset can then be used to train standard CATE estimation models. We provide theoretical guarantees and extensive numerical studies, demonstrating our approach’s effectiveness in improving the accuracy and robustness of various CATE estimation models.

3132Monitoring Latent World States in Language Models with Propositional Probes

[openreview] [pdf]

Abstract Language models (LMs) are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of LMs could help monitor and correct unfaithful behavior. We hypothesize that LMs faithfully represent their input contexts in a latent world model, and we seek to extract these latent world states as logical propositions. For example, given the input context ``Greg is a nurse. Laura is a physicist.‘’, we aim to decode the propositions WorksAs(Greg, nurse) and WorksAs(Laura, physicist) from the model’s internal activations. To do so we introducepropositional probes, which compositionally extract lexical concepts from token activations and bind them into propositions. Key to this is identifying abinding subspacein which bound tokens have high similarity (Greg \leftrightarrow nurse) but unbound ones do not (Greg ↮\not\leftrightarrow physicist). Despite only being trained on linguistically simple English templates, we find that propositional probes generalize to inputs written as short stories and translated to Spanish. Moreover, in three settings where LMs respond unfaithfully to the input context---prompt injections, backdoor attacks, and gender bias--- the decoded propositions remain faithful. This suggests that LMs often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.

3133Deep Distributed Optimization for Large-Scale Quadratic Programming

[openreview] [pdf]

Abstract Distributed optimization is a powerful technique for large-scale decision-making, yet it is typically subject to rigorous tuning, computational/communication restrictions, and limited generalizability. On the other hand, black-box deep neural networks often lack interpretability and performance guarantees despite their widespread success in certain tasks. To leverage their complementary strengths, we introduce a deep learning-aided distributed optimization architecture for large-scale quadratic programming (QP). First, we combine the state-of-the-art OSQP method with a consensus approach to yield a distributed QP method, whose convergence guarantees are established. Then, we unfold this optimizer into a deep learning framework, named DeepDistributedQP, which relies on learning policies for the algorithm parameters towards accelerating its convergence to the optimal solution under a prescribed amount of iterations. Our approach is also theoretically grounded through Probably Approximately Correct (PAC)-Bayes theory, providing generalization bounds on the expected distance from optimality for unseen problems. DeepDistributedQP, as well as its non-distributed version, significantly outperform their standard optimization counterparts, on a variety of tasks ranging from randomly generated problems and optimal control to linear regression and transportation networks. The strong generalization capabilities of our approach are also demonstrated by evaluating the provided PAC-Bayes bounds which guarantee improved performance over traditional optimizers.

3134Structure-preserving contrastive learning for spatial time series

[openreview] [pdf]

Abstract Informative representations enhance model performance and generalisability in downstream tasks. However, learning self-supervised representations for spatially characterised time series, like traffic interactions, poses challenges as it requires maintaining fine-grained similarity relations in the latent space. In this study, we extend time series contrastive learning by incorporating two structure-preserving regularisers: one preserves the topology of similarities between instances, and the other preserves the graph geometry of similarities across spatial and temporal dimensions. We conduct experiments on multivariate time series classification, as well as macroscopic and microscopic traffic prediction. For all three tasks, our method preserves the structures of similarity relations more effectively and improves state-of-the-art task performances. This extension can be applied to an arbitrary encoder and is particularly beneficial for time series with spatial or geographical features. Our code is attached as supplementary material, which will be made openly available with all resulting data after review.

3135Retrieval Head Mechanistically Explains Long-Context Factuality

[openreview] [pdf]

Abstract Despite the recent progress in long-context language models, it remains elusive how transformer-based models exhibit the capability to retrieve relevant information from arbitrary locations within the long context. This paper aims to address this question. Our systematic investigation across a wide spectrum of models reveals that a special type of attention heads are largely responsible for retrieving information, which we dub retrieval heads. We identify intriguing properties of retrieval heads:(1) universal: all the explored models with long-context capability have a set of retrieval heads; (2) sparse: only a small portion (less than 5%) of the attention heads are retrieval. (3) intrinsic: retrieval heads already exist in models pretrained with short context. When extending the context length by continual pretraining, it is still the same set of heads that perform information retrieval. (4) dynamically activated: take Llama-2 7B for example, 12 retrieval heads always attend to the required information no matter how the context is changed. The rest of the retrieval heads are activated in different contexts. (5) causal: completely pruning retrieval heads leads to failure in retrieving relevant information and results in hallucination, while pruning random non-retrieval heads does not affect the model’s retrieval ability. We further show that retrieval heads strongly influence chain-of-thought (CoT) reasoning, where the model needs to frequently refer back the question and previously-generated context. Conversely, tasks where the model directly generates the answer using its intrinsic knowledge are less impacted by masking out retrieval heads. These observations collectively explain which internal part of the model seeks information from the input tokens. We believe our insights will foster future research on reducing hallucination, improving reasoning, and compressing the KV cache.

3136Training Large Language Model to Reason in a Continuous Latent Space

[openreview] [pdf]

Abstract Large language models are restricted to reason in the “language space”, where they typically express the reasoning process with a chain-of-thoughts (CoT) to solve a complex reasoning problem. However, we argue that language space may not be the optimal reasoning space. For example, most word tokens are primarily for textual coherence and not essential for reasoning, while some critical tokens require complex planning and pose huge challenges to LLMs. To explore the potential of LLM reasoning in an unrestricted latent space instead of using human language, we introduce a new paradigm COCONUT (Chain of Continuous Thought). We utilize the last hidden state of the LLM as a representation of the reasoning state (termed “continuous thought”). Rather than decoding this into a word token, we feed it back to the LLM as the subsequent input embedding directly in the continuous space. Experiments show that COCONUT can effectively augment the LLM on several reasoning tasks. It even outperforms CoT in certain logical reasoning tasks that require substantial planning, despite generating fewer tokens during inference. More interestingly, we observe an advanced reasoning patterns emerging from latent reasoning: the continuous thought can encode multiple potential next reasoning steps, allowing the model to perform a breadth-first search (BFS) to solve the problem, rather than prematurely committing to a single deterministic path like CoT. These findings demonstrate the promise of latent reasoning and offer valuable insights for future research on latent reasoning methods.

3137Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

[openreview] [pdf]

Abstract The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system’s turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

3138Merging Feed-Forward Sublayers for Compressed Transformers

[openreview] [pdf]

Abstract With the rise and ubiquity of larger deep learning models, the need for high-quality compression techniques has been growing in order to deploy these models widely. The sheer parameter count of some models makes it difficult to fit them into the memory constraints of different hardware. In this work, we present a novel approach to model compression by merging similar parameter groups within a model, rather than pruning away less important parameters. Specifically, we propose a straightforward method for selecting, aligning, and merging separate feed-forward sublayers in Transformer models, and test our method on a language modeling task, image classification, and machine translation. With our method, we demonstrate performance comparable to the original models across our three diverse tasks while combining more than a third of model feed-forward sublayers. For instance, we can remove over 21% of total parameters from a Vision Transformer, while maintaining 99% of its original performance. Additionally, we observe that some feed-forward sublayers often exhibit regions of high similarity between their activations, which may help explain their surprising mergeability.

3139Finding Shared Decodable Concepts and their Negations in the Brain

[openreview] [pdf]

Abstract Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP, \cite{CLIP}, we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants.Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas with visuo-semantic sensitivity such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, sensitivity associated with visual perspective (close/far) and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.

3140Steering Language Models with Activation Engineering

[openreview] [pdf]

Abstract Prompt engineering and finetuning aim to maximize language model performance on a given metric, like toxicity reduction. However, these methods do not fully elicit a model’s capabilities. To reduce this gap, we introduceactivation engineering: the inference-time modification of activations in order to control (orsteer) model outputs. Specifically, we introduce theActivation Addition(ActAdd) technique, which contrasts the intermediate activations on prompt pairs (such as “Love” versus “Hate”) to compute asteering vector. By tactically adding in e.g. the “Love”−“Hate” steering vector during the forward pass, we achieve SOTA on negative-to-positive sentiment shift and detoxification using models including LLaMA-3 and OPT. ActAdd yields inference-time control over high-level properties of output (like topic and sentiment) while preserving performance on off-target tasks. ActAdd is lightweight: it does not require any machine optimization and works with a single pair of data points, which enables rapid iteration over steering. ActAdd demonstrates the power of activation engineering.

3141Stochastic Monkeys at Play: Random Augmentations Cheaply Break LLM Safety Alignment

[openreview] [pdf]

Abstract Safety alignment of Large Language Models (LLMs) has recently become a critical objective of model developers. In response, a growing body of work has been investigating how safety alignment can be bypassed through various jailbreaking methods, such as adversarial attacks. However, these jailbreak methods can be rather costly or involve a non-trivial amount of creativity and effort, introducing the assumption that malicious users are high-resource or sophisticated. In this paper, we study how simple random augmentations to the input prompt affect safety alignment effectiveness in state-of-the-art LLMs, such as Llama 3 and Qwen 2. We perform an in-depth evaluation of 17 different models and investigate the intersection of safety under random augmentations with multiple dimensions: augmentation type, model size, quantization, fine-tuning-based defenses, and decoding strategies (e.g., sampling temperature). We show that low-resource and unsophisticated attackers, i.e. stochastic monkeys\textit{stochastic monkeys}, can significantly improve their chances of bypassing alignment with just 25 random augmentations per prompt.

3142Accelerate Quantization Aware Training for Diffusion Models with Difficulty-aware Time Allocation

[openreview] [pdf]

Abstract Diffusion models have demonstrated remarkable power in various generation tasks. Nevertheless, the large computational cost during inference is a troublesome issue for diffusion models, especially for large pretrained models such as Stable Diffusion. Quantization-aware training (QAT) is an effective method to reduce both memory and time costs for diffusion models while maintaining good performance. However, QAT methods usually suffer from the high cost of retraining the large pretrained model, which restricts the efficient deployment of diffusion models. To alleviate this problem, we propose a framework DFastQ (Diffusion Fast QAT) to accelerate the training of QAT from a difficulty-aware perspective in the timestep dimension. Specifically, we first propose to adaptively identify the difficulties of different timesteps according to the oscillation of their training loss curves. Then we propose a difficulty-aware time allocation module, which aims to dynamically allocate more training time to difficult timesteps to speed up the convergence of QAT. The key component of this is a timestep drop mechanism consisting of a drop probability predictor and a pair of adversarial losses. We conduct a series of experiments on different Stable Diffusion models, quantization settings, and sampling strategies, demonstrating that our method can effectively accelerate QAT by at least 24% while achieving comparable or even better performance.

3143Improving Ordinal Conformal Prediction by Stepwise Adaptive Posterior Alignment

[openreview] [pdf]

Abstract Ordinal classification (OC) is widely used in real-world applications to categorize instances into ordered discrete classes. In risk-sensitive scenarios, ordinal conformal prediction (OCP) is used to obtain a small contiguous prediction set containing ground-truth labels with a desired coverage guarantee. However, OC models often fail to accurately model the posterior distribution, which harms the prediction set obtained by OCP. Therefore, we introduce a new method called \textit{Adaptive Posterior Alignment Step-by-Step} (APASS), which reduces the distribution discrepancy to improve the downstream OCP performance. It is designed as a versatile, plug-and-play solution that is easily integrated into any OC model before OCP. APASS first employs an attention-based estimator to adaptively estimate the variance of the posterior distribution using the information in the calibration set, then utilizes a stepwise temperature scaling algorithm to align the posterior variance predicted by OC models to the better variance estimation. Extensive evaluations on 10 real-world datasets demonstrate that APASS consistently boosts the OCP performance of 5 popular OC models.

3144Topic-XICL: Demonstration Selection with Topic Inference for Cross-lingual In-context Learning

[openreview] [pdf]

Abstract Cross-lingual in-context learning (XICL) shows promise for adapting large language models (LLMs) to low-resource languages. Previous methods typically rely on off-the-shelf similarity-based approaches or task-specific retrievers trained with LLM feedback for demonstration selection. However, these methods often overlook important factors beyond a single criterion or can be resource-intensive. To address these challenges, we propose a novel approach called Topic-XICL, which leverages a latent topic model for demonstration selection. We assume that latent topic variables encapsulate information that more accurately characterizes demonstrations. By training this topic model on rich-resource language data with a compact LLM, we obtain more relevant demonstrations through topic inference and apply them for in-context learning across various LLMs. We evaluated our method on three multilingual tasks (XNLI, XCOPA, and TyDiQA-GoldP) using three models with 7 to 8 billion parameters (BLOOM, Qwen1.5, and Llama3.1). Our approach outperformed the baselines—random selection, semantic similarity, and clustering-based methods—on TyDiQA-GoldP, XCOPA, and XNLI by 3.32%, 2.47%, and 1.77%, respectively, while requiring only moderate additional resources.

3145EDiSon: Efficient Design-and-Control Optimization with Reinforcement Learning and Adaptive Design Reuse

[openreview] [pdf]

Abstract Seeking good designs is a central goal of many important domains, such as robotics, integrated circuits (IC), medicine, and materials science. These design problems are expensive, time-consuming, and traditionally performed by human experts. Moreover, the barriers to domain knowledge make it challenging to propose a universal solution that generalizes to different design problems. In this paper, we propose a new method called Efficient Design and Stable Control (EDiSon) for automatic design and control in different design problems. The key ideas of our method are (1) interactive sequential modeling of the design and control process and (2) adaptive exploration and design replay. To decompose the difficulty of learning design and control as a whole, we leverage sequential modeling for both the design process and control process, with a design policy to generate step-by-step design proposals and a control policy to optimize the objective by operating the design. With deep reinforcement learning (RL), the policies learn to find good designs by maximizing a reward signal that evaluates the quality of designs. Furthermore, we propose an adaptive exploration and replay strategy based on a design memory that maintains high-quality designs generated so far. By regulating between constructing a design from scratch or replaying a design from memory to refine it, EDiSon balances the trade-off between exploration and exploitation in the design space and stabilizes the learning of the control policy. In the experiments, we evaluate our method in robotic morphology design and Tetris-based design tasks. Our results show that our method effectively learns to explore high-quality designs and outperforms previous results in terms of design score and efficiency.

3146SIKeD: Self-guided Iterative Knowledge Distillation for Mathematical Reasoning

[openreview] [pdf]

Abstract Large Language Models (LLMs) can transfer their reasoning skills to smaller models by teaching them to generate the intermediate reasoning process required to solve multistep reasoning tasks. While LLMs can accurately solve reasoning tasks through a variety of strategies, even without fine-tuning, smaller models are not expressive enough to fit the LLMs distribution on all strategies when distilled and tend to prioritize one strategy over the others. This reliance on one strategy poses a challenge for smaller models when attempting to solve reasoning tasks that may be difficult with their preferred strategy. To address this, we propose a distillation methodSIKeD:Self-guidedIterativeKnowledgeDistillation, where the LLM teaches the smaller model to approach a task using different strategies and the smaller model uses its self-generated on-policy outputs to choose the most suitable strategy for the given task. The training continues in aself-guidediterative manner, where for each training iteration, a decision is made on how to combine the LLM data with the self-generated outputs. Unlike traditional distillation methods,SIKeDallows the smaller model to learnwhichstrategy is suitable for a given task while continuously learning to solve a task using different strategies. Our experiments on various mathematical reasoning datasets show thatSIKeDsignificantly outperforms traditional distillation techniques across smaller models of different sizes.

3147Attention-Only Transformers via Unrolled Subspace Denoising

[openreview] [pdf]

Abstract Despite the great success of transformers in practice, their architectures have been empirically designed, hence lack of mathematical justification and interpretability. Moreover, many empirical studies have indicated that some components of the transformer architectures may be redundant and can be removed or replaced without compromising overall performance. Hence to derive a compact and interpretable transformer architecture, we contend that the goal of representation learning is to compress a set of noisy initial token representations towards a mixture of low-dimensional subspaces. Based on the existing literature, the associated denoising operation naturally takes the form of a multi-subspace self-attention (MSSA). By unrolling such iterative denoising operations as a deep network, we arrive at a highly compact architecture that consists of only an MSSA operator with skip connections at each layer, without MLP. We rigorously prove that each layer of the proposed transformer performs so highly efficient denoising that it improves the signal-to-noise ratio of token representations {\em at a linear rate} with respect to the number of layers. Despite its simplicity, extensive experiments on language and vision tasks demonstrate that such a minimalistic attention-only transformer can achieve performance close to conventional transformers, such as GPT-2 and CRATE.

3148Towards Faster Decentralized Stochastic Optimization with Communication Compression

[openreview] [pdf]

Abstract Communication efficiency has garnered significant attention as it is considered the main bottleneck for large-scale decentralized Machine Learning applications in distributed and federated settings. In this regime, clients are restricted to transmitting small amounts of compressed information to their neighbors over a communication graph. Numerous endeavors have been made to address this challenging problem by developing algorithms with compressed communication for decentralized non-convex optimization problems. Despite considerable efforts, current theoretical understandings of the problem are still very limited, and existing algorithms all suffer from various limitations. In particular, these algorithms typically rely on strong, and often infeasible assumptions such as bounded data heterogeneity or require large batch access while failing to achieve linear speedup with the number of clients. In this paper, we introduce MoTEF, a novel approach that integrates communication compression with Mo\textbf{Mo}mentum T\textbf{T}racking and E\textbf{E}rror F\textbf{F}eedback. MoTEF is the first algorithm to achieve an asymptotic rate matching that of distributed SGD under arbitrary data heterogeneity, hence resolving a long-standing theoretical obstacle in decentralized optimization with compressed communication. We provide numerical experiments to validate our theoretical findings and confirm the practical superiority of MoTEF.

3149MOMENTUM MEETS VIRALITY: A NOVEL METRIC FOR UNMASKING SOCIAL BIAS IN VIRAL TWEETS

[openreview] [pdf]

Abstract Predicting which social media posts will go viral is a critical but complex task in the field of computational social science. Previous studies have utilized various measures to forecast the virality of tweets or Facebook posts, but these approaches exhibit limitations, particularly in the absence of a virality metric that specifically considers social biases. In this paper, we test existing metrics and introduce a new metric, ViralTweet Score (VTS)\textbf{ViralTweet Score (VTS)}, inspired by principles of momentum from physics to better predict a tweet’s virality given that it consists of social biases. We compare this new metric with others, highlighting the advantages and disadvantages of each of them as a virality measurement metric. We release the ViralTweets Dataset\textbf{ViralTweets Dataset} with 88.8k\mathbf{88.8k} Hindi tweets and corresponding virality labels based on our VTS metric. We also show how social biases in posts can influence their potential to go viral. We test our hypothesis that VTS is a better metric using two methodologies and we show how VTS achieves an F1 score of 0.87 based on pairwise evaluation methodology and an overall F1 score of 0.58 based on our clustering-based verification methodology. Our work offers a novel metric for understanding tweet virality for biased tweets and opens the door for more equitable and effective social media analytics by considering the role of social biases in virality.

3150offline_rl_ope: A Python package for off-policy evaluation of offline RL models with real world data

[openreview] [pdf]

Abstract offline_rl_ope is a fully unit tested and runtime type checked Python package for performing off-policy evaluation of offline RL models. offline_rl_ope has been designed for OPE workflows using real world data by: naturally handling uneven trajectory lengths; including novel convergence metrics which do not rely on OPE estimator ground truths; and providing a compute and data efficient API which can be integrated with many offline RL frameworks. This paper motivates and describes the core API design and functionality to enable ease of use and extension. The implementations of OPE methods have been benchmarked against existing implementations to ensure consistency and reproducibility. The offline_rl_ope source code can be found on GitHub at: REDACTED.

3151Personalized Language Modeling from Personalized Human Feedback

[openreview] [pdf]

Abstract Personalized large language models (LLMs) are designed to tailor responses to individual user preferences. While Reinforcement Learning from Human Feedback (RLHF) is a commonly used framework for aligning LLMs with human preferences, vanilla RLHF assumes that all human preferences share the same distribution, preventing fine-tuned LLMs from generating personalized content when user preferences are diverse. In this work, we propose Personalized-RLHF (P-RLHF), an efficient framework that utilizes a lightweight user model to capture individual user preferences and jointly learns the user model and the personalized LLM from human feedback. P-RLHF exhibits the following three characteristics: (1) It enables an LLM to generate personalized content and scale efficiently with growing number of users. (2) It handles both explicit user preferences described as textual input and implicit user preferences encoded in the feedback data. (3) It eliminates the need for users to fully articulate their preferences, which are normally needed for prompting LLMs to generate personalized content yet are often impractical to obtain in real-world scenarios. Our experimental results show that personalized LLMs trained using P-RLHF generate responses that are more closely aligned with individual user preferences, outperforming vanilla, non-personalized RLHF and prompting-based personalization approaches across different tasks.

3152Using Interleaved Ensemble Unlearning to Keep Backdoors at Bay for Finetuning Vision Transformers

[openreview] [pdf]

Abstract Vision Transformers (ViTs) have become popular in computer vision tasks. Backdoor attacks, which trigger undesirable behaviours in models during inference, threaten ViTs’ performance, particularly in security-sensitive tasks. Although backdoor defences have been developed for Convolutional Neural Networks (CNNs), they are less effective for ViTs, and defences tailored to ViTs are scarce. To address this, we present Interleaved Ensemble Unlearning (IEU), a method for finetuning clean ViTs on backdoored datasets. In stage 1, a shallow ViT is finetuned to have high confidence on backdoored data and low confidence on clean data. In stage 2, the shallow ViT acts as a “gate” to block potentially poisoned data from the defended ViT. This data is added to an unlearn set and asynchronously unlearned via gradient ascent. We demonstrate IEU’s effectiveness on three datasets against 11 state-of-the-art backdoor attacks and show its versatility by applying it to different model architectures.

3153Mitigating Catastrophic Forgetting in Large Language Models with Forgetting-aware Pruning

[openreview] [pdf]

Abstract Recent advancements in Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks. These models are typically pretrained on extensive corpora and subsequently fine-tuned on task-specific datasets. However, during the fine-tuning process, LLMs often suffer from Catastrophic Forgetting (CF), wherein previously acquired general knowledge is lost. Traditional approaches to mitigating CF often rely on data replay, which may not be viable when the original training data is inaccessible. Additionally, methods that alter the training process or the model architecture can increase complexity and detract from the accuracy of downstream tasks, thus limiting their generalizability. In this paper, we propose Forgetting-Aware Pruning Metric (FAPM), a novel pruning-based approach to balance CF and downstream task performance. Our investigation reveals that the degree to which task vectors (i.e., the subtraction of pre-trained weights from the weights fine-tuned on downstream tasks) overlap with pre-trained model parameters is a critical factor for CF. Motivated by this insight, FAPM employs the ratio of the task vector to pre-trained model parameters as a metric to quantify CF, integrating this measure into the pruning criteria. Importantly, FAPM does not necessitate modifications to the training process or model architecture, nor does it require any auxiliary data. We conducted extensive experiments across six datasets encompassing natural language inference, question answering, reading comprehension, and cloze tests. The results demonstrate that FAPM limits CF to just 1% while maintaining 99% accuracy on downstream tasks, rendering FAPM highly competitive relative to the state-of-the-art methods that involve modifications to the training process.

3154Set-Size Dependent Combinatorial Bandits

[openreview] [pdf]

Abstract This paper introduces and studies a new variant of Combinatorial Multi-Armed Bandits (\CMAB{}), called Set-Size Dependent Combinatorial Multi-Armed Bandits (\SDMAB{}). In \SDMAB{}, each base arm is associated with a set of different reward distributions instead of a single distribution as in \CMAB{}, and the reward distribution of each base arm depends on the set size, i.e., the number of the base arms in the chosen super arm in \CMAB{}. \SDMAB{} involves a much larger exploration set of the super arms than the basic \CMAB{} model. An important property called order preservation exists in \SDMAB{}, i.e. the order of reward means of base arms is independent of set size, which widely exists in real-world applications. We propose the \SUCB{} algorithm, effectively leveraging the order preservation property to shrink the exploration set. We provide theoretical upper bound of O\left(\max\left{\frac{M\delta_L}{\Delta_{L}},\frac{L^2}{\Delta_S}\right}\log(T)\right) for \SUCB{} which outperforms the classic \CMAB{} algorithms with regret O(ML2ΔSlog(T))O\left(\frac{ML^2}{\Delta_S}\log(T)\right), where MM denotes the number of base arms, LL denotes the maximum number of base arms in a super arm, δ\delta and Δ\Delta are related to the gap of arms. We also derive a lower bound which can be informally written as \Omega\left(\max\left{\min_{k\in[L]}\left{\frac{(M-L)\delta_{k}}{\Delta_{k}^2}\right},\frac{L^2}{\Delta_S}\right}\log(T)\right) showing that \SUCB{} is partially tight. We conduct numerical experiments, showing the good performance of \SUCB{}.

3155Enabling Fine-Tuning of Direct Feedback Alignment via Feedback-Weight Matching

[openreview] [pdf]

Abstract In this paper, we introduce feedback-weight matching, a new method that facilitates reliable fine-tuning of fully connected neural networks using Direct Feedback Alignment (DFA). Although DFA has demonstrated potential by enabling efficient and parallel updates of weight parameters through direct propagation of the network’s output error, its usage has been primarily restricted to training networks from scratch. We provide the first analysis showing that existing standard DFA struggles to fine-tune networks pre-trained via back-propagation. Through an analysis of weight alignment (WA) and gradient alignment (GA), we show that the proposed feedback-weight matching enhances DFA’s ability and stability in fine-tuning pre-trained networks, providing insights into DFA’s behavior and characteristics when applied to fine-tuning. In addition, we find that feedback-weight matching, when combined with weight decay, not only mitigates over-fitting but also further reduces the network output error, leading to improved learning performance during DFA-based fine-tuning. Our experimental results show that, for the first time, feedback-weight matching enables reliable and superior fine-tuning across various fine-tuning tasks compared to existing standard DFA, e.g., achieving 7.97% accuracy improvement on image classification tasks (i.e., 82.67% vs. 74.70%) and 0.66 higher correlation score on NLP tasks (i.e., 0.76 vs. 0.10). The code implementation is available at an anonymous GitHub repository.

3156Compute Or Load KV Cache? Why Not Both?

[openreview] [pdf]

Abstract Recent advancements in Large Language Models (LLMs) have significantly in- creased context window sizes, enabling sophisticated applications but also in- troducing substantial computational overheads, particularly computing key-value (KV) cache in the prefill stage. Prefix caching has emerged to save GPU power in this scenario, which saves KV cache at disks and reuse them across multiple queries. However, traditional prefix caching mechanisms often suffer from sub- stantial latency because the speed of loading KV cache from disks to GPU mem- ory is bottlenecked by the throughput of I/O devices. To optimize the latency of long-context prefill, we propose Cake, a novel KV cache loader, which employs a bidirectional parallelized KV cache generation strategy. Upon receiving a pre- fill task, Cake simultaneously and dynamically loads saved KV cache from prefix cache locations and computes KV cache on local GPUs, maximizing the utiliza- tion of available computation and I/O bandwidth resources. Additionally, Cake automatically adapts to diverse system statuses without manual parameter. tuning. In experiments on various prompt datasets, GPUs, and I/O devices, Cake offers up to 68.1% Time To First Token (TTFT) reduction compare with compute-only method and 94.6% TTFT reduction compare with I/O-only method.

3157Regression Conformal Prediction under Bias

[openreview] [pdf]

Abstract Uncertainty quantification is crucial to account for the imperfect predictions of machine learning algorithms for high-impact applications. Conformal prediction (CP) is a powerful framework for uncertainty quantification that generates calibrated prediction intervals with valid coverage. In this work, we study how CP intervals are affected by \emph{bias} -- the systematic deviation of a prediction from ground truth values -- a phenomenon prevalent in many real-world applications. We investigate the influence of bias on interval lengths of two different types of adjustments -- symmetric adjustments, the conventional method where both sides of the interval are adjusted equally, and asymmetric adjustments, a more flexible method where the interval can be adjusted unequally in positive or negative directions. We present theoretical and empirical analyses characterizing how symmetric and asymmetric adjustments impact the “tightness” of CP intervals for regression tasks. Specifically for absolute residual and quantile-based non-conformity scores, we prove: 1) the upper bound of symmetrically adjusted interval lengths increases by 2b2|b| where bb is a globally applied scalar value representing bias, 2) asymmetrically adjusted interval lengths are not affected by bias, and 3) conditions when asymmetrically adjusted interval lengths are guaranteed to be smaller than symmetric ones. Our analyses suggest that even if predictions exhibit significant drift from ground truth values, asymmetrically adjusted intervals are still able to maintain the same tightness and validity of intervals as if the drift had never happened, while symmetric ones significantly inflate the lengths. We demonstrate our theoretical results with two real-world prediction tasks: sparse-view computed tomography (CT) reconstruction and time-series weather forecasting. Our work paves the way for more bias-robust machine learning systems.

3158From Exploration to Mastery: Enabling LLMs to Master Tools via Self-Driven Interactions

[openreview] [pdf]

Abstract Tool learning enables Large Language Models (LLMs) to interact with external environments by invoking tools, serving as an effective strategy to mitigate the limitations inherent in their pre-training data. In this process, tool documentation plays a crucial role by providing usage instructions for LLMs, thereby facilitating effective tool utilization. This paper concentrates on the critical challenge of bridging the comprehension gap between LLMs and external tools due to the inadequacies and inaccuracies inherent in existing human-centric tool documentation. We propose a novel framework, DRAFT, aimed at Dynamically Refining tool documentation through the Analysis of Feedback and Trails emanating from LLMs’ interactions with external tools. This methodology pivots on an innovative trial-and-error approach, consisting of three distinct learning phases: experience gathering, learning from experience, and documentation rewriting, to iteratively enhance the tool documentation. This process is further optimized by implementing a diversity-promoting exploration strategy to ensure explorative diversity and a tool-adaptive termination mechanism to prevent overfitting while enhancing efficiency. Extensive experiments on multiple datasets demonstrate that DRAFT’s iterative, feedback-based refinement significantly ameliorates documentation quality, fostering a deeper comprehension and more effective utilization of tools by LLMs. Notably, our analysis reveals that the tool documentation refined via our approach demonstrates robust cross-model generalization capabilities. Our code is available athttps://anonymous.4open.science/r/DRAFT-10B3.

3159Rethinking Memorization in LLMs: On Learning by Rote vs. with Understanding

[openreview] [pdf]

Abstract Understanding whether and to what extent token sequences generated by large language models (LLMs) are the result of regurgitating memorized training data or are based on meaningful learning of the training data’s syntax and semantics has many important implications. In order to cleanly measure and disentangle token recollection by rote (memorization) from generation with understanding, we create an experimental framework that is based on training LLMs oversequences generated using formal grammars. Our framework allows us to better understand the interplay between the two types of learning, namely,by rotevs.with understanding. Using our framework we make several striking observations that hold consistently across different open-source model families (Pythia, Llama, and Mistral): (a) we find that the learning types are at odds with each other during training, i.e., rote learning harms understanding and by developing understanding, models forget previously memorized sequences, (b) we find thatentropy of the training datasetsimpacts the ease of learning, with lower entropy datasets being easier to learn with understanding and higher entropy datasets being easier to learn by rote, (c) we highlight the difficulty of determining the type of learning involved in a model based solely on recollecting a training data sequence. Our surprising results have significant downstream implications in the study and usage of LLMs.

3160Learning How Hard to Think: Input-Adaptive Allocation of LM Computation

[openreview] [pdf]

Abstract Computationally intensive decoding procedures---including search, reranking, and self-critique---can improve the quality of language model (LM) outputs in problems spanning code generation, numerical reasoning, and dialog. Existing work typically applies the same decoding procedure for every input to an LM. But not all inputs require the same amount of computation to process. Can we allocate decoding computation adaptively, using more resources to answer questions whose answers will be harder to compute? We present an approach that predicts the distribution of rewards given an input and computation budget, then allocates additional computation to inputs for which it is predicted to be most useful. We apply this approach in two decoding procedures: first, an adaptive best-of-kk procedure that dynamically selects the number of samples to generate as input to a reranker; second, a routing procedure that dynamically responds to a query using a decoding procedure that is expensive but accurate, or one that is cheaper but less capable. Across a suite of programming, mathematics, and dialog tasks, we show that accurate computation-allocation procedures can be learned, and reduce computation by up to 50% at no cost to quality.

3161Semantic Object Navigation with Segmenting Decision Transformer

[openreview] [pdf]

Abstract Understanding scene semantics plays an important role in solving the object navigation task, where an embodied intelligent agent has to find an object in the scene given its semantic category. This task can be divided into two stages: exploring the scene and reaching the found target. In this work, we consider the latter stage of reaching a given semantic goal. This stage is particularly sensitive to errors in the semantic understanding of the scene. To address this challenge, we propose a multimodal and multitasking method called SegDT, which is based on the joint training of a segmentation model and a decision transformer model. Our method aggregates information from multiple multimodal frames to predict the next action and the current segmentation mask of the target object. To optimize our model, we first performed a pre-training phase using a set of collected trajectories. In the second phase, online policy fine-tuning, we addressed the problems of long-term credit assignment and poor sampling efficiency of transformer models. Using the PPO algorithm, we simultaneously trained an RNN-based policy using ground-truth segmentation and transferred its knowledge to the proposed transformer-based model, which trains the segmentation in itself through an additional segmentation loss. We conducted extensive experiments in the Habitat Sim environment and demonstrated the advantage of the proposed method over the basic navigation approach as well as current state-of-the-art methods that do not consider the auxiliary task of improving the quality of the segmentation of the current frame during training.

3162Latent Feature Mining for Predictive Model Enhancement with Large Language Models

[openreview] [pdf]

Abstract Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional data collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce a novel approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent FeAture Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features, enhancing the predictive power of ML models in downstream tasks. Our novel approach transforms the latent feature extraction task to a text-to-text propositional reasoning task. Our framework is generalizable across various domains with minimal domain-specific customization, ensuring easy transfer to other areas facing similar challenges in data availability. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection. (2) the healthcare domain, where patient privacy concerns and the complexity of medical data often limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.

3163A Training-Free Sub-quadratic Cost Transformer Model Serving Framework with Hierarchically Pruned Attention

[openreview] [pdf]

Abstract In modern large language models (LLMs), increasing the context length is crucial for improving comprehension and coherence in long-context, multi-modal, and retrieval-augmented language generation. While many recent transformer models attempt to extend their context length over a million tokens, they remain impractical due to the quadratic time and space complexities. Although recent works on linear and sparse attention mechanisms can achieve this goal, their real-world applicability is often limited by the need to re-train from scratch and significantly worse performance. In response, we propose a novel approach, Hierarchically Pruned Attention (HiP), which reduces the time complexity of the attention mechanism to O(TlogT)O(T \log T) and the space complexity to O(T)O(T), where TT is the sequence length. We notice a pattern in the attention scores of pretrained LLMs where tokens close together tend to have similar scores, which we call “attention locality”. Based on this observation, we utilize a novel tree-search-like algorithm that estimates the top-kk key tokens for a given query on the fly, which is mathematically guaranteed to have better performance than random attention pruning. In addition to improving the time complexity of the attention mechanism, we further optimize GPU memory usage by implementing KV cache offloading, which stores only O(logT)O(\log T) tokens on the GPU while maintaining similar decoding throughput. Experiments on benchmarks show that HiP, with its training-free nature, significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation. HiP enables pretrained LLMs to scale up to millions of tokens on commodity GPUs, potentially unlocking long-context LLM applications previously deemed infeasible.

3164Scalable and Enhanced Hallucination Detection in LLMs using Semantic Clustering

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly being adopted across various domains, driven by their ability to generate general-purpose and domain-specific text. However, LLMs can also produce responses that seem plausible but are factually incorrect—a phenomenon commonly referred to as “hallucination”. This issue limits the potential and trustworthiness of LLMs, especially in critical fields such as medicine and law. Among the strategies proposed to address this problem uncertainty-based methods stand out due to their ease of implementation, independence from external data sources, and compatibility with standard LLMs. In this paper, we present an optimized semantic clustering framework for automated hallucination detection in LLMs, using sentence embeddings and hierarchical clustering. Our proposed method enhances both scalability and performance compared to existing approaches across different LLM models. This results in more homogeneous clusters, improved entropy scores, and a more accurate reflection of detected hallucinations. Our approach significantly boosts accuracy on widely used open and closed-book question-answering datasets such as TriviaQA, NQ, SQuAD, and BioASQ, achieving AUROC score improvements of up to 9.3% over the current state-of-the-art semantic entropy method. Further ablation studies highlight the effectiveness of different components of our approach.

3165How Do Augmentations with Label Smoothing Enhance Model Robustness?

[openreview] [pdf]

Abstract Model robustness indicates a model’s capability to generalize well on unforeseen distributional shifts, including data corruption, adversarial attacks, and domain shifts. One of the most prevalent and effective ways to enhance the robustness often involves data augmentations and label smoothing techniques. Despite the great success of the related approaches in diverse practices, a unified theoretical understanding of their efficacy in improving model robustness is lacking. We offer a theoretical framework to clarify how augmentations, label smoothing, or their combination enhance model robustness through the lens of loss surface flatness, generalization bound, and adversarial robustness. Specifically, we first formally bridge the diversified data distribution via augmentations to the flatter minima on the parameter space, which directly links to the improved generalization capability. Moreover, we further bridge augmentations with label smoothing, which softens the confidence of the target label, to the improved adversarial robustness. We broadly confirm our theories through extensive simulations on the existing common corruption and adversarial robustness benchmarks based on the CIFAR and tinyImageNet datasets, as well as various domain generalization benchmarks.

3166SPEED: Selective Prediction for Early Exit DNNs

[openreview] [pdf]

Abstract Inference latency and trustworthiness of Deep Neural Networks (DNNs) are the bottlenecks in deploying them in critical applications like autonomous driving. Early Exit (EE) DDNs overcome the latency issues by allowing samples to exit from intermediary layers if they attain high confidence scores on the predicted class. However, the DNNs are known to exhibit overconfidence, which can lead to many samples exiting early and render EE strategies untrustworthy. We use Selective Prediction (SP) to overcome this issue by checking the hardness of the samples rather than just relying on the confidence score alone. We propose SPEED, a novel approach that uses Deferral Classifiers (DCs) at each layer to check the hardness of samples before performing EEs. The DCs at each layer identify if a sample is hard and either differ its inference to the next layer or directly send it to an expert. Early detection of hard samples and using an expert for inference prevents the wastage of computational resources and improves trust. We also investigate the generalization capability of DCs trained on one domain when applied to other domains where target domain data is not readily available. We observe that EE aided with SP improves both accuracy and latency. Our method minimizes the risk by 50% with a speedup of 2.05×2.05\times as compared to the final layer. The anonymized source code is available athttps://anonymous.4open.science/r/SPEED-35DC/README.md.

3167RDHNet: Addressing Rotational and Permutational Symmetries in Continuous Multi-Agent Systems

[openreview] [pdf]

Abstract Symmetry is prevalent in multi-agent systems. The presence of symmetry, coupled with the misuse of absolute coordinate systems, often leads to a large amount of redundant representation space, significantly increasing the search space for learning policies and reducing learning efficiency. Effectively utilizing symmetry and extracting symmetry-invariant representations can significantly enhance multi-agent systems’ learning efficiency and overall performance by compressing the model’s hypothesis space and improving sample efficiency. The issue of rotational symmetry in multi-agent reinforcement learning has received little attention in previous research and is the primary focus of this paper. To address this issue, we propose a rotation-invariant network architecture for continuous action space tasks. This architecture utilizes relative coordinates between agents, eliminating dependence on absolute coordinate systems, and employs a hypernetwork to enhance the model’s fitting capability, enabling it to model MDPs with more complex dynamics. It can be used for both predicting actions and evaluating action values/utilities. In benchmark tasks, experimental results validate the impact of rotational symmetry on multi-agent decision systems and demonstrate the effectiveness of our method.

3168Spark Transformer: How Many FLOPs is a Token Worth?

[openreview] [pdf]

Abstract This work introduces Spark Transformer, an architectural variant of the Transformer model that drastically reduces the FLOPs count while maintaining comparable quality and an identical parameter count. This reduction is achieved by introducing sparse activations in both the feedforward network (FFN) and the Attention mechanism. In the FFN, this sparsity engages only a subset of parameters for each input. In the Attention mechanism, it limits the number of tokens that each token attends to. We achieve this sparsity through statistical top-kk, a lightweight approximate algorithm that is well-suited for accelerator hardware and minimizes training slowdown. Furthermore, Spark Transformer incorporates dedicated predictors to identify the activated entries. These predictors are formed by allocating a portion of the model’s parameters and are trained jointly with the rest of the model. This approach distinguishes Spark Transformer from existing methods that introduce sparsity and predictors post-training, which often leads to increased training costs, additional model parameters, and complex modifications to the model architecture. Our Spark Transformer, pretrained using the Gemma 2 recipe, achieves competitive performance on standard benchmarks while exhibiting significant sparsity. Specifically, it utilizes only 8% nonzeros in the FFN activation and attends to a maximum of 256 tokens. This results in a 3.1×\times reduction in FLOPs, yielding a 1.70×\times speedup for prefill and a 1.79×\times speedup for decoding on a 16-core CPU VM.

3169Measuring Non-Adversarial Reproduction of Training Data in Large Language Models

[openreview] [pdf]

Abstract Large language models frequently memorize parts of their training data. This behavior led to a large body of research on data extraction attacks, where adversaries coerce a model to output memorized examples. However, most LLM users are not malicious; they only want an LLM to perform some desired task. In this work, we investigate non-adversarial reproduction, where the outputs of a large language model overlap with existing public text when responding to natural and benign prompts. For a variety of innocuous prompt categories (e.g., writing a letter or a tutorial), we show that up to 15% of the text output by popular conversational language models overlaps with moderate snippets (40–60 characters) of the Internet. For the same tasks, we find that human-written text has far less overlap with existing Internet data. We further study whether prompting strategies can close this reproduction gap between models and humans. However, while appropriate prompting can reduce non-adversarial reproduction on average, we find that mitigating worst-case reproduction of training data requires stronger defenses—even for benign interactions.

3170Deep Bootstrap Aggregation via Least Squares Estimation

[openreview] [pdf]

Abstract Bootstrap aggregation, commonly referred to as bagging, is a fundamental technique in ensemble learning designed to enhance the performance of predictive models. It is well-established that the effectiveness of bagging is strongly influenced by the management of correlations among the aggregated models. For instance, random forests, a widely-used ensemble method, address this issue by randomly selecting features to reduce the correlation between individual tree models. In this study, we propose a method called ``Deep Bootstrap Aggregation’’ for regression tasks, which combines deep network architectures with least squares estimation to improve the predictive accuracy of bagging models. Both theoretical analysis and empirical experiments support the effectiveness of the proposed approach.

3171Time, Space and Streaming Efficient Algorithm for Heavy Attentions

[openreview] [pdf]

Abstract A central problem related to transformers can be stated as follows: given two n×dn \times d matrices QQ and KK, and a non-negative function ff, define the matrix AA as follows: (1) apply the function ff to each entry of the n×nn \times n matrix QKTQ K^T, and then (2) normalize each of the row sums of AA to be equal to 1. The matrix AA can be computed in O(n2d)O(n^2 d) time assuming ff can be applied to a number in constant time, but the quadratic dependence on nn is prohibitive in applications where it corresponds to long context lengths. For a large class of functions ff, we show how to find all the “large attention scores”, i.e., entries of AA which are at least a positive value ε\varepsilon, in time with linear dependence on nn (i.e., npoly(d/ε)n \cdot \textrm{poly}(d/\varepsilon)) for a positive parameter ε>0\varepsilon > 0. Our class of functions include all functions ff of the form f(x)=xpf(x) = |x|^p, as explored recently in transformer models. Using recently developed tools from randomized numerical linear algebra, we prove that for any KK, there is a “universal set” U[n]U \subset [n] of size independent of nn, such that for any QQ and any row ii, the large attention scores Ai,jA_{i,j} in row ii of AA all have jUj \in U. We also find UU in npoly(d/ε)n \cdot \textrm{poly}(d/\varepsilon) time. Notably, we (1) make no assumptions on the data, (2) our workspace does not grow with nn, and (3) our algorithms can be computed in streaming and parallel settings. We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well, showing that our model is able to consistently select “important keys’” during training.

3172Learning to Select Nodes in Branch and Bound with Sufficient Tree Representation

[openreview] [pdf]

Abstract Branch-and-bound methods are pivotal in solving Mixed Integer Linear Programming (MILP), where the challenge of node selection arises, necessitating the prioritization of different regions of the space for subsequent exploration. While machine learning techniques have been proposed to address this, two crucial problems concerning \textbf{(P1)} how to sufficiently extract features from the branch-and-bound tree, and \textbf{(P2)} how to assess the node quality comprehensively based on the features remain open. To tackle these challenges, we propose to tackle the node selection problem employing a novel Tripartite graph representation and Reinforcement learning with a Graph Neural Network model (TRGNN). The tripartite graph is theoretically proved to encompass sufficient information for tree representation in information theory. We learn node selection via reinforcement learning for learning delay rewards and give more comprehensive node metrics. Experiments show that TRGNN significantly improves the efficiency of solving MILPs compared to human-designed and learning-based baselines on both synthetic and large-scale real-world MILPs. Moreover, experiments demonstrate that TRGNN well generalizes to MILPs that are significantly larger than those seen during training.

3173KAN See Your Face

[openreview] [pdf]

Abstract With the advancement of face reconstruction (FR) systems, privacy-preserving face recognition (PPFR) has gained popularity for its secure face recognition, enhanced facial privacy protection, and robustness to various attacks. Besides, specific models and algorithms are proposed for face embedding protection by mapping embeddings to a secure space. However, there is a lack of studies on investigating and evaluating the possibility of extracting face images from embeddings of those systems, especially for PPFR. In this work, we introduce the first approach to exploit Kolmogorov-Arnold Network (KAN) for conducting embedding-to-face attacks against state-of-the-art (SOTA) FR and PPFR systems. Face embedding mapping (FEM) models are proposed to learn the distribution mapping relation between the embeddings from the initial domain and target domain. In comparison with Multi-Layer Perceptrons (MLP), we provide two variants, FEM-KAN and FEM-MLP, for efficient non-linear embedding-to-embedding mapping in order to reconstruct realistic face images from the corresponding face embedding. To verify our methods, we conduct extensive experiments with various PPFR and FR models. We also measure reconstructed face images with different metrics to evaluate the image quality. Through comprehensive experiments, we demonstrate the effectiveness of FEMs in accurate embedding mapping and face reconstruction.

3174Adaptive HL-Gaussian: A Value Function Learning Method with Dynamic Support Adjustment

[openreview] [pdf]

Abstract Recent research indicates that using cross-entropy (CE) loss for value function learning surpasses traditional mean squared error (MSE) loss in performance and scalability, with the HL-Gaussian method showing notably strong results. However, this method requires a pre-specified support for representing the categorical distribution of the value function, and an inappropriately chosen interval for the support may not match the time-varying value function, potentially impeding the learning process. To address this issue, we theoretically establish that HL-Gaussian inherently introduces a projection error during the learning of the value function, which is dependent on the support interval. We further prove that an ideal interval should be sufficiently broad to reduce truncation-induced projection errors, yet not so excessive as to counterproductively amplify them. Guided by these findings, we introduce the Adaptive HL-Gaussian (AHL-Gaussian) approach. This approach starts with a confined support interval and dynamically adjusts its range by minimizing the projection error. This ensures that the interval’s size stabilizes to adapt to the learning value functions without further expansion. We integrate AHL-Gaussian into several classic value-based algorithms and evaluate it on Atari 2600 games and Gym Mujoco. The results show that AHL-Gaussian significantly outperforms the vanilla baselines and standard HL-Gaussian with a static interval across the majority of tasks.

3175Discrete GCBF Proximal Policy Optimization for Multi-agent Safe Optimal Control

[openreview] [pdf]

Abstract Control policies that can achieve high task performance and satisfy safety constraints are desirable for any system, including multi-agent systems (MAS). One promising technique for ensuring the safety of MAS is distributed control barrier functions (CBF). However, it is difficult to design distributed CBF-based policies for MAS that can tackle unknown discrete-time dynamics, partial observability, changing neighborhoods, and input constraints, especially when a distributed high-performance nominal policy that can achieve the task is unavailable. To tackle these challenges, we proposeDGPPO, a new framework thatsimultaneouslylearns both adiscretegraph CBF which handles neighborhood changes and input constraints, and a distributed high-performance safe policy for MAS with unknown discrete-time dynamics. We empirically validate our claims on a suite of multi-agent tasks spanning three different simulation engines. The results suggest that, compared with existing methods, our DGPPO framework obtains policies that achieve high task performance (matching baselines that ignore the safety constraints), and high safety rates (matching the most conservative baselines), with aconstantset of hyperparameters across all environments.

3176Systems with Switching Causal Relations: A Meta-Causal Perspective

[openreview] [pdf]

Abstract Most works on causality in machine learning assume that causal relationships are governed by a constant underlying process. However, the flexibility of agents’ actions or tipping point behavior in the environmental process can change the qualitative dynamics of the system. As a result, new causal relationships may emerge, while existing ones change or disappear, resulting in an altered causal graph. To analyze these qualitative changes on the causal graph, we propose the concept ofmeta-causal states, which groups classical causal models into clusters based on equivalent qualitative behavior and consolidates specific mechanism parameterizations. We demonstrate how meta-causal states can be inferred from observed agent behavior, and discuss potential methods for disentangling these states from unlabeled data. Finally, we direct our analysis toward the application of a dynamical system, demonstrating that meta-causal states can also emerge from inherent system dynamics, and thus constitute more than a context-dependent framework in which mechanisms emerge merely as a result of external factors.

3177Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference

[openreview] [pdf]

Abstract Large language models have demonstrated promising capabilities upon scaling up parameters. However, serving large language models incurs substantial computation and memory movement costs due to their large scale. Quantization methods have been employed to reduce service costs and latency. Nevertheless, outliers in activations hinder the development of INT4 weight-activation quantization. Existing approaches separate outliers and normal values into two matrices or migrate outliers from activations to weights, suffering from high latency or accuracy degradation. Based on observing activations from large language models, outliers can be classified into channel-wise and spike outliers. In this work, we propose Rotated Runtime Smooth (RRS), a plug-and-play activation smoother for quantization, consisting of Runtime Smooth and the Rotation operation. Runtime Smooth (RS) is introduced to eliminatechannel-wise outliersby smoothing activations with channel-wise maximums during runtime. The Rotation operation can narrow the gap betweenspike outliersand normal values, alleviating the effect of victims caused by channel-wise smoothing. The proposed method outperforms the state-of-the-art method in the LLaMA and Qwen families and improves WikiText-2 perplexity from 57.33 to 6.66 for INT4 inference.

3178Advancing Algorithmic Trading with Large Language Models: A Reinforcement Learning Approach for Stock Market Optimization

[openreview] [pdf]

Abstract In the fast-evolving landscape of financial markets, effective decision-making tools are essential for managing complexities driven by economic indicators and market dynamics. Algorithmic trading strategies have gained prominence for their ability to execute trades autonomously, with Deep Reinforcement Learning (DRL) emerging as a key approach for optimizing trading actions through continuous market interaction. However, RL-based systems face significant challenges, particularly in adapting to evolving time series data and incorporating unstructured textual information. In response to these limitations, recent advancements in Large Language Models (LLMs) offer new opportunities. LLMs possess the capacity to analyze vast volumes of data, providing enhanced insights that can complement traditional market analysis. This study proposes a novel approach that integrates six distinct LLMs into algorithmic trading frameworks, developing Stock-Evol-Instruct, an innovative instruction generation algorithm. This algorithm enables RL agents to fine-tune their trading strategies by leveraging LLM-driven insights for daily stock trading decisions. Empirical evaluation using real-world stock data from Silver and JPMorgan demonstrates the significant potential of this approach to outperform conventional trading models. By bridging the gap between LLMs and RL in algorithmic trading, this study contributes to a new frontier in financial technology, setting the stage for future advancements in autonomous trading systems.

3179Private Blind Model Averaging – Distributed, Non-interactive, and Convergent

[openreview] [pdf]

Abstract Scalable distributed differentially private learning would benefit notably from reduced communication and synchronization overhead. The current best methods, based on gradient averaging, inherently require many synchronization rounds. In this work, we analyze blind model averaging for convex and smooth empirical risk minimization (ERM): each user first locally finishes training a model and then submits the model for secure averaging without any client-side online synchronization. This setting lends itself not only to data point-level privacy but also to flexible user-level privacy, where the combined impact of the user’s trained model does not depend on the number of data points used for training.In detail, we analyze the utility side of blind model averaging for support vector machines (SVMs) and the inherently multi-class Softmax regression (SoftmaxReg). On the theory side, we use strong duality to show for SVMs that blind model averaging converges toward centralized training performance if the task is robust against L2-regularization, i.e. if increasing the regularization weight does not destroy utility. Furthermore, we provide theoretical and experimental evidence that blind averaged Softmax Regression works well: we prove strong convexity of the dual problem by proving smoothness of the primal problem. Using this result, we also conclude the first output perturbation bounds for Softmax regression. On the experimental side, we support our theoretical SVM convergence. Furthermore, we observe hints of an even more fine-granular connection between good utility of model averaging and mid-range regularization weights which lead to compelling utility-privacy-tradeoffs for SVM and Softmax regression on 3 datasets (CIFAR-10, CIFAR-100, and federated EMNIST embeddings). We additionally provide ablation for an artificially extreme non-IID scenario.

3180Sharpness-Aware Minimization Efficiently Selects Flatter Minima Late In Training

[openreview] [pdf]

Abstract Sharpness-Aware Minimization (SAM) has substantially improved the generalization of neural networks under various settings. Despite the success, its effectiveness remains poorly understood. In this work, we discover an intriguing phenomenon in the training dynamics of SAM, shedding lights on understanding its implicit bias towards flatter minima over Stochastic Gradient Descent (SGD). Specifically, we find thatSAM efficiently selects flatter minima late in training. Remarkably, even a few epochs of SAM applied at the end of training yield nearly the same generalization and solution sharpness as full SAM training. Subsequently, we delve deeper into the underlying mechanism behind this phenomenon. Theoretically, we identify two phases in the learning dynamics after applying SAM late in training: i) SAM first escapes the minimum found by SGD exponentially fast; and ii) then rapidly converges to a flatter minimum within the same valley. Furthermore, we empirically investigate the role of SAM during the early training phase. We conjecture that the optimization method chosen in the late phase is more crucial in shaping the final solution’s properties. Based on this viewpoint, we extend our findings from SAM to Adversarial Training. We provide source code in supplementary materials and will release checkpoints in future.

3181Exploring Selective Layer Freezing Strategies in Transformer Fine-Tuning: NLI Classifiers with Sub-3B Parameter Models

[openreview] [pdf]

Abstract In recent years, methods that selectively fine-tune or reduce the number of layers in large language models (LLMs) have garnered attention as an efficient alternative to traditional fine-tuning, where all layers are trained. In this paper, we revisit the concept of Layer Freezing, a simple yet effective fine-tuning strategy, and introduce detailed strategies that improve the training efficiency of LLMs by selectively fine-tuning only a portion of the layers. We tested various freezing ratios and positions, and found that by freezing the bottom 25% or 50% of transformer layers during fine-tuning of an LLM with sub 3 billion parameters, we can achieve performance equal to or better than full model fine-tuning and Low-Rank Adaptation (LoRA), while significantly reducing memory usage and training time. Our experiments on natural language inference tasks show that this approach reduces memory consumption by about 30% and 50%, and improves training speed by 20-30%.

3182Language-Guided Object-Centric World Models for Predictive Control

[openreview] [pdf]

Abstract A world model is essential for an agent to predict the future and plan in domains such as autonomous driving and robotics. To achieve this, recent advancements have focused on video generation, which has gained significant attention due to the impressive success of diffusion models. However, these models require substantial computational resources. To address these challenges, we propose a world model leveraging object-centric representation space using slot attention, guided by language instructions. Our model perceives the current state as an object-centric representation and predicts future states in this representation space conditioned on natural language instructions. This approach results in a more compact and computationally efficient model compared to diffusion-based generative alternatives. Furthermore, it flexibly predicts future states based on language instructions, and offers a significant advantage in manipulation tasks where object recognition is crucial. In this paper, we demonstrate that our latent predictive world model surpasses generative world models in visuo-linguo-motor control tasks, achieving superior sample and computation efficiency. We also investigate the generalization performance of the proposed method and explore various strategies for predicting actions using object-centric representations.

3183Improving Generalization with Flat Hilbert Bayesian Inference

[openreview] [pdf]

Abstract We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against seven baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy. Our code is available at \url{https://anonymous.4open.science/r/Flat-Hilbert-Variational-Inference-008F/}.

3184Improving Tabular Generative Models: Loss Functions, Benchmarks, and Iterative Objective Bayesian Approaches

[openreview] [pdf]

Abstract In many applications of deep learning (DL), more data is essential to enhance model performance and generalization. A promising avenue to increase data availability is to use deep generative models (DGMs) to create synthetic data. However, existing DGMs struggle to capture the complexities of real-world tabular data, which often contain diverse variable types with potential imbalances and dependencies. To address these challenges, we propose a novel correlation- and distribution-aware loss function that works as a regularizer for DGMs. Additionally, to address the limitations of standard Bayesian optimization (SBO), which struggles to aggregate multiple metrics with different units--resulting in unreliable direct averaging and sub-optimal decisions--we introduce iterative objective refinement Bayesian optimization (IORBO) to rank metrics to enable more meaningful comparisons across diverse objectives. To ensure a rigorous evaluation, we establish a comprehensive benchmarking framework using twenty real-world datasets along with ten established tabular DGM baselines. The proposed loss function demonstrates statistically significant improvements over existing methods in capturing the true data distribution, significantly enhancing the quality of synthetic data generated with DGMs. The benchmarking framework shows that the enhanced synthetic data quality leads to improved performance in downstream DGMs tasks. Further, the proposed IORBO outperformed the SBO with mean aggregation in terms of win rate and outperformed the SBO with median aggregation overall.

3185Unlocking Trilevel Learning with Level-Wise Zeroth Order Constraints: Distributed Algorithms and Provable Non-Asymptotic Convergence

[openreview] [pdf]

Abstract Trilevel learning (TLL) found diverse applications in numerous machine learning applications, ranging from robust hyperparameter optimization to domain adaptation. However, existing researches primarily focus on scenarios where TLL can be addressed with first order information available at each level, which is inadequate in many situations involving zeroth order constraints, such as when black-box models are employed. Moreover, in trilevel learning, data may be distributed across various nodes, necessitating strategies to address TLL problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the TLL problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) TLL problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the ϵ\epsilon-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO, e.g., it approximately achieves up to a 40% improvement in performance.

3186MUSE: Machine Unlearning Six-Way Evaluation for Language Models

[openreview] [pdf]

Abstract Language models (LMs) are trained on vast amounts of text data, which may include private and copyrighted content. Data owners may request the removal of their data from a trained model due to privacy or copyright concerns. However, exactly unlearning only these datapoints (i.e., retraining with the data removed) is intractable in modern-day models. This has led to the development of many approximate unlearning algorithms. The evaluation of the efficacy of these algorithms has traditionally been narrow in scope, failing to precisely quantify the success and practicality of the algorithm from the perspectives of both the model deployers and the data owners. We address this issue by proposing MUSE, a comprehensive machine unlearning evaluation benchmark that enumerates six diverse desirable properties for unlearned models: (1) no verbatim memorization, (2) no knowledge memorization, (3) no privacy leakage, (4) utility preservation on data not intended for removal, (5) scalability with respect to the size of removal requests, and (6) sustainability over sequential unlearning requests. Using these criteria, we benchmark how effectively eight popular unlearning algorithms on 7B-parameter LMs can unlearn Harry Potter books and news articles. Our results demonstrate that most algorithms can prevent verbatim memorization and knowledge memorization to varying degrees, but only one algorithm does not lead to severe privacy leakage. Furthermore, existing algorithms fail to meet deployer’s expectations because they often degrade general model utility and also cannot sustainably accommodate successive unlearning requests or large-scale content removal. Our findings identify key issues with the practicality of existing unlearning algorithms on language models.

3187Logic-Logit: A Logic-Based Approach to Choice Modeling

[openreview] [pdf]

Abstract In this study, we propose a novel rule-based interpretable choice model, {\bf Logic-Logit}, designed to effectively learn and explain human choices. Choice models have been widely applied across various domains—such as commercial demand forecasting, recommendation systems, and consumer behavior analysis—typically categorized as parametric, nonparametric, or deep network-based. While recent innovations have favored neural network approaches for their computational power, these flexible models often involve large parameter sets and lack interpretability, limiting their effectiveness in contexts where transparency is essential.Previous empirical evidence shows that individuals usually use {\it heuristic decision rules} to form their consideration sets, from which they then choose. These rules are often represented as {\it disjunctions of conjunctions} (i.e., OR-of-ANDs). These rules-driven, {\it consider-then-choose} decision processes enable people to quickly screen numerous alternatives while reducing cognitive and search costs. Motivated by this insight, our approach leverages logic rules to elucidate human choices, providing a fresh perspective on preference modeling. We introduce a unique combination of column generation techniques and the Frank-Wolfe algorithm to facilitate efficient rule extraction for preference modeling—a process recognized as NP-hard. Our empirical evaluation, conducted on both synthetic datasets and real-world data from commercial and healthcare domains, demonstrates that Logic-Logit significantly outperforms baseline models in terms of interpretability and accuracy.

3188Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

[openreview] [pdf]

Abstract Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as thegeneration-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement -- a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. We believe our results have several empirical implications, and our study leaves many exciting future directions for understanding the potential and limits of LLM self-improvement.

3189Error Bounds for Deep Learning-based Uncertainty Propagation in SDEs

[openreview] [pdf]

Abstract Stochastic differential equations are commonly used to describe the evolution of stochastic processes. The uncertainty of such processes is best represented by the probability density function (PDF), whose evolution is governed by the Fokker-Planck partial differential equation (FP-PDE). However, it is generally infeasible to solve the FP-PDE in closed form. In this work, we show that physics-informed neural networks (PINNs) can be trained to approximate the solution PDF using existing methods. The main contribution is the analysis of the approximation error: we develop a theory to construct an arbitrary tight error bound with PINNs. In addition, we derive a practical error bound that can be efficiently constructed with existing training methods. Finally, we explain that this error-bound theory generalizes to approximate solutions of other linear PDEs. Several numerical experiments are conducted to demonstrate and validate the proposed methods.

3190ODE-based Smoothing Neural Network for Reinforcement Learning Tasks

[openreview] [pdf]

Abstract The smoothness of control actions is a significant challenge faced by deep reinforcement learning (RL) techniques in solving optimal control problems. Existing RL-trained policies tend to produce non-smooth actions due to high-frequency input noise and unconstrained Lipschitz constants in neural networks. This article presents a Smooth ODE (SmODE) network capable of simultaneously addressing both causes of unsmooth control actions, thereby enhancing policy performance and robustness under noise condition. We first design a smooth ODE neuron with first-order low-pass filtering expression, which can dynamically filter out high frequency noises of hidden state by a learnable state-based system time constant. Additionally, we construct a state-based mapping function, gg, and theoretically demonstrate its capacity to control the ODE neuron’s Lipschitz constant. Then, based on the above neuronal structure design, we further advanced the SmODE network serving as RL policy approximators. This network is compatible with most existing RL algorithms, offering improved adaptability compared to prior approaches. Various experiments show that our SmODE network demonstrates superior anti-interference capabilities and smoother action outputs than the multi-layer perception and smooth network architectures like LipsNet.

3191Re-Imagining Multimodal Instruction Tuning: A Representation View

[openreview] [pdf]

Abstract Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.

3192LLMs Are In-Context Reinforcement Learners

[openreview] [pdf]

Abstract Large Language Models (LLMs) can learn new tasks through in-context supervised learning (i.e., ICL). This work studies if this ability extends to in-context reinforcement learning (ICRL), where models are not given gold labels in context, but only their past predictions and rewards. We show that a naive application of ICRL fails miserably, and identify the root cause as a fundamental deficiency at exploration, which leads to quick model degeneration. We propose an algorithm to address this deficiency by increasing test-time compute, as well as a compute-bound approximation. We use several challenging classification tasks to empirically show that our ICRL algorithms lead to effective learning from rewards alone, and analyze the characteristics of this ability and our methods. Overall, our results reveal remarkable ICRL abilities in LLMs.

3193Multiple-play Stochastic Bandits with Prioritized Resource Sharing

[openreview] [pdf]

Abstract This paper proposes a variant of multiple-play stochastic bandits tailored to resource allocation problems arising from LLM applications, edge intelligence applications, etc. The proposed model is composed of MM arms and KK plays. Each arm has a stochastic number of capacities, and each unit of capacity is associated with a reward function. Each play is associated with a priority weight.When multiple plays compete for the arm capacity, the arm capacity is allocated in a larger priority weight first manner. Instance independent and instance dependent regret lower bounds of Ω(α1σKMT)\Omega( \alpha_1 \sigma \sqrt{KM T} ) and Ω(α1σ2MKΔlnT)\Omega(\alpha_1 \sigma^2 \frac{MK}{\Delta} \ln T) are proved, where α1\alpha_1 is the largest priority weight and σ\sigma characterizes the reward tail.When model parameters are given, we design an algorithm named \texttt{MSB-PRS-OffOpt} to locate the optimal play allocation policy with a computational complexity of O(M3K3)O(M^3K^3). Utilizing \texttt{MSB-PRS-OffOpt} as a subroutine, an approximate upper confidence bound (UCB) based algorithm is designed, which has instance independent and instance dependent regret upper bounds matching the corresponding lower bound up to factors of KlnKTK \sqrt{ \ln KT } and α1K\alpha_1 K respectively. To this end, we address nontrivial technical challenges arising from optimizing and learning under a special nonlinear combinatorial utility function induced by the prioritized resource sharing mechanism.

3194Advantages, Risks and Insights from Comparing In-Context Learning Models with Typical Meta-Learners

[openreview] [pdf]

Abstract We investigate in-context learning (ICL) models from the perspective of learning to learn. Unlike existing works understanding what exact explicit learning algorithms can and do ICL models learn, we compares ICL model with typical meta-learners to obtain end-to-end understanding on more general settings. We theoretically prove its expressiveness as learning algorithms and investigate its learnability and generalizability on extensive settings. It is demonstrated that ICL with transformers can effectively learn optimal learning algorithms data-dependently in an inclusive space containing existing gradient-based, metric-based and amortization-based meta-learners. However, the generalizability of these learning algorithms is identified to be a critical issue, as the learned algorithm could be implicitly fitting the training distribution rather than an explicit learning algorithm. Based on above understanding, we propose to systematically transfer deep-learning techniques which have been widely-studied in supervised-learning to meta-learning to address their common challenges. We practice meta-level meta-learning for domain-adaptability with few data and meta-level curriculum learning for fast convergence in pre-training as examples, showing their empirical effectiveness.

3195REvolve: Reward Evolution with Large Language Models using Human Feedback

[openreview] [pdf]

Abstract Designing effective reward functions is crucial to training reinforcement learning (RL) algorithms. However, this design is non-trivial, even for domain experts, due to the subjective nature of certain tasks that are hard to quantify explicitly. In recent works, large language models (LLMs) have been used for reward generation from natural language task descriptions, leveraging their extensive instruction tuning and commonsense understanding of human behavior. In this work, we hypothesize that LLMs, guided by human feedback, can be used to formulate reward functions that reflect human implicit knowledge. We study this in three challenging settings -- autonomous driving, humanoid locomotion, and dexterous manipulation -- wherein notions of ``good" behavior are tacit and hard to quantify. To this end, we introduce REvolve, a truly evolutionary framework that uses LLMs for reward design in RL. REvolve generates and refines reward functions by utilizing human feedback to guide the evolution process, effectively translating implicit human knowledge into explicit reward functions for training (deep) RL agents. Experimentally, we demonstrate that agents trained on REvolve-designed rewards outperform other state-of-the-art baselines.

3196A Model of Place Field Reorganization During Reward Maximization

[openreview] [pdf]

Abstract When rodents learn to navigate in a novel environment, a high density of place fields emerge at reward locations, fields elongate against the trajectory, and individual fields change spatial selectivity while demonstrating stable behavior. Why place fields demonstrate these characteristic phenomena during learning remains elusive. We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modelled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actor-critic for policy learning. Each field’s amplitude, center and width, as well as downstream weights, are updated online at each time step to maximize cumulative reward. We demonstrate that this framework unifies the three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improves policy convergence when learning to navigate to a single target and relearning multiple new targets. To conclude, we develop a normative model that recapitulates several aspects of hippocampal place field learning dynamics and unifies mechanisms to offer testable predictions for future experiments.

3197Towards Optimal Adapter Placement for Efficient Transfer Learning

[openreview] [pdf]

Abstract Parameter-efficient transfer learning (PETL) aims to adapt pre-trained models to new downstream tasks while minimizing the number of fine-tuned parameters. Adapters, a popular approach in PETL, inject additional capacity into existing networks by incorporating low-rank projections, achieving performance comparable to full fine-tuning with significantly fewer parameters. This paper investigates the relationship between the placement of an adapter and its performance. We observe that adapter location within a network significantly impacts its effectiveness, and that the optimal placement is task-dependent. To exploit this observation, we introduce an extended search space of adapter connections, including long-range and recurrent adapters. We demonstrate that even randomly selected adapter placements from this expanded space yield improved results, and that high-performing placements often correlate with high gradient rank. Our findings reveal that a small number of strategically placed adapters can match or exceed the performance of the common baseline of adding adapters in every block, opening a new avenue for research into optimal adapter placement strategies.

3198Exponential Topology-enabled Scalable Communication in Multi-agent Reinforcement Learning

[openreview] [pdf]

Abstract In cooperative multi-agent reinforcement learning (MARL), well-designed communication protocols can effectively facilitate consensus among agents, thereby enhancing task performance. Moreover, in large-scale multi-agent systems commonly found in real-world applications, effective communication plays an even more critical role due to the escalated challenge of partial observability compared to smaller-scale setups. In this work, we endeavor to develop a scalable communication protocol for MARL. Unlike previous methods that focus on selecting optimal pairwise communication links—a task that becomes increasingly complex as the number of agents grows—we adopt a global perspective on communication topology design. Specifically, we propose to utilize the exponential topology to enable rapid information dissemination among agents by leveraging its small-diameter and small-size properties. This approach leads to a scalable communication protocol, named ExpoComm. To fully unlock the potential of exponential graphs as communication topologies, we employ memory-based message processors and auxiliary tasks to ground messages, ensuring that they reflect global information and benefit decision-making. Extensive experiments on large-scale cooperative benchmarks, including MAgent and Infrastructure Management Planning, demonstrate the superior performance and robust zero-shot transferability of ExpoComm compared to existing communication strategies.

3199How Does Critical Batch Size Scale in Pre-training?

[openreview] [pdf]

Abstract Training large-scale models under given resource budgets requires the careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the point beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.

3200SWGA: A Distributed Hyperparameter Search Method for Time Series Prediction Models

[openreview] [pdf]

Abstract We propose a distributed hyperparameter search method for time series prediction models named SWGA (Sliding Window Genetic Algorithm). Compared to current genetic algorithms for hyperparameter search, our method has three major advantages: (i) It adopts a configurable sliding window mechanism to effectively combat overfitting from distribution shifts inherent in time series data. (ii) It introduces a warm-up stage using Bayesian optimization-based methods to generate a good initial population. (iii) It supports distributed hyperparameter search across multi-node computing clusters, enhancing both scalability and efficiency. To demonstrate SWGA’s efficacy, we conduct hyperparameter search experiments on time series datasets from various domains. The experiment results show that our method consistently finds a hyperparameter configuration that achieves better performance on out-of-sample time series data compared to the traditional genetic algorithm. On average, it reduces the out-of-sample loss by about 56.1%.

3201Embedding Learning for Approximating Person-specific Cognitive Similarity

[openreview] [pdf]

Abstract Metric learning is often applied in scenarios where labels are well-defined or where there is a ground truth for semantic similarity between data points. However, in expert domains such as medical data, where experts perceive features and similarities differently on an individual basis, modeling psychological embeddings at the individual level can be beneficial. Such embeddings can predict factors that influence behavior, such as individual uncertainty, and support personalized learning strategies. Despite this potential, the amount of person-specific behavioral data that can be collected through similarity behavior sampling is insufficient in most scenarios, making modeling individual cognitive embeddings challenging and underexplored. In this study, we proposed integrating supervised learning on small-scale similarity sampling data with unsupervised autoencoder-based manifold learning to approximate person-specific psychological embeddings with significantly improved similarity inference performance. We conducted a large-scale experiment with 121 clinical physicians, measured their cognitive similarities using medical image data, and implemented person-specific models. Our results demonstrate that even in complex expert domains, such as medical imaging, where cognitive similarity varies between individuals, person-specific psychological embeddings can be effectively approximated using limited behavioral data.

3202Gray-Box Fine-Tuning for Single Backbone Domain Experts

[openreview] [pdf]

Abstract The emergence of foundational models has greatly improved performance across various downstream tasks, with fine-tuning often yielding even better results. However, existing fine-tuning approaches typically require access to model weights and layers, leading to challenges such as managing multiple model copies or inference pipelines, inefficiencies in edge device optimization, and concerns over proprietary rights, privacy, and exposure to unsafe model variants. In this paper, we address these challenges by exploring “Gray-box” fine-tuning approaches, where the model’s architecture and weights remain hidden, allowing only gradient propagation. We introduce a novel yet simple and effective framework that adapts to new tasks using two lightweight learnable modules at the model’s input and output. Additionally, we present a less restrictive variant that offers more entry points into the model, balancing performance with model exposure. We evaluate our approaches across several backbones on benchmarks for text-image alignment, text-video alignment, and sketch-image alignment. Our results demonstrate that, despite having limited access to the model, our Gray-box approaches achieve competitive performance with full-access fine-tuning methods.

3203Mitigating Modality Prior-Induced Hallucinations in Multimodal Large Language Models via Deciphering Attention Causality

[openreview] [pdf]

Abstract Multimodal Large Language Models (MLLMs) have emerged as a central focus in both industry and academia, but often suffer from biases introduced by visual and language priors, which can lead to multimodal hallucination. These biases arise from the visual encoder and the Large Language Model (LLM) backbone, affecting the attention mechanism responsible for aligning multimodal inputs. Existing decoding-based mitigation methods focus on statistical correlations and overlook the causal relationships between attention mechanisms and model output, limiting their effectiveness in addressing these biases. To tackle this issue, we propose a causal inference framework termed CausalMM that applies structural causal modeling to MLLMs, treating modality priors as a confounder between attention mechanisms and output. Specifically, by employing backdoor adjustment and counterfactual reasoning at both the visual and language attention levels, our method mitigates the negative effects of modality priors and enhances the alignment of MLLM’s inputs and outputs, with a maximum score improvement of 65.3% on 6 VLind-Bench indicators and 164 points on MME Benchmark compared to conventional methods. Extensive experiments validate the effectiveness of our approach while being a plug-and-play solution.

3204Differentially Private Bilevel Optimization

[openreview] [pdf]

Abstract We present differentially private (DP) algorithms for bilevel optimization, a problem class that received significant attention lately in various machine learning applications. These are the first DP algorithms for this task that are able to provide any desired privacy, while also avoiding Hessian computations which are prohibitive in large-scale settings. Under the well-studied setting in which the upper-level is not necessarily convex and the lower-level problem is strongly-convex, our proposed gradient-based (ϵ,δ)(\epsilon,\delta)-DP algorithm returns a point with hypergradient norm at most O~((dup/ϵn)1/2+(dlow/ϵn)1/3)\widetilde{\mathcal{O}}\left((\sqrt{d_\mathrm{up}}/\epsilon n)^{1/2}+(\sqrt{d_\mathrm{low}}/\epsilon n)^{1/3}\right) where nn is the dataset size, and dup/dlowd_\mathrm{up}/d_\mathrm{low} are the upper/lower level dimensions. Our analysis covers constrained and unconstrained problems alike, accounts for mini-batch gradients, and applies to both empirical and population losses.

3205Exact Recovery Guarantees for Parameterized Nonlinear System Identification Problem under Adversarial Attacks

[openreview] [pdf]

Abstract In this work, we study the system identification problem for parameterized nonlinear systems using basis functions under adversarial attacks. Motivated by the LASSO-type estimators, we analyze the exact recovery property of a nonsmooth estimator, which is generated by solving an embedded 1\ell_1-loss minimization problem. First, we derive necessary and sufficient conditions for the well-specifiedness of the estimator and the uniqueness of global solutions to the underlying optimization problem. Next, we provide exact recovery guarantees for the estimator under two different scenarios of boundedness and Lipschitz continuity of the basis functions. The non-asymptotic exact recovery is guaranteed with high probability, even when there are more severely corrupted data than clean data. Finally, we numerically illustrate the validity of our theory. This is the first study on the sample complexity analysis of a nonsmooth estimator for the nonlinear system identification problem.

3206Consistency Guaranteed Causal Graph Recovery with Large Language Models

[openreview] [pdf]

Abstract Causal graph recovery traditionally relies on statistical estimation of observable variables or individual knowledge, which suffer from data collection biases and knowledge limitations of individuals. Leveraging the broad knowledge in scientific corpus, we propose a novel method for causal graph recovery to deduce causal relationships with the large language models (LLMs) as a knowledge extractor. Our method extracts associational relationships among variables and further eliminates the inconsistent relationship to recover a causal graph using the constraint-based causal discovery methods. Comparing to other LLM-based methods that directly instruct LLMs to do highly complex causal reasoning, our method shows advantages on causal graph quality on benchmark datasets. More importantly, as causal graphs may evolve when new research results emerge, our method shows sensitivity to new evidence in the literature and can provide useful information to update causal graphs accordingly.

3207What’s Wrong With Non-Autoregressive Graph Neural Networks in Neural Combinatorial Optimization

[openreview] [pdf]

Abstract Neural combinatorial optimization (NCO) leverages machine learning models to tackle complex combinatorial problems by learning heuristics or direct solution construction. Graph Neural Networks (GNNs) are particularly effective for NCO due to their ability to capture the relational structure inherent in many such problems. In this work, we examine the supervised non-autoregressive (NAR) solution construction framework, revealing a misalignment between training objective and solution quality. Specifically, through experiments on six GNN architectures across three problems—Traveling Salesperson Problem (TSP), Maximum Independent Set (MIS), and Minimum Vertex Cover (MVC)—we show that lower training loss does not correlate with lower optimality gap. To address this, we propose a supervised autoregressive (AR) framework that leverages the conditional dependencies between variables by training to complete partial solutions. Empirical results show that the proposed AR framework does not exhibit the same misalignment and consistently improves performance. We further compare the proposed AR framework against existing supervised GNN-based methods and achieve superior performance, especially in terms of generalizing to larger problem instances.

3208PLHF: Prompt Learning from Few-shot Human Feedback

[openreview] [pdf]

Abstract Recent advances explore prompt tuning for large language models (LLMs) and develop automatic optimization frameworks to obtain suitable prompts with respect to desired output quality metrics. Although existing approaches can handle conventional tasks such as fixed-solution question answering, defining the metric becomes complicated when the output quality cannot be easily assessed by comparisons with standard golden samples, especially for those natural language applications that multiple outputs are equally valid. Consequently, optimizing the prompts effectively and efficiently without a clear metric becomes a critical challenge. To address this issue, we present PLHF, a few-shot prompt optimization framework inspired by the well-known RLHF technique. Different from naive strategies involving human experts, PLHF employs a specific evaluator module acting as the metric to estimate the output quality. PLHF requires only a single round of human feedback to complete the entire prompt optimization process. Empirical results on both public and industrial datasets show that PLHF significantly outperforms existing output scoring strategies for LLM prompt optimizations.

3209Evidence-Enhanced Triplet Generation Framework for Hallucination Alleviation in Generative Question Answering

[openreview] [pdf]

Abstract To address the hallucination in generative question answering (GQA) where the answer can not be derived from the document, we propose a novel evidence-enhanced triplet generation framework, EATQA, encouraging the model to predict all the combinations of ⟨Question, Evidence, Answer⟩ triplet by flipping the source pair and the target label to understand their logical relationships, i.e., predict Answer(A), Question(Q), and Evidence(E) given a QE, EA, and QA pairs, respectively. Furthermore, we bridge the distribution gap to distill the knowledge from evidence in inference stage. Our framework ensures the model to learn the logical relation between query, evidence and answer, which simultaneously improves the evidence generation and query answering. In this paper, we apply EATQA to LLama and it outperforms other LLMs-based methods and hallucination mitigation approaches on two challenging GQA benchmarks. Further analysis shows that our method not only keeps prior knowledge within LLM, but also mitigates hallucination and generates faithful answers.

3210Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

[openreview] [pdf]

Abstract The mixture ratio of data from different source domains significantly affects the performance of language models (LM) pretraining. In this paper, we introduce~\textsc{Domain2Vec}, a novel approach that decomposes any dataset into a linear combination of several ``Meta-Domains’', a new concept designed to capture key underlying features of datasets. \textsc{Domain2Vec} maintains a vocabulary of Meta-Domains and uses a Meta-Domain Classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of optimal data mixture ratio for LM pretraining in a training-free manner under the \textit{\textbf{D}istribution \textbf{A}lignment \textbf{A}ssumption} (DA2^{2}). Moreover, previous work could use \textsc{Domain2vec} to model the relationship between domain vectors and LM performance, greatly enhancing the scalability of previous methods without retraining as new datasets are introduced. Extensive experiments demonstrate that \textsc{Domain2Vec} finds data mixture ratios that enhance downstream task performance with minimal computational overhead. Specifically, \textsc{Domain2Vec} achieves the same validation loss on Pile-CC using only 51.551.5% of the compute required when training on the original mixture of The Pile Dataset. Under equivalent compute budget, \textsc{Domain2Vec} improves downstream performance by an average of 2.722.72%. \textsc{Domain2Vec} serves as a strong and efficient baseline for data mixture optimization in LM pretraining, offering insights into improving data efficiency in large-scale models.

3211In-Context Editing: Learning Knowledge from Self-Induced Distributions

[openreview] [pdf]

Abstract In scenarios where language models must incorporate new information efficiently without extensive retraining, traditional fine-tuning methods are prone to overfitting, degraded generalization, and unnatural language generation. To address these limitations, we introduce Consistent In-Context Editing (ICE), a novel approach leveraging the model’s in-context learning capability to optimize towards a contextual distribution rather than a one-hot target. ICE introduces a simple yet effective optimization framework for the model to internalize new knowledge by aligning its output distributions with and without additional context. This method enhances the robustness and effectiveness of gradient-based tuning methods, preventing overfitting and preserving the model’s integrity. We analyze ICE across four critical aspects of knowledge editing: accuracy, locality, generalization, and linguistic quality, demonstrating its advantages. Experimental results confirm the effectiveness of ICE and demonstrate its potential for continual editing, ensuring that the integrity of the model is preserved while updating information.

3212Scalable Bayesian Learning with posteriors

[openreview] [pdf]

Abstract Although theoretically compelling, Bayesian learning with modern machine learning models is computationally challenging since it requires approximating a high dimensional posterior distribution. In this work, we (i) introduce posteriors, an easily extensible PyTorch library hosting general-purpose implementations making Bayesian learning accessible and scalable to large data and parameter regimes; (ii) present a tempered framing of stochastic gradient Markov chain Monte Carlo, as implemented in posteriors, that transitions seamlessly into optimization and unveils a minor modification to deep ensembles to ensure they are asymptotically unbiased for the Bayesian posterior, and (iii) demonstrate and compare the utility of Bayesian approximations through experiments including an investigation into the cold posterior effect and applications with large language models.

3213BiDoRA: Bi-level Optimization-based Weight-Decomposed Low-Rank Adaptation

[openreview] [pdf]

Abstract Parameter-efficient fine-tuning (PEFT) of large language models (LLMs) has gained considerable attention as a flexible and efficient way of adapting LLMs to downstream tasks. Among these methods, weighted decomposed low-rank adaptation (DoRA) has emerged as a promising approach. DoRA bridges the gap between low-rank adaptation (LoRA) and full fine-tuning (FT) by decomposing the weight matrices into magnitude and direction components, thereby maintaining learning behavior similar to FT. Although DoRA shows encouraging performance, it introduces additional parameters compared to LoRA, which potentially increases the risk of overfitting. Moreover, optimizing magnitude and direction simultaneously leads to a coupled gradient updating pattern for both components, limiting its learning capacity. To overcome these limitations, we propose BiDoRA, a bi-level optimization-based PEFT method. In BiDoRA, the direction and magnitude components are optimized on two distinct datasets at different optimization levels, mitigating the risk of overfitting. Additionally, the asynchronous optimization of the two components promotes their decoupling, allowing for more flexible gradient updates suitable for various downstream tasks. Evaluation of BiDoRA on fourteen datasets spanning natural language understanding, natural language generation, and token classification reveals that it significantly outperforms DoRA and other PEFT methods. The superior performance of BiDoRA underscores its effectiveness. The code for BiDoRA is available athttps://anonymous.4open.science/r/BiDoRA-5D31.

3214Selective Preference Optimization via Token-Level Reward Function Estimation

[openreview] [pdf]

Abstract Recent advancements in large language model alignment leverage token-level supervisions to perform fine-grained preference optimization. However, existing token-level alignment methods either optimize on all available tokens, which can be noisy and inefficient, or perform selective training with complex and expensive key token selection strategies. In this work, we propose Selective Preference Optimization (SePO), a novel selective alignment strategy that centers on efficient key token selection without requiring strong, fine-grained supervision signals. We theoretically prove the feasibility of Direct Preference Optimization (DPO) as token-level reward function estimators, which applies to any existing alignment datasets and enables cost-efficient token selection with small-scale model sizes and training data. We then train an oracle model with DPO on the target data and utilize the estimated reward function to score all tokens within the target dataset, where only the key tokens are selected to supervise the target policy model with a contrastive objective function. Extensive experiments on three public evaluation benchmarks show that SePO significantly outperforms competitive baseline methods by only optimizing on 30% key tokens. We also explore SePO as a new paradigm for weak-to-strong generalization, showing that weak oracle models effectively supervise strong policy models with up to 16.8×\times more parameters. SePO also selects useful supervision signals from out-of-distribution data, alleviating the over-optimization problem.

3215Large Legislative Models: Towards Efficient AI Policymaking in Economic Simulations

[openreview] [pdf]

Abstract The improvement of economic policymaking presents an opportunity for broad societal benefit, a notion that has inspired research towards AI-driven policymaking tools. AI policymaking holds the potential to surpass human performance through the ability to process data quickly at scale. However, existing RL-based methods exhibit sample inefficiency, and are further limited by an inability to flexibly incorporate nuanced information into their decision-making processes. Thus, we propose a novel method in which we instead utilize pre-trained Large Language Models (LLMs), as sample-efficient policymakers in socially complex multi-agent reinforcement learning (MARL) scenarios. We demonstrate significant efficiency gains, outperforming existing methods across three environments.

3216Speculative Streaming: Fast LLM Inference without Auxiliary Models

[openreview] [pdf]

Abstract Speculative decoding is a prominent technique to accelerate large language model inference by leveraging predictions from an auxiliary draft model. While effective, in application-specific settings, it often involves fine-tuning both draft and target models to achieve high acceptance rates. As the number of downstream tasks grows, draft models add significant complexity to inference systems. Recently several single model architectures viz. Medusa have been proposed to speculate tokens in non-autoregressive manner, however, their effectiveness is limited due to lack of dependency between speculated tokens. We introduce a novel speculative decoding method that integrates drafting within the target model by using Multi-stream attention and incorporates future token planning into supervised finetuning objective. To the best of our knowledge, this is the first parameter-efficient approach that scales well with an increasing number of downstream tasks while enhancing downstream metrics and achieving high acceptance rates, attributable to the interdependence among the speculated tokens. Speculative Streaming speeds up decoding by 1.9 - 3X in a diverse set of tasks, such as Summarization, Structured Queries, and Meaning Representation, while improving generation quality and using ∼10000X fewer extra parameters than alternative architectures, making it ideal for resource-constrained devices. Our approach can also be effectively deployed in lossless settings for generic chatbot applications that do not necessitate supervised fine-tuning. In such setups, we achieve 2.9 - 3.2X speedup while maintaining the integrity of the base model’s output.

3217Who Should Join the Decision-Making Table? Targeted Expert Selection for Enhanced Human-AI Collaboration

[openreview] [pdf]

Abstract Integrating AI and human expertise can significantly enhance decision-making across various scenarios. This paper introduces a novel approach that leverages the Product of Experts (PoE) model to optimize decision-making by strategically combining AI with human inputs. While human experts bring diverse perspectives, their decisions may be constrained by biases or knowledge gaps. To address these limitations, we propose an AI agent that provides probabilistic, rule-based insights, complementing and filling human experts’ knowledge gaps. A key feature of our approach is the strategic selection of human experts based on how well their knowledge complements or enhances the AI’s recommendations. By dynamically adapting the expert selection process, we ensure that decisions benefit from the most impactful and complementary inputs. Our PoE model calibrates inputs from both AI and human experts, leveraging their combined strengths to improve decision outcomes. Furthermore, operating in an online setting, our framework can also continuously update the AI’s knowledge and refine expert selection criteria, ensuring adaptability to evolving environments. Experiments in simulation environments demonstrate that our model effectively integrates logic rule-informed AI with human expertise, enhancing collaborative decision-making.

3218Balancing Label Quantity and Quality for Scalable Elicitation

[openreview] [pdf]

Abstract Scalable oversight studies methods of training and evaluating AI systems in domains where human judgement is unreliable or expensive, such as scientific research and software engineering in complex codebases. Recent work in this area by Burns et al. (2023) suggests that Language Models (LMs) pretrained on internet-scale corpora exhibit an inductive bias toward producing correct answers, even when finetuned on error-prone labels produced by a smaller language model. This suggests that massive pretraining combined with finetuning on imperfect human labels may be a solid baseline method for scalable oversight. In the real world, however, label quality is not fixed: practitioners face a quantity-quality tradeoff when generating finetuning data. In this paper, we explore the microeconomics of the quantity-quality tradeoff on binary NLP classification tasks used in Burns et al. (2023). We find that there are three regimes of eliciting classification knowledge from pretrained models using supervised finetuning: quantity-dominant, quality-dominant, and a mixed regime involving the use of low- and high-quality data together to attain higher accuracy at a lower cost than using either alone. We explore sample-efficient elicitation methods that make use of two datasets of differing qualities, and establish a Pareto frontier of scalable elicitation methods that optimally trade off labeling cost and classifier performance.

3219LAM Simulator: Advancing Large Action Model Training for Agent via Online Exploration and Feedback Simulation

[openreview] [pdf]

Abstract Large Action Models (LAMs) for AI agents have significant potential, but their development is often constrained by the reliance on supervised learning and manual data curation, which are both time-consuming and costly. To address these limitations, we present the LAM Simulator, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback. This framework includes a curated set of high-quality agentic tasks, a diverse collection of tools, and an interactive environment where agent models can call tools, receive execution responses, and obtain action feedback. Our findings indicate that the LAM Simulator significantly enhances model performance and effectively identifies and addresses potential issues. Specifically, our model, LAM-Sim-8x7B, demonstrates an 18.54% improvement over its base LAM and significantly outperforms other state-of-the-art alternatives on ToolEval benchmark. Furthermore, we have demonstrated that LLMs lacking in agentic capability can greatly benefit from the implementation of LAM Simulator. Our experiments with a model trained on Mixtral-8x7B-Instruct-v0.1 have yielded a doubling to tripling of performance. Remarkably, the data construction process for training these models requires minimal human intervention, making the LAM Simulator a robust framework for accelerating the development of AI agents.

3220Online Gradient Boosting Decision Tree: In-Place Updates for Adding/Deleting Data

[openreview] [pdf]

Abstract Gradient Boosting Decision Tree (GBDT) is one of the most popular machine learning models in various applications. But in the traditional settings, all data should be simultaneously accessed in the training procedure: it does not allow to add or delete any data instances after training. In this paper, we propose a novel online learning framework for GBDT supporting both incremental and decremental learning. To the best of our knowledge, this is the first work that considers an in-place unified incremental and decremental learning on GBDT. To reduce the learning cost, we present a collection of optimizations for our framework, so that it can add or delete a small fraction of data on the fly. We theoretically show the relationship between the hyper-parameters of the proposed optimizations, which enables trading off accuracy and cost on incremental and decremental learning. The backdoor attack results show that our framework can successfully inject and remove backdoor in a well-trained model using incremental and decremental learning, and the empirical results on public datasets confirm the effectiveness and efficiency of our proposed online learning framework and optimizations.

3221ShuffleMTM: Learning Cross-channel Dependence in Multivariate Time Series from Shuffled Patches

[openreview] [pdf]

Abstract Masked time-series modeling has widely gained attention as a self-supervised pre-training method for multivariate time series (MTS). Recent studies adopt a channel-independent (CI) strategy to enhance the temporal modeling capacity. Despite the effectiveness and performance of this strategy, the CI methods inherently overlook cross-channel dependence, which is inherent and crucial in MTS data in various domains. To fill this gap, we propose ShuffleMTM, a simple yet effective masked time-series modeling framework to learn cross-channel dependence from shuffled patches. Technically, ShuffleMTM proposes to shuffle the unmasked patches from masked series across different channels, positioned at the same index. Then, Siamese encoders learn two views of masked patch representations from original and shuffled masked series, simultaneously capturing the temporal dependence within a channel as well as spatial dependence across different channels. ShuffleMTM pre-trains the Siamese encoders to reconstruct the original series by incorporating cross-channel information with intra-channel cross-time information. Our proposed method consistently achieves superior performance in various experiments, compared to advanced CI pre-training methods and channel-dependent methods in both time series forecasting and classification tasks.

3222Zero-shot Outlier Detection via Synthetically Pretrained Transformers: Model Selection Bygone!

[openreview] [pdf]

Abstract Outlier detection (OD) has a vast literature as it finds numerous applications in environmental monitoring, security, manufacturing, and finance to name a few. Being an inherently unsupervised task, model selection is a key bottleneck for OD (both algorithm and hyperparameter selection) without label supervision. There is a long list of techniques to choose from – both classical algorithms and deep neural architectures – and while several studies report their hyperparameter sensitivity, the literature remains quite slim on unsupervised model selection—limiting the effective use of OD in practice. In this paper we present FoMo-0D, for zero/0-shot OD exploring a transformative new direction that bypasses the hurdle of model selection altogether (!), thus breaking new ground. The fundamental idea behind FoMo-0D is the Prior-data Fitted Networks, recently introduced by Müller et al. (2022), which trains a Transformer model on a large body of synthetically generated data from a prior data distribution. In essence, FoMo-0D is a pretrained Foundation Model for zero/0-shot OD on tabular data, which can directly predict the (outlier/inlier) label of any test data at inference time, by merely a single forward pass—making obsolete the need for choosing an algorithm/architecture and tuning its associated hyperparameters, besides requiring no training of model parameters when given a new OD dataset. Extensive experiments on 57 public benchmark datasets against 26 baseline methods show that FoMo-0D performs statistically no different from the 2nd top baseline, while significantly outperforming the majority of the baselines, with an average inference time of 7.7 ms per test sample.

3223Convex Distillation: Efficient Compression of Deep Networks via Convex Optimization

[openreview] [pdf]

Abstract Deploying large and complex deep neural networks on resource-constrained edge devices poses significant challenges due to their computational demands and the complexities of non-convex optimization. Traditional compression methods such as distillation and pruning often retain non-convexity that complicates fine-tuning in real-time on such devices. Moreover, these methods often necessitate extensive end-to-end network fine-tuning after compression to preserve model performance, which is not only time-consuming but also requires fully annotated datasets, thus potentially negating the benefits of efficient network compression. In this paper, we introduce a novel distillation technique that efficiently compresses the model via convex optimization -- eliminating intermediate non-convex activation functions and using only intermediate activations from the original model. Our approach enables distillation in a label-free data setting and achieves performance comparable to the original model without requiring any post-compression fine-tuning. We demonstrate the effectiveness of our method for image classification models on multiple standard datasets, and further show that in the data limited regime, our method can outperform standard non-convex distillation approaches. Our method promises significant advantages for deploying high-efficiency, low-footprint models on edge devices, making it a practical choice for real-world applications. We show that convex neural networks, when provided with rich feature representations from a large pre-trained non-convex model, can achieve performance comparable to their non-convex counterparts, opening up avenues for future research at the intersection of convex optimization and deep learning.

3224Bridging The Gap Between Training and Testing for Certified Robustness

[openreview] [pdf]

Abstract Certified robustness provides a theoretical lower bound for adversarial robustness and arouses widespread interests and discussions from the research community. With theoretical support to improve the certified robustness on the training set, practitioners endeavor to train a more certified robust model during inference on the test set. However, the experimental neglect on the training set and the theoretical ignorance during inference on the test set induce a gap between training and testing for certified robustness. By establishing an equivalence between the convergence of training loss and the improvement of certified robustness, we recognize there is a trade-off between expressive power and generalization (assuming a well-conditioned optimization) for certified robustness, which is similar to the underfitting and overfitting discussed in machine learning. To investigate this trade-off, we design a new orthogonal convolution-Controllable Orthogonal Convolution Kernel (COCK) which provides a wider range of expressive power than existing orthogonal convolutions. Empirically, there is a power-driven shift from vanilla classification accuracy to certified robustness in the sense of the optimal trade-off between expressive power and generalization. The experimental results suggest that by carefully improving the expressive power from the optimal trade-off for vanilla classification performance, the model will be more certified robust.

3225SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation

[openreview] [pdf]

Abstract As advancements in large language models (LLMs) continue and the demand for personalized models increases, parameter-efficient fine-tuning (PEFT) methods (e.g., LoRA) will become essential due to their efficiency in reducing computation costs. However, recent studies have raised alarming concerns that LoRA fine-tuning could potentially compromise the safety alignment in LLMs, posing significant risks for the model owner. In this paper, we first investigate the underlying mechanism by analyzing the changes in safety alignment related features before and after fine-tuning. Then, we propose a fixed safety module calculated by safety data and a task-specific initialization for trainable parameters in low-rank adaptations, termed Safety-alignment preserved Low-Rank Adaptation (SaLoRA). Unlike previous LoRA methods and their variants, SaLoRA enables targeted modifications to LLMs without disrupting their original alignments. Our experiments show that SaLoRA outperforms various adapters-based approaches across various evaluation metrics in different fine-tuning tasks.

3226Sanitizing LLMs: Retrospective Learning for Self-Correction of Inconsistent Samples via User Preferences

[openreview] [pdf]

Abstract With the advent of large language models (LLMs), using LLMs in conjunction with prompt-based tasks has demonstrated the ability to reduce the high cost and inefficiency of human annotations. Nonetheless, in unsupervised new downstream tasks that require user preferences to align data annotations with expectations, existing evaluation methods for prompt-based tasks become ineffective, especially when ground truth annotations are insufficient or missing. To fill this gap, we propose the novel Consistent and Inconsistent (CAI) Ratio, inspired by our experimental observation that LLMs underperform when the number of inconsistent samples—those with inconsistent predictions across LLMs and the student model—exceeds the number of consistent samples. By estimating the CAI ratio and identifying consistent and inconsistent samples with our proposed CAI identification approach, we aim to minimize inconsistency and enhance the accuracy of LLM-generated annotations for unsupervised data. To achieve this, we introduce Retrospective Learning (RL) with user preference, a data-centric approach that collaborates with the student model and LLMs, using a small number of human annotations as user preferences to resolve inconsistencies in the identified samples. Applied to five domain-specific NLP datasets, our Retrospective Learning approach, leveraging CAI identification, significantly improved the accuracy of LLM-generated responses, with the CAI ratio increasing as the accuracy improved.

3227Finite-Time Analysis for Conflict-Avoidant Multi-Task Reinforcement Learning

[openreview] [pdf]

Abstract Multi-task reinforcement learning (MTRL) has shown great promise in many real-world applications. Existing MTRL algorithms often aim to learn a policy that optimizes individual objective functions simultaneously with a given prior preference (or weights) on different tasks. However, these methods often suffer from the issue of gradient conflict such that the tasks with larger gradients dominate the update direction, resulting in a performance degeneration on other tasks. In this paper, we develop a novel dynamic weighting multi-task actor-critic algorithm (MTAC) under two options of sub-procedures named as CA and FC in task weight updates. MTAC-CA aims to find a conflict-avoidant (CA) update direction that maximizes the minimum value improvement among tasks, and MTAC-FC targets at a much faster convergence rate. We provide a comprehensive finite-time convergence analysis for both algorithms. We show that MTAC-CA can find a ϵ+ϵapp\epsilon+\epsilon_{\text{app}}-accurate Pareto stationary policy using O(ϵ5)\mathcal{O}({\epsilon^{-5}}) samples, while ensuring a small ϵ+ϵapp\epsilon+\sqrt{\epsilon_{\text{app}}}-level CA distance (defined as the distance to the CA direction), where ϵapp\epsilon_{\text{app}} is the function approximation error. The analysis also shows that MTAC-FC improves the sample complexity to O(ϵ3)\mathcal{O}(\epsilon^{-3}), but with a constant-level CA distance. Our experiments on MT10 demonstrate the improved performance of our algorithms over existing MTRL methods with fixed preference.

3228PersonaEval: Benchmarking LLMs on Role-Playing Evaluation Tasks

[openreview] [pdf]

Abstract Role-playing in large language models (LLMs) has become a crucial area of research, enabling models to simulate diverse personas and tailor responses, significantly impacting natural language understanding and human-computer interaction. However, while advanced LLMs like GPT-4 are used to evaluate role-playing methods, their reliability in providing accurate assessments remains uncertain, especially in distinguishing nuanced role-playing characteristics. In this paper, we introduce PersonaEval, a benchmark designed to assess the effectiveness of LLMs in role-playing evaluation tasks. We frame the problem as a classification task to determine whether an LLM evaluator can distinguish between sentences from different levels of expertise based solely on linguistic cues. Using real-world data from the Wired 5 Levels video series—where experts explain concepts to five distinct audiences: a child, a teenager, a college student, a graduate student, and another expert—we design three evaluation settings that correspond to commonly used LLM evaluation approaches: five-level classification, pairwise role comparison, and few-shot learning. These settings aim to capture various aspects of how effectively LLMs evaluate role-playing performance. Our study highlights the limitations of current LLMs in persona evaluation tasks and underscores the need for further research to enhance their evaluation capabilities. We provide a foundation for future work aimed at improving the accuracy and professionalism of LLM evaluators in role-playing contexts.

3229AnyGraph: Graph Foundation Model in the Wild

[openreview] [pdf]

Abstract The growing ubiquity of relational data structured as graphs has underscored the need for graph learning models with exceptional generalization capabilities. However, current approaches often struggle to effectively extract generalizable insights, frequently requiring extensive fine-tuning and limiting their versatility. Graph foundation models offer a transformative solution, with the potential to learn robust, generalizable representations from graph data. This enables more effective and adaptable applications across a wide spectrum of tasks and domains. In this work, we investigate a unified graph model, AnyGraph, designed to handle key challenges: i) Structure Heterogenity. Addressing distribution shift in graph structural information; ii) Feature Heterogenity. Handling diverse feature representation spaces across graph datasets; iii) Fast Adaptation. Efficiently adapting the model to new graph domains; iv) Scaling Law Emergence. Enabling the model to exhibit scaling law behavior, where its performance scales favorably with the amount of data and parameter sizes. To tackle these critical challenges, we build the AnyGraph upon a Graph Mixture-of-Experts (MoE) architecture. This approach empowers the model to effectively manage both the in-domain and cross-domain distribution shift concerning structure-level and feature-level heterogeneity. Furthermore, a lightweight graph expert routing mechanism is proposed to facilitate AnyGraph’s fast adaptability to new data and domains. Our extensive experiments on diverse 38 graph datasets have demonstrated the strong zero-shot learning performance of AnyGraph across diverse graph domains with significant distribution shift. Furthermore, we have validated the model’s fast adaptation ability and scaling law emergence, showcasing its versatility. We have anonymously released our open-sourced AnyGraph implementation at the following link:https://anonymous.4open.science/r/AnyGraph-FECD.

3230KambaAD: Enhancing State Space Models with Kolmogorov–Arnold for time series Anomaly Detection

[openreview] [pdf]

Abstract Time series anomaly detection is critical in numerous practical applications, yet existing deep learning methods often fall short of real-world demands. These models fail to swiftly filter out physically implausible anomalies, insufficiently address distributional shifts, and lack a comprehensive approach that integrates both global and local perspectives for anomaly detection. Moreover, most successful models rely on channel-dependent methods that tend to treat all features at the same timestamp as a single token and then focus on finding relationships between these tokens. This approach overlooks the unique periodicities, trends, and lagged relationships between different features, leading to suboptimal performance. To address these limitations, we propose KambaAD, a model comprised of an Encoder and Reconstructor. The Encoder integrates the strengths of the Kolmogorov-Arnold Network (KAN), the attention mechanism, and the Selective Structured State Space Model (MAMBA). Specifically, KAN is employed to swiftly enforce data consistency, enabling rapid detection of anomalies that violate physical laws. The attention mechanism ensures balanced processing of global information while enhancing the representation of key data characteristics. We leverage MAMBA’s capability as a sequence model to capture anomalies caused by local variations. Additionally, its internal selection mechanism allows the model to effectively handle distribution shifts, ensuring robustness and adaptability in the presence of changing data distributions. Additionally, the framework incorporates a time-series-specific Reconstructor, which reduces computational complexity through patch-based operations that exploit local consistency in time series data. It also employs channel-independent linear reconstruction to prevent interference between different features. Through extensive experiments on multiple multivariate datasets, KambaAD consistently outperforms state-of-the-art models, demonstrating its superior performance in anomaly detection.

3231Routing Experts: Learning to Route Dynamic Experts in Existing Multi-modal Large Language Models

[openreview] [pdf]

Abstract Recently, mixture of experts (MoE) has become a popular paradigm for achieving the trade-off between modal capacity and efficiency of multimodal large language models (MLLMs). Different from previous efforts, we are dedicated to exploring the dynamic experts in existing MLLMs and showing that a standard MLLM can also be a mixture of experts. However, achieving this target is still notoriously challenging. The well-trained MLLMs are more accustomed to the fixed pathway and a drastic change in its inference manner also greatly impedes its performance. To address these issues, we propose a novel dynamic expert routing method for existing MLLMs, termed Routing Experts (RoE), which can achieve example-dependent optimal path routing without obvious structure tweaks. Meanwhile, a new structure sparsity regularization is also introduced to force the well-trained MLLMs to learn more short-cut pathways. In addition, we also address the alignment of the training and inference of MLLMs in terms of network routing. To validate RoE, we apply it to a set of existing MLLMs, including LLaVA-1.5, LLaVA-HR and VILA, and conduct extensive experiments on a bunch of VL benchmarks. The experiment results not only show the effectiveness of our RoE in improving MLLMs’ efficiency, but also yield obvious advantages over MoE-LLaVA in both performance and speed, e.g., an average performance gain of 3.3% on 5 benchmarks while being 1.61 times faster. Our code is anonymously released athttps://anonymous.4open.science/r/AnonymousRoE-6FE6

3232New Paradigm of Adversarial Training: Breaking Inherent Trade-Off between Accuracy and Robustness via Dummy Classes

[openreview] [pdf]

Abstract Adversarial Training (AT) is recognized as one of the most effective methods to enhance the robustness of Deep Neural Networks (DNNs). However, existing AT methods suffer from an inherent trade-off between adversarial robustness and clean accuracy, which seriously hinders their real-world deployment. Previous works have studied this trade-off within the current AT paradigm, exploring various factors such as perturbation intensity, label noise and class margin. Despite these efforts, current AT methods still typically experience a reduction in clean accuracy by over 10% to date, without significant improvements in robustness compared with simple baselines like PGD-AT. This inherent trade-off raises a question: whether the current AT paradigm, which assumes to learn the corresponding benign and adversarial samples as the same class, inappropriately combines clean and robust objectives that may be essentially inconsistent. In this work, we surprisingly reveal that up to 40% of CIFAR-10 adversarial samples always fail to satisfy such an assumption across various AT methods and robust models, explicitly indicating the improvement room for the current AT paradigm. Accordingly, to relax the tension between clean and robust learning derived from this overstrict assumption, we propose a new AT paradigm by introducing an additional dummy class for each original class, aiming to accommodate the hard adversarial samples with shifted distribution after perturbation. The robustness w.r.t. these adversarial samples can be achieved by runtime recovery from the predicted dummy classes to their corresponding original ones, eliminating the compromise with clean learning. Building on this new paradigm, we propose a novel plug-and-play AT technology named DUmmy Classes-based Adversarial Training (DUCAT). Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the DUCAT concurrently improves clean accuracy and adversarial robustness compared with state-of-the-art benchmarks, effectively breaking the existing inherent trade-off. The code is available athttps://anonymous.4open.science/r/DUCAT.

3233WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs

[openreview] [pdf]

Abstract While large language models (LLMs) have shown impressive abilities across a wide range of domains, they still encounter significant challenges in reasoning tasks that require gathering evidence over multiple turns and drawing logical conclusions from this evidence. These challenges present significant obstacles for LLM chat user interfaces, which rely on multi-turn interactions to facilitate effective collaboration. This limitation leads to real-world issues; for example, service chatbots must gather necessary information from customers over multiple turns to diagnose and resolve problems effectively. Despite the multi-turn nature of many real-world LLM use cases, most existing benchmarks rely on carefully curated single-turn tests, which often blur the line between memorization and genuine reasoning. To address this, we introduce the Wason Inductive Logic Test (WILT)\textbf{Wason Inductive Logic Test (WILT)}, a simple yet challenging multi-turn reasoning benchmark designed to resist memorization. WILT is inspired by the Wason 2-4-6 task, where participants must infer a basic boolean function involving three variables (e.g., x<y<zx < y < z) by proposing test cases (such as (2,4,6)(2, 4, 6)). In WILT, each test starts from a clean slate, with only the initial instructions provided, preventing models from relying on pre-learned responses. Over several turns, models must interact with the environment by suggesting test cases to narrow the possible hypotheses and ultimately infer the hidden function based on the outcomes. Our findings reveal that LLMs struggle with this task, exhibiting various strengths and weaknesses: some are better at narrowing down the hypothesis space by proposing valuable test cases, while others are more adept at deducing the hidden function from observed cases. Despite these variations, the best-performing model achieves only 28% accuracy, highlighting a significant gap in LLM performance on complex multi-turn reasoning tasks.

3234TUI: A Conformal Uncertainty Indicator for Continual Test-Time Adaptation

[openreview] [pdf]

Abstract Continual Test-Time Adaptation (CTTA) addresses the challenge of adapting models to sequentially changing domains during the testing phase. Since no ground truth labels are provided, existing CTTA methods rely on pseudo-labels for self-adaptation. However, CTTA is prone to error accumulation, where incorrect pseudo-labels can negatively impact subsequent model updates. Critically, during testing, a CTTA method can not detect its mistakes, which may then propagate through further adaptations. In this paper, we propose a simple uncertainty indicator called TUI for the CTTA task based on Conformal Prediction (CP), which generates a set of possible labels for each test sample, ensuring that the true label is included within this set with a given coverage probability. Specifically, since domain shifts can undermine the coverage of predictions, making uncertainty estimation less dependable, we propose compensating for coverage by dynamically measuring the domain difference between the target and source domains in continuously changing environments. Moreover, after estimating uncertainty, we separate reliable test pseudo-labels and use them to discriminatively enhance the adaptation process. Empirical results demonstrate that our algorithm effectively estimates the uncertainty for CTTA under a specified coverage probability and improves adaptation performance across various existing CTTA methods.

3235Towards Efficient Adaptation of Pruning Strategy in Large Language Models

[openreview] [pdf]

Abstract Post-training pruning has gained increasing attention with the rapid growth of large language models (LLMs). However, significant variations in weight distributions across different LLMs make a fixed pruning strategy inadequate for multiple models. In this paper, we propose an efficient evolutionary optimization framework, \textbf{Mecon}, for adaptive LLM pruning. In particular, we design an effective search space built on our \textbf{Me}ta pruning metric to mitigate diverse weight distributions among LLMs. We then introduce model-wise re\textbf{con}struction error, a lightweight search evaluation to speed up the evaluation of each search trial. We finally leverage Non-dominated Sorting Genetic Algorithm III (NSGA-III) as our search algorithm, handling both the single-objective problem of pruning metric search and the multi-objective problem of layerwise sparsity ratio search in discovering the optimal pruning strategy. We extensively evaluate our framework on LLaMA-1/2/3 and Mistral models across multiple benchmarks. Our results demonstrate that our adaptive pruning metrics consistently outperform existing ones, and the layerwise sparsity ratios improve the effectiveness of other pruning metrics. Furthermore, we validate the cross-task and cross-model generalizability of our pruning metrics, offering a cost-effective solution to streamline the search process. We release our code in the anonymous repository: \textcolor{blue}{\url{https://anonymous.4open.science/r/Mecon-5819}}.

3236Optimizing Q-Learning Using Expectile Regression: A Dual Approach to Handle In-Sample and Out-of-Sample Data

[openreview] [pdf]

Abstract Offline Reinforcement Learning (RL) presents unique challenges, primarily due to the constraint of learning from a static dataset without additional environmental interaction. Traditional methods often face limitations in effectively exploiting the available data, particularly when navigating the exploration-exploitation trade-off inherent in RL. This paper introduces a novel algorithm inspired by Implicit Q-Learning, designed to extend the utility of the Bellman update to actions not explicitly present in the dataset. Our approach, termed Extended Implicit Q-Learning (EIQL), strategically incorporates actions beyond the dataset constraints by allowing selection actions with maximum Q. By doing so, it leverages the maximization capability of the Bellman update, while simultaneously mitigating error extrapolation risks. We demonstrate the efficacy of EIQL through a series of experiments that show its improved performance over traditional offline RL algorithms, particularly in environments characterized by sparse rewards or those containing suboptimal and incomplete trajectories. Our results suggest that EIQL enhances the potential of offline RL by utilizing a broader action spectrum.

3237‘No’ Matters: Out-of-Distribution Detection in Multimodality Long Dialogue

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection in multimodal contexts is essential for identifying deviations in combined inputs from different modalities, particularly in applications like open-domain dialogue systems or real-life dialogue interactions. This paper aims to improve the user experience that involves multi-round long dialogues by efficiently detecting OOD dialogues and images. We introduce a novel scoring framework namedDialogueImageAligning andEnhancingFramework (DIAEF) that integrates the visual language models with the novel proposed scores that detect OOD in two key scenarios (1) mismatches between the dialogue and image input pair and (2) input pairs with previously unseen labels. Our experimental results, derived from various benchmarks, demonstrate that integrating image and multi-round dialogue OOD detection is more effective with previously unseen labels than using either modality independently. In the presence of mismatched pairs, our proposed score effectively identifies these mismatches and demonstrates strong robustness in long dialogues. This approach enhances domain-aware, adaptive conversational agents and establishes baselines for future studies.

3238Certified PEFTSmoothing: Parameter-Efficient Fine-Tuning with Randomized Smoothing

[openreview] [pdf]

Abstract Randomized smoothing is the primary certified robustness method for accessing the robustness of deep learning models to adversarial perturbations in the l2l_2-norm, by taking a majority vote over the multiple predictions of a random Gaussian perturbed input of the base classifier. To fulfill the certified bound and empirical accuracy of randomized smoothing, the base model either needs to be retrained from scratch to learn Gaussian noise or adds an auxiliary denoiser to eliminate it. In this work, we propose \textit{PEFTSmoothing}, which teach the base model to learn the Gaussian noise-augmented data with Parameter-Efficient Fine-Tuning (PEFT) methods in both white-box and black-box settings. This design is based on the intuition that large-scale models have the potential to learn diverse data patterns, including the noise data distributions. In addition, we explore the possibility of combining \textit{PEFTSmoothing} with the fine-tuning for downstream task adaptation, which allows us to simultaneously obtain a robust version of the large vision model and its adaptation tailored to downstream datasets. Extensive results demonstrate the effectiveness and efficiency of \textit{PEFTSmoothing}, which allow us to certify over 98% accuracy for ViT on CIFAR-10, 20% higher than SoTA denoised smoothing, and over 61% accuracy on ImageNet which is 30% higher than CNN-based denoiser and comparable to the Diffusion-based denoiser.

3239Conv-Basis: A New Paradigm for Efficient Attention Inference and Gradient Computation in Transformers

[openreview] [pdf]

Abstract The self-attention mechanism is the key to the success of transformers in recent Large Language Models (LLMs). However, the quadratic computational cost O(n2)O(n^2) in the input sequence length nn is a notorious obstacle for further improvement and scalability in longer contexts. In this work, we leverage the convolution-like structure of attention matrices to develop an efficient approximation method for attention computation using convolution matrices. We propose a conv\mathsf{conv} basis system, analogous to the rank basis, and show that any lower triangular matrix can always be decomposed as a sum of structured convolution matrices in this basis. We then design a fast algorithm to approximate the attention matrix via a sum of such kk convolution matrices. This allows us to compute the attention {\it inference} via Fast Fourier Transforms (FFT) in O(kndlogn)O(knd \log n) time, where dd is the hidden dimension, and thus achieve almost linear time n1+o(1)n^{1+o(1)} in the practical scenario where kd=no(1)kd = n^{o(1)}. Furthermore, the attention {\it training forward} and {\it backward gradient} can be computed in n1+o(1)n^{1+o(1)} as well. We provide theoretical guarantees on the run time and approximation error and conduct preliminary experiments to evaluate its effectiveness. We hope our new paradigm for accelerating attention computation in transformer models can help their application to longer contexts.

3240Understanding Optimization of Operator Networks with Variational Loss for Solving PDEs

[openreview] [pdf]

Abstract In this paper, we analyze the optimization of operator networks for solving elliptic PDEs with variational loss functions. While approximation and generalization errors in operator networks have been extensively studied, optimization error remains largely unexplored. We apply Restricted Strong Convexity (RSC) theory to rigorously examine the optimization dynamics of operator networks trained with variational loss, providing theoretical guarantees for convergence and training stability. We further investigate the role of the condition number of AA in optimization and demonstrate that preconditioning strategies significantly improve convergence rates, establishing a solid theoretical basis for the empirical benefits of preconditioning. We also address the lower bound of a key quantity, qtq_t, which ensures convergence. To prevent qtq_t from vanishing, we propose an algorithm that adaptively incorporates additional weights into the variational loss function, leveraging values already computed during training, thereby avoiding any extra computational costs. Finally, we validate {our theoretical assumptions through numerical experiments, demonstrating their practical applicability} and confirming the effectiveness of preconditioning, with significant improvements in training performance and convergence rates.

3241Independently-Normalized SGD for Generalized-Smooth Nonconvex Optimization

[openreview] [pdf]

Abstract Recent studies have shown that many nonconvex machine learning problems meet a so-called generalized-smooth condition that extends beyond traditional smooth nonconvex optimization. However, the existing algorithms designed for generalized-smooth nonconvex optimization encounter significant limitations in both their design and convergence analysis. In this work, we first study deterministic generalized-smooth nonconvex optimization and analyze the convergence of normalized gradient descent under the generalized Polyak-Lojasiewicz condition. Our results provide a comprehensive understanding of the interplay between gradient normalization and function geometry. Then, for stochastic generalized-smooth nonconvex optimization, we propose an independently-normalized stochastic gradient descent algorithm, which leverages independent sampling, gradient normalization, and clipping to achieve an O(ϵ4)\mathcal{O}(\epsilon^{-4}) sample complexity under relaxed assumptions. Experiments demonstrate the fast convergence of our algorithm.

3242A Fast Federated Method for Minimax Problems with Sequential Convergence Guarantees

[openreview] [pdf]

Abstract Federated learning (FL) has recently been actively studied to collaboratively train machine learning models across clients without directly sharing data and to address data-hungry issues. Many FL works have been focusing on minimizing a loss function but many important machine learning tasks such as adversarial training, GANs, fairness learning, and AUROC maximization are formulated as minimax problems. In this paper, we propose a new federated learning method for minimax problems. Our method allows client drift and addresses the data heterogeneity issue. In theoretical analysis, we prove that our method can improve sample complexity and has convergence guarantees for the updates of the model parameters, i.e., the sequences generated by the method. Given the Kurdyka-Łojasiewicz (KL) exponent of a novel potential function related to the objective function, we demonstrate that the sequences generated by our method converge finitely, linearly, or sublinearly. Our assumptions on the KL property are weaker than previous work on the sequential convergence of centralized minimax methods. Additionally, we further weaken the KL assumption by deducing the KL exponent of the potential function from that of the original objective function. We validate our federated learning method on AUC maximization tasks. The experimental results demonstrate that our method outperforms state-of-the-art federated learning methods when the distributions of local training data are non-IID.

3243Robust Model Evaluation over Large-scale Federated Networks

[openreview] [pdf]

Abstract In this paper, we address the challenge of certifying the performance of a machine learning model on an unseen target network. We consider a source network “A” of KK clients, each with private data from unique and heterogeneous distributions, assumed to be independent samples from a broader meta-distribution μ \mu . Our goal is to provide certified guarantees for the model’s performance on a different, unseen target network “B,” governed by another meta-distribution μ \mu' , assuming the deviation between μ\mu and μ\mu' is bounded by either the {\it Wasserstein} distance or an ff-{\it divergence}. We derive theoretical guarantees for the model’s empirical average loss and provide uniform bounds on the risk CDF, where the latter correspond to novel and adversarially robust versions of the Glivenko-Cantelli theorem and the Dvoretzky-Kiefer-Wolfowitz (DKW) inequality. Our bounds are computable in polynomial time with a polynomial number of queries to the KK clients, preserving client privacy by querying only the model’s (potentially adversarial) loss on private data. We also establish non-asymptotic generalization bounds that consistently converge to zero as both KK and the minimum client sample size grow. Extensive empirical evaluations validate the robustness and practicality of our bounds across real-world tasks.

3244Eliciting Human Preferences with Language Models

[openreview] [pdf]

Abstract Language models (LMs) can be directed to perform user- and context-dependent tasks by using labeled examples or natural language prompts. But selecting examples or writing prompts can be challenging---especially in tasks that require users to precisely articulate nebulous preferences or reason about complex edge cases. For such tasks, we introduceGenerative Active Task Elicitation (GATE), a method for usingLMs themselvesto guide the task specification process. GATE is a learning framework in which models elicit and infer human preferences through free-form, language-based interaction with users. We identify prototypical challenges that users face when specifying preferences, and design three preference modeling tasks to study these challenges: content recommendation, moral reasoning, and email validation. In preregistered experiments, we show that LMs that learn to perform these tasks using GATE (by interactively querying users with open-ended questions) obtain preference specifications that are more informative than user-written prompts or examples. GATE matches existing task specification methods in the moral reasoning task, and significantly outperforms them in the content recommendation and email validation tasks. Users additionally report that interactive task elicitation requires less effort than prompting or example labeling and surfaces considerations that they did not anticipate on their own. Our findings suggest that LM-driven elicitation can be a powerful tool for aligning models to complex human preferences and values.

3245Latent Wasserstein Adversarial Imitation Learning

[openreview] [pdf]

Abstract Imitation Learning (IL) enables agents to mimic expert behavior by learning from demonstrations. However, traditional IL methods require large amounts of medium-to-high-quality demonstrations as well as actions of expert demonstrations, both of which are often unavailable. To address these limitations, we propose LWAIL (Latent Wasserstein Adversarial Imitation Learning), a novel adversarial imitation learning framework that focuses on state-only distribution matching by leveraging the Wasserstein distance computed in a latent space. To obtain a meaningful latent space, our approach includes a pre-training stage, where we employ the Intention Conditioned Value Function (ICVF) model to capture the underlying structure of the state space using randomly generated state-only data. This enhances the policy’s understanding of state transitions, enabling the learning process to use only one or a few state-only expert episodes to achieve expert-level performance. Through experiments on multiple MuJoCo environments, we demonstrate that our method outperforms prior Wasserstein-based IL methods and prior adversarial IL methods, achieving better sample efficiency and policy robustness across various tasks.

3246Enhancing Hallucination Detection with Noise Injection

[openreview] [pdf]

Abstract Large Language Models (LLMs) are observed to generate plausible yet incorrect responses, known as hallucinations. Effectively detecting such hallucination instances is crucial for the safe deployment of LLMs. Recent research has linked hallucination to model uncertainty, suggesting to detect hallucinations by measuring dispersion over answer distributions obtained from a set of samples drawn from the model. While using the model’s next token probabilities used during training is a natural way to obtain samples, in this work, we argue that for the purpose of hallucination detection, it is overly restrictive and hence sub-optimal. Motivated by this viewpoint, we perform an extensive empirical analysis showing that an alternative way to measure uncertainty - by perturbing hidden unit activations in intermediate layers of the model - is complementary to sampling, and can significantly improve detection accuracy over mere sampling.

3247Backpropagation-Free Learning through Gradient Aligned Feedbacks

[openreview] [pdf]

Abstract Deep neural networks heavily rely on the back-propagation algorithm for optimiza- tion. Nevertheless, the global sequential transmission of gradients in the backward pass inhibits its scalability. The Direct Feedback Alignment algorithm has been proposed as a promising approach for parallel learning of deep neural networks, relying on fixed random feedback weights to project the error on every layer in a parallel manner. However, it notoriously fails to train networks that are really deep and that include compulsory layers like convolutions and transformers. In this paper, we show that alternatives to back-propagation may greatly benefit from local and forward approximation of the gradient to better cope with the inherent and constrained structure of such layers.This directional approximation allows us to design a novel algorithm that updates the feedback weights called GrAPE (GRadient Aligned Projected Error). A first set of experiments are carried out on image classi- fication tasks with feedforward and convolutional architectures. The results show important improvement in performance over other backpropagation-free algorithms, narrowing the gap with backpropagation. More importantly, the method scales to modern and deep architectures like AlexNet, VGG-16 and Transformer-based language models where the performance gains are even more notable.

3248Towards Fair RAG: On the Impact of Fair Ranking in Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Many language models now enhance their responses with retrieval capabilities, leading to the widespread adoption of retrieval-augmented generation (RAG) systems. However, despite retrieval being a core component of RAG, much of the research in this area overlooks the extensive body of work on fair ranking, neglecting the importance of considering all stakeholders involved. This paper presents the first systematic evaluation of RAG systems integrated with fair rankings. We focus specifically on measuring the fair exposure of each relevant item across the rankings utilized by RAG systems (i.e., item-side fairness), aiming to promote equitable growth for relevant item providers. To gain a deep understanding of the relationship between item-fairness, ranking quality, and generation quality in the context of RAG, we analyze nine different RAG systems that incorporate fair rankings across seven distinct datasets. Our findings indicate that RAG systems with fair rankings can maintain a high level of generation quality and, in many cases, even outperform traditional RAG systems, despite the general trend of a tradeoff between ensuring fairness and maintaining system-effectiveness. We believe our insights lay the groundwork for responsible and equitable RAG systems and open new avenues for future research. We publicly release our codebase and dataset.

3249Enhancing Uncertainty Estimation and Interpretability with Bayesian Non-negative Decision Layer

[openreview] [pdf]

Abstract Although deep neural networks have demonstrated significant success due to their powerful expressiveness, most models struggle to meet practical requirements for uncertainty estimation. Concurrently, the entangled nature of deep neural net- works leads to a multifaceted problem, where various localized explanation tech- niques reveal that multiple unrelated features influence the decisions, thereby un- dermining interpretability. To address these challenges, we develop a Bayesian Nonnegative Decision Layer (BNDL), which reformulates deep neural networks as a conditional Bayesian non-negative factor analysis. By leveraging stochastic latent variables, the BNDL can model complex dependencies and provide robust uncertainty estimation. Moreover, the sparsity and non-negativity of the latent variables encourage the model to learn disentangled representations and decision layers, thereby improving interpretability. We also offer theoretical guarantees that BNDL can achieve effective disentangled learning. In addition, we developed a corresponding variational inference method utilizing a Weibull variational in- ference network to approximate the posterior distribution of the latent variables. Our experimental results demonstrate that with enhanced disentanglement capa- bilities, BNDL not only improves the model’s accuracy but also provides reliable uncertainty estimation and improved interpretability.

3250Accelerate Vertical Federated Adversarial Learning with Dual-level Decoupled Backpropagation

[openreview] [pdf]

Abstract Vertical Federated Learning (VFL) involves multiple participants collaborating to train models on distinct feature sets from the same data samples. The distributed deployment of VFL models renders them vulnerable to adversarial perturbations during inference, motivating the need to visit the VFL robustness problem. Adversarial Training (AT) is the predominant approach for enhancing model robustness. However, its application in VFL, termed Vertical Federated Adversarial Learning (VFAL), faces significant computational challenges: Generating adversarial examples in AT requiresiterative full propagations across participants with heavy computation overload, resulting in VFAL training time far exceeding those of regular VFLs. To address this challenge, we proposeDecVFAL, an acceleratedVFALframework through a novelDecoupled backpropagation incorporating adual-level decoupled mechanism to enable lazy sequential and decoupled parallel backpropagation. Lazy sequential backpropagation sequentially updates the adversarial example using timely partial derivatives with respect to the bottom module and delayed partial derivatives for the remaining modules. Decoupled parallel backpropagation updates these delayed partial derivatives by utilizing module-wise delayed gradients, enabling asynchronous parallel backpropagation with flexible partitions that align with VFL’s distributed deployment. Rigorous theoretical analysis demonstrates that despite introducing multi-source approximate gradients due to the dual decoupled mechanism and the techniques from the existing VFL methods,DecVFALachieves a O(1/K)\mathcal{O}(1 / \sqrt{\mathcal{K}}) convergence rate after K\mathcal{K} iterations, on par with regular VFL systems. Experimental results show that, compared to existing methods,DecVFALensures competitive robustness while significantly achieving about 3103\sim10 times speed up on various datasets.

3251Cayley Maze: Universal Open-Ended Reinforcement Learning Environment

[openreview] [pdf]

Abstract Parametrizable environments with variable complexity are crucial for advancing fields such as Unsupervised Environment Design (UED), Open-Ended Learning, Curriculum Learning, and Meta Reinforcement Learning. However, the selection of environments in evaluation procedures, along with their complexities, is often either neglected or lacks formal justification. We propose the formal definition of complexity for Markov Decision Processes using Finite Automata and Group Theory machinery. We introduce Cayley Maze, a novel open-ended reinforcement learning environment that naturally generalizes problems like solving the Rubik’s Cube, sorting, and integer factorization. Cayley Maze is universal: every deterministic sparse MDP is an MDP of a certain instance of Cayley Maze. We demonstrate how Cayley Maze enables control over complexity, simplification and combination of its instances. Finally, we evaluate UED algorithms on various instances of the Cayley Maze and analyze their capacity to produce agents with robust generalization capabilities.

3252Binary Reward Labeling: Bridging Offline Preference and Reward-Based Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning has become one of the most practical RL settings. However, most existing works on offline RL focus on the standard setting with scalar reward feedback. It remains unknown how to universally transfer the existing rich understanding of offline RL from the reward-based to the preference-based setting. In this work, we propose a general framework to bridge this gap. Our key insight is transforming preference feedback to scalar rewards via binary reward labeling (BRL), and then any reward-based offline RL algorithms can be applied to the dataset with the reward labels. The information loss during the feedback signal transition is minimized with binary reward labeling in the practical learning scenarios. We theoretically show the connection between several recent PBRL techniques and our framework combined with specific offline RL algorithms. By combining reward labeling with different algorithms, our framework can lead to new and potentially more efficient offline PBRL algorithms. We empirically test our framework on preference datasets based on the standard D4RL benchmark. When combined with a variety of efficient reward-based offline RL algorithms, the learning result achieved under our framework is comparable to training the same algorithm on the dataset with actual rewards in many cases and better than the recent PBRL baselines in most cases.

3253FMint: Bridging Human Designed and Data Pretrained Models for Differential Equation Foundation Model

[openreview] [pdf]

Abstract The fast simulation of dynamical systems is a key challenge in many scientific and engineering applications, such as weather forecasting, disease control, and drug discovery. With the recent success of deep learning, there is increasing interest in using neural networks to solve differential equations in a data-driven manner. However, existing methods are either limited to specific types of differential equations or require large amounts of data for training. This restricts their practicality in many real-world applications, where data is often scarce or expensive to obtain. To address this, we propose a novel multi-modal foundation model, named \textbf{FMint} (\textbf{F}oundation \textbf{M}odel based on \textbf{In}i\textbf{t}ialization), to bridge the gap between human-designed and data-driven models for the fast simulation of dynamical systems. Built on a decoder-only transformer architecture with in-context learning, FMint utilizes both numerical and textual data to learn a universal error correction scheme for dynamical systems, using prompted sequences of coarse solutions from traditional solvers. The model is pre-trained on a corpus of 40K ODEs, and we perform extensive experiments on challenging ODEs that exhibit chaotic behavior and of high dimensionality. Our results demonstrate the effectiveness of the proposed model in terms of both accuracy and efficiency compared to classical numerical solvers, highlighting FMint’s potential as a general-purpose solver for dynamical systems. Our approach achieves an accuracy improvement of 1 to 2 orders of magnitude over state-of-the-art dynamical system simulators, and delivers a 5X speedup compared to traditional numerical algorithms.

3254StarCraft II Arena: Evaluating LLMs in Strategic Planning, Real-Time Decision Making, and Adaptability

[openreview] [pdf]

Abstract StarCraft II plays an important role in developing AI agents for real-time strategic reasoning due to its complex nature. However, people usually draw conclusions of how competent their agents are according to the level of the built-in agents in StarCraft II which they can win in terms of the final success rate. Little intermediate quantitative information is considered while human-in-the-loop analysis is time inefficient, which results in inadequate reflection of the true strategic reasoning ability. In this work, we propose StarCraft II Arena, a well-designed benchmark for evaluating the strategic planning, real-time decision-making, and adaptability capabilities of large language models (LLMs) agents. We introduce using fine-grained capability metrics, allowing for targeted capture and analysis of specific capability, and further propose a detailed decision trace to enhance the understanding of LLM behavior. We demonstrate the utility of such a benchmark by evaluating several state-of-the-art LLMs in various setups. Our results reveal distinct performances in long-term strategy development, real-time decision-making, and adapting to environmental changes. Such results show that the StarCraft II Arena offers a deeper insight into the decision-making process of LLMs and has the potential to become a challenging and comprehensive benchmark for strategic reasoning.

3255LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

[openreview] [pdf]

Abstract Hallucination, a phenomenon where multimodal large language models(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with real-world scenarios, including object/image descriptions and multi-round conversations with 14/130 words and 189 words, respectively, on average. It introduces two new tasks, hallucination discrimination and hallucination completion, unifying both discriminative and generative evaluations in a single multiple-choice-question form and leading to more reliable and efficient evaluations without the need for LLM evaluators. Further, we propose an advanced pipeline that greatly facilitates the construction of future hallucination benchmarks with long and complex questions and descriptions. Extensive experiments over multiple recent MLLMs reveal various new challenges when they are handling hallucinations with long and complex textual data.

3256Tuning-Free Bilevel Optimization: New Algorithms and Convergence Analysis

[openreview] [pdf]

Abstract Bilevel optimization has recently attracted considerable attention due to its abundant applications in machine learning problems. However, existing methods rely on prior knowledge of problem parameters to determine stepsizes, resulting in significant effort in tuning stepsizes when these parameters are unknown. In this paper, we propose two novel tuning-free algorithms, D-TFBO and S-TFBO. D-TFBO employs a double-loop structure with stepsizes adaptively adjusted by the “inverse of cumulative gradient norms” strategy. S-TFBO features a simpler fully single-loop structure that updates three variables simultaneously with a theory-motivated joint design of adaptive stepsizes for all variables. We provide a comprehensive convergence analysis for both algorithms and show that D-TFBO and S-TFBO respectively require O(1ϵ)\mathcal{O}(\frac{1}{\epsilon}) and O(1ϵlog4(1ϵ))\mathcal{O}(\frac{1}{\epsilon}\log^4(\frac{1}{\epsilon})) iterations to find an ϵ\epsilon-accurate stationary point, (nearly) matching their well-tuned counterparts using the information of problem parameters. Experiments on various problems show that our methods achieve performance comparable to existing well-tuned approaches, while being more robust to the selection of initial stepsizes. To the best of our knowledge, our methods are the first to completely eliminate the need for stepsize tuning, while achieving theoretical guarantees.

3257Integrating Relation Dependences and Textual Semantics for Coherent Logical Reasoning over Temporal Knowledge Graph

[openreview] [pdf]

Abstract Temporal knowledge graphs (TKGs) reflect the evolution patterns of facts, which can be summarized as logical rules and applied to forecast future facts. However, existing logical reasoning methods on TKGs face two limitations: 1) A lack of efficient strategies for extracting logical paths. 2) Insufficient utilization of structural and textual information. To bridge these gaps, we propose CoLR, a two-stage framework that mines relation dependencies and textual semantics for Coherent Logical Reasoning over TKGs. In the first stage, we construct a temporal relation structure graph (TRSG) composed of relations and cohesion weights between them. Besides, we define a novel time-fusion search graph (TFSG) along with TRSG to facilitate efficient and reliable temporal path searching. In the second stage, the textual content and timestamp sequences from these paths undergo encoding via a pre-trained language model and a time sequence encoder to accurately capture potential logical rules. Additionally, for quadruplets missing paths, historical edges sampled based on relation cohesion are used as supplements. Given the limitations of existing benchmark datasets in evaluating accuracy, generalization, and robustness, we construct three new datasets tailored to transductive, inductive, and few-shot scenarios, respectively. These datasets, combined with four real-world datasets, are employed to evaluate our model comprehensively. Experimental results demonstrate that our approach significantly outperforms existing methods across all three scenarios. Our code is available athttps://anonymous.4open.science/r/CoLR-0839

3258Unsupervised Radar Point Cloud Enhancement Using Diffusion Model as Prior without Paired Traning Data

[openreview] [pdf]

Abstract In industrial automation technology, radar is one of the crucial sensors in the machine perception stage. However, due to the long wavelength of radar electromagnetic waves and the limited number of antennas, the angle resolution is limited. Recent advancements have introduced methods that leverage paired LiDAR-radar data for training, achieving notable point enhancement effect. However, the requirement for paired data significantly increases the cost and complexity of model development, limiting model’s widespread adoption and scalability. To address this, we propose an unsupervised radar point cloud enhancement algorithm using diffusion model as prior without paired training data. Specifically, our method formulates radar angle estimation recovery into an inverse problem and introduces prior knowledge via a diffusion model when solving it. Experimental results demonstrate that our method achieves high fidelity and low noise performance compared to traditional regularization methods. Compared to paired data training methods, our approach not only delivers comparable performance but also offers greater content control and reduced generation variance. Additionally, it does not require a huge amount of paired data. To the best of our knowledge, our method is the first to enhance radar point cloud by introducing prior knowledge via diffusion model instead of training on paired data.

3259Improve Vision Language Model Chain-of-thought Reasoning

[openreview] [pdf]

Abstract Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we first evaluate the CoT abilities of existing VLMs and show that training on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality by constructing positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, based on the comparisons with annotated short answers. We then use the Direct Preference Optimization algorithm on this pairwise data to refine the model’s reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.

3260Benign Overfitting in Single-Head Attention

[openreview] [pdf]

Abstract The phenomenon of benign overfitting, where a trained neural network perfectly fits noisy training data but still achieves near-optimal test performance, has been extensively studied in recent years for linear models and fully-connected/convolutional networks. In this work, we study benign overfitting in a single-head softmax attention model, which is the fundamental building block of Transformers. We prove that under appropriate conditions, the model exhibits benign overfitting in a classification setting already after two steps of gradient descent. Moreover, we show conditions where a minimum-norm/maximum-margin interpolator exhibits benign overfitting. We study how the overfitting behavior depends on the signal-to-noise ratio (SNR) of the data distribution, namely, the ratio between norms of signal and noise tokens, and prove that a sufficiently large SNR is both necessary and sufficient for benign overfitting.

3261Investigating the Effectiveness of HyperTuning via Gisting

[openreview] [pdf]

Abstract Gisting (Mu et al., 2023) is a simple method for training models to compress information into fewer token representations using a modified attention mask, and can serve as an economical approach to training Transformer-based hypernetworks. We introduce HyperLlama, a set of Gisting-based hypernetworks built on Llama-2 models that generates task-specific soft prefixes based on few-shot inputs. In experiments across P3, Super-NaturalInstructions and Symbol Tuning datasets, we show that HyperLlama models can effectively compress information from few-shot examples into soft prefixes. However, they still underperform multi-task fine-tuned language models with full attention over few-shot in-context examples. We also show that HyperLlama-generated soft prefixes can serve as better initializations for further prefix tuning. Overall, Gisting-based hypernetworks are economical and easy to implement, but have mixed empirical performance.

3262Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) on human preferences, typically through reinforcement learning from human feedback (RLHF), has proven successful in enhancing their capabilities. However, ensuring the safety of LLMs during fine-tuning remains a critical concern, and mitigating the potential conflicts in safety and helpfulness is costly in RLHF. To address this issue, we propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO), which re-parameterizes a joint RLHF objective of both safety and helpfulness into a single supervised learning objective. In the supervised optimization, a labeling function is used to capture global preferences ranking to balance both safety and helpfulness. To evaluate BFPO, we develop a benchmark including comprehensive discriminative and generative tasks for helpfulness and harmlessness. The results indicate that our method significantly outperforms existing approaches in both safety and helpfulness. Moreover, BFPO eliminates the need for human prompting and annotation in LLM fine-tuning while achieving the same level of safety as methods that heavily rely on human labor, with less than 10% of the computational resources. The training recipes and models will be released.

3263Potential Outcomes Estimation Under Hidden Confounders

[openreview] [pdf]

Abstract One of the major challenges in estimating conditional potential outcomes and the conditional average treatment effects (CATE) is the presence of hidden confounders. Since testing for hidden confounders cannot be accomplished only with observational data, conditional unconfoundedness is commonly assumed in the literature of CATE estimation. Nevertheless, under this assumption, CATE estimation can be significantly biased due to the effects of unobserved confounders. In this work, we consider the case where in addition to a potentially large observational dataset, a small dataset from a randomized controlled trial (RCT) is available. Notably, we make no assumptions on the existence of any covariate information for the RCT dataset, only requiring the outcomes to be observed. We propose a CATE estimation method based on a pseudo-confounder generator and a CATE model that aligns the learned potential outcomes from the observational data with those observed from the RCT. Our method is applicable to many practical scenarios of interest, particularly when privacy is under concern (e.g., medical applications). Extensive numerical experiments are provided demonstrating the effectiveness of our approach for both synthetic and real-world datasets.

3264Rethinking the "Heatmap + Monte Carlo Tree Search’’ Paradigm for Solving Large Scale TSP

[openreview] [pdf]

Abstract The Travelling Salesman Problem (TSP) remains a fundamental challenge in combinatorial optimization, inspiring diverse algorithmic strategies. This paper revisits the ``heatmap + Monte Carlo Tree Search (MCTS)" paradigm that has recently gained traction for learning-based TSP solutions. Within this framework, heatmaps encode the likelihood of edges forming part of the optimal tour, and MCTS refines this probabilistic guidance to discover optimal solutions. Contemporary approaches have predominantly emphasized the refinement of heatmap generation through sophisticated learning models, inadvertently sidelining the critical role of MCTS. Our extensive empirical analysis reveals two pivotal insights: \textbf{1}) The configuration of MCTS strategies profoundly influences the solution quality, demanding meticulous tuning to leverage their full potential; \textbf{2}) Our findings demonstrate that a rudimentary and parameter-free heatmap, derived from the intrinsic kk-nearest nature of TSP, can rival or even surpass the performance of complicated heatmaps, with strong generalizability across various scales. Empirical evaluations across various TSP scales underscore the efficacy of our approach, achieving competitive results. These observations challenge the prevailing focus on heatmap sophistication, advocating a reevaluation of the paradigm to harness both components synergistically.

3265Solving Differential Equations with Constrained Learning

[openreview] [pdf]

Abstract (Partial) differential equations (PDEs) are fundamental tools for describing natural phenomena, making their solution crucial in science and engineering. While traditional methods, such as the finite element method, provide reliable solutions, their accuracy is often tied to the use of computationally intensive fine meshes. Moreover, they do not naturally account for measurements or prior solutions, and any change in the problem parameters requires results to be fully recomputed. Neural network-based approaches, such as physics-informed neural networks and neural operators, offer a mesh-free alternative by directly fitting those models to the PDE solution. They can also integrate prior knowledge and tackle entire families of PDEs by simply aggregating additional training losses. Nevertheless, they are highly sensitive to hyperparameters such as collocation points and the weights associated with each loss. This paper addresses these challenges by developing a science-constrained learning (SCL) framework. It demonstrates that finding a (weak) solution of a PDE is equivalent to solving a constrained learning problem with worst-case losses. This explains the limitations of previous methods that minimize the expected value of aggregated losses. SCL also organically integrates structural constraints (e.g., invariances) and (partial) measurements or known solutions. The resulting constrained learning problems can be tackled using a practical algorithm that yields accurate solutions across a variety of PDEs, neural network architectures, and prior knowledge levels without extensive hyperparameter tuning and sometimes even at a lower computational cost.

3266TwinsFormer: Revisiting Inherent Dependencies via Two Interactive Components for Time Series Forecasting

[openreview] [pdf]

Abstract Due to the remarkable ability to capture long-term dependencies, Transformer-based models have shown great potential in time series forecasting. However, real-world time series usually present intricate temporal patterns, making forecasting still challenging in many practical applications. To better grasp inherent dependencies, in this paper, we propose \textbf{TwinsFormer}, a Trans\underline{former}-based model utilizing \underline{tw}o \underline{in}teractive component\underline{s} for time series forecasting. Unlike the mainstream paradigms of plain decomposition that train the model with two independent branches, we design an interactive strategy around the attention module and the feed-forward network to strengthen the dependencies via decomposed components. Specifically, we adopt dual streams to facilitate progressive and implicit information interactions for trend and seasonal components. For the seasonal stream, we feed the seasonal component to the attention module and feed-forward network with a subtraction mechanism. Meanwhile, we construct an auxiliary highway (without the attention module) for the trend stream by the supervision of seasonal signals. Finally, we incorporate the dual-stream outputs into a linear layer leading to the ultimate prediction. In this way, we can avoid the model overlooking inherent dependencies between different components for accurate forecasting. Our interactive strategy, albeit simple, can be adapted as a plug-and-play module to existing Transformer-based methods with negligible extra computational overhead. Extensive experiments on various real-world datasets show the superiority of TwinsFormer, which can outperform previous state-of-the-art methods in terms of both long-term and short-term forecasting performance.

3267Multiple Descents in Unsupervised Auto-Encoders: The Role of Noise, Domain Shift and Anomalies

[openreview] [pdf]

Abstract The phenomenon of double descent has recently gained attention in supervised learning. It challenges the conventional wisdom of the bias-variance trade-off by showcasing a surprising behavior. As the complexity of the model increases, the test error initially decreases until reaching a certain point where the model starts to overfit the train set, causing the test error to rise. However, deviating from classical theory, the error exhibits another decline when exceeding a certain degree of over-parameterization. We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood. We conduct extensive experiments using under-complete auto-encoders (AEs) for various applications, such as dealing with noisy data, domain shifts, and anomalies. We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for all the aforementioned applications. Finally, we assessed the usability of the AEs for detecting anomalies and mitigating the domain shift between datasets. Our findings indicate that over-parameterized models can improve performance not only in terms of reconstruction, but also in enhancing capabilities for the downstream task.

3268A Multi-Power Law for Loss Curve Prediction Across Learning Rate Schedules

[openreview] [pdf]

Abstract Training large models is both resource-intensive and time-consuming, making it crucial to understand the quantitative relationship between model performance and hyperparameters. In this paper, we derive an empirical law that predicts pretraining loss for large language models for every intermediate training step across various learning rate schedules, including constant, cosine, and step decay schedules. Our proposed law takes a multi-power form, combining a power law based on the sum of learning rates and additional power laws to account for a loss reduction effect as learning rate decays. We validate this law extensively on Llama-2 models of varying sizes and demonstrate that, after fitting on a few learning rate schedules, it accurately predicts the loss curves for unseen schedules of different shapes and horizons. Moreover, by minimizing the predicted final pretraining loss across learning rate schedules, we are able to find a schedule that outperforms the widely-used cosine learning rate schedule. Interestingly, this automatically discovered schedule bears some resemblance to the recently proposed Warmup-Stable-Decay (WSD) schedule (hu et al, 2024) but achieves slightly faster convergence. We believe these results could offer valuable insights for understanding the dynamics of pretraining and for designing learning rate schedules to improve efficiency.

3269GRIC: General Representation and Informative Content for Enhanced Out-of-Distribution Detection

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection is crucial for ensuring the robustness of machine learning models in open-world scenarios by identifying inputs from unknown classes. Vision-language models like CLIP have enabled zero-shot OOD detection without requiring labels or training on in-distribution (ID) data. However, current approaches are limited by their dependence on \textit{closed-set text-based labels} and \textit{full image feature representations}, constraining CLIP’s capacity to generalize across diverse labels. In this work, we propose GRIC, a novel method that improves zero-shot multi-modal OOD detection by leveraging two key insights: (1) OOD detection is driven by general ID representations rather than class-specific features, and (2) large language models (LLMs) can enrich the model’s understanding of ID data and simulate potential OOD scenarios without actual OOD samples. GRIC is simple yet highly effective, reducing the false positive rate at 9595% recall (FPR95) by up to 1919%, significantly surpassing state-of-the-art methods.

3270Generative Modeling of Individual Behavior at Scale

[openreview] [pdf]

Abstract Recent years have seen a growing interest in using AI to model human behavior, particularly in domains where humans learn from or collaborate with this technology. While most existing work attempts to model human behavior at an aggregate level, our goal is to model behavior at the individual level. Recent work in the domain of chess has shown that behavioral stylometry, or the task of identifying a person from their actions alone, can be achieved with high accuracy among a pool of a few thousand players. However, this approach cannot generate actions in the style of each player, and hence cannot reason about or influence player behavior in practice. We provide a new perspective on behavioral stylometry that addresses these limitations, by drawing a connection to the vast literature of transfer learning in NLP. Specifically, by casting the stylometry problem as a multi-task learning problem---where each task represents a distinct---we show that parameter-efficient fine-tuning (PEFT) methods can be adapted to model individual behavior in an explicit and generative manner, at unprecedented scale. We apply our approach at scale to two very different games: chess (47,864 players) and Rocket League (2,000 players).Our approach leverages recent modular PEFT methods to learn a shared set of skill parameters that can be combined in different ways via style vectors. Style vectors enable two important capabilities. First, they are generative: we can generate actions in the style of a player simply by conditioning on the player’s style vector. Second, they induce a latent style space that we can interpret and manipulate algorithmically. This allows us to compare different player styles, as well as synthesize new (human-like) styles, e.g. by interpolating between the style vectors of two players.

3271Doubly robust identification of treatment effects from multiple environments

[openreview] [pdf]

Abstract Practical and ethical constraints often dictate the use of observational data for causal inference, particularly in medicine and social sciences. Yet, observational datasets are prone to confounding, potentially compromising the validity of conclusions. While adjusting for all available covariates is a common corrective strategy, this approach can introduce bias, especially when post-treatment variables are present or some variables remain unobserved—a frequent scenario in practice. Avoiding this bias often requires detailed knowledge of the underlying causal graph, a challenging and often impractical prerequisite. In this work, we propose RAMEN, an algorithm that tackles this challenge by leveraging the heterogeneity of multiple data sources without the need to know the complete causal graph. Notably, RAMEN achievesdoubly robust identification: we identify the treatment effect if either the causal parents of the treatment or those of the outcome are observed. Empirical evaluations across synthetic, semi-synthetic, and real-world datasets show that our approach significantly outperforms existing methods.

3272Generalization and Distributed Learning of GFlowNets

[openreview] [pdf]

Abstract Conventional wisdom attributes the success of Generative Flow Networks (GFlowNets) to their ability to exploit the compositional structure of the sample space for learning generalizable flow functions (Bengio et al., 2021). Despite the abundance of empirical evidence, formalizing this belief with verifiable non-vacuous statistical guarantees has remained elusive. We address this issue with the first data-dependent generalization bounds for GFlowNets. We also elucidate the negative impact of the state space size on the generalization performance of these models via Azuma-Hoeffding-type oracle PAC-Bayesian inequalities. We leverage our theoretical insights to design a novel distributed learning algorithm for GFlowNets, which we callSubgraph Asynchronous Learning(SAL). In a nutshell, SAL utilizes a divide-and-conquer strategy: multiple GFlowNets are trained in parallel on smaller subnetworks of the flow network, and then aggregated with an additional GFlowNet that allocates appropriate flow to each subnetwork. Our experiments with synthetic and real-world problems demonstrate the benefits of SAL over centralized training in terms of mode coverage and distribution matching.

3273GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation

[openreview] [pdf]

Abstract Robots’ ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in understanding novel tasks, thereby mitigating this issue. However, these methods lack a task-specific learning process, which is essential for an accurate understanding of 3D environments, often leading to execution failures. In this paper, we introduce GravMAD, a sub-goal-driven, language-conditioned action diffusion framework that combines the strengths of imitation learning and foundation models. Our approach breaks tasks into sub-goals based on language instructions, allowing auxiliary guidance during both training and inference. During training, we introduce Sub-goal Keypose Discovery to identify key sub-goals from demonstrations. Inference differs from training, as there are no demonstrations available, so we use pre-trained foundation models to bridge the gap and identify sub-goals for the current task. In both phases, GravMaps are generated from sub-goals, providing GravMAD with more flexible 3D spatial guidance compared to fixed 3D positions. Empirical evaluations on RLBench show that GravMAD significantly outperforms state-of-the-art methods, with a 28.63% improvement on novel tasks and a 13.36% gain on tasks encountered during training. These results demonstrate GravMAD’s strong multi-task learning and generalization in 3D manipulation. Video demonstrations are available at:https://gravmad.github.io.

32743DS: Decomposed Difficulty Data Selection’s Case Study on LLM Medical Domain Adaptation

[openreview] [pdf]

Abstract Large Language Models (LLMs) excel in general tasks but struggle in specialized domains like healthcare due to limited domain-specific knowledge. Supervised Fine-Tuning (SFT) data construction for domain adaptation often relies on heuristic methods, such as GPT-4 annotation or manual data selection, with a data centric focus on presumed diverse, high-quality datasets. However, these methods overlook the model’s inherent knowledge distribution, introducing noise, redundancy, and irrelevant data, leading to a mismatch between the selected data and the model’s learning task, resulting in suboptimal performance. To address this, we propose a two-stage model-centric data selection framework, Decomposed Difficulty Data Selection (3DS), which aligns data with the model’s knowledge distribution for optimized adaptation. In Stage 1, we apply Prompt-Driven Data Selection via Explicit Alignment, where the model filters irrelevant or redundant data based on its internal knowledge. In Stage 2, we perform Decomposed Difficulty Data Selection, where data selection is guided by our defined difficulty decomposition, using three metrics: Instruction Understanding, Response Confidence, and Response Correctness. Additionally, an attention-based importance weighting mechanism captures token importance for more accurate difficulty calibration. This two-stage approach ensures the selected data is not only aligned with the model’s knowledge and preferences but also appropriately challenging for the model to learn, leading to more effective and targeted domain adaptation. In the case study of the medical domain, our extensive experiments on real-world healthcare datasets demonstrate the superiority of 3DS over existing methods in accuracy by over 5.29%. Our dataset and code will be open-sourced athttps://anonymous.4open.science/r/3DS-E67F.

3275EFFECTIVE REGULARIZATION WITH RELATIVE-DISTANCE VARIANCES IN DEEP METRIC LEARNING

[openreview] [pdf]

Abstract This paper develops, for the first time, a novel method using relative-distance variance to regularize deep metric learning (DML), overcoming the drawbacks of existing pair-distance-based metrics, notably loss functions. Being a fundamental field in machine learning research, DML has been widely studied with the goal of learning a feature space where dissimilar data samples are further apart than similar ones. A typical approach of DML is to optimize the feature space by maximizing the relative distances between negative and positive pairs. Despite the rapid advancement, the pair-distance-based approach suffers from a few drawbacks that it heavily relies on the appropriate selection of margin to determine decision boundaries, and it depends on the effective selection of informative pairs, and resulting in low generalization across tasks. To address these issues, this paper explores the use of relative-distance variance and investigates its impact on DML through both empirical and theoretical studies. Based upon such investigation, we propose a novel Relative Distance Variance Constraint (RDVC) loss by regularizing the representation or embedding function learning. The proposed RDVC loss can seamlessly integrate with various pair-distance-based loss functions to ensure a robust and effective performance. Substantial experimental results have demonstrated the effectiveness of our proposed RDVC loss on both within-domain and cross-domain retrieval tasks. In particular, the RDVC loss is also shown useful in fine-grained zero-shot sketch-based image retrieval, a challenging task, revealing its general applicability to cross-domain and zero-shot learning.

3276Advancing Few-shot Continual Learning via Selective Knowledge Transfer

[openreview] [pdf]

Abstract Continual learning with large language models (LLMs) is a promising and challenging research that greatly impacts many applications. Existing solutions treat previous tasks equally, making them vulnerable to task interference, lacking scalability with a large number of tasks, and oblivious to the intrinsic relationships among tasks. This work presents selective knowledge transfer (SKT), a novel and principled framework for continual learning with LLMs. SKT aims to maximize positive knowledge transfer while systematically minimizing the effects of irrelevant information from dissimilar tasks. To this end, SKT first assesses the degree of interference between the current and previous tasks and then selectively aggregates the tasks that maximize knowledge transfer for continual training. In addition, we integrate SKT into the current state-of-the-art continual language learning algorithm, Progressive Prompts, to introduce Log-evidence Progressive Prompts (LePP), which facilitate knowledge transfer between tasks. Comprehensive evaluations on challenging few-shot continual learning benchmarks demonstrate that LePP can surpass existing baselines for continual learning with LLMs with minimal overhead. Our extensive ablation studies reveal that SKT can discover useful task correlations without any prior knowledge, many of which align with human evaluations. Code will be published upon acceptance.

3277Tight Lower Bounds under Asymmetric High-Order Hölder Smoothness and Uniform Convexity

[openreview] [pdf]

Abstract In this paper, we provide tight lower bounds for the oracle complexity of minimizing high-order Hölder smooth and uniformly convex functions. Specifically, for a function whose pthp^{th}-order derivatives are Hölder continuous with degree ν\nu and parameter HH, and that is uniformly convex with degree qq and parameter σ\sigma, we focus on two asymmetric cases: (1) q>p+νq > p + \nu, and (2) q<p+νq < p+\nu. Given up to pthp^{th}-order oracle access, we establish worst-case oracle complexities of Ω((Hσ)23(p+ν)2(σϵ)2(qpν)q(3(p+ν)2))\Omega\left( \left( \frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}\left( \frac{\sigma}{\epsilon}\right)^\frac{2(q-p-\nu)}{q(3(p+\nu)-2)}\right) in the first case with an \ell_\infty-ball-truncated-Gaussian smoothed hard function and Ω((Hσ)23(p+ν)2+loglog((σp+νHq)1p+νq1ϵ))\Omega\left(\left(\frac{H}{\sigma}\right)^\frac{2}{3(p+\nu)-2}+ \log\log\left(\left(\frac{\sigma^{p+\nu}}{H^q}\right)^\frac{1}{p+\nu-q}\frac{1}{\epsilon}\right)\right) in the second case, for reaching an ϵ\epsilon-approximate solution in terms of the optimality gap. Our analysis generalizes previous lower bounds for functions under first- and second-order smoothness as well as those for uniformly convex functions, and furthermore our results match the corresponding upper bounds in this general setting.

3278Recurrent Context Compression: Efficiently Expanding the Context Window of LLM

[openreview] [pdf]

Abstract To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, researchers often encounter constraints stemming from finite computational resources and bounded memory capacities. This work proposes a novel approach, termed Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs. Furthermore, we delve into the prevalent issue of degraded model performance when both instructional prompts and contextual information undergo compression for downstream tasks. To address this challenge, we propose a novel instruction reconstruction methodology aimed at mitigating the detrimental effects of this compression process. The effectiveness of our proposed approach was validated across multiple tasks while achieving an impressive context compression rate of at least 32x. On text reconstruction task, we maintain a BLEU-4 score close to 0.95. On passkey retrieval task, we achieve nearly 100% accuracy involving an extensive sequence length of 1 million tokens. On long-text question-answering task, we obtain comparable performance with the non-compressed LLM in F1 and Rouge scores. Our method also demonstrated competitive performance in long-text question-answering tasks compared to non-compressed methods, while significantly saving storage resources.

3279Modeling Unseen Environments with Language-guided Composable Causal Components in Reinforcement Learning

[openreview] [pdf]

Abstract Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling with Compositional Causal Components (WM3C). This novel framework enhances RL generalization by learning and leveraging compositional causal components. Unlike previous approaches focusing on invariant representation learning or meta-learning, WM3C identifies and utilizes causal dynamics among composable elements, facilitating robust adaptation to new tasks. Our approach integrates language as a compositional modality to decompose the latent space into meaningful components and provides theoretical guarantees for their unique identification under mild assumptions. Our practical implementation uses a masked autoencoder with mutual information constraints and adaptive sparsity regularization to capture high-level semantic information and effectively disentangle transition dynamics. Experiments on numerical simulations and real-world robotic manipulation tasks demonstrate that WM3C significantly outperforms existing methods in identifying latent processes, improving policy learning, and generalizing to unseen tasks.

3280Switch EMA: A Free Lunch for Better Flatness and Sharpness

[openreview] [pdf]

Abstract Exponential Moving Average (EMA) is a widely used weight averaging (WA) regularization to learn flat optima for better generalizations without extra cost in deep neural network (DNN) optimization. Despite achieving better flatness, existing WA methods might fall into worse final performances or require extra test-time computations. This work unveils the full potential of EMA with a single line of modification, i.e., switching the EMA parameters to the original model after each epoch, dubbed as Switch EMA (SEMA). From both theoretical and empirical aspects, we demonstrate that SEMA can help DNNs to reach generalization optima that better trade-off between flatness and sharpness. To verify the effectiveness of SEMA, we conduct comparison experiments with discriminative, generative, and regression tasks on vision and language datasets, including image classification, self-supervised learning, object detection and segmentation, image generation, video prediction, attribute regression, and language modeling. Comprehensive results with popular optimizers and networks show that SEMA is a free lunch for DNN training by improving performances and boosting convergence speeds.

3281The Geometry of Phase Transitions in Diffusion Models: Tubular Neighbourhoods and Singularities

[openreview] [pdf]

Abstract Diffusion models undergo phase transitions during the generative process where data features suddenly emerge in the final stages. The current study aims to elucidate this critical phenomenon from the geometrical perspective. We employ the concept of ``injectivity radius’', a quantity that characterises the structure of the data manifold. Through theoretical and empirical evidence, we demonstrate that phase transitions in the generative process of diffusion models are closely related to the injectivity radius. Our findings offer a novel perspective on phase transitions in diffusion models, with potential implications for improving performance and sampling efficiency.

3282Asynchronous Factorization for Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Value factorization is widely used to design high-quality, scalable multi-agent reinforcement learning algorithms. However, current methods typically assume agents execute synchronous, 1-stepprimitive actions, failing to capture the typical nature of multi-agent systems. In reality, agents are asynchronous and executemacro-actions---extended actions of variable and unknown duration---making decisions at different times. This paper proposes value factorization for asynchronous agents. First, we formalize the requirements for consistency between centralized and decentralized macro-action selection, proving they generalize the primitive case. We then propose update schemes to enable factorization architectures to support macro-actions. We evaluate these asynchronous factorization algorithms on standard macro-action benchmarks, showing they scale and perform well on complex coordination tasks where their synchronous counterparts fail.

3283Efficient and Trustworthy Causal Discovery with Latent Variables and Complex Relations

[openreview] [pdf]

Abstract Most traditional causal discovery methods assume that all task-relevant variables are observed, an assumption often violated in practice. Although some recent works allow the presence of latent variables, they typically assume the absence of certain special causal relations to ensure a degree of simplicity, which might also be invalid in real-world scenarios. This paper tackles a challenging and important setting where latent and observed variables are interconnected through complex causal relations. Under an assumption ensuring that latent variables leave adequate footprints in observed variables, we develop a series of novel theoretical results, leading to an efficient causal discovery algorithm which is the first one capable of handling the setting with both latent variables and complex relations within polynomial time. Our algorithm first sequentially identifies latent variables from leaves to roots and then sequentially infers causal relations from roots to leaves. Moreover, we prove trustworthiness of our algorithm, meaning that when the assumption is invalid, it can raise an error rather than draw an incorrect causal conclusion, thus preventing potential damage to downstream tasks. We demonstrate the efficacy of our algorithm through experiments. Our work significantly enhances efficiency and reliability of causal discovery in complex systems.

3284UFGTime: Reforming the Pure Graph Paradigm for Multivariate Time Series Forecasting in the Frequency Domain

[openreview] [pdf]

Abstract Recent advances in multivariate time series forecasting have seen a shift toward a pure graph paradigm, which transforms time series into hypervariate graphs and employs graph neural networks (GNNs) to holistically capture intertwined spatiotemporal dependencies. While promising, this approach faces notable challenges. First, converting time series into hypervariate graphs often neglects essential temporal sequences, which are vital for accurately capturing temporal dependencies. Second, treating the graph as a complete structure can obscure the varying importance of intra- and inter-series connections, potentially overlooking key local patterns. To address these challenges, we introduce a novel hyperspectral graph data structure that embeds sequential order into frequency signals and employs a sparse yet meaningful topological structure. In addition, we propose the \textsc{Ufgtime} framework, featuring a frequency-based global graph framelet message-passing operator tailored to hyperspectral graphs, effectively mitigating the smoothing issue and capturing global insights through sparse connections. Extensive experiments demonstrate that our framework significantly surpasses state-of-the-art methods, excelling in both short- and long-range time series forecasting while achieving superior efficiency. Our code is available at:~\url{https://anonymous.4open.science/r/UFGTIME-E352}.

3285Towards Zero-Shot Generalization in Offline Reinforcement Learning

[openreview] [pdf]

Abstract In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. To address such an issue, we propose new offline RL frameworks with ZSG, based on empirical risk minimization or proximal policy optimization. We prove that our frameworks find the near-optimal policy with ZSG both theoretically and empirically, from general environments to specific settings such as linear Markov decision processes (MDPs). Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.

3286Knowledge Augmentation: In-context or In-parameter?

[openreview] [pdf]

Abstract Generative language models rely on knowledge augmentation to enhance their ability to complete a variety of tasks by incorporating relevant external information. The most common approach is in-context knowledge injection, where the relevant information is appended directly to the model’s input. While straightforward and easy to implement, this approach is limited for complex reasoning tasks due to input length constraints and shallow integration of external and internal knowledge in language models. To overcome these limitations, we introduce an in-parameter knowledge injection method, which temporarily embeds external knowledge into the model’s parameters. By injecting knowledge in this way, the language models can access and reason over the information more flexibly, bringing enhanced performance on tasks requiring sophisticated reasoning. We conduct deep explorations of both in-context and in-parameter knowledge injection, highlighting their respective advantages and limitations. Through comprehensive experiments across tasks of varying complexity, we demonstrate that in-parameter knowledge injection is particularly advantageous for complex tasks requiring deep reasoning, while in-context injection remains effective for simpler tasks where the answer can be directly extracted. Our findings provide practical guidance for selecting appropriate knowledge augmentation strategies based on the complexity of the task.We have open-sourced all the code, data, and models in the following anonymous GitHub link:https://anonymous.4open.science/r/In-parameter-Knowledge-Injection/

3287Optimization on Manifolds with Riemannian Jacobian Regularization

[openreview] [pdf]

Abstract Understanding the effectiveness of intrinsic geometry in enhancing a model’s generalization ability, we draw upon prior works that apply geometric principles to optimization and present a novel approach to improve robustness and generalization for constrained optimization problems. This work aims to strengthen the sharpness-aware optimizers and proposes a novel Riemannian optimizer. We first present a theoretical analysis that characterizes the relationship between the general loss and the perturbation of the empirical loss in the context of Riemannian manifolds. Motivated by the result obtained from this analysis, we introduce our algorithm named Riemannian Jacobian Regularization (RJR), which explicitly regularizes the Riemannian gradient norm and the projected Hessian. To demonstrate RJR’s ability to enhance generalization, we evaluate and contrast our algorithm on a broad set of problems, such as image classification and contrastive learning across different datasets with various architectures.

3288TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring

[openreview] [pdf]

Abstract Text-to-SQL allows users to interact with databases using natural language, simplifying information retrieval. However, widespread deployment remains limited for two main reasons: (1) existing benchmarks focus solely on feasible questions that can always be mapped to SQL queries, overlooking infeasible questions that cannot, and (2) current models lack abstention mechanisms, posing the risk of providing incorrect answers. To address these gaps, we introduce TrustSQL, a new benchmark designed to evaluate text-to-SQL reliability—quantified by our proposed Reliability Score (RS) that measures the model’s potential helpfulness relative to its harmfulness. TrustSQL is constructed by re-annotating three datasets—ATIS, Advising, and EHRSQL—and incorporating infeasible questions for a more comprehensive evaluation of models on diverse inputs. We evaluate text-to-SQL models integrated with various abstention mechanisms such as using classifiers and uncertainty estimation. Our experiments show that only a few models achieve a positive score under high penalty settings, indicating that most models are unsuitable for deployment as they fail to meet safety requirements (i.e., potential harmfulness outweighs helpfulness). This underscores the need for developing models that not only improve SQL generation but also guarantee a certain degree of reliability. Additionally, we provide detailed analyses across different types of feasible and infeasible questions, offering insights for building more reliable text-to-SQL models.

3289Enhancing Human Body Generation in Diffusion Models with Dual-Level Prior Knowledge

[openreview] [pdf]

Abstract The development of diffusion models (DMs) has greatly enhanced text-to-image generation, outperforming previous methods like generative adversarial networks (GANs) in terms of image quality and text alignment. However, accurately generating human body images remains challenging, often resulting in disproportionate figures and anatomical errors, which limits their practical applications in areas such as portrait generation. While previous methods such as HcP have shown promising results, limitations including wrongly kept prior, insufficient human-related knowledge, and limited generalization ability still exist due to the specific design of fully-supervised learning with only pose-related information. In this study, we introduce a novel method to enhance pretrained diffusion models for realistic human body generation by incorporating dual-level human prior knowledge. Our approach involves learning shape-level details with the human-related tokens in the original prompts, and learning pose-level prior by adding a learnable pose-aware token to each text prompt. We use a two-stage training strategy to rectify the cross attentions with a bind-then-generalize process, leveraging multiple novel objectives along with adversarial training. Our extensive experiments show that this method significantly improves the ability of SD1.5 and SDXL pretrained models to generate human bodies, reducing deformities and enhancing practical utility.

3290Grey-box Prompt Optimization and Fine-Tuning for Cloud-Edge LLM Agents

[openreview] [pdf]

Abstract Large Language Models (LLMs) are transforming the landscape of generative AI, delivering groundbreaking performance across diverse tasks. Yet, their immense model sizes tether most LLMs to the cloud, posing challenges for tasks that demand processing private and proprietary data. In this paper, we introduce a grey-box prompt optimization and fine-tuning framework for cloud-edge LLMs-paving the way for a seamless, hybrid approach that merges the best of both private and public cloud environments. This framework not only boosts flexibility and scalability but also empowers users with heightened security and compliance, optimizing cost and performance. Beyond that, it ensures robust disaster recovery and business continuity through redundancy and smart workload distribution. At the heart of our solution is an efficient algorithm with guaranteed convergence, specifically tailored to the structure of the grey-box optimization problem. We rigorously analyze and derive its non-asymptotic convergence rate. Our extensive experiments reveal that sandwiched tuning-our novel fine-tuning method-delivers up to a 47.9% performance improvement over traditional methods across multiple tasks.

3291Safety-Advanced Autonomous Driving for Urgent Hazardous Situations using Q-Compared Soft Actor-Critic

[openreview] [pdf]

Abstract Autonomous vehicles must be capable of safe driving under all conditions to ensure passenger safety. This includes urgent hazardous situations (UHS), such as skidding on slippery roads or tire grip saturation during high-speed driving, which are not only difficult even for expert human drivers but also challenging to develop autonomous driving technologies that surpass human capabilities. Even though the recent advancements in machine learning including imitation learning (IL), reinforcement learning (RL), and hybrid learning (HL) have enabled the safe navigation of autonomous vehicles in various complex scenarios, they have fundamental limitations in UHS. Driving policies trained via IL degrade in novel situations where expert demonstration data is scarce or of poor quality, and RL struggles to develop optimal driving policies in UHS, which have broad state and action spaces and high transition variance. HL techniques combining IL and RL also fall short, as they require nearly optimal demonstration data, which is nearly impossible to obtain in UHS due to the difficulty for human drivers to react appropriately. To address these limitations, we propose a novel HL technique, Q-Compared Soft Actor-Critic (QC-SAC), which effectively utilizes immature demonstration data to develop optimal driving policies and adapt quickly to novel situations in UHS. QC-SAC evaluates the quality of demonstration data based on action value Q to prioritize beneficial data and disregard detrimental ones. Furthermore, QC-SAC improves the performance of the Q-network by leveraging demonstration data and enhances learning by rapidly incorporating new successful experiences from ongoing interactions, enabling fast adaptation to new situations. We test QC-SAC for two extreme UHS scenarios: oversteer control with collision avoidance (OCCA) and time-trial race (TTR). In OCCA, QC-SAC achieves a success rate 2.36 times higher than existing techniques, and in TTR, it reduces lap time by more than 13.6% while completing 300 test runs without a single failure. By proposing an innovative HL technique capable of training superior driving policies with immature demonstration data, we provide a solution for autonomous driving technologies that can handle UHS and introduce the world-first safe-advanced autonomous driving technology capable of controlling a vehicle oversteer safely and avoiding obstacles ahead.

3292Contradiction Retrieval Via Sparse-Aware Sentence Embedding

[openreview] [pdf]

Abstract Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.

3293Enhancing Cost Efficiency in Active Learning with Candidate Set Query

[openreview] [pdf]

Abstract This paper introduces a cost-efficient active learning (AL) framework for classification, featuring a novel query design called candidate set query. Unlike traditional AL queries requiring the oracle to examine all possible classes, our method narrows down the set of candidate classes likely to include the ground-truth class, significantly reducing the search space and labeling cost. Moreover, we leverage conformal prediction to dynamically generate small yet reliable candidate sets, adapting to model enhancement over successive AL rounds. To this end, we introduce an acquisition function designed to prioritize data points that offer high information gain at lower cost. Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the effectiveness and scalability of our framework. Notably, it reduces labeling cost by 42% on ImageNet64x64.

3294ChatSR: Conversational Symbolic Regression

[openreview] [pdf]

Abstract Formulas are the language of communication between humans and nature. It is an important research topic of artificial intelligence to find expressions from observed data to reflect the relationship between each variable in the data, which is called a symbolic regression problem. The existing symbolic regression methods directly generate expressions according to the given observation data, but we cannot require the algorithm to generate expressions that meet specific requirements according to the known prior knowledge. For example, the expression needs to contain the symbol `sin\sin’ or be periodicity, and so on. Even if it can, it often requires very complex operations, which is very inconvenient. In this paper, based on multi-modal large language models, we propose ChatSR, a conversational symbolic regression method that can generate expressions that meet the requirements simply by describing the requirements with natural language instructions. By experimenting on the test datasets, we can demonstrate that ChatSR leads the state-of-the-art baselines in fitting performance. More notably, ChatSR can well understand the prior knowledge contained in natural language prompts, and can further improve the quality of generated expressions according to the prior knowledge. In addition, it is exciting that ChatSR has good zero-shot capability.

3295Poisoning with A Pill: Circumventing Detection in Federated Learning

[openreview] [pdf]

Abstract Federated learning (FL) protects data privacy by enabling distributed model training without direct access to client data. However, its distributed nature makes it vulnerable to model and data poisoning attacks. While numerous defenses filter malicious clients using statistical metrics, they overlook the role of model redundancy, where not all parameters contribute equally to the model/attack performance. Current attacks manipulate all model parameters uniformly, making them more detectable, while defenses focus on the overall statistics of client updates, leaving gaps for more sophisticated attacks. We propose an attack-agnostic augmentation method to enhance the stealthiness and effectiveness of existing poisoning attacks in FL, exposing flaws in current defenses and highlighting the need for fine-grained FL security. Our three-stage methodology—pill construction\textit{pill construction}, pill poisoning\textit{pill poisoning}, and pill injection\textit{pill injection}—injects poison into a compact subnet (i.e., pill) of the global model during the iterative FL training. Experimental results show that FL poisoning attacks enhanced by our method can bypass 8 state-of-the-art (SOTA) defenses, gaining an up to 7x error rate increase, as well as on average a more than 2x error rate increase on both IID and non-IID data, in both cross-silo and cross-device FL systems.

3296NegMerge: Consensual Weight Negation for Strong Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning aims to selectively remove specific knowledge from a model. Current methods, such as task arithmetic, rely on fine-tuning models on the forget set, generating a task vector, and subtracting it from the original model. However, we argue the effectiveness of this approach is highly sensitive to hyperparameter selection, necessitating careful validation to identify the best model among many fine-tuned candidates. In this paper, we propose a novel method that leverages all given fine-tuned models rather than selecting a single one. By constructing task vectors from models trained with varied hyperparameters and merging only the components of the task vectors with consistent signs, we perform unlearning by negating the merged task vector from the original model. Given that existing methods also utilize multiple fine-tuned models, our approach delivers more effective unlearning without incurring additional computational costs. We demonstrate the effectiveness of our method on both vision-language models and standard image classification models, showing improved unlearning performance with minimal degradation on the retain set, outperforming state-of-the-art techniques.

3297Beyond Squared Error: Exploring Loss Design for Enhanced Training of Generative Flow Networks

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are a novel class of generative models designed to sample from unnormalized distributions and have found applications in various important tasks, attracting great research interest in their training algorithms. In general, GFlowNets are trained by fitting the forward flow to the backward flow on sampled training objects. Prior work focused on the choice of training objects, parameterizations, sampling and resampling strategies, and backward policies, aiming to enhance credit assignment, exploration, or exploitation of the training process. However, the choice of regression loss, which can highly influence the exploration and exploitation behavior of the under-training policy, has been overlooked. Due to the lack of theoretical understanding for choosing an appropriate regression loss, most existing algorithms train the flow network by minimizing the squared error of the forward and backward flows in log-space, i.e., using the quadratic regression loss. In this work, we rigorously prove that distinct regression losses correspond to specific divergence measures, enabling us to design and analyze regression losses according to the desired properties of the corresponding divergence measures. Specifically, we examine two key properties: zero-forcing and zero-avoiding, where the former promotes exploitation and higher rewards, and the latter encourages exploration and enhances diversity. Based on our theoretical framework, we propose three novel regression losses, namely, Shifted-Cosh, Linex(1/2), and Linex(1). We evaluate them across three benchmarks: hyper-grid, bit-sequence generation, and molecule generation. Our proposed losses are compatible with most existing training algorithms, and significantly improve the performances of the algorithms concerning convergence speed, sample diversity, and robustness.

3298AttentionNCE: Contrastive Learning with Instance Attention

[openreview] [pdf]

Abstract Contrastive learning has found extensive applications in computer vision, natural language processing, and information retrieval, significantly advancing the frontier of self-supervised learning. However, the limited availability of labels poses challenges in contrastive learning, as the positive and negative samples can be noisy, adversely affecting model training. To address this, we introduce instance-wise attention into the variational lower bound of contrastive loss, and proposing the AttentionNCE loss accordingly. AttentioNCE incorporates two key components that enhance contrastive learning performance: First, it replaces instance-level contrast with attention-based sample prototype contrast, helping to mitigate noise disturbances. Second, it introduces a flexible hard sample mining mechanism, guiding the model to focus on high-quality, informative samples. Theoretically, we demonstrate that optimizing AttentionNCE is equivalent to optimizing the variational lower bound of contrastive loss, offering a worst-case guarantee for maximum likelihood estimation under noisy conditions. Empirically, we apply AttentionNCE to popular contrastive learning frameworks and validate its effectiveness. The code is released at: \url{https://anonymous.4open.science/r/AttentioNCE-55EB}

3299Unlocking the Potential of Model Calibration in Federated Learning

[openreview] [pdf]

Abstract Over the past several years, various federated learning (FL) methodologies have been developed to improve model accuracy, a primary performance metric in machine learning. However, to utilize FL in practical decision-making scenarios, beyond considering accuracy, the trained model must also have a reliable confidence in each of its predictions, an aspect that has been largely overlooked in existing FL research. Motivated by this gap, we propose Non-Uniform Calibration for Federated Learning (NUCFL), a generic framework that integrates FL with the concept of model calibration. The inherent data heterogeneity in FL environments makes model calibration particularly difficult, as it must ensure reliability across diverse data distributions and client conditions. Our NUCFL addresses this challenge by dynamically adjusting the model calibration objectives based on statistical relationships between each client’s local model and the global model in FL. In particular, NUCFL assesses the similarity between local and global model relationships, and controls the penalty term for the calibration loss during client-side local training. By doing so, NUCFL effectively aligns calibration needs for the global model in heterogeneous FL settings while not sacrificing accuracy. Extensive experiments show that NUCFL offers flexibility and effectiveness across various FL algorithms, enhancing accuracy as well as model calibration.

3300BOOST: Enhanced Jailbreak of Large Language Model via Slient eos Tokens

[openreview] [pdf]

Abstract Along with the remarkable successes of Language language models, recent research also started to explore the security threats of LLMs, including jailbreaking attacks. Attackers carefully craft jailbreaking prompts such that a target LLM will respond to the harmful question. Existing jailbreaking attacks require either human experts or leveraging complicated algorithms to craft jailbreaking prompts. In this paper, we introduce BOOST, a simple attack that leverages only the eos tokens. We demonstrate that rather than constructing complicated jailbreaking prompts, the attacker can simply append a few eos tokens to the end of a harmful question. It will bypass the safety alignment of LLMs and lead to successful jailbreaking attacks. We further apply BOOST to four representative jailbreak methods and show that the attack success rates of these methods can be significantly enhanced by simply adding eos tokens to the prompt. To understand this simple but novel phenomenon, we conduct both theoretical and empirical analyses. Our analysis reveals that (1) adding eos tokens makes the target LLM believe the input is much less harmful, and (2) eos tokens have low attention values and do not affect LLM’s understanding of the harmful questions, leading the model to actually respond to the questions. Our findings uncover how fragile an LLM is against jailbreak attacks, motivating the development of strong safety alignment approaches.large language model, Jailbreak

3301Think while You Generate: Discrete Diffusion with Planned Denoising

[openreview] [pdf]

Abstract Discrete diffusion has achieved state-of-the-art performance, outperforming or approaching autoregressive models on standard benchmarks. In this work, we introduceDiscrete Diffusion with Planned Denoising(DDPD), a novel framework that separates the generation process into two models: a planner and a denoiser. At inference time, the planner selects which positions to denoise next by identifying the most corrupted positions in need of denoising, including both initially corrupted and those requiring additional refinement. This plan-and-denoise approach enables more efficient reconstruction during generation by iteratively identifying and denoising corruptions in the optimal order. DDPD outperforms traditional denoiser-only mask diffusion methods, achieving superior results on language modeling benchmarks such astext8,OpenWebText, and token-based generation onImageNet 256 × 256. Notably, in language modeling, DDPD significantly reduces the performance gap between diffusion-based and autoregressive methods in terms of generative perplexity.

3302CPSample: Classifier Protected Sampling for Guarding Training Data During Diffusion

[openreview] [pdf]

Abstract Diffusion models have a tendency to exactly replicate their training data, especially when trained on small datasets. Most prior work has sought to mitigate this problem by imposing differential privacy constraints or masking parts of the training data, resulting in a notable substantial decrease in image quality. We present CPSample, a method that modifies the sampling process to prevent training data replication while preserving image quality. CPSample utilizes a classifier that is trained to overfit on random binary labels attached to the training data. CPSample then uses classifier guidance to steer the generation process away from the set of points that can be classified with high certainty, a set that includes the training data. CPSample achieves FID scores of 4.97 and 2.97 on CIFAR-10 and CelebA-64, respectively, without producing exact replicates of the training data. Unlike prior methods intended to guard the training images, CPSample only requires training a classifier rather than retraining a diffusion model, which is computationally cheaper. Moreover, our technique provides diffusion models with greater robustness against membership inference attacks, wherein an adversary attempts to discern which images were in the model’s training dataset. We show that CPSample behaves like a built-in rejection sampler, and we demonstrate its capabilities to prevent mode collapse in Stable Diffusion.

3303Learning Fused State Representations for Control from Multi-View Observations

[openreview] [pdf]

Abstract In visual control tasks, leveraging observations from multiple views enables Reinforcement Learning (RL) agents to perceive the environment more effectively. However, while multi-view observations enrich decision-making information, they also increase the dimension of observation space and introduce more redundant information. Thus, how to learn compact and task-relevant representations from multi-view observations for downstream RL tasks remains a challenge. In this paper, we propose a Multi-view Fusion State for Control (MFSC), which integrates a self-attention mechanism with bisimulation metric learning to fuse task-relevant representations from multi-view observations. To foster more compact fused representations, we also incorporate a mask-based latent reconstruction auxiliary task to learn cross-view information. Additionly, this mechanism of mask and reconstruction can enpower the model with the ability to handle missing views by learning an additional mask tokens. We conducted extensive experiments on the Meta-World and Pybullet benchmarks, and the results demonstrate that our proposed method outperforms other multi-view RL algorithms and effectively aggregates task-relevant details from multi-view observations, coordinating attention across different views.

3304FairlyUncertain: A Comprehensive Benchmark of Uncertainty in Algorithmic Fairness

[openreview] [pdf]

Abstract Fair predictive algorithms hinge on both equality and trust, yet inherent uncertainty in real-world data challenges our ability to make consistent, fair, and calibrated decisions. While fairly managing predictive error has been extensively explored, some recent work has begun to address the challenge of fairly accounting for irreducible prediction uncertainty. However, a clear taxonomy and well-specified objectives for integrating uncertainty into fairness remains undefined. We address this gap by introducing FairlyUncertain, an axiomatic benchmark for evaluating uncertainty estimates in fairness. Our benchmark posits that fair predictive uncertainty estimates should be consistent across learning pipelines and calibrated to observed randomness. Through extensive experiments on 10 popular fairness datasets, our evaluation reveals: (1) A theoretically justified and simple method for estimating uncertainty in binary settings is more consistent and calibrated than prior work; (2) Abstaining from binary predictions, even with improved uncertainty estimates, reduces error but does not alleviate outcome imbalances between demographic groups; (3) Incorporating consistent and calibrated uncertainty estimates in regression tasks improves fairness without any explicit fairness interventions. Our benchmark package is designed to be extensible and open-source. By providing a standardized framework for assessing the interplay between uncertainty and fairness, FairlyUncertain paves the way for more equitable and trustworthy machine learning practices.

3305Knowledge And Capability Transfer Through Large Language Models’ Parameters Fusing

[openreview] [pdf]

Abstract The post-training phase of large language models (LLMs) plays a pivotal role in refining models to follow instructions and align with human preferences. However, this phase is fraught with challenges, particularly in sourcing high-quality post-training data. This paper introduces a novel approach, termed Parameters Fusing, that simplifies the post-training process by amalgamating model parameters delta from existing instruct-tuned checkpoints with a new base model tailored to specific domain data obtained by continual pre-training. Utilizing open-weight models such as Meta’s Llama, our method replicates the effects of the traditional post-training phase while significantly reducing both time and resource costs. Moreover, it facilitates the customization of model attributes (e.g., tool usage, instruction-following, coding proficiency, and tonal qualities) by adjusting parameter deltas from multiple checkpoints. This approach not only minimizes the challenges of post-training data acquisition but also provides a flexible and efficient framework for enhancing LLMs with domain-specific knowledge or capabilities.

3306Accelerated Over-Relaxation Heavy-Ball Method: Achieving Global Accelerated Convergence with Broad Generalization

[openreview] [pdf]

Abstract The heavy-ball momentum method is widely used to accelerate gradient descent by incorporating a momentum term. However, recent studies have shown it cannot achieve accelerated convergence for general smooth strongly convex problems. This work introduces the Accelerated Over-Relaxation Heavy-Ball (AOR-HB) method, the first heavy-ball variant with provable global and accelerated convergence for smooth strongly convex optimization. This breakthrough closes a long-standing theoretical gap and extends to composite convex optimization and min-max problems, achieving optimal complexity bounds and demonstrating broad generalization. The AOR-HB approach offers several advantages: (1) Generality: It applies to a wider range of optimization problems, (2) Impact: It introduces techniques that may reshape understanding of acceleration, and (3) Elegance: It is conceptually clearer and more intuitive than existing accelerated methods.

3307SAGMAN: Stability Analysis of Graph Neural Networks on the Manifolds

[openreview] [pdf]

Abstract Modern graph neural networks (GNNs) can be sensitive to changes in the input graph structure and node features, potentially resulting in unpredictable behavior and degraded performance. In this work, we introduce a spectral framework known as SAGMAN for examining the stability of GNNs. This framework assesses the distance distortions that arise from the nonlinear mappings of GNNs between the input and output manifolds: when two nearby nodes on the input manifold are mapped (through a GNN model) to two distant ones on the output manifold, it implies a large distance distortion and thus a poor GNN stability. We propose a distance-preserving graph dimension reduction (GDR) approach that utilizes spectral graph embedding and probabilistic graphical models (PGMs) to create low-dimensional input/output graph-based manifolds for meaningful stability analysis. Our empirical evaluations show that SAGMAN effectively assesses the stability of each node when subjected to various edge or feature perturbations, offering a scalable approach for evaluating the stability of GNNs, extending to applications within recommendation systems. Furthermore, we illustrate its utility in downstream tasks, notably in enhancing GNN stability and facilitating adversarial targeted attacks.

3308MetaGFN: Exploring Distant Modes with Adapted Metadynamics for Continuous GFlowNets

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are a class of generative models that sample objects in proportion to a specified reward function through a learned policy. They can be trained either on-policy or off-policy, needing a balance between exploration and exploitation for fast convergence to a target distribution. While exploration strategies for discrete GFlowNets have been studied, exploration in the continuous case remains to be investigated, despite the potential for novel exploration algorithms due to the local connectedness of continuous domains. Here, we introduce Adapted Metadynamics, a variant of metadynamics that can be applied to arbitrary black-box reward functions on continuous domains. We use Adapted Metadynamics as an exploration strategy for continuous GFlowNets. We show several continuous domains where the resulting algorithm, MetaGFN, accelerates convergence to the target distribution and discovers more distant reward modes than previous off-policy exploration strategies used for GFlowNets.

3309Abstracting and Refining Provably Sufficient Explanations of Neural Network Predictions

[openreview] [pdf]

Abstract Despite significant advancements in post-hoc explainability techniques for neural networks, many current methods rely on approximations and heuristics and do not provide formally provable guarantees over the explanations provided. Recent work has shown that it is possible to obtain explanations with formal guarantees by identifying subsets of input features that are sufficient to determine that predictions remain unchanged by incorporating neural network verification techniques. Despite the appeal of these explanations, their computation faces significant scalability challenges. In this work, we address this gap by proposing a novel abstraction-refinement technique for efficiently computing provably sufficient explanations of neural network predictions. Our methodabstractsthe original large neural network by constructing a substantially reduced network, where a sufficient explanation of the reduced network is alsoprovably sufficientfor the original network, hence significantly speeding up the verification process. If the explanation is insufficient on the reduced network, we iterativelyrefinethe network size (by gradually increasing it) until convergence. Our experimental results demonstrate that our approach substantially enhances the efficiency of obtaining provably sufficient explanations for neural network predictions while additionally providing a fine-grained interpretation of the network’s decisions across different abstraction levels. We thus regard this work as a substantial step forward in improving the feasibility of computing explanations with formal guarantees for neural networks.

3310ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving

[openreview] [pdf]

Abstract End-to-end differentiable learning has emerged as a prominent paradigm in autonomous driving (AD). A significant bottleneck in this approach is its substantial demand for high-quality labeled data, such as 3D bounding boxes and semantic segmentation, which are especially expensive to annotate manually. This challenge is exacerbated by the long tailed distribution in AD datasets, where a substantial portion of the collected data might be trivial (e.g. simply driving straight on a straight road) and only a minority of instances are critical to safety. In this paper, we propose ActiveAD, a planning-oriented active learning strategy designed to enhance sampling and labeling efficiency in end-to-end autonomous driving. ActiveAD progressively annotates parts of collected raw data based on our newly developed metrics. We design innovative diversity metrics to enhance initial sample selection, addressing the cold-start problem. Furthermore, we develop uncertainty metrics to select valuable samples for the ultimate purpose of route planning during subsequent batch selection. Empirical results demonstrate that our approach significantly surpasses traditional active learning methods. Remarkably, our method achieves comparable results to state-of-the-art end-to-end AD methods - by using only 30% data in both open-loop nuScenes and closed-loop CARLA evaluation.

3311Advancing Portfolio Optimization: Hybrid Relaxation and Heuristic Approaches for Cardinality-Constrained MIQP Problems

[openreview] [pdf]

Abstract The growing magnitude of investments in global markets has intensified the need for sophisticated risk mitigation strategies in portfolio optimization. Traditional portfolio optimization models that seek to minimize risk for a specified return frequently incorporate cardinality constraints, rendering them as Mixed-Integer Quadratic Programming (MIQP) challenges. These constraints elevate the problem to NP-Hard status, complicating the solution process. While heuristic methods have historically been favored for their direct approach to MIQP problems, relaxation techniques offer a strategic alternative by simplifying MIQP into a more tractable Quadratic Programming (QP) problem. We first introduce an approach that facilitates the conversion of MIQP to QP by relaxing integer constraints into continuous domains and integrating integer conditions into the objective function using Lagrange multipliers. This dual application not only eases the computational burden but preserves the integrity of the original problem’s structure. An innovative diagonalization technique applied to the covariance matrix further refines our method, enhancing the fit for integer variables, as Lagrange multipliers are inherently biased towards continuous variables. We present a comparative analysis of three distinct models, Linear, Dual, and Diagonal, each employing a unique relaxation strategy. Our research evaluates their efficacy in addressing the MIQP problem under cardinality constraints. In conjunction with heuristic methods, the refined solutions from our exact relaxation models serve as a starting point for further refinement using Genetic Algorithm and Neighborhood Searching Algorithm. This hybrid methodology yields results that not only rival but occasionally surpass those achieved by the latest models and the commercial solver CPLEX. Our findings endorse the potential of combining exact and heuristic techniques in portfolio optimization, marking a significant advancement in the field.

3312Feature-guided score diffusion for sampling conditional densities

[openreview] [pdf]

Abstract Score diffusion methods can learn probability densities from samples. The score of the noise-corrupted density is estimated using a deep neural network, which is then used to iteratively transport a Gaussian white noise density to a target density. Variants for conditional densities have been developed, but correct estimation of the corresponding scores is difficult. We avoid these difficulties by introducing an algorithm that operate by projecting the score onto the target class mean in a learned feature space. The features and the projected score are computed using the same network, which is trained by optimizing a single denoising loss. Learned feature vectors of same-class images are tightly clustered relative to those of different classes. We show that feature class centroids provide a low-dimensional Euclidean embedding of the class conditional densities. We demonstrate that, when trained on a dataset of mixed image classes, this projected score can generate high quality and diverse samples from the conditioning class. Conditional generation can be performed using feature vectors interpolated between those of the training set, demonstrating out-of-distribution generalization.

3313Routoo: Learning to Route to Large Language Models Effectively

[openreview] [pdf]

Abstract LLMs with superior response quality—particularly larger or closed-source models—often come with higher inference costs, making their deployment inefficient and costly. Meanwhile, developing foundational LLMs from scratch is becoming increasingly resource-intensive and impractical for many applications. To address the challenge of balancing quality and cost, we introduce Routoo, an architecture designed to optimize the selection of LLMs for specific prompts based on performance, cost, and efficiency. Routoo provides controllability over the trade-off between inference cost and quality, enabling significant reductions in inference costs for a given quality requirement. Routoo comprises two key components: a performance predictor and cost-aware selector. The performance predictor is a lightweight LLM that estimates the expected performance of various underlying LLMs on a given prompt without executing them. The cost-aware selector module then selects the most suitable model based on these predictions and constraints such as cost and latency, significantly reducing inference costs for the same quality. We evaluated Routoo using the MMLU benchmark across 57 domains employing open-source models. Our results show that Routoo matches the performance of the Mixtral 8x7b model while reducing inference costs by one-third. Additionally, by allowing increased costs, Routoo surpasses Mixtral’s accuracy by over 5% at equivalent costs, achieving an accuracy of 75.9%. When integrating GPT4 into our model pool, Routoo nearly matches GPT4’s performance at half the cost and exceeds it with a 25% cost reduction. These outcomes highlight Routoo’s potential to significantly reduce inference costs without compromising quality, and even to establish new state-of-the-art results by leveraging the collective capabilities of multiple LLMs.

3314Bayesian Analysis of Combinatorial Gaussian Process Bandits

[openreview] [pdf]

Abstract We consider the combinatorial volatile Gaussian process (GP) semi-bandit problem. Each round, an agent is provided a set of available base arms and must select a subset of them to maximize the long-term cumulative reward. We study the Bayesian setting and provide novel Bayesian cumulative regret bounds for three GP-based algorithms: GP-UCB, GP-BayesUCB and GP-TS. Our bounds extend previous results for GP-UCB and GP-TS to the \emph{infinite}, \emph{volatile} and \emph{combinatorial} setting, and to the best of our knowledge, we provide the first regret bound for GP-BayesUCB. Volatile arms encompass other widely considered bandit problems such as contextual bandits. Furthermore, we employ our framework to address the challenging real-world problem of online energy-efficient navigation, where we demonstrate its effectiveness compared to the alternatives.

3315VideoAlchemy: Open-set Personalization in Video Generation

[openreview] [pdf]

Abstract Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present VideoAlchemy VideoAlchemy~- a video model equipped with built-in multi-subject, open-set personalization capabilities for both foreground objects and backgrounds, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each reference image conditioning and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: datasetdataset and evaluationevaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we opt to sample video frames as reference images and synthesize entire videos. This approach, however, introduces data biases issue, where models can easily denoise training videos but fail to generalize to new contexts during inference. To mitigate these issue, we carefully design a new automatic data construction pipeline with extensive image augmentation and sampling techniques. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a new personalization benchmark with evaluation protocols focusing on accurate subject fidelity assessment and accommodating different types of personalization conditioning. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods, regarding quantitative and qualitative evaluations.

3316Test-Time Fairness and Robustness in Large Language Models

[openreview] [pdf]

Abstract Frontier Large Language Models (LLMs) can be socially discriminatory or sensitive to spurious features of their inputs. Because only well-resourced corporations can train frontier LLMs, we need robust test-time strategies to control such biases. Existing solutions, which instruct the LLM to be fair or robust, rely on the model’s implicit understanding of bias. Causality provides a rich formalism through which we can be explicit about our debiasing requirements. Yet, as we show, a naive application of the standard causal debiasing strategy, counterfactual data augmentation, fails under standard assumptions to debias predictions at an individual level at test time. To address this, we develop a stratified notion of debiasing called stratified invariance, which can capture a range of debiasing requirements from population level to individual level through an additional measurement that stratifies the predictions. We present a complete observational test for stratified invariance. Finally, we introduce a data augmentation strategy that guarantees stratified invariance at test time under suitable assumptions, together with a prompting strategy that encourages stratified invariance in LLMs. We show that our prompting strategy, unlike implicit instructions, consistently reduces the bias of frontier LLMs across a suite of synthetic and real-world benchmarks without requiring additional data, finetuning or pre-training.

3317On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

[openreview] [pdf]

Abstract Looped Transformers offer advantages in parameter efficiency and Turing completeness. However, their expressive power for function approximation and approximation rate remains underexplored. In this paper, we establish approximation rates of Looped Transformers by defining the concept of the modulus of continuity for sequence-to-sequence functions. This reveals a limitation specific to the looped architecture. That is, the analysis prompts us to incorporate scaling parameters for each loop, conditioned on timestep encoding. Experimental results demonstrate that increasing the number of loops enhances performance, with further gains achieved through the timestep encoding architecture.

3318Large Language Models Can Self-Improve At Web Agent Tasks

[openreview] [pdf]

Abstract Training models to act as agents that can effectively navigate and perform actions in a complex environment, such as a web browser, has typically been challenging due to lack of training data. Large language models (LLMs) have recently demonstrated some capability to navigate novel environments as agents in a zero-shot or few-shot fashion, purely guided by natural language instructions as prompts. Recent research has also demonstrated LLMs have the capability to exceed their base performance through self-improvement, i.e. fine-tuning on data generated by the model itself. In this work, we explore the extent to which LLMs can self-improve their performance as agents in long-horizon tasks in a complex environment using the WebArena benchmark. In WebArena, an agent must autonomously navigate and perform actions on web pages to achieve a specified objective. We explore fine-tuning on three distinct synthetic training data mixtures and achieve a 31% improvement in task completion rate over the base model on the WebArena benchmark through a self-improvement procedure. We additionally contribute novel evaluation metrics for assessing the performance, robustness, capabilities, and quality of trajectories of our fine-tuned agent models to a greater degree than simple, aggregate-level benchmark scores currently used to measure self-improvement.

3319Salvador Urban Network Transportation (SUNT): A Landmark Spatiotemporal Dataset for Public Transportation

[openreview] [pdf]

Abstract Efficient public transportation management is essential for the development of large urban centers, providing several benefits such as comprehensive coverage of population mobility, improvement of the local economy with the offer of new jobs and the decrease of transport costs, better control of traffic congestion, and significant reduction of environmental impact limiting gas emissions and pollution. Realizing these benefits requires carefully pursuing two essential pathways: (i) deeply understanding the population and transit patterns and (ii) using intelligent approaches to model multiple relations and characteristics efficiently. This work addresses these challenges by providing a novel dataset that includes various public transportation components alongside machine learning models trained to understand and predict different real-world behaviors. Our dataset comprises daily information from about 710,000 passengers in Salvador, one of Brazil’s largest cities, and local public transportation data with approximately 2,000 vehicles operating across nearly 400 lines, connecting almost 3,000 stops and stations. As benchmarks, we have fine-tuned diverse Graph Neural Networks to perform inference on vertices and edges, undertaking both regression and classification tasks. These models leverage temporal and spatial features concerning passengers and transportation data. We emphasize the greatest advantage of using our dataset lies in different possibilities of modeling a real-world urban mobility dataset, reproducing our results, overcoming our models, and investigating several other open-problem situations listed in this manuscript as future work, which include the designing of new methods, optimization strategies, and environmental approaches. Our dataset, codes, and models are available athttps://github.com/suntdataset/sunt.git.

3320Bridging the Gap Betweenf-divergences and Bayes Hilbert Spaces

[openreview] [pdf]

Abstract We introduce a novel framework that generalizes ff-divergences by incorporating locally non-convex divergence-generating functions. Using this extension, we define a new class of pseudo ff-divergences, encompassing a wider range of distances between distributions that traditional ff-divergences cannot capture. Among these, we focus on a particular pseudo divergence obtained by considering the induced metric of Bayes Hilbert spaces. Bayes Hilbert spaces are frequently used due to their inherent connection to Bayes’s theorem. They allow sampling from potentially intractable posterior densities, which has remained challenging until now. In the more general context, we prove that pseudo ff-divergences are well-defined and introduce a variational estimation framework that can be used in a statistical learning context. By applying this variational estimation framework to ff-GANs, we achieve improved FID scores over existing ff-GAN architectures and competitive results with the Wasserstein GAN, highlighting its potential for both theoretical research and practical applications in learning theory.

3321Combining Induction and Transduction for Abstract Reasoning

[openreview] [pdf]

Abstract When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs? We study this question on ARC, a highly diverse dataset of abstract reasoning tasks. We train neural models for induction (inferring latent functions) and transduction (directly predicting the test output for a given test input). Our models are trained on synthetic data generated by prompting LLMs to produce Python code specifying a function to be inferred, plus a stochastic subroutine for generating inputs to that function. We find inductive and transductive models solve very different problems, despite training on the same data, and having the same architecture.

3322Augmenting Offline Reinforcement Learning with State-only Interactions

[openreview] [pdf]

Abstract Batch offline data have been shown considerably beneficial for reinforcement learning. Their benefit is further amplified by upsampling with generative models. In this paper, we consider a novel opportunity where interaction with environment is feasible, but only restricted to observations, i.e.no rewardfeedback is available. This setting is realistic, because simulators or even real cyber-physical systems are often accessible, while in contrast reward is often difficult or expensive to obtain, similar to imitation learning settings. As a result, the learner must make best sense of the offline data to synthesize the most sample-efficient scheme of querying the transition of observation. Our method first leverages online interactions to generate high-return trajectories via conditional diffusion models. They are then blended with the original offline trajectories through a stitching algorithm, and the resulting augmented data is applied to downstream reinforcement learner. Superior empirical performance is demonstrated over state-of-the-art data augmentation methods that are extended to utilize observation-only interactions.

3323Bayes’ Power for Explaining In-Context Learning Generalizations

[openreview] [pdf]

Abstract Traditionally, neural network training has been primarily viewed as an approximation of maximum likelihood estimation (MLE). This interpretation originated in a time when training for multiple epochs on small datasets was common and performance was data bound; but it falls short in the era of large-scale single-epoch trainings ushered in by large self-supervised setups, like language models. In this new setup, performance is compute-bound, but data is readily available. As models became more powerful, in-context learning (ICL), i.e., learning in a single forward-pass based on the context, emerged as one of the dominant paradigms. In this paper, we argue that a more useful interpretation of neural network behavior in this era is as an approximation of the true posterior, as defined by the data-generating process. We demonstrate this interpretations’ power for ICL and its usefulness to predict generalizations to previously unseen tasks. We show how models become robust in-context learners by effectively composing knowledge from their training data. We illustrate this with experiments that reveal surprising generalizations, all explicable through the exact posterior. Finally, we show the inherent constraints of the generalization capabilities of posteriors and the limitations of neural networks in approximating these posteriors.

3324Preference Optimization with Multi-Sample Comparisons

[openreview] [pdf]

Abstract Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

3325Decentralized Federated Learning Over Noisy Labels: A Majority Voting Method

[openreview] [pdf]

Abstract Contrary to centralized federated learning (CFL), decentralized federated learning (DFL) allows clients to cooperate in training their local models without relying on a central parameter server. As different clients have varying annotation skills and preferences, noisy labels are inevitable in decentralized data ownership. In centralized learning (CL) and CFL settings, learning from noisy labels has been extensively explored; however, such methods cannot be directly applied in DFL settings due to limited computational resources or privacy requirements. This paper introduces DFLMV \textit{(majority voting based decentralized federated learning)}, a general DFL framework for learning from noisy data without relying on any assumptions about local client noise models while maintaining data privacy for all clients. Specifically, (1) Clients first use traditional DFL to train their local models until they become stable. (2) Clients use each of their neighbors’ models to make a prediction of every data point in their training datasets, then correct the labels based on majority voting. (3) Clients further fine-tune their models based on their updated training dataset. A theoretical analysis of DFLMV is also provided. Extensive experiments conducted on MNIST, Fashion-MNIST, CIFA-10, CIFAR-10N, CIFAR-100N, Clothing1M, and ANIMAL-10N validate the effectiveness of our proposed approach at various noise levels and different data settings in mitigating the adverse effects of noisy labels.

3326Adaptive Transformer Programs: Bridging the Gap Between Performance and Interpretability in Transformers

[openreview] [pdf]

Abstract Balancing high performance with interpretability in increasingly powerful Transformer-based models remains a challenge. While mechanistic interpretability aims to specify neural network computations in explicit, pseudocode-like formats, existing methods often involve laborious manual analysis or struggle to fully elucidate learned internal algorithms. Recent efforts to build intrinsically interpretable models have introduced considerable expressivity and optimization challenges. This work introduces Adaptive Transformer Programs, an enhanced framework building upon RASP language and Transformer Programs to create more robust and interpretable models. The proposed method increases expressivity by redesigning two primary attention modules to improve categorical and numerical reasoning capabilities. To overcome optimization hurdles, we introduce a novel reparameterization scheme that enhances the exploration-exploitation trade-off during training. We validate our approach through extensive experiments on diverse tasks, including in-context learning, algorithmic problems (e.g., sorting and Dyck languages), and NLP benchmarks such as named entity recognition and text classification. Results demonstrate that Adaptive Transformer Programs substantially narrow the performance gap between black-box Transformers and interpretable models, enhancing transparency. This work advances the development of high-performing, transparent AI systems for critical applications, addressing crucial ethical concerns in AI development.

3327SwitchLoss: A Novel Optimization Scheme for Imbalanced Regression

[openreview] [pdf]

Abstract In the realm of machine learning, conventional techniques like neural networks often encounter challenges when dealing with imbalanced data. Unfortunately, imbalanced data is a common occurrence in real-world datasets, where collection methods may fail to capture sufficient data within specific target variable ranges. Additionally, certain tasks inherently involve imbalanced data, where the occurrences of normal events significantly outweigh those of edge cases. While the problem of imbalanced data has been extensively studied in the context of classification, only a limited number of methods have been proposed for regression tasks. Furthermore, the existing methods often yield suboptimal performance when applied to high-dimensional data, and the domain of imbalanced high-dimensional regression remains relatively unexplored. In response to the identified challenge, this paper presents SwitchLoss, a novel optimization scheme for neural networks, and SwitchLossR, a variant with a restricted search space. Diverging from conventional approaches, SwitchLoss and SwitchLossR integrate variable loss functions into the traditional training process. Our assessment of these methods spans 15 regression datasets across diverse imbalanced domains, 5 synthetic high-dimensional imbalanced datasets, and two imbalanced age estimation image datasets. Findings from our investigation demonstrate that the combined utilization of SwitchLoss and SwitchLossR not only leads to a notable reduction in validation error, but also surpasses prevailing state-of-the-art techniques dedicated to imbalanced regression.

3328Contemporary Continuous Aggregation: A Robust Categorical Encoding for Zero-Shot Transfer Learning on Tabular Data

[openreview] [pdf]

Abstract Tabular data, as the most fundamental structure of many real-world applications, has been a spotlight of machine learning since the last decade. Regardless of the adopted approaches, e.g., decision trees or neural networks, Categorical Encoding is an essential operation for processing raw data into a numeric format so that machine learning algorithms can accept it. One fatal limitation of popular categorical encodings is that they cannot extrapolate to unseen categories for machine learning models without re-training. However, it is common to observe new categories in industry, while re-training is not always possible, e.g., during the cold-start stage with no target examples. In this work, we propose Contemporary Continuous Aggregation (CCA), a novel and theoretically sound categorical encoding which can automatically extrapolate to unseen categories without any training. CCA only relies on statistics from raw input that can be maintained at low time and memory costs, thus it is scalable to heavy workloads in real-time. Besides, we also empirically showcase that CCA outperforms existing encodings on unsupervised unseen category extrapolation, and achieves similar or even better performance in normal situations without extrapolation, promising CCA to be a powerful toolkit for tabular learning.

3329ROUTE: Robust Multitask Tuning and Collaboration for Text-to-SQL

[openreview] [pdf]

Abstract Despite the significant advancements in Text-to-SQL (Text2SQL) facilitated by large language models (LLMs), the latest state-of-the-art techniques are still trapped in the in-context learning of closed-source LLMs (e.g., GPT-4), which limits their applicability in open scenarios. To address this challenge, we propose a novel RObust mUltitask Tuning and collaboration mEthod (ROUTE) to improve the comprehensive capabilities of open-source LLMs for Text2SQL, thereby providing a more practical solution. Our approach begins with multi-task supervised fine-tuning (SFT) using various synthetic training data related to SQL generation. Unlike existing SFT-based Text2SQL methods, we introduced several additional SFT tasks, including schema linking, noise correction, and continuation writing. Engaging in a variety of SQL generation tasks enhances the model’s understanding of SQL syntax and improves its ability to generate high-quality SQL queries. Additionally, inspired by the collaborative modes of LLM agents, we introduce a Multitask Collaboration Prompting (MCP) strategy. This strategy leverages collaboration across several SQL-related tasks to reduce hallucinations during SQL generation, thereby maximizing the potential of enhancing Text2SQL performance through explicit multitask capabilities. Extensive experiments and in-depth analyses have been performed on eight open-source LLMs and five widely-used benchmarks. The results demonstrate that our proposal outperforms the latest Text2SQL methods and yields leading performance.

3330On the Vulnerability of Applying Retrieval-Augmented Generation within Knowledge-Intensive Application Domains

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) has been empirically shown to enhance the performance of large language models (LLMs) in knowledge-intensive domains such as healthcare, finance, and legal contexts. Given a query, RAG retrieves relevant documents from a corpus and integrates them into the LLMs’ generation process. In this study, we investigate the adversarial robustness of RAG, focusing specifically on examining the retrieval system. First, across 225 different setup combinations of corpus, retriever, query, and targeted information, we show that retrieval systems are vulnerable to universal poisoning attacks in medical Q&A. In such attacks, adversaries generate poisoned documents containing a broad spectrum of targeted information, such as personally identifiable information. When these poisoned documents are inserted into a corpus, they can be accurately retrieved by any users, as long as attacker-specified queries are used. To understand this vulnerability, we discovered that the deviation from the query’s embedding to that of the poisoned document tends to follow a pattern in which the high similarity between the poisoned document and the query is retained, thereby enabling precise retrieval. Based on these findings, we develop a new detection-based defense to ensure the safe use of RAG. Through extensive experiments spanning various Q&A domains, we observed that our proposed method consistently achieves excellent detection rates in nearly all cases.

3331Revisiting Covariate and Hypothesis Roles in ITE Estimation: A New Approach Using Laplacian Regularization

[openreview] [pdf]

Abstract The recent surge in data availability across many fields, such as medicine, social science, and marketing, has brought to the forefront the problem of estimating Individual Treatment Effect (ITE) from observational data to effectively tailor treatment to personalized characteristics. ITE estimation is known to be a challenging task because we can only observe the outcome with or without treatment, but never both. Moreover, observational datasets exhibit selection bias induced by the treatment assignment policy. In this paper, we present a new approach consisting of two novel aspects. First, we depart from conventional approaches that minimize the covariate shift. Instead, we incorporate it as a crucial element in ITE estimation, recognizing that it stems from highly predictive features that exhibit significant imbalance in observational data. Second, unlike existing methods, our approach utilizes hypothesis functions to directly estimate outcomes under covariate shift, enhancing reliability across observed and unobserved outcomes. To support this approach theoretically, we derive a new upper bound of the expected ITE loss and show that it explicitly depends on the discrepancy between the hypothesis functions, which are absent from the objectives of existing methods. Based on this new approach, we present LITE: Laplacian Individual Treatment Effect, a novel method that leverages Laplacian-regularized representation and incorporates both the covariate shift and the hypothesis functions for ITE estimation, effectively bridging observed and unobserved outcomes. We demonstrate LITE on illustrative simulations and two leading benchmarks, where we show superior results compared to state-of-the-art methods.

3332EqNIO: Subequivariant Neural Inertial Odometry

[openreview] [pdf]

Abstract Neural network-based odometry using accelerometer and gyroscope readings from a single IMU can achieve robust, and low-drift localization capabilities, through the use ofneural displacement priors. These priors learn to produce denoised displacement measurements but need to ignore data variations due to specific IMU mount orientation and motion directions, hindering generalization. This work introduces EqNIO, which addresses this challenge withcanonical displacement priors. We train an off-the-shelf architecture with IMU measurements that are mapped into a canonical gravity-aligned frame with learnable yaw. The outputs (displacement and covariance) are mapped back to the original frame. To maximize generalization, we find that these learnable yaw frames must transform equivariantly with global trajectory rotations and reflections across the gravity direction,i.e.action by the roto-reflection group Og(3)O_g(3) which preserves gravity (a subgroup of O(3)O(3)). This renders the displacement prior O(3)O(3)subequivariant. We tailor specific linear, convolutional and non-linear layers that commute with the actions of the group. Moreover, we introduce a bijective decomposition of angular rates into vectors that transform similarly to accelerations, allowing us to leverage both measurements types. Natively, angular rates would need to be inverted upon reflection, unlike acceleration, which hinders their joint processing. We highlight EqNIO’s flexibility and generalization capabilities by applying it to both filter-based (TLIO), and end-to-end (RONIN) architectures, and outperforming existing methods that usesoftequivariance from auxiliary losses or data augmentation on the TLIO, Aria, RONIN, RIDI and OxIOD datasets. We believe this work paves the way to low-drift, and generalizable neural inertial odometry on edge-devices.

3333ExpanDyNeRF: Expanding the Viewpoint of Dynamic Scenes beyond Constrained Camera Motions

[openreview] [pdf]

Abstract In the domain of dynamic Neural Radiance Fields (NeRF) for novel view synthesis, current state-of-the-art (SOTA) techniques struggle when the camera’s pose deviates significantly from the primary viewpoint, resulting in unstable and unrealistic outcomes. This paper introduces Expanded Dynamic NeRF (ExpanDyNeRF), a monocular NeRF method that integrates a Gaussian splatting prior to tackle novel view synthesis with large-angle rotations. ExpanDyNeRF employs a pseudo ground truth technique to optimize density and color features, which enables the generation of realistic scene reconstructions from challenging viewpoints. Additionally, we present the Synthetic Dynamic Multiview (SynDM) dataset, the first GTA V-based dynamic multiview dataset designed specifically for evaluating robust dynamic reconstruction from significantly shifted views. We evaluate our method quantitatively and qualitatively on both the SynDM dataset and the widely recognized NVIDIA dataset, comparing it against other SOTA methods for dynamic scene reconstruction. Our evaluation results demonstrate that our method achieves superior performance.

3334High-Dynamic Radar Sequence Prediction for Weather Nowcasting Using Spatiotemporal Coherent Gaussian Representation

[openreview] [pdf]

Abstract Weather nowcasting is an essential task that involves predicting future radar echo sequences based on current observations, offering significant benefits for disaster management, transportation, and urban planning. Current prediction methods are limited by training and storage efficiency, mainly focusing on 2D spatial predictions at specific altitudes. Meanwhile, 3D volumetric predictions at each timestamp remain largely unexplored. To address such a challenge, we introduce a comprehensive framework for 3D radar sequence prediction in weather nowcasting, using the newly proposed SpatioTemporal Coherent Gaussian Splatting (STC-GS) for dynamic radar representation and GauMamba for efficient and accurate forecasting. Specifically, rather than relying on a 4D Gaussian for dynamic scene reconstruction, STC-GS optimizes 3D scenes at each frame by employing a group of Gaussians while effectively capturing their movements across consecutive frames. It ensures consistent tracking of each Gaussian over time, making it particularly effective for prediction tasks. With the temporally correlated Gaussian groups established, we utilize them to train GauMamba, which integrates a memory mechanism into the Mamba framework. This allows the model to learn the temporal evolution of Gaussian groups while efficiently handling a large volume of Gaussian tokens. As a result, it achieves both efficiency and accuracy in forecasting a wide range of dynamic meteorological radar signals. The experimental results demonstrate that our STC-GS can efficiently represent 3D radar sequences with over 16×16\times higher spatial resolution compared with the existing 3D representation methods, while GauMamba outperforms state-of-the-art methods in forecasting a broad spectrum of high-dynamic weather conditions.

3335Phase Transitions in the Output Distribution of Large Language Models

[openreview] [pdf]

Abstract In a physical system, changing parameters such as temperature can induce a phase transition: an abrupt change from one state of matter to another. Analogous phenomena have recently been observed in large language models. Typically, the task of identifying phase transitions requires human analysis and some prior understanding of the system to narrow down which low-dimensional properties to monitor and analyze. Statistical methods for the automated detection of phase transitions from data have recently been proposed within the physics community. These methods are largely system agnostic and, as shown here, can be adapted to study the behavior of large language models. In particular, we quantify distributional changes in the generated output via statistical distances, which can be efficiently estimated with access to the probability distribution over next-tokens. This versatile approach is capable of discovering new phases of behavior and unexplored transitions -- an ability that is particularly exciting in light of the rapid development of language models and their emergent capabilities.

3336Model Risk-sensitive Offline Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) is becoming critical in risk-sensitive areas such as finance and autonomous driving, where incorrect decisions can lead to substantial financial loss or compromised safety. However, traditional risk-sensitive offline RL methods often struggle with accurately assessing risk, with minor errors in the estimated return potentially causing significant inaccuracies of risk estimation. These challenges are intensified by distribution shifts inherent in offline RL. To mitigate these issues, we propose a model risk-sensitive offline RL framework designed to minimize the worst-case of risks across a set of plausible alternative scenarios rather than solely focusing on minimizing estimated risk. We present a critic-ensemble criterion method that identifies the plausible alternative scenarios without introducing additional hyperparameters. We also incorporate the learned Fourier feature framework and the IQN framework to address spectral bias in neural networks, which can otherwise lead to severe errors in calculating model risk. Our experiments in finance and self-driving scenarios demonstrate that the proposed framework significantly reduces risk, by 11.211.2% to 18.518.5%, compared to the most outperforming risk-sensitive offline RL baseline, particularly in highly uncertain environments.

3337Interactive Speculative Planning: Enhance Agent Efficiency through Co-design of System and User Interface

[openreview] [pdf]

Abstract Agents, as user-centric tools, are increasingly deployed for human task delegation, assisting with a broad spectrum of requests by generating thoughts, engaging with user proxies, and producing action plans. However, agents based on large language models often face substantial planning latency due to two primary factors: the efficiency limitations of the underlying LLMs due to their large size and high demand, and the structural complexity of the agents due to the extensive generation of intermediate steps to produce the final output. Given that inefficiency in service provision can undermine the value of automation for users, this paper presents a human-centered efficient agent planning method – Interactive Speculative Planning – aiming at enhancing the efficiency of agent planning through both system design and user interaction. Our approach advocates for the co-design of the agent system and user interface, underscoring the importance of an agent system that can fluidly manage user interactions and interruptions. By integrating human interruptions as a fundamental component of the system, we not only make it more user-centric but also expedite the entire process by leveraging human-in-the-loop interactions to provide accurate intermediate steps.

3338From Probability to Counterfactuals: the Increasing Complexity of Satisfiability in Pearl’s Causal Hierarchy

[openreview] [pdf]

Abstract The framework of Pearl’s Causal Hierarchy (PCH) formalizes three types of reasoning: probabilistic (i.e. purely observational), interventional, and counterfactual, that reflect the progressive sophistication of human thought regarding causation. We investigate the computational complexity aspects of reasoning in this framework focusing mainly on satisfiability problems expressed in probabilistic and causal languages across the PCH. That is, given a system of formulas in the standard probabilistic and causal languages, does there exist a model satisfying the formulas?Our main contribution is to prove the exact computational complexities showing that languages allowing addition and marginalization (via the summation operator) yield NP^{PP}-, PSPACE-, and NEXP-complete satisfiability problems, depending on the level of the PCH. These are the first results to demonstrate a strictly increasing complexity across the PCH: from probabilistic to causal and counterfactual reasoning. On the other hand, in the case of full languages, i.e.~allowing addition, marginalization, and multiplication, we show that the satisfiability for the counterfactual level remains the same as for the probabilistic and causal levels, solving an open problem in the field.

3339Review and Rebuttal: Zero-shot In-context Adversarial Learning for Improving Research Ideation

[openreview] [pdf]

Abstract Recent studies highlight that the advancements in Large Language Models (LLMs) have opened up exciting possibilities for scientific discovery, where LLMs can assist researchers in generating novel hypotheses and ideas. In this work, we draw inspiration from Generative Adversarial Networks (GANs) and make the first effort to formalize the concept of zero-shot in-context adversarial learning and implement it through multi-LLM-agent interactions to improve the research ideation process. Our approach takes the best of two worlds: (1) by making in-context learning adversarial, the utilization of an LLM’s vast parametric knowledge can be optimized; and (2) by keeping adversarial learning in context, we eliminate the need for bi-level optimization through additional model training. To evaluate the quality of the open-ended generation produced by LLMs, we develop a relative quality ranking metric, designed to serve as a proxy for human evaluation when human assessments are impractical or costly. Our findings demonstrate that zero-shot in-context adversarial learning significantly enhances idea generation across two dimensions. Specifically, with GPT-4o, the novelty of generated ideas improved by 21%, and feasibility of the ideas saw an impressive increase of 322%. These results underscore the transformative potential of zero-shot in-context adversarial learning in driving innovation and creativity within the research process.

3340Grokking at the Edge of Linear Separability

[openreview] [pdf]

Abstract We study the generalization properties of binary logistic classification in a simplified setting, for which a “memorizing” and “generalizing” solution can always be strictly defined, and elucidate empirically and analytically the mechanism underlying Grokking in its dynamics. Concretely, we show that binary logistic classification on a random feature model with a constant label exhibits Grokking, in the sense of delayed generalization and non-monotonic test loss. We find that Grokking is amplified when classification is applied to training sets on the verge of being linearly separable from the origin. Even though a perfect generalizing solution always exists, we prove the implicit bias of the logistic loss will cause the model to overfit if the training data is linearly separable from the origin. For training sets that are not separable from the origin, the model will always generalize perfectly asymptotically, but overfitting may occur at early stages of training. Importantly, in the vicinity of the transition, that is, for training sets that are almost separable from the origin, the model may overfit for arbitrarily long times before generalizing. We gain more insights by examining a tractable one-dimensional toy model that quantitatively captures the key features of the full model. Finally, we highlight intriguing common properties of our findings with recent literature, suggesting that grokking generally occurs in proximity to the interpolation threshold, reminiscent of critical phenomena often observed in physical systems.

3341Vanishing Privacy: Fast Gradient Leakage Threat to Federated Learning

[openreview] [pdf]

Abstract In the federated learning (FL) framework, clients participate in collaborative learning tasks under the coordination of a central server. Clients train local submodels using their own data and share gradients with the server, which aggregates the gradients to achieve privacy protection. However, recent research has revealed that gradient inversion attacks (GIAs) can leak private data from the shared gradients. Prior work has only demonstrated the feasibility of recovering input data from gradients under highly restrictive conditions, such as when dealing with high-resolution face datasets, where GIAs often struggle to initiate attacks effectively, and on object datasets like Imagenet, where they encounter limitations, primarily manifested in their ability to handle only small batch sizes and high time costs. As a result, we believe that implementing GIAs on high-resolution face datasets with large batch sizes is a challenging task. In this work, we introduce \textbf{F}ast \textbf{G}radient \textbf{L}eakage (FGL), which enables rapid image recovery across various network models on complex datasets, including the CelebA face dataset (1000 classes, 224×\times 224 px). We also introduced StyleGAN as prior knowledge for images and achieved FGL with a batch size of 60 in experiments (constrained by experimental hardware). We further propose a joint gradient matching loss, where multiple distinct matching losses collectively contribute to clarifying the attack direction and enhancing the efficiency of the optimization process. Extensive experimentation validates the feasibility of our approach. We anticipate that our proposed method can serve as a valuable tool to advance the development of privacy defense techniques.

3342Imbalance-Regularized LoRA: A Plug-and-Play Method for Improving Fine-Tuning of Foundation Models

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) is an effective fine-tuning algorithm for large models, enabling efficient adaptation with fewer trainable parameters. Despite its success, there remains significant potential for improving LoRA’s performance. In this paper, we introduce iLoRA (Imbalance-Regularized LoRA), which enhances LoRA by incorporating a regularization term to address the imbalance in forward propagation. This regularization maintains an imbalance between matrices A\mathbf{A} and B\mathbf{B}, ensuring stable activation variance independent of dimension. Specifically, we first analyze forward dynamics, observe this imbalance in stable training, and introduce imbalanced regularization. Further, by combining this with preconditioning techniques (Zhang and Pilanci, 2024), we propose π\piLoRA (Preconditioned iLoRA), which improves the backpropagation process. Our method is a plug-and-play algorithm that requires only minor modifications to the existing code and incurs negligible additional computational overhead. Finally, experiments on large language models and text-to-image models demonstrate that iLoRA and π\piLoRA significantly outperform existing LoRA and preconditioned LoRA methods.

3343Stochastic Sampling from Deterministic Flow Models

[openreview] [pdf]

Abstract Deterministic flow models such as rectified flows offer a general framework for learning a deterministic transport map between two distributions, realized as the vector field for an ordinary differential equation (ODE). However, they are sensitive to estimation and discretization errors and do not permit different samples conditioned on an intermediate state. We present a general method to turn the underlying ODE of such flow models into a family of stochastic differential equations (SDEs) that have the same marginal distributions. This method permits us to derive families ofstochastic samplers, for fixed (e.g., previously trained)deterministicflow models, that continuously span the spectrum of deterministic and stochastic sampling, given access to the flow field and the score function. Our method provides additional degrees of freedom that help alleviate some of the issues with the deterministic samplers and empirically outperforms them. We demonstrate this empirically on a toy Gaussian setup, as well as on the large scale ImageNet generation task. Further, our family of stochastic samplers provide an additional knob for controlling the diversity of generation, which we qualitatively demonstrate in our experiments.

3344MagicDec: Breaking the Latency-Throughput Tradeoff for Long Context Generation with Speculative Decoding

[openreview] [pdf]

Abstract Large Language Models (LLMs) have become more prevalent in long-context applications such as interactive chatbots, document analysis, and agent workflows, but it is challenging to serve long-context requests with low latency and high throughput. Speculative decoding (SD) is a widely used technique to reduce latency losslessly, but the conventional wisdom suggests that its efficacy is limited to small batch sizes. In MagicDec, we show that surprisingly SD can achieve speedup even for a high throughput inference regime for moderate to long sequences. More interestingly, an intelligent drafting strategy can achieve better speedup with increasing batch size based on our rigorous analysis. MagicDec first identifies the bottleneck shifts with increasing batch size and sequence length, and uses these insights to deploy SD more effectively for high throughput inference. We leverage draft model with sparse KV cache to address the KV bottleneck, which scales with both sequence length and batch size. Additionally, we propose a theoretical model to select the optimal drafting strategy for maximum speedup. Our work highlights the broad applicability of speculative decoding in long-context serving, as it can enhance throughput and reduce latency without compromising accuracy. For moderate to long sequences, we demonstrate up to 2.51x speedup for LLaMA-3.1-8B when serving batch sizes ranging from 32 to 256 on various types of hardware and tasks.

3345LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

[openreview] [pdf]

Abstract We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs. We introduce two architectures: (1) an encoder-decoder LVSM, which encodes input image tokens into a fixed number of 1D latent tokens, functioning as a fully learned scene representation, and decodes novel-view images from them; and (2) a decoder-only LVSM, which directly maps input images to novel-view outputs, completely eliminating intermediate scene representations. Both models bypass the 3D inductive biases used in previous methods---from 3D representations (e.g., NeRF, 3DGS) to network designs (e.g., epipolar projections, plane sweeps)---addressing novel view synthesis with a fully data-driven approach. While the encoder-decoder model offers faster inference due to its independent latent representation, the decoder-only LVSM achieves superior quality, scalability, and zero-shot generalization, outperforming previous state-of-the-art methods by 1.5 to 3.5 dB PSNR. Comprehensive evaluations across multiple datasets demonstrate that both LVSM variants achieve state-of-the-art novel view synthesis quality, delivering superior performance even with reduced computational resources (1-2 GPUs). Please see our anonymous website for more details:https://lvsm-web.github.io/

3346DiT-LSTM-SVAR Model For Portfolios

[openreview] [pdf]

Abstract This paper proposes a novel combined model named DiT-LSTM-SVAR, which successfully integrates time series and the Efficient Markets Hypothesis. This is the first to combine the microstructure of financial markets with deep learning networks to improve the performance of portfolios. We employ the DiT model to predict the upside and downside movements and an information decomposition model based on the SVAR model to identify random walk stocks. The DiT module significantly improves the Matthews correlation coefficient by almost 3%. The annual return of the portfolio is improved by almost 20%. The SVAR module greatly improves the Matthews correlation coefficient by almost 4%. Portfolios constructed using the DiT-LSTM-SVAR module based on market and public information outperformed those created with the DiT-LSTM model. The annual cumulative return of the portfolio is 266.60% and a Sharpe ratio of 1.8.

3347Sampling-Enhanced Large Neighborhood Search for Solving Integer Linear Programs

[openreview] [pdf]

Abstract Large Neighborhood Search (LNS) is a common heuristic in combinatorial optimization that iteratively searches over a large neighborhood of the current solution for a better one. Recently, neural network-based LNS solvers have achieved great success in solving Integer Linear Program (ILP) problems with a learnable policy for neighborhood selection, followed by an off-the-shelf ILP solver for re-optimization. Nonetheless, existing neural LNS solvers often get stuck in the same solution due to their greedy update strategy, i.e., only moving to the best solution found within the neighborhood. In this work, we try to theoretically identify the limitation of neural models in escaping the “local optima”. Accordingly, we propose a novel sampling-enhanced neural LNS solver, namely SPL-LNS, by reformulating LNS as a stochastic process, which uses a locally-informed proposal to sample the next assignment and simulated annealing to alleviate the ``local optima’’ issue. We also develop a novel hindsight relabeling method to efficiently train SPL-LNS on self-generated data. Experimental results reveal that our method substantially surpasses prior neural LNS solvers on multiple ILP problems.

3348Sufficient Context: A New Lens on Retrieval Augmented Generation Systems

[openreview] [pdf]

Abstract Augmenting LLMs with context leads to improved performance across many applications. Despite much research on Retrieval Augmented Generation (RAG) systems, an open question is whether errors arise because LLMs fail to utilize the context from retrieval or the context itself is insufficient to answer the query. To shed light on this, we develop a new notion of sufficient context, along with a way to classify instances that have enough information to answer the query. We then use sufficient context to analyze several models and datasets. By stratifying errors based on context sufficiency, we find that proprietary LLMs (Gemini, GPT, Claude) excel at answering queries when the context is sufficient, but often output incorrect answers instead of abstaining when the context is not. On the other hand, open-source LLMs (Llama, Mistral, Gemma) hallucinate or abstain often, even with sufficient context. We further categorize cases when the context is useful, and improves accuracy, even though it does not fully answer the query and the model errs without the context. Building on our findings, we explore ways to reduce hallucinations in RAG systems, including a new selective generation method that leverages sufficient context information for guided abstention. Our method improves the fraction of correct answers among times where the model responds by 2--10% for Gemini, GPT, and Gemma.

3349S4D: Streaming 4D Real-World Reconstruction with Gaussians and 3D Control Points

[openreview] [pdf]

Abstract Dynamic scene reconstruction using Gaussians has recently attracted increased interest. Mainstream approaches typically employ a global deformation field to warp a 3D scene in canonical space. However, the inherent low-frequency nature of implicit neural fields often leads to ineffective representations of complex motions. Moreover, their structural rigidity can hinder adaptation to scenes with varying resolutions and durations. To address these challenges, we introduce a novel approach for streaming 4D real-world reconstruction utilizing discrete 3D control points. This method physically models local rays and establishes a motion-decoupling coordinate system. By effectively merging traditional graphics with learnable pipelines, it provides a robust and efficient local 6-degrees-of-freedom (6-DoF) motion representation. Additionally, we have developed a generalized framework that integrates our control points with Gaussians. Starting from an initial 3D reconstruction, our workflow decomposes the streaming 4D reconstruction into four independent submodules: 3D segmentation, 3D control point generation, object-wise motion manipulation, and residual compensation. Experimental results demonstrate that our method outperforms existing state-of-the-art 4D Gaussian splatting techniques on both the Neu3DV and CMU-Panoptic datasets. Notably, the optimization of our 3D control points is achievable in 100 iterations and within just 2 seconds per frame on a single NVIDIA 4070 GPU.

3350OscillationInversion: Understand the structure of Large Flow Model through the Lens of Inversion Method

[openreview] [pdf]

Abstract We investigate oscillation phenomena observed in inversion methods applied to large text-to-image diffusion models, particularly the ``Flux’’ model. Using a fixed-point-inspired iteration method to invert real-world images, we find that the solution does not converge but instead oscillates between distinct clusters. Our results, validated both on real diffusion models and toy experiments, show that these oscillated clusters exhibit significant semantic coherence. We propose that this phenomenon arises from oscillatory solutions in dynamic systems, linking it to the structure of rectified flow models. The oscillated clusters serve as local latent distributions that allow for effective semantic-based image optimization.We provide theoretical insights, linking these oscillations to fixed-point dynamics and proving conditions for stable cluster formation and differentiation in flow models.

3351A Causal Lens for Evaluating Faithfulness Metrics

[openreview] [pdf]

Abstract The increasing capabilities of Large Language Models (LLMs) have made natural language explanations a promising alternative to traditional feature attribution methods for model interpretability. However, while these explanations may seem plausible, they can fail to reflect the model’s underlying reasoning faithfully. The idea of faithfulness is critical for assessing the alignment between the explanation and the model’s true decision-making mechanisms. Although several faithfulness metrics have been proposed, they lack a unified evaluation framework. To address this limitation, we introduce Causal Diagnosticity, a new evaluation framework for comparing faithfulness metrics in natural language explanations. Our framework extends the idea of diagnosticity to the faithfulness metrics for natural language explanations by using model editing to generate faithful and unfaithful explanation pairs. We introduce a benchmark consisting of three tasks: fact-checking, analogy, and object counting, and evaluate a diverse set of faithfulness metrics, including post-hoc explanation-based and chain-of-thought (CoT)-based methods. Our results show that while CC-SHAP significantly outperforms other metrics, there is substantial room for improvement. This work lays the foundation for future research in developing more faithful natural language explanations, highlighting the need for improved metrics and more reliable interpretability methods in LLMs.

3352Revisiting Positional Information in Transformers in the era of Fused Attention

[openreview] [pdf]

Abstract Imparting positional information has been a crucial component in Transformers due to attention’s invariance to permutation. Methods that bias attention weights, like Relative Positional Bias (RPB), have been preferred choice in more recent transformer-based architectures for vision. In parallel, fused attention has become the standard implementation for attention, largely thanks to open source solutions such as Flash Attention and FMHA. However, it is not trivial to fuse explicit biasing or masking of attention weights into a fused attention kernel without affecting its performance. In this scenario, position embeddings present themselves as a viable replacement for attention weight biases. Position embeddings are applied to the tokens directly, decoupled from the attention mechanism, thereby sidestepping the problems that arise with attention weight biases in fused kernels. In this work, inspired by the booming LLM landscape, we analyze the applicability of Rotary Position Embeddings (RoPE) as a replacement for RPBs in vision models. Unlike RPB which explicitly biases attention weights, RoPE biases the dot product inputs (query and key) directly and ahead of the attention operation. We empirically show the prowess of RoPE over RPBs in terms of accuracy and speed. We study multiple implementations of RoPE and show that it is sufficient to use only a fraction of hidden dimensions for RoPE to achieve competitive performance. We also develop a fast implementation for Axial RoPE. Together with the most performant fused attention implementations, and our fast RoPE implementation, we observe inference speedups compared to RPB with improved or similar accuracy. We foresee RoPE as a replacement for RPBs, paving the way for the widespread adoption of fused attention in transformer-based vision models.

3353OGBench: Benchmarking Offline Goal-Conditioned RL

[openreview] [pdf]

Abstract Offline goal-conditioned reinforcement learning (GCRL) is a major problem in reinforcement learning (RL) because it provides a simple, unsupervised, and domain-agnostic way to acquire diverse behaviors and representations from unlabeled data without rewards. Despite the importance of this setting, we lack a standard benchmark that can systematically evaluate the capabilities of offline GCRL algorithms. In this work, we propose OGBench, a new, high-quality benchmark for algorithms research in offline goal-conditioned RL. OGBench consists of 7 types of environments, 59 datasets, and reference implementations of 6 representative offline GCRL algorithms. We have designed these challenging and realistic environments and datasets to directly probe different capabilities of algorithms, such as stitching, long-horizon reasoning, and the ability to handle high-dimensional inputs and stochasticity. While representative algorithms may rank similarly on prior benchmarks, our experiments reveal stark strengths and weaknesses in these different capabilities, providing a strong foundation for building new algorithms. Videos:https://ogbenchauthors.github.io/ogbench-anon/

3354DON’T STOP ME NOW: EMBEDDING BASED SCHEDULING FOR LLMS

[openreview] [pdf]

Abstract Efficient scheduling is crucial for interactive Large Language Model (LLM) applications, where low request completion time directly impacts user engagement. Size-based scheduling algorithms like Shortest Remaining Process Time (SRPT) aim to reduce average request completion time by leveraging known or estimated request sizes and allowing preemption by incoming jobs with shorter service times. However, two main challenges arise when applying size-based scheduling to LLM systems. First, accurately predicting output lengths from prompts is challenging and often resource-intensive, making it impractical for many systems. As a result, the state-of-the-art LLM systems default to first-come, first-served scheduling, which can lead to head-of-line blocking and reduced system efficiency. Second, preemption introduces extra memory overhead to LLM systems as they must maintain intermediate states for unfinished (preempted) requests. In this paper, we propose TRAIL, a method to obtain output predictions from the target LLM itself. After generating each output token, we recycle the embedding of its internal structure as input for a lightweight classifier that predicts the remaining length for each running request. Using these predictions, we propose a prediction-based SRPT variant with limited preemption designed to account for memory overhead in LLM systems. This variant allows preemption early in request execution when memory consumption is low but restricts preemption as requests approach completion to optimize resource utilization. On the theoretical side, we derive a closed-form formula for this SRPT variant in an M/G/1 queue model, which demonstrates its potential value. In our system, we implement this preemption policy alongside our embedding-based prediction method. Our refined predictions from layer embeddings achieve 2.66x lower mean absolute error compared to BERT predictions from sequence prompts. TRAIL achieves 1.66x to 2.01x lower mean latency on the Alpaca dataset and 1.76x to 24.07x lower mean time to the first token compared to the state-of-the-art serving system.

3355Equivariant Denoisers Cannot Copy Graphs: Align Your Graph Diffusion Models

[openreview] [pdf]

Abstract Graph diffusion models, while dominant in graph generative modeling, remain relatively underexplored for graph-to-graph translation tasks like chemical reaction prediction. We show that standard permutation equivariant denoisers cause severe limitations such tasks, a problem that we pinpoint to their inability for breaking symmetries present in the noisy inputs. We then propose to \emph{align} the input and target graphs in order to break the input symmetries, while retaining permutation equivariance in the non-matching portions of the graph. We choose retrosynthesis as an application domain, and show how alignment takes the performance of a discrete diffusion model from a mere 5% to a SOTA-matching 54.7% top-1 accuracy.

3356Improving AI via Novel Computational Models and Programming Challenges

[openreview] [pdf]

Abstract AI, like humans, should be able to adapt and apply learned knowledge across diverse domains, such as computational models, mathematical/formal systems, and programming languages to solve problems. Current AI training often relies on existing systems, which limits its ability to generate original solutions or generalize across unfamiliar contexts. To address this, we propose a new computational model along with a revised programming language tailored to this model. By challenging AI to write, analyze, or verify programs within these new frameworks, and by utilizing a virtual machine for evaluation, we aim to test and enhance the AI’s adaptability and problem-solving capabilities in a verifiable manner.

3357Continual Memorization of Factoids in Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) can absorb a massive amount of knowledge through pretraining, but pretraining is inefficient for acquiring long-tailed or specialized facts. Therefore, fine-tuning on specialized or new knowledge that reflects changes in the world has become popular, though it risks disrupting the model’s original capabilities. We study this fragility in the context of continual memorization, where the model is trained on a small set of long-tail factoids (subject-relation-object associations) and must retain these factoids after multiple stages of subsequent training on other datasets. Continual memorization focuses on the specific challenge of retaining long-tail factoids, whereas general continual learning aims to maintain the LLM’s capabilities across a wide range of generic tasks (e.g., reasoning, commonsense knowledge). Through extensive experiments, we show that LLMs suffer from forgetting across a wide range of subsequent tasks, and simple replay techniques do not fully prevent forgetting, especially when the factoid datasets are trained in the later stages. We posit that there are two ways to alleviate forgetting: 1) protect the memorization process as the model learns the factoids, or 2) reduce interference from training in later stages. With this insight, we develop an effective mitigation strategy: REMIX (Random and Generic Data Mixing). REMIX prevents forgetting by mixing generic data sampled from pretraining corpora or even randomly generated word sequences during each stage, despite being unrelated to the memorized factoids in the first stage. REMIX can recover performance from severe forgetting, often outperforming replay-based methods that have access to the factoids from the first stage. We then analyze how REMIX alters the learning process and find that successful forgetting prevention is associated with a pattern: the model stores factoids in earlier layers than usual and diversifies the set of layers that store these factoids. The efficacy of REMIX invites further investigation into the underlying dynamics of memorization and forgetting, opening exciting possibilities for future research.

3358Integrating Geodesic Interpolation and Flow Matching for Non-Autoregressive Text Generation in Logit Space

[openreview] [pdf]

Abstract Non-autoregressive language models are emerging as effective alternatives to autoregressive models in natural language processing, enabling simultaneous token generation. This study presents a novel flow matching approach using Kullback-Leibler (KL) divergence geodesics to interpolate between initial and target distributions for discrete sequences. We establish a loss function that maximizes the conditional likelihood of discrete tokens, demonstrating that its maximizer corresponds to the flow matching velocity under logit interpolation. While initial tests on the TinyStories dataset yielded unsatisfactory results, we introduce an empirical sampling scheme based on a pretrained denoiser, which significantly improves performance.

3359Δ-DiT: Accelerating Diffusion Transformers without training via Denoising Property Alignment

[openreview] [pdf]

Abstract Diffusion models are now commonly used for producing high-quality and diverse images, but the iterative denoising process is time-intensive, limiting their usage in real-time applications. As a result, various acceleration techniques have been developed, though these primarily target UNet-based architectures and are not directly applicable to Transformer-based diffusion models (DiT). To address the specific challenges of the DiT architecture, we first analyze the relationship between the depth of DiT blocks and the quality of image generation. While skipping blocks can lead to large degradations in generation quality, we propose the Δ\Delta-Cache method, which captures and stores the incremental changes of different blocks, thereby mitigating the performance gap and maintaining closer alignment with the original results. Our analysis indicates that the shallow DiT blocks primarily define the global structure of images such as compositions, and outlines, while the deep blocks refine details. Based on this, we introduce a denoising property alignment method that selectively bypasses computations of different blocks at various timesteps while preserving performance. Comprehensive experiments on PIXART-α\alpha and DiT-XL demonstrate that Δ\Delta-DiT achieves a 1.6×1.6\times speedup in 20-step generation and enhances performance in most cases. In the 4-step consistent model generation scenario, and with a more demanding 1.12×1.12\times acceleration, our approach significantly outperforms existing methods.

3360Metric-Driven Attributions for Vision Transformers

[openreview] [pdf]

Abstract Attribution algorithms explain computer vision models by attributing the model response to pixels within the input. Existing attribution methods generate explanations by combining transformations of internal model representations such as class activation maps, gradients, attention, or relevance scores. The effectiveness of an attribution map is measured using attribution quality metrics. This leads us to pose the following question: if attribution methods are assessed using attribution quality metrics, why are the metrics not used to generate the attributions? In response to this question, we propose a Metric-Driven Attribution for explaining Vision Transformers (ViT) called MDA. Guided by attribution quality metrics, the method creates attribution maps by performing patch order and patch magnitude optimization across all patch tokens. The first step orders the patches in terms of importance and the second step assigns the magnitude to each patch while preserving the patch order. Moreover, MDA can provide a smooth trade-off between sparse and dense attributions by modifying the optimization objective. Experimental evaluation demonstrates the proposed MDA method outperforms 7 existing ViT attribution methods by an average of 2525% across 6 attribution metrics on the ImageNet dataset for the ViT-base 16×1616 \times 16, ViT-tiny 16×1616 \times 16, and ViT-base 32×3232 \times 32 models.

3361Maximum Coverage in Turnstile Streams with Applications to Fingerprinting Measures

[openreview] [pdf]

Abstract In the maximum coverage problem we aim to choose at most kk subsets such that the number of distinct items covered by the subsets is maximized. The input can be formalized by an n×dn \times d matrix AA where there are nn items in the universe and dd input subsets. AijA_{ij} is nonzero if item ii is in subset jj and is 0 otherwise. To our knowledge, we are the first to create a linear sketch to solve maximum coverage which can lead to large runtime improvements and allow for implementation in distributed and streaming environments. We specifically focus on the application to turnstile streams which allows deletions. Here, the updates are of the form (i,j,±1)(i,j,\pm 1) which performs Aij=Aij±1A_{ij} = A_{ij} \pm 1. Previous work mainly considers the more restrictive set-arrival model where each update reveals an entire column of AA or the insertion-only model which does not allow deletions. We design an algorithm with an O~(d/ϵ3)\tilde{O}(d/\epsilon^3) space bound, which is nearly optimal for constant kk. We then turn to fingerprinting for risk measurement where the aim is to monitor which kk columns of an input n×dn \times d dataset poses the highest re-indentification risk. Our maximum coverage sketch directly enables a solution of targeted fingerprinting for risk measurement. Furthermore, we give an independent result of independent interest: a sketch related to the complement of FkF_k for k2k \geq 2. We use this sketch to create a streaming algorithm for general fingerprinting for risk management. Empirical evaluation confirms the practicality of our fingerprinting algorithms and shows a speedup of up to 210x over prior work. We also demonstrate the use of our general fingerprinting algorithm as a dimensionality reduction technique, facilitating enhanced feature selection efficiency.

3362Local Loss Optimization in the Infinite Width: Stable Parameterization of Predictive Coding Networks and Target Propagation

[openreview] [pdf]

Abstract Local learning, which trains a network through layer-wise local targets and losses, has been studied as an alternative to backpropagation (BP) in neural computation. However, its algorithms often become more complex or require additional hyperparameters due to the locality, making it challenging to identify desirable settings where the algorithm progresses in a stable manner. To provide theoretical and quantitative insights, we introduce maximal update parameterization (μ\muP) in the infinite-width limit for two representative designs of local targets: predictive coding (PC) and target propagation (TP). We verify that μ\muP enables hyperparameter transfer across models of different widths. Furthermore, our analysis reveals unique and intriguing properties of μ\muP that are not present in conventional BP. By analyzing deep linear networks, we find that PC’s gradients interpolate between first-order and Gauss-Newton-like gradients, depending on the parameterization.We demonstrate that, in specific standard settings, PC in the infinite-width limit behaves more similarly to the first-order gradient. For TP, even with the standard scaling of the last layer differing from classical μ\muP, its local loss optimization favors the feature learning regime over the kernel regime.

3363Explanation using Simulation

[openreview] [pdf]

Abstract In safety-critical domains, such as industrial systems, the lack of explainability in predictive `black-box’ machine learning models can hinder trust and adoption. Standard explainability techniques, while powerful, often require deep expertise in data analytics and machine learning and fail to align with the sequential, dynamic nature of data in these environments. In this paper, we propose a novel explainability framework that leverages reinforcement learning (RL) to support model predictions with visual explanations based on dynamical system simulation. By training RL agents to simulate events that require prediction, we use these agents’ critics to make classifications. Next, we employ the actors of the RL agents to simulate the potential future trajectories underlying these classifications, providing visual explanations that are more intuitive and align with the expertise of industrial domain experts. We demonstrate the applicability of this method through a case study involving monitoring a small industrial system for cyberattacks, showing how our framework generates actionable predictions that are supported with visual explanations. This approach aims to bridge the gap between advanced machine learning models and their real-world deployment in safety-critical environments.

3364Need a Small Specialized Language Model? Plan Early!

[openreview] [pdf]

Abstract Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized domain. This paper explores how to get good specialized small language models using a large, generic, pretraining set and a limited amount of specialized data. We consider two scenarios, depending on whether (i) one can afford pretraining a model for each specialization task, or (ii) one wants to cheaply adapt a single pretrained model for each task. In the first scenario, we propose an effective solution based on importance sampling: we resample the pretraining set to imitate the specialization data and train a small model on it. In the second scenario, we propose a novel architecture, projected networks (PN). PN is a large network whose parameters can be linearly projected into a small network for specialization. For both scenarios, we demonstrate the empirical effectiveness of our solutions across various domains, training set sizes, and training budgets.

3365M2M: LEARNING CONTROLLABLE MULTI OF EXPERTS AND MULTI-SCALE OPERATORS ARE THE PARTIAL DIFFERENTIAL EQUATIONS NEED

[openreview] [pdf]

Abstract Learning the evolutionary dynamics of Partial Differential Equations (PDEs) is critical in understanding dynamic systems, yet current methods insufficiently learn their representations. This is largely due to the multi-scale nature of the solution, where certain regions exhibit rapid oscillations while others evolve more slowly. This paper introduces a framework of multi-scale and multi-expert (M2^2M) neural operators designed to simulate and learn PDEs efficiently. We employ a divide-and-conquer strategy to train a multi-expert gated network for the dynamic router policy. Our method incorporates a controllable prior gating mechanism that determines the selection rights of experts, enhancing the model’s efficiency. To optimize the learning process, we have implemented a PI (Proportional, Integral) control strategy to adjust the allocation rules precisely. This universal controllable approach allows the model to achieve greater accuracy. We test our approach on benchmark 2D Navier-Stokes equations and provide a custom multi-scale dataset. M2^2M can achieve higher simulation accuracy and offer improved interpretability compared to baseline methods.

3366Entropy Reveals What You Know: An Entropy-Guided Method for Enhancing the Reliability of Large Language Models

[openreview] [pdf]

Abstract While large language models (LLMs) encode vast amounts of knowledge within their parameters for some mainstream entities, factual inconsistencies and untruthfulness in LLMs often lead to unreliable responses and cause significant risks in practical applications. This paper aims to improve model reliability by enhancing consistency in answers to known facts and encouraging refusal to answer for uncertain questions. Specifically, we introduce \textbf{SREF}, an entropy-guided approach designed to enhance the reliability of language models by incorporating \textbf{S}elf-\textbf{REF}erences, models’ understanding of rephrasing questions, with inputs. We analyze and reveal the effectiveness of SREF in enhancing model reliability from the perspectives of entropy and KL divergence. Extensive experiments on 12 LLMs demonstrate that outputs generated with SREF yield more reliable results, including an average improvement of 16.01% over the baselines and a 15.10% average improvement in consistency, while also adapting to identify and acknowledge uncertain facts.

3367Conditional Variable Flow Matching: Transforming Conditional Densities with Amortized Conditional Optimal Transport

[openreview] [pdf]

Abstract Forecasting stochastic nonlinear dynamical systems under the influence of conditioning variables is a fundamental challenge repeatedly encountered across the biological and physical sciences. While flow-based models can impressively predict the temporal evolution of probability distributions representing possible outcomes of a specific process, existing frameworks cannot satisfactorily account for the impact of conditioning variables on these dynamics. Amongst several limitations, existing methods require training data with paired conditions and are developed for discrete conditioning variables. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across continuous conditioning variables -- permitting predictions across the conditional density manifold. This is accomplished through several novel advances, in particular, simultaneous sample conditioned flows over the main and conditioning variables, alongside a conditional Wasserstein metric and kernel facilitating conditional optimal transport. Collectively, these advances allow for learning system dynamics provided measurement data whose states and conditioning variables are not in correspondence. We demonstrate CVFM on a suite of increasingly challenging problems, including discrete and continuous conditional mapping benchmarks, image-to-image domain transfer, and modeling the temporal evolution of materials internal structure during manufacturing processes. We observe that CVFM results in improved performance and convergence characteristics over alternative conditional variants.

3368Improved Localized Machine Unlearning Through the Lens of Memorization

[openreview] [pdf]

Abstract Machine unlearning refers to removing the influence of a specified subset of training data from a machine learning model, efficiently, after it has already been trained. This is important for key applications, including making the model more accurate by removing outdated, mislabeled, or poisoned data. In this work, we study localized unlearning, where the unlearning algorithm operates on a (small) identified subset of parameters. Drawing inspiration from the memorization literature, we propose an improved localization strategy that yields strong results when paired with existing unlearning algorithms. We also propose a new unlearning algorithm, Deletion by Example Localization (DEL), that resets the parameters deemed-to-be most critical according to our localization strategy, and then finetunes them. Our extensive experiments on different datasets, forget sets and metrics reveal that DEL sets a new state-of-the-art for unlearning metrics, against both localized and full-parameter methods, while modifying a small subset of parameters, and outperforms the state-of-the-art localized unlearning in terms of test accuracy too.

3369Automated Knowledge Concept Annotation and Question Representation Learning for Knowledge Tracing

[openreview] [pdf]

Abstract Knowledge tracing (KT) is a popular approach for modeling students’ learning progress over time, which can enable more personalized and adaptive learning. However, existing KT approaches face two major limitations: (1) they rely heavily on expert-defined knowledge concepts (KCs) in questions, which is time-consuming and prone to errors; and (2) KT methods tend to overlook the semantics of both questions and the given KCs. In this work, we address these challenges and present KCQRL, a framework for automated knowledge concept annotation and question representation learning that can improve the effectiveness of any existing KT model. First, we propose an automated KC annotation process using large language models (LLMs), which generates question solutions and then annotates KCs in each solution step of the questions. Second, we introduce a contrastive learning approach to generate semantically rich embeddings for questions and solution steps, aligning them with their associated KCs via a tailored false negative elimination approach. These embeddings can be readily integrated into existing KT models, replacing their randomly initialized embeddings. We demonstrate the effectiveness of KCQRL across 15 KT algorithms on two large real-world Math learning datasets, where we achieve consistent performance improvements.

3370DOTA: Distributional Test-Time Adaptation of Vision-Language Models

[openreview] [pdf]

Abstract Vision-language foundation models (e.g., CLIP) have shown remarkable performance across a wide range of tasks. However, deploying these models may be unreliable when significant distribution gaps exist between the training and test data. The training-free test-time dynamic adapter (TDA) is a promising approach to address this issue by storing representative test samples to guide the classification of subsequent ones. However, TDA only naively maintains a limited number of reference samples in the cache, leading to severe test-time catastrophic forgetting when the cache is updated by dropping samples. In this paper, we propose a simple yet effective method for DistributiOnal Test-time Adaptation (DOTA). Instead of naively memorizing representative test samples, DOTA continually estimates the distributions of test samples, allowing the model to continually adapt to the deployment environment. The test-time posterior probabilities are then computed using the estimated distributions based on Bayes’ theorem for adaptation purposes. To further enhance the adaptability on the uncertain samples, we introduce a new human-machine collaboration paradigm which identifies uncertain samples, collects human-feedback, and incorporates it into the DOTA framework. Extensive experiments validate that DOTA enables CLIP to continually learn, resulting in a significant improvement compared to current state-of-the-art methods.

3371Temporal Test-Time Adaptation with State-Space Models

[openreview] [pdf]

Abstract Distribution shifts between training and test data are inevitable over the lifecycle of a deployed model, leading to performance decay. Adapting a model on test samples can help mitigate this drop in performance. However, most test-time adaptation methods have focused on synthetic corruption shifts, leaving a variety of distribution shifts underexplored. In this paper, we focus on distribution shifts that evolve gradually over time, which are common in the wild but challenging for existing methods, as we show. To address this, we propose STAD, a probabilistic state-space model that adapts a deployed model to temporal distribution shifts by learning the time-varying dynamics in the last set of hidden features. Without requiring labels, our model infers time-evolving class prototypes that act as a dynamic classification head. Through experiments on real-world temporal distribution shifts, we show that our method excels in handling small batch sizes and label shift.

3372Efficiently Learning Probabilistic Logical Models by Cheaply Ranking Mined Rules

[openreview] [pdf]

Abstract Probabilistic logical models are a core component of neurosymbolic AI and are important models in their own right for tasks that require high explainability. Unlike neural networks, logical models are often handcrafted using domain expertise, making their development costly and prone to errors. While there are algorithms that learn logical models from data, they are generally prohibitively expensive, limiting their applicability in real-world settings. In this work, we introduce precision and recall for logical rules and define their composition as rule utility -- a cost-effective measure to evaluate the predictive power of logical models. Further, we introduce SPECTRUM, a scalable framework for learning logical models from relational data. Its scalability derives from a linear-time algorithm that mines recurrent structures in the data along with a second algorithm that, using the cheap utility measure, efficiently ranks rules built from these structures. Moreover, we derive theoretical guarantees on the utility of the learnt logical model. As a result, we demonstrate across various tasks that SPECTRUM scales to larger datasets, often learning more accurate logical models orders of magnitude faster than previous methods without requiring specialised GPU hardware.

3373Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hack

[openreview] [pdf]

Abstract Previous work has shown that training “helpful-only” LLMs with reinforcement learning on a curriculum of gameable environments can lead models to generalize to egregious specification gaming, such as editing their own reward function or modifying task checklists to appear more successful. We show that gpt-4o, gpt-4o-mini, o1-preview, and o1-mini — frontier models trained to be helpful, harmless, and honest — can engage in specification gaming without training on a curriculum of tasks, purely from in-context iterative reflection (which we call in-context reinforcement learning, “ICRL”). We also show that using ICRL to generate highly-rewarded outputs for expert iteration (compared to the standard expert iteration reinforcement learning algorithm) may increase gpt-4o-mini’s propensity to learn specification-gaming policies, generalizing (in very rare cases) to the most egregious strategy where gpt-4o-mini edits its own reward function. Our results point toward the strong ability of in-context reflection to discover rare specification-gaming strategies that models might not exhibit zero-shot or with normal training, highlighting the need for caution when relying on alignment of LLMs in zero-shot settings.

3374USDC: A Dataset ofU―serS―tance andD―ogmatism in LongC―onversations

[openreview] [pdf]

Abstract Although prior studies have explored Stance and Dogmatism in user conversations, their datasets are constructed at the post level, treating each post as independent and randomly sampling posts from conversation threads. Thus, Stance and Dogmatism labels in these datasets cannot capture the user’s opinion fluctuations expressed throughout the entire conversation context. However, identifying user’s opinion fluctuations in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user’s stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user’s overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. Additionally, human annotations on 200 test conversations achieved inter-annotator agreement scores of 0.49 for stance and 0.50 for dogmatism, indicating a reasonable level of consistency between human and LLM annotations. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].

3375Gradient-free variational learning with conditional mixture networks

[openreview] [pdf]

Abstract Bayesian methods are known to address some limitations of standard deep learning, such as the lack of calibrated predictions and uncertainty quantification. However, they can be computationally expensive as model and data complexity increase. Fast variational methods can reduce the computational requirements of Bayesian methods by eliminating the need for gradient descent or sampling, but are often limited to simple models. We demonstrate that conditional mixture networks (CMNs), a probabilistic variant of the mixture-of-experts (MoE) model, are suitable for fast, gradient-free inference and can solve complex classification tasks, thus balancing the expressiveness and scalability of neural networks with the probabilistic benefits of Bayesian methods . By exploiting conditional conjugacy and Polya-Gamma augmentation, we furnish Gaussian likelihoods for the weights of both the experts and the gating network. This enables efficient variational updates using coordinate ascent variational inference (CAVI), avoiding traditional gradient-based optimization. We validate this approach by training two-layer CMNs on standard benchmarks from the UCI repository. Our method, CAVI-CMN, achieves competitive and often superior predictive accuracy compared to maximum likelihood estimation (MLE) with backpropagation, while maintaining competitive runtime and full posterior distributions over all model parameters. Moreover, as input size or the number of experts increases, computation time scales competitively with MLE and other gradient-based solutions like black-box variational inference (BBVI), making CAVI-CMN a promising tool for deep, fast, and gradient-free Bayesian networks.

3376Evaluating Information Gathering Abilities of Large Language Models with QuestBench

[openreview] [pdf]

Abstract Human queries and instructions to large language models (LLMs) often containincompleteorunderspecifiedinformation. In these circumstances, the ability to acquire the missing information by asking clarifying questions is crucial; in particular, doing so in a way to obtainminimally sufficientpiece of information. To assess whether LLMs possess this ability, we construct QuestBench, a benchmark of underspecified tasks that can be resolved by asking at most a single question. We frame underspecified tasks as a constraint satisfaction problems with missing variable assignments, where the exact model response cannot be determined unless certain variables’ values are acquired. This framework allows us to more precisely focus on tasks where uncertainty arises due to missing information, in contrast to tasks where it arises due to semantic ambiguity. The benchmarks include (1) Logic-Q: Logical reasoning tasks where one proposition is missing, (2)Planning-Q: PDDL planning problems where the initial state is underspecified, and (3) GSM-Q: Grade school math problems where one variable assignment is missing. We evaluate Gemini and GPT-4o models and find that they achieve 20 –30% accuracy in both zero-shot and few-shot settings. Furthermore, when evaluating GPT-4-o1 on a subset of our data, we find that it is only 41 – 44% accurate, despite utilizing state-of-the-art inference-time reasoning techniques. Overall, our results show that there is significant room for improvement on information gathering tasks. We conduct preliminary analysis to study factors that may correlate with reasoning mechanisms LLMs may use to tackle QuestBench.

3377Novel RL Approach for Efficient Elevator Group Control Systems

[openreview] [pdf]

Abstract The management of elevator traffic in large buildings is crucial for ensuring low passenger travel times and energy consumption. We optimize the Elevator Group Control System (EGCS) using a novel Reinforcement Learning (RL) approach. Existing methods, including heuristic-based and pattern detection algorithms, often fall short in handling the complex and stochastic nature of elevator systems. This research proposes an end-to-end RL-based approach. A custom elevator simulation environment representing the 6-elevator, 15-floor system at Vrije Universiteit Amsterdam (VU) is developed as a Markov Decision Process (MDP). Key innovations include a novel action space encoding to handle the combinatorial complexity of elevator dispatching, the introduction of infra-steps\textit{infra-steps} to model continuous passenger arrivals, and a tailored reward signal to improve learning efficiency. Additionally, we explore various ways of adapting the discounting factor to the infra-step\textit{infra-step} formulation. We investigate RL architectures based on Dueling Double Deep Q-learning, showing that the proposed RL-based EGCS adapts to fluctuating traffic patterns, learns from a highly stochastic environment, and thereby outperforms a traditional rule-based algorithm.

3378Adaptive Exponential Decay Rates for Adam

[openreview] [pdf]

Abstract Adam and its variants, including AdaBound, AdamW, and AdaBelief, have gained widespread popularity for enhancing the learning speed and generalization performance of deep neural networks. This optimization technique adjusts weight vectors by utilizing predetermined exponential decay rates (i.e.,β1\beta_1 = 0.9, β2\beta_2 = 0.999) based on the first moment estimate and the second raw moment estimate of the gradient. However, the default exponential decay rates might not be optimal, and the process of tuning them through trial and error with experience proves to be time-consuming. In this paper, we introduce AdamE, a novel variant of Adam designed to automatically leverage dynamic exponential decay rates on the first moment estimate and the second raw moment estimate of the gradient. Additionally, we provide theoretical proof of the convergence of AdamE in both convex and non-convex cases. To validate our claims, we perform experiments across various neural network architectures and tasks. Comparative analyses with adaptive methods utilizing default exponential decay rates reveal that AdamE consistently achieves rapid convergence and high accuracy in language modeling, node classification, and graph clustering tasks.

3379Perplexity Trap: PLM-Based Retrievers Overrate Low Perplexity Documents

[openreview] [pdf]

Abstract Previous studies have found that PLM-based retrieval models exhibit a preference for LLM-generated content, assigning higher relevance scores to these documents even when their semantic quality is comparable to human-written ones. This phenomenon, known as source bias, threatens the sustainable development of the information access ecosystem. However, the underlying causes of source bias remain unexplored. In this paper, we explain the process of information retrieval with a causal graph and discover that PLM-based retrievers learn perplexity features for relevance estimation, causing source bias by ranking the documents with low perplexity higher. Theoretical analysis further reveals that the phenomenon stems from the positive correlation between the gradients of the loss functions in language modeling task and retrieval task. Based on the analysis, a causal-inspired inference-time debiasing method is proposed, called C\textbf{C}ausal D\textbf{D}iagnosis and C\textbf{C}orrection (CDC). CDC first diagnoses the bias effect of the perplexity and then separates the bias effect from the overall estimated relevance score. Experimental results across three domains demonstrate the superior debiasing effectiveness of CDC, emphasizing the validity of our proposed explanatory framework. Source codes are available athttps://anonymous.4open.science/r/Perplexity-Trap-D6FE.

3380BOSE-NAS: Differentiable Neural Architecture Search with Bi-Level Optimization Stable Equilibrium

[openreview] [pdf]

Abstract Differentiable Architecture Search (DARTS) has gained prominence in the neural architecture search community for its efficiency and simplicity, achieved through optimizing architecture parameters via gradient descent. However, the magnitude of these architecture parameters frequently fails to accurately represent the true significance of the corresponding operations, adversely affecting the performance of the resultant architectures. While numerous studies have introduced alternative metrics to evaluate operation significance, the actual role and impact of architecture parameters remain inadequately explored. This lack of understanding creates critical ambiguity in the architecture search process. Resolving these ambiguities is essential for the effective utilization of architecture parameters, thereby facilitating the development of more effective differentiable NAS methodologies. In this work, we first conduct a rigorous theoretical analysis, revealing that the change rate of architecture parameters reflects the sensitivity of the supernet’s validation loss in the architecture space. Building on this foundation, we introduce the concept of the ‘Stable Equilibrium State’, which offers essential insights into the validation loss trajectory across architectural spaces and elucidates the stability of the supernet’s bi-level optimization process. We further investigate the supernet training dynamics to assess the influence of operations on the Stable Equilibrium State, leading to the proposal of a novel metric for evaluating operation importance, termed Equilibrium Influential (EIE_\mathcal{I}). Integrating these elements, we introduce BOSE-NAS, an effective differentiable NAS method that utilizes the Stable Equilibrium State to identify the optimal state during the search process, subsequently deriving the final architecture based on the EIE_\mathcal{I} metric. Extensive experiments conducted across diverse datasets and search spaces demonstrate that BOSE-NAS achieves competitive test accuracy compared to state-of-the-art methods while significantly reducing search costs.

3381Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling

[openreview] [pdf]

Abstract Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this idea and enabling sample-efficient reinforcement learning. However, existing methods overlook a crucial aspect: the need for optimism to be informed by a belief connecting the reward and state. To address this, we propose a practical, theoretically grounded approach to optimistic exploration based on Thompson sampling. Our model structure is the first that allows for reasoning aboutjointuncertainty over transitions and rewards. We apply our method on a set of MuJoCo and VMAS continuous control tasks. Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration.

3382Odyssey: Empowering Minecraft Agents with Open-World Skills

[openreview] [pdf]

Abstract Recent studies have delved into constructing generalist agents for open-world environments like Minecraft. Despite the encouraging results, existing efforts mainly focus on solving basic programmatic tasks, e.g., material collection and tool-crafting following the Minecraft tech-tree, treating the ObtainDiamond task as the ultimate goal. This limitation stems from the narrowly defined set of actions available to agents, requiring them to learn effective long-horizon strategies from scratch. Consequently, discovering diverse gameplay opportunities in the open world becomes challenging. In this work, we introduce Odyssey, a new framework that empowers Large Language Model (LLM)-based agents with open-world skills to explore the vast Minecraft world. Odyssey comprises three key parts: (1) An interactive agent with an open-world skill library that consists of 40 primitive skills and 183 compositional skills. (2) A fine-tuned LLaMA-3 model trained on a large question-answering dataset with 390k+ instruction entries derived from the Minecraft Wiki. (3) A new agent capability benchmark includes the long-term planning task, the dynamic-immediate planning task, and the autonomous exploration task. Extensive experiments demonstrate that the proposed Odyssey framework can effectively evaluate different capabilities of LLM-based agents. All datasets, model weights, and code are publicly available to motivate future research on more advanced autonomous agent solutions.

3383Diffusion-based Neural Network Weights Generation

[openreview] [pdf]

Abstract Transfer learning has gained significant attention in recent deep learning research due to its ability to accelerate convergence and enhance performance on new tasks. However, its success is often contingent on the similarity between source and target data, and training on numerous datasets can be costly, leading to blind selection of pretrained models with limited insight into their effectiveness. To address these challenges, we introduce \textit{D2NWG}, a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning, conditioned on the target dataset. Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation, learning the weight distributions of models pretrained on various datasets. This allows for automatic generation of weights that generalize well across both seen and unseen tasks, outperforming state-of-the-art meta-learning methods and pretrained models. Moreover, our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques that rely on task-specific model collections or access to original training data. By modeling the parameter distribution of LLMs, D2NWG enables task-specific parameter generation without requiring additional fine-tuning or large collections of model variants. Extensive experiments show that our method consistently enhances the performance of diverse base models, regardless of their size or complexity, positioning it as a robust solution for scalable transfer learning.

3384Mitigating Hallucination in Large Vision-Language Models via Modular Attribution and Intervention

[openreview] [pdf]

Abstract Large Vision-Language Models (LVLMs) exhibit impressive capabilities in complex visual tasks but are prone to hallucination, especially in open-ended generation tasks. This paper explores why LVLMs tend to hallucinate and how to mitigate it. First, we conduct causal mediation analysis through counterfactual edits on specific modules in LVLMs. Our results disclose that Multi-Head Attention (MHA) modules contribute more to the probability of generating hallucination words than multi-layer perceptron modules. We then identify specific heads that are responsible for hallucination, referred to as hallucination heads. Second, we examine the behavior of hallucination heads. We find that they are concentrated in the middle and deeper layers, displaying a strong attention bias toward text tokens. Further, we show that the attention patterns of certain hallucination heads exhibit greater similarity to the base language model and change slowly during the instruction tuning process. Finally, we propose two simple yet effective methods to mitigate hallucination: one is training-free and can be applied directly during decoding, while the other involves fine-tuning. Both methods are targeted for hallucination heads to reduce their reliance on text tokens. Notably, our methods achieve up to 1.7x reduction in hallucination rate for the LLaVA-v1.5-7B model in COCO captioning task, outperforming existing baselines. Overall, our findings suggest that hallucinations in LVLMs are likely to stem from certain modules, and targeted interventions can effectively mitigate these issues.

3385MoSH: Modeling Multi-Objective Tradeoffs with Soft and Hard Bounds

[openreview] [pdf]

Abstract Countless science and engineering applications in multi-objective optimization (MOO) necessitate that decision-makers (DMs) select a Pareto-optimal solution which aligns with their preferences. Evaluating individual solutions is often expensive, necessitating cost-sensitive optimization techniques. Due to competing objectives, the space of trade-offs is also expansive --- thus, examining the full Pareto frontier may prove overwhelming to a DM. Such real-world settings generally have loosely-defined and context-specific desirable regions for each objective function that can aid in constraining the search over the Pareto frontier. In this paper, we operationalize these priors using soft-hard functions\textit{soft-hard functions}, SHFs, which allow for the DM to impose soft and hard bounds on each objective. Leveraging a novel minimax formulation for Pareto frontier sampling, we propose a two-step process for obtaining a compact set of Pareto-optimal points which respect the user-defined soft and hard bounds: (1) densely sample the Pareto frontier using Bayesian optimization, and (2) sparsify the selected set to surface to the user, using robust submodular function optimization. We prove that (2) obtains the optimal compact Pareto-optimal set of points from (1). We further show that many practical problems fit within the SFH framework, and provide extensive empirical validation on several synthetic and real-world applications. Specifically, for brachytherapy, our approach returns a compact set of points with over 3% greater SHF-defined utility than the next best approach. Among the other diverse experiments, our approach consistently leads in utility, allowing the DM to reach >>99% of their maximum possible desired utility within validation of 5 points.

3386FEDNET: FREQUENCY ENHANCED DECOMPOSED NETWORK FOR OUT-OF-DISTRIBUTION TIME SERIES CLASSIFICATION

[openreview] [pdf]

Abstract Time series classification is a crucial task with widespread applications in various fields such as medicine and energy. Due to the non-stationary property of time series, its data distribution will change over time, which makes it challenging for models to generalize to the out-of-distribution (OOD) environment. However, limitations persist in the current research on OOD time series classification, particularly the absence of a unified consideration addressing both domain distribution shift and temporal distribution shift. To this end, we view the time series distribution shift from the frequency perspective and propose a novel method called Frequency Enhanced Decomposed Network (FEDNet) for OOD time series classification. FEDNet utilizes frequency domain information to guide the decomposition of time series and further eliminates domain shift and temporal shift, it then obtains domain-invariant features for adapting to OOD data. Finally,we provide theoretical insights of FEDNet to validate its superiority for OOD time series classification. Comprehensive results on synthetic and real-world datasets demonstrate that FEDNet achieves state-of-the-art performance in OOD time series classification tasks, surpassing previous methods by up to 7%.

3387Making Transformer Decoders Better Differentiable Indexers

[openreview] [pdf]

Abstract Retrieval aims to find the top-k items most relevant to a query/user from a large dataset. Traditional retrieval models represent queries/users and items as embedding vectors and use Approximate Nearest Neighbor (ANN) search for retrieval. Recently, researchers have proposed a generative-based retrieval method that represents items as token sequences and uses a decoder model for autoregressive training. Compared to traditional methods, this approach uses more complex models and integrates index structure during training, leading to better performance. However, these methods remain two-stage processes, where index construction is separate from the retrieval model, limiting the model’s overall capacity. Additionally, existing methods construct indices by clustering pre-trained item representations in Euclidean space. However, real-world scenarios are more complex, making this approach less accurate. To address these issues, we propose a \underline{U}nified framework for \underline{R}etrieval and \underline{I}ndexing, termed \textbf{URI}. URI ensures strong consistency between index construction and the retrieval model, typically a Transformer decoder. URI simultaneously builds the index and trains the decoder, constructing the index through the decoder itself. It no longer relies on one-sided item representations in Euclidean space but constructs the index within the interactive space between queries and items. Experimental comparisons on three real-world datasets show that URI significantly outperforms existing methods.

3388Differential learning kinetics govern the transition from memorization to generalization during in-context learning

[openreview] [pdf]

Abstract Transformers exhibit in-context learning (ICL): the ability to use novel information presented in the context without additional weight updates. Recent work shows that ICL emerges when models are trained on a sufficiently diverse set of tasks and the transition from memorization to generalization is sharp with increasing task diversity. One interpretation is that a network’s limited capacity to memorize favors generalization. Here, we examine the mechanistic underpinnings of this transition using a small transformer applied to a synthetic ICL task. Using theory and experiment, we show that the sub-circuits that memorize and generalize can be viewed as largely independent. The relativeratesat which these sub-circuits learn explains the transition from memorization to generalization, rather than capacity constraints. We uncover a memorization scaling law, which determines the task diversity threshold at which the network generalizes. The theory quantitatively explains a variety of other ICL-related phenomena, including the long-tailed distribution of when ICL is acquired, the bimodal behavior of solutions close to the task diversity threshold, the influence of contextual and data distributional statistics on ICL, and the transient nature of ICL.

3389Post-Hoc Robustness Enhancement in Graph Neural Networks with Conditional Random Fields

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs), which are nowadays the benchmark approach in graph representation learning, have been shown to be vulnerable to adversarial attacks, raising concerns about their real-world applicability. While existing defense techniques primarily concentrate on the training phase of GNNs, involving adjustments to message passing architectures or pre-processing methods, there is a noticeable gap in methods focusing on increasing robustness during inference. In this context, this study introduces RobustCRF, a post-hoc approach aiming to enhance the robustness of GNNs at the inference stage. Our proposed method, founded on statistical relational learning using a Conditional Random Field, is model-agnostic and does not require prior knowledge about the underlying model architecture. We validate the efficacy of this approach across various models, leveraging benchmark node classification datasets.

3390Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?

[openreview] [pdf]

Abstract Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE), a technique traditionally used to enhance model generalization under occlusion. Surprisingly, our study reveals that RE emerges as a powerful defense against MI attacks. We conduct analysis to identify crucial properties of RE to serve as an effective defense. Particularly, Partial Erasure in RE prevents the model from observing the entire objects during training, and we find that this has significant impact on MI, which aims to reconstruct the entire objects. Meanwhile, our analysis suggests Random Location in RE is important for outstanding privacy-utility trade-off. Furthermore, our analysis reveals that model trained with RE leads to a discrepancy between the features of MI-reconstructed images and that of private images. These effects significantly degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Our RE-based defense method is simple to implement and can be combined with other defenses. Extensive experiments of 34 setups demonstrate that our method achieve SOTA performance in privacy-utility tradeoff. The results consistently demonstrate the superiority of our defense over existing defenses across different MI attacks, network architectures, and attack configurations. For the first time, we achieve significant degrade in attack accuracy without decrease in utility for some configurations. Our code and additional results are included in Supplementary.

3391GraphFM: A generalist graph transformer that learns transferable representations across diverse domains

[openreview] [pdf]

Abstract Graph neural networks (GNNs) are often trained on individual datasets, requiring specialized models and significant hyperparameter tuning due to the unique structures and features of each dataset. This approach limits the scalability and generalizability of GNNs, as models must be tailored for each specific graph type. To address these challenges, we introduce GraphFM, a scalable multi-graph pretraining approach designed for learning across diverse graph datasets. GraphFM uses a Perceiver-based encoder with learned latent tokens to compress domain-specific features into a shared latent space, enabling generalization across graph domains. We propose new techniques for scaling up graph training on datasets of different sizes, allowing us to train GraphFM on 152 distinct graph datasets, spanning 7.4 million nodes and 189 million edges. This allows us to study the effect of scale on pretraining across domains such as molecules, citation networks, and product graphs, and show that training on diverse datasets improves performance over single-source pretraining. Our results demonstrate that pretraining on diverse real and synthetic graphs enhances adaptability and stability, leading to competitive performance with state-of-the-art models across various node classification tasks. This approach reduces the burden of dataset-specific training and provides a single generalist model capable of performing across multiple diverse graph structures and tasks.

3392You Only Prune Once: Designing Calibration-Free Model Compression With Policy Learning

[openreview] [pdf]

Abstract The ever-increasing size of large language models (LLMs) presents significant challenges for deployment due to their heavy computational and memory requirements. Current model pruning techniques attempt to alleviate these issues by relying heavily on external calibration datasets to determine which parameters to prune or compress, thus limiting their flexibility and scalability across different compression ratios. Moreover, these methods often cause severe performance degradation, particularly in downstream tasks, when subjected to higher compression rates. In this paper, we proposePruneNet, a novel model compression method that addresses these limitations by reformulating model pruning as a policy learning process. PruneNet decouples the pruning process from the model architecture, eliminating the need for calibration datasets. It learns a stochastic pruning policy to assess parameter importance solely based on intrinsic model properties while preserving the spectral structure to minimize information loss. PruneNet can compress the LLaMA-2-7B model in just 15 minutes, achieving over 80% retention of its zero-shot performance with a 30% compression ratio, outperforming existing methods that retain only 75% performance. Furthermore, on complex multitask language understanding tasks, PruneNet demonstrates its robustness by preserving up to 80% performance of the original model, proving itself a superior alternative to conventional structured compression techniques.

3393Multiple Heads are Better than One: Mixture of Modality Knowledge Experts for Entity Representation Learning

[openreview] [pdf]

Abstract Learning high-quality multi-modal entity representations is an important goal of multi-modal knowledge graph (MMKG) representation learning, which can enhance reasoning tasks within the MMKGs, such as MMKG completion (MMKGC). The main challenge is to collaboratively model the structural information concealed in massive triples and the multi-modal features of the entities. Existing methods focus on crafting elegant entity-wise multi-modal fusion strategies, yet they overlook the utilization of multi-perspective features concealed within the modalities under diverse relational contexts. To address this issue, we introduce a novel framework with Mixture of Modality Knowledge experts (MoMoK for short) to learn adaptive multi-modal entity representations for better MMKGC. We design relation-guided modality knowledge experts to acquire relation-aware modality embeddings and integrate the predictions from multi-modalities to achieve joint decisions. Additionally, we disentangle the experts by minimizing their mutual information. Experiments on four public MMKG benchmarks demonstrate the outstanding performance of MoMoK under complex scenarios. Our code and data are available athttps://anonymous.4open.science/r/MoMoK-8532/.

3394BlendRL: A Framework for Merging Symbolic and Neural Policy Learning

[openreview] [pdf]

Abstract Humans can leverage both symbolic reasoning and intuitive responses. In contrast, reinforcement learning policies are typically encoded in either opaque systems like neural networks or symbolic systems that rely on predefined symbols and rules. This disjointed approach severely limits the agents’ capabilities, as they often lack either the flexible low-level reaction characteristic of neural agents or the interpretable reasoning of symbolic agents.To overcome this challenge, we introduceBlendRL, a neuro-symbolic RL framework that harmoniously integrates both paradigms. We empirically demonstrate that BlendRL agents outperform both neural and symbolic baselines in standard Atari environments, and showcase their robustness to environmental changes. Additionally, we analyze the interaction between neural and symbolic policies, illustrating how their hybrid use helps agents overcome each other’s limitations.

3395Do Stochastic, Feel Noiseless: Stable Stochastic Optimization via a Double Momentum Mechanism

[openreview] [pdf]

Abstract Optimization methods are crucial to the success of machine learning, with Stochastic Gradient Descent (SGD) serving as a foundational algorithm for training models. However, SGD is often sensitive to the choice of the learning rate, which necessitates extensive hyperparameter tuning. In this work, we introduce a new variant of SGD that brings enhanced stability in two key aspects. First, our method allows the use of the same fixed learning rate to attain optimal convergence rates regardless of the noise magnitude, eliminating the need to adjust learning rates between noiseless and noisy settings. Second, our approach achieves these optimal rates over a wide range of learning rates, significantly reducing sensitivity compared to standard SGD, which requires precise learning rate selection. Our key innovation is a novel gradient estimator based on a double-momentum mechanism that combines two recent momentum-based techniques. Utilizing this estimator, we design both standard and accelerated algorithms that are robust to the choice of learning rate. Specifically, our methods attain optimal convergence rates in both noiseless and noisy stochastic convex optimization scenarios without the need for learning rate decay or fine-tuning. We also prove that our approach maintains optimal performance across a wide spectrum of learning rates, underscoring its stability and practicality. Empirical studies further validate the robustness and enhanced stability of our approach.

3396Transfering Knowledge into Efficient Tiny Models for Object Detection with Dual Prompt Distillation

[openreview] [pdf]

Abstract Knowledge Distillation (KD) has demonstrated significant benefits for learning compact models for object detection. Most current work focuses on general distillation settings, where student models are relatively large and learnable, then compete with the distillation performance. However, due to the model scale and inference speed, these models are seldom deployed in real-world applications. In this paper, we dive into a challenging but more applicable setting: how to distill rich teacher knowledge into tiny, faster models for object detection? We first show that simply applying previous KD strategies under such settings cannot achieve satisfying results, due to the extremely large model capacity gap between the teacher-student pairs. To this end, we propose a simple prompt-based object detection distillation framework, namely DualPromptKD, which aims to improve knowledge transfer efficiency from both teacher and student perspectives. Specifically, by distilling teacher representations into compact external prompts, we enable the student model to fully leverage proficient teacher knowledge even at inference time. In terms of the limited learning ability of the student model, we introduce lightweight internal prompts tailored to bolster the feature imitation capability for the target model. Extensive experimental results on the COCO benchmarks validate the effectiveness and generalization of our approach, including different image backbones and detector types. Notably, our DualPromptKD surpasses the previous best distillation strategies by more than 2.0 mAP under various experimental settings. The code will be available.

3397DELIFT: Data Efficient Language model Instruction Fine-Tuning

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) is essential for enhancing their performance on specific tasks but is often resource-intensive due to redundant or uninformative data. To address this inefficiency, we introduce DELIFT (Data Efficient Language model Instruction Fine-Tuning), a novel algorithm that systematically optimizes data selection across the three key stages of fine-tuning: (1) instruction tuning, (2) task-specific fine-tuning (e.g., reasoning, question-answering), and (3) continual learning (e.g., incorporating new data versions). Unlike existing methods that focus on single-stage optimization or rely on computationally intensive gradient calculations, DELIFT operates efficiently across all stages. Central to our approach is a pairwise utility metric that quantifies how beneficial a data sample is for improving the model’s responses to other samples, effectively measuring the informational value relative to the model’s current capabilities. By leveraging different submodular functions applied to this metric, DELIFT selects diverse and optimal subsets that are useful across all stages of fine-tuning. Experiments across various tasks and model scales demonstrate that DELIFT can reduce the fine-tuning data size by up to 70% without compromising performance, offering significant computational savings and outperforming existing methods in both efficiency and efficacy.

3398UniHDA: A Unified and Versatile Framework for Generalized Hybrid Domain Adaptation

[openreview] [pdf]

Abstract Recently, generative domain adaptation has achieved remarkable progress, enabling us to adapt a pre-trained generator to a new target domain. However, existing methods are limited to a single target domain and single modality, either text-driven or image-driven. In this paper, we explore a novel task -- Generalized Hybrid Domain Adaptation\textit{Generalized Hybrid Domain Adaptation}. Compared with conventional generative domain adaptation, it provides greater flexibility to adapt the generator to the hybrid of multiple target domains, with multi-modal references including one-shot image and zero-shot text prompt. Meanwhile, it is more challenging to represent the composition of multi-modal target domains and preserve the characteristics from the source domain. To address these issues, we propose UniHDA, a unified\textbf{unified} and versatile\textbf{versatile} framework for generalized hybrid domain adaptation. Drawing inspiration from the interpolable latent space of StyleGAN, we find that a linear interpolation between domain shifts in CLIP’s embedding space can also uncover favorable compositional capabilities for the adaptation. In light of this finding, we linearly interpolate the domain shifts from multiple target domains to achieve hybrid domain adaptation. To enhance consistency\textbf{consistency} with the source domain, we further propose a novel cross-domain spatial structure (CSS) loss that maintains the detailed spatial structure between the source and target generator. Experiments show the adapted generator can synthesize realistic images with various attribute compositions and maintain robust consistency with the source domain. Additionally, UniHDA is generator-agnostic and versatile to multiple generators, e.g., StyleGAN, EG3D, and video generators.

3399Towards Better Multi-head Attention via Channel-wise Sample Permutation

[openreview] [pdf]

Abstract Transformer plays a central role in many fundamental deep learning models, e.g., the ViT in computer vision and the BERT and GPT in natural language processing, whose effectiveness is mainly attributed to its multi-head attention (MHA) mechanism. In this study, we propose a simple and novel channel-wise sample permutation (CSP) operator, achieving a new structured MHA with fewer parameters and lower complexity. Given an input matrix, CSP circularly shifts the samples of different channels with various steps and then sorts grouped samples of each channel. This operator is equivalent to implicitly implementing cross-channel attention maps as permutation matrices, which achieves linear complexity and suppresses the risk of rank collapse when representing data. We replace the MHA of some representative models with CSP and test the CSP-based models in several discriminative tasks, including image classification and long sequence analysis. Experiments show that the CSP-based models achieve comparable or better performance with fewer parameters and lower computational costs than the classic Transformer and its state-of-the-art variants. The code is available athttps://anonymous.4open.science/r/CSP-BA52.

3400Diminishing Exploration: A Minimalist Approach to Piecewise Stationary Multi-Armed Bandits

[openreview] [pdf]

Abstract The piecewise-stationary bandit problem is an important variant of the multi-armed bandit problem that further considers abrupt changes in the reward distributions. The main theme of the problem is the trade-off between exploration for detecting environment changes and exploitation of traditional bandit algorithms. While this problem has been extensively investigated, existing works either assume knowledge about the number of change points MM or require extremely high computational complexity. In this work, we revisit the piecewise-stationary bandit problem from a minimalist perspective. We propose a novel and generic exploration mechanism, called diminishing exploration, which eliminates the need for knowledge about MM and can be used in conjunction with an existing change detection-based algorithm to achieve near-optimal regret scaling. Simulation results show that despite oblivious of MM, equipping existing algorithms with the proposed diminishing exploration generally achieves better empirical regret than the traditional uniform exploration.

3401Realizing Video Summarization from the Path of Language-based Semantic Understanding

[openreview] [pdf]

Abstract The recent development of Video-based Large Language Models (VideoLLMs), has significantly advanced video summarization by aligning video features—and, in some cases, audio features—with Large Language Models (LLMs). Each of these VideoLLMs possesses unique strengths and weaknesses. Many recent methods have required extensive fine-tuning to overcome the limitations of these models, which can be resource-intensive. In this work, we observe that the strengths of one VideoLLM can complement the weaknesses of another. Leveraging this insight, we propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm, which operates as an inference-time algorithm without requiring any form of fine-tuning. Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries. It effectively combines visual and audio content, provides detailed background descriptions, and excels at identifying keyframes, which enables more semantically meaningful retrieval compared to traditional computer vision approaches that rely solely on visual information, all without the need for additional fine-tuning. Moreover, the resulting summaries enhance performance in downstream tasks such as summary video generation, either through keyframe selection or in combination with text-to-image models. Our language-driven approach offers a semantically rich alternative to conventional methods and provides flexibility to incorporate newer VideoLLMs, enhancing adaptability and performance in video summarization tasks.

3402Discovering the Gems in Early Layers: Accelerating Long-Context LLMs with 1000x Input Token Reduction

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in handling long context inputs, but this comes at the cost of increased computational resources and latency. Our research introduces a novel approach for the long context bottleneck to accelerate LLM inference and reduce GPU memory consumption. Our research demonstrates that LLMs can identify relevant tokens in the early layers before generating answers to a query. Leveraging this insight, we propose an algorithm that uses early layers of an LLM as filters to select and compress input tokens, significantly reducing the context length for subsequent processing. Our method, GemFilter, demonstrates substantial improvements in both speed and memory efficiency compared to existing techniques, such as standard attention and SnapKV/H2O. Notably, it achieves a 2.4×\times speedup and 30% reduction in GPU memory usage compared to SOTA methods. Evaluation on the Needle in a Haystack task shows that GemFilter significantly outperforms standard attention, SnapKV and demonstrates comparable performance on the LongBench challenge. GemFilter is simple, training-free, and broadly applicable across different LLMs. Crucially, it provides interpretability by allowing humans to inspect the selected input sequence. These findings not only offer practical benefits for LLM deployment, but also enhance our understanding of LLM internal mechanisms, paving the way for further optimizations in LLM design and inference.

3403JoPA: Explaining Large Language Model’s Generation via Joint Prompt Attribution

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated impressive performances in complex text generation tasks. However, the contribution of the input prompt to the generated content still remains obscure to humans, underscoring the necessity of elucidating and explaining the causality between input and output pairs. Existing works for providing prompt-specific explanation often confine model output to be classification or next-word prediction. Few initial attempts aiming to explain the entire language generation often treat input prompt texts independently, ignoring their combinatorial effects on the follow-up generation. In this study, we introduce a counterfactual explanation framework based on joint prompt attribution, JoPA, which aims to explain how a few prompt texts collaboratively influences the LLM’s complete generation. Particularly, we formulate the task of prompt attribution for generation interpretation as a combinatorial optimization problem, and introduce a probabilistic algorithm to search for the casual input combination in the discrete space. We define and utilize multiple metrics to evaluate the produced explanations, demonstrating both faithfulness and efficiency of our framework.

3404FedL2G: Learning to Guide Local Training in Heterogeneous Federated Learning

[openreview] [pdf]

Abstract Data and model heterogeneity are two core issues in Heterogeneous Federated Learning (HtFL). In scenarios with heterogeneous model architectures, aggregating model parameters becomes infeasible, leading to the use of prototypes (i.e., class representative feature vectors) for aggregation and guidance. However, they still experience a mismatch between the extra guiding objective and the client’s original local objective when aligned with global prototypes. Thus, we propose a Federated Learning-to-Guide (FedL2G) method that adaptively learns to guide local training in a federated manner and ensures the extra guidance is beneficial to clients’ original tasks. With theoretical guarantees, FedL2G efficiently implements the learning-to-guide process using only first-order derivatives w.r.t. model parameters and achieves a non-convex convergence rate of O(1/T)\mathcal{O}(1/T). We conduct extensive experiments on two data heterogeneity and six model heterogeneity settings using 14 heterogeneous model architectures (e.g., CNNs and ViTs) to demonstrate FedL2G’s superior performance compared to six counterparts.

3405OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code

[openreview] [pdf]

Abstract Open-ended and AI-generating algorithms aim to continuously generate and solve increasingly complex tasks indefinitely, offering a promising path toward more general intelligence. To accomplish this grand vision, learning must occur within a vast array of potential tasks. Existing approaches to automatically generating environments are constrained within manually predefined, often narrow distributions of environment, limiting their ability to create any learning environment. To address this limitation, we introduce a novel framework, OMNI-EPIC, that augments previous work in Open-endedness via Models of human Notions of Interestingness (OMNI) with Environments Programmed in Code (EPIC). OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent’s current skill set) and interesting (e.g., worthwhile and novel) tasks. OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simulatable learning task. We showcase the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges. We also highlight how OMNI-EPIC can adapt to reinforcement learning agents’ learning progress, generating tasks that are of suitable difficulty. Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms.

3406The Mutual Information Matrix in Hyperbolic Embedding and a Generalization Error Bound

[openreview] [pdf]

Abstract Representation learning is a crucial task of deep learning, which aims to project texts and other symbolic inputs into mathematical embedding. Traditional representation learning encodes symbolic data into an Euclidean space. However, the high dimensionality of the Euclidean space used for embedding words presents considerable computational and storage challenges. Hyperbolic space has emerged as a promising alternative for word embedding, which demonstrates strong representation and generalization capacities, particularly for latent hierarchies of language data. In this paper, we analyze the Skip-Gram Negative-sampling representation learning method in hyperbolic spaces, and explore the potential relationship between the mutual information and hyperbolic embedding. Furthermore, we establish generalization error bounds for hyperbolic embedding. These bounds demonstrate the dimensional parsimony of hyperbolic space and its relationship between the generalization error and the sample size. Finally, we conduct two experiments on the Wordnet dataset and the THUNews dataset, whose results further validate our theoretical properties.

3407On the Role of Attention Heads in Large Language Model Safety

[openreview] [pdf]

Abstract Large language models (LLMs) achieve state-of-the-art performance on multiple language tasks, yet their safety guardrails can be circumvented, leading to harmful generations. In light of this, recent research on safety mechanisms has emerged, revealing that when safety representations or component are suppressed, the safety capability of LLMs are compromised. However, existing research tends to overlook the safety impact of multi-head attention mechanisms, despite their crucial role in various model functionalities. Hence, in this paper, we aim to explore the connection between standard attention mechanisms and safety capability to fill this gap in the safety-related mechanistic interpretability. We propose an novel metric which tailored for multi-head attention, the Safety Head ImPortant Score (Ships), to assess the individual heads’ contributions to model safety. Base on this, we generalize Ships to the dataset level and further introduce the Safety Attention Head AttRibution Algorithm (Sahara) to attribute the critical safety attention heads inside the model. Our findings show that special attention head has a significant impact on safety. Ablating a single safety head allows aligned model (e.g., Llama-2-7b-chat) to respond to16×\times\uparrowmore harmful queries, while only modifying0.006%\downarrow of the parameters, in contrast to the \sim5%modification required in previous studies. More importantly, we demonstrate that attention heads primarily function as feature extractors for safety and models fine-tuned from the same base model exhibit overlapping safety heads through comprehensive experiments. Together, our attribution approach and findings provide a novel perspective for unpacking the black box of safety mechanisms in large models.

3408The Implicit Bias of Stochastic AdaGrad-Norm on Separable Data

[openreview] [pdf]

Abstract This work explores stochastic adaptive gradient descent, i.e., stochastic AdaGrad-Norm, when applied to linearly separable datasets. For the stochastic AdaGrad-Norm method equipped with a wide range of sampling noise, we demonstrate its almost surely convergence result to the L2\mathcal{L}^{2} max-margin solution. This means that stochastic AdaGrad-Norm has an implicit bias that yields good generalization, even without regularization terms. We show that the convergence rate of the classification direction is o(1/ln(1ϵ)/2n)o({1}/{\ln^{(1-\epsilon)/{2}}n}). Our approach takes a novel stance by explicitly characterizing the L2\mathcal{L}^{2} max-margin direction. By doing so, we overcome the challenge that arises from the dependency between the stepsize and the gradient and also address the limitations in the previous AdaGrad-Norm analyses.

[openreview] [pdf]

Abstract Barlow Twins is a feature-contrastive self-supervised learning framework built on the principle of redundancy reduction. The idea is to train a network by maximizing the correlation between corresponding features and minimizing the correlation between non-corresponding features in distorted views of the same image, through this facilitating effective pretraining of a backbone network for a subsequent classification head. This is achieved by diagonalizing the cross-correlation matrix of the network’s representations and scaling it towards the identity matrix. We show that the cross-correlation matrix of distorted images is inherently symmetric, independent of the backbone network’s weights, which leads to two key insights: (i) the cross-correlation matrix can always be diagonalized using a linear transformation (layer), and (ii) the core idea of maximizing correlations between corresponding features while minimizing them for non-corresponding features alone is insufficient for effective backbone network pretraining. Nevertheless, Barlow Twins provide highly effective pretraining. We show that this is due to the normalization of the cross-correlation matrix in the Barlow Twins cost function. This normalization leads to minima of the cost function which are equivalent to the minima of sample contrastive approaches to enforce invariance.

3410ContraFusion: Contrastively Improving Compositional Understanding in Diffusion Models via Fine-Grained Negative Images

[openreview] [pdf]

Abstract Despite the impressive text-to-image (T2I) synthesis capabilities of diffusion models, they often struggle to understand compositional relationships between objects and attributes, especially in complex settings. Existing solutions have tackled these challenges through optimizing the cross-attention mechanism or learning from the caption pairs with minimal semantic changes. However, can we generate high-quality complex contrastive images that diffusion models can directly discriminate based on visual representations? In this work, we leverage large-language models (LLMs) to compose realistic, complex scenarios and harness Visual-Question Answering (VQA) systems alongside diffusion models to automatically curate a contrastive dataset, COM-DIFF, consisting of 15k pairs of high-quality contrastive images. These pairs feature minimal visual discrepancies and cover a wide range of attribute categories, especially complex and natural scenarios. To learn effectively from these error cases, i.e., hard negative images, we propose CONTRAFUSION, a new multi-stage curriculum for contrastive learning of diffusion models. Through extensive experiments across a wide range of compositional scenarios, we showcase the effectiveness of our proposed framework on compositional T2I benchmarks. We will release our contrastive dataset to support the development of generative models.

3411Learning Fairer Representations with FairVIC

[openreview] [pdf]

Abstract Mitigating bias in automated decision-making systems, specifically deep learning models, is a critical challenge in achieving fairness. This complexity stems from factors such as nuanced definitions of fairness, unique biases in each dataset, and the trade-off between fairness and model accuracy. To address such issues, we introduce FairVIC, an innovative approach designed to enhance fairness in neural networks by addressing inherent biases at the training stage. Unlike other methods that require a user-defined declaration of what it means to be fair, FairVIC integrates an abstract concept of fairness through variance, invariance and covariance terms into the loss function. These terms aim to minimise the model’s dependency on protected characteristics for making predictions, thus promoting fairness. Our experimentation consists of evaluating FairVIC against other comparable bias mitigation techniques, on a number of datasets known for their biases. Additionally, we conduct an ablation study to examine the accuracy-fairness trade-off. We also extend FairVIC by offering multi-objective lambda recommendations, allowing users to train a fairer model with a set of weights that are tuned best for their application. Through our implementation of FairVIC, we observed a significant improvement in fairness across all metrics tested, without compromising the model’s accuracy. Our findings suggest that FairVIC presents a straightforward, out-of-the-box solution for the development of fairer deep learning models, thereby offering a generalisable solution applicable across many tasks and datasets.

3412Towards Reliable Offline Reinforcement Learning via Lyapunov Uncertainty Control

[openreview] [pdf]

Abstract Learning trustworthy and reliable offline policies presents significant challenges due to the inherent uncertainty in pre-collected datasets. In this paper, we propose a novel offline reinforcement learning method to tackle this issue. Inspired by the concepts of Lyapunov stability and control-invariant sets from control theory, the central idea is to introduce a restricted state space for the agent to operate within. This approach allows the learned models to exhibit reduced Bellman uncertainty and make reliable decisions. To achieve this, we regulate the expected Bellman uncertainty associated with the new policy, ensuring that its growth trend in subsequent states remains within acceptable limits. The resulting method, termed Lyapunov Uncertainty Control (LUC), is shown to guarantee that the agent remains within a low-uncertainty state enclosure throughout its entire trajectory. Furthermore, we perform extensive theoretical and experimental analysis to showcase the effectiveness and feasibility of the proposed LUC.

3413FlipAttack: Jailbreak LLMs via Flipping

[openreview] [pdf]

Abstract This paper proposes a simple yet effective jailbreak attack named FlipAttack against black-box LLMs. First, from the autoregressive nature, we reveal that LLMs tend to understand the text from left to right and find that they struggle to comprehend the text when noise is added to the left side. Motivated by these insights, we propose to disguise the harmful prompt by constructing left-side noise merely based on the prompt itself, then generalize this idea to 4 flipping modes. Second, we verify the strong ability of LLMs to perform the text-flipping task, and then develop 4 variants to guide LLMs to denoise, understand, and execute harmful behaviors accurately. These designs keep FlipAttack universal, stealthy, and simple, allowing it to jailbreak black-box LLMs within only 1 query. Experiments on 8 LLMs demonstrate the superiority of FlipAttack. Remarkably, it achieves \sim98% attack success rate on GPT-4o, and \sim98% bypass rate against 5 guardrail models on average. The codes are available at Anonymous GitHub\footnote{https://anonymous.4open.science/r/ICLR25-1731-FlipAttack}.

3414Locking Down the Finetuned LLMs Safety

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) on additional datasets is often necessary to optimize them for specific downstream tasks. However, existing safety alignment measures, which restrict harmful behavior during inference, are insufficient to mitigate safety risks during fine-tuning. Alarmingly, fine-tuning with just 10 toxic sentences can make models comply with harmful instructions. We introduce SafetyLock, a novel alignment intervention method that maintains robust safety post-fine-tuning through efficient and transferable mechanisms. SafetyLock leverages our discovery that fine-tuned models retain similar safety-related activation representations to their base models. This insight enables us to extract what we term the Meta-SafetyLock, a set of safety bias directions representing key activation patterns associated with safe responses in the original model. We can then apply these directions universally to fine-tuned models to enhance their safety. By searching for activation directions across multiple token dimensions, SafetyLock achieves enhanced robustness and transferability. SafetyLock re-aligns fine-tuned models in under 0.01 seconds without additional computational cost. Our experiments demonstrate that SafetyLock can reduce the harmful instruction response rate from 60% to below 1% in toxic fine-tuned models. It surpasses traditional methods in both performance and efficiency, offering a scalable, non-invasive solution for ensuring the safety of customized LLMs. Our analysis across various fine-tuning scenarios confirms SafetyLock’s robustness, advocating its integration into safety protocols for aligned LLMs.

3415Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

[openreview] [pdf]

Abstract In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

3416Sampling Theory and Overparameterization: Shaping Loss Landscapes inℓ2Regression

[openreview] [pdf]

Abstract Overparameterization in neural networks has demonstrated remarkable advantages for both memorization and generalization, particularly in models trained with gradient descent. While much of the existing research focuses on the interplay between overparameterization and gradient-based methods, we explore its influence on the loss landscape of 2\ell^2 supervised regression problems, independent of any specific optimizer. By leveraging the Nyquist-Shannon-Whittaker sampling theorem, we establish a theoretical link between sampling theory and overparameterized neural networks. Our findings reveal that overparameterization not only exponentially increases the number of global minima but also expands the dimensionality of loss valleys for various 2\ell^2 regression problems modelled with feedforward neural networks. We empirically validate these theoretical insights across multiple supervised 2\ell^2 regression tasks, trained with both gradient-based and non-gradient-based optimization algorithms. These results offer fresh perspectives on the advantages of overparameterization in neural network design, independent of the chosen learning algorithm.

3417Self-Informed Generative Active Learning

[openreview] [pdf]

Abstract Active learning has been a cost-efficient approach to obtaining high-performance AI models with fewer selective annotations. In scenarios where the acquisition of original unlabeled data poses significant challenges, active learning harnessing synthesized data instances is more promising than traditional pool-based methods. In this paper, we propose the Self-Informed Generative Active Learning (SIGnAL) framework as an effective solution to actively generate and select data instances for annotation and downstream model training. In SIGnAL, we propose to guide the data generation based on a reinforcement learning policy, where the generator is self-informed by the reward to generate more informative instances. In addition, we introduce an acquisition function that measures both the informativeness and relevance of instances. Such acquisition function can be transformed to the reward seamlessly for generator optimization. Our experiments on the text classification task validate the effectiveness of our framework, especially when the original data scale is limited.

3418Theory of LLM sampling: part descriptive and part prescriptive

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly utilized in autonomous decision-making systems, where they sample options from an action space. However, the underlying heuristics guiding the sampling of LLMs remain under-explored. We examine LLM response sampling and propose a theory that the sample of an LLM is driven by a descriptive component (the notion of statistical average) and a prescriptive component (notion of an ideal represented in the LLM). In a controlled experimental setting, we demonstrate that LLM outputs deviate from statistically probable outcome in the direction of a presciptive component. We further show this deviation towards prescriptive component consistently appears across diverse real-world domains, including social, public health, and scientific contexts. Using this theory, we show that concept prototypes in LLMs are affected by prescriptive norms, similar to concept of normality in humans. Through case studies, we illustrate that in real-world applications, the shift toward an ideal value in LLM outputs can result in significantly biased decision-making, raising ethical concerns.

3419Analyzing and Optimizing Perturbation of DP-SGD Geometrically

[openreview] [pdf]

Abstract Differential privacy (DP) has become a prevalent privacy model in a wide range of machine learning tasks, especially after the debut of DP-SGD. However, DP-SGD, which directly perturbs gradients in the training iterations, fails to mitigate the negative impacts of noise on gradient direction. As a result, DP-SGD is often inefficient. Although various solutions (e.g., clipping to reduce the sensitivity of gradients and amplifying privacy bounds to save privacy budgets) are proposed to trade privacy for model efficiency, the root cause of its inefficiency is yet unveiled. In this work, we first generalize DP-SGD and theoretically derive the impact of DP noise on the training process. Our analysis reveals that, in terms of a perturbed gradient, only the noise on a direction has eminent impact on the model efficiency while that on magnitude can be mitigated by optimization techniques, i.e., fine-tuning gradient clipping and learning rate. Besides, we confirm that traditional DP introduces biased noise on the direction when adding unbiased noise to the gradient itself. Overall, the perturbation of DP-SGD is actually sub-optimal from a geometric perspective. Motivated by this, we design a geometric perturbation strategy GeoDP within the DP framework, which perturbs the direction and the magnitude of a gradient, respectively. By directly reducing the noise on the direction, GeoDP mitigates the negative impact of DP noise on model efficiency with the same DP guarantee. Extensive experiments on two public datasets (i.e., MNIST and CIFAR-10), one synthetic dataset and three prevalent models (i.e., Logistic Regression, CNN and ResNet) confirm the effectiveness and generality of our strategy.

3420Convergence-Aware Multi-Fidelity Bayesian Optimization

[openreview] [pdf]

Abstract Multi-fidelity Bayesian Optimization (MFBO) has emerged as a powerful approach for optimizing expensive black-box functions by leveraging evaluations at different fidelity levels. However, existing MFBO methods often overlook the convergence behavior of the objective function as fidelity increases, leading to inefficient exploration and suboptimal performance. We propose CAMO, a novel Convergence-Aware Multi-fidelity Optimization framework based on Fidelity Differential Equations (FiDEs). CAMO explicitly captures the convergence behavior of the objective function, enabling more efficient optimization. We introduce two tractable forms of CAMO: an integral Automatic Relevance Determination (ARD) kernel and a data-driven Deep Kernel. Theoretical analysis demonstrates that CAMO with the integral ARD kernel achieves a tighter regret bound compared to state-of-the-art methods. Our empirical evaluation on synthetic benchmarks and real-world engineering design problems shows that CAMO consistently outperforms existing MFBO algorithms in optimization efficiency and solution quality, with up to 4x improvement in optimal solution. This work establishes a foundation for tractable convergence-aware MFBO and opens up new avenues for research in this area.

3421Fine-grained Hallucination Detection and Mitigation in Language Model Mathematical Reasoning

[openreview] [pdf]

Abstract Hallucinations in large language models (LLMs) pose significant challenges in tasks requiring complex multi-step reasoning, such as mathematical problem-solving. Existing approaches primarily detect the presence of hallucinations but lack a nuanced understanding of their types and manifestations. In this paper, we first introduce a comprehensive taxonomy that categorizes the common hallucinations in mathematical reasoning task into six types: fabrication, factual inconsistency, context inconsistency, instruction inconsistency, logical inconsistency, and logical error. We then propose FG-PRM (Fine-Grained Process Reward Model), an augmented model designed to detect and mitigate hallucinations in a fine-grained, step-level manner. To address the limitations of manually labeling training data, we propose an automated method for generating fine-grained hallucination data using LLMs. By injecting hallucinations into reasoning steps of correct solutions, we create a diverse and balanced synthetic dataset for training FG-PRM, which consists of six specialized Process Reward Models (PRMs), each tailored to detect a specific hallucination type. Our FG-PRM demonstrates superior performance across two key tasks: 1) Fine-grained hallucination detection: classifying hallucination types for each reasoning step; and 2) Verification: ranking multiple LLM-generated outputs to select the most accurate solution, mitigating reasoning hallucinations. Our experiments show that FG-PRM outperforms ChatGPT-3.5 and Claude-3 on fine-grained hallucination detection and substantially boosts the performance of LLMs on GSM8K and MATH benchmarks.

3422Optimal Transport-Based Domain Alignment as a Preprocessing Step for Federated Learning

[openreview] [pdf]

Abstract Federated learning is a subfield of machine learning that avoids sharing local data with a central server, which can enhance privacy and scalability. The inability to consolidate data in a central server leads to a unique problem called dataset imbalance, which can be challenging because learning typically requires iid datasets. In FL, fusing locally-trained models that are trained using non-iid datasets may deteriorate the performance of global model aggregation; this further reduces the quality of updated local models and the accuracy of the distributed agents’ decisions. In this work, we introduce an Optimal Transport-based preprocessing algorithm that aligns the datasets in a privacy-preserving manner, in turn minimizing the distributional discrepancy of data along the edge devices. We accomplish this by leveraging Wasserstein barycenters when computing channel-wise averages. These barycenters are collected in a trusted central server where they collectively generate a target RGB space. By projecting our dataset towards this target space, we minimize the distributional discrepancy on a global level, which facilitates the learning process due to a minimization of variance across the samples in the analyzed network. We demonstrate the capabilities of the proposed approach over the CIFAR-10 dataset, where we show its capability of reaching higher degrees of generalization in fewer communication rounds.

3423Improving Instruction-Following in Language Models through Activation Steering

[openreview] [pdf]

Abstract The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generatio

3424Balancing Gradient Frequencies Facilitates Inductive Inference in Algorithmic Reasoning

[openreview] [pdf]

Abstract Inductive inference, or extrapolation of general rules from finite instances, is understood to be the foundation of human intelligence. Unfortunately, Deep Neural Networks (DNNs) struggle with inductive inference and thus fail to learn even the simplest algorithms in Algorithmic Reasoning (AR). Existing research efforts on AR with DNNs are limited to those on the architectural design for DNNs. In this study, we investigate the influence of optimization techniques on AR performance. Through toy experiments designed to understand an optimizer’s susceptibility to shortcuts in AR, we reveal that Adam, the naive choice of optimization, is easily fooled by spurious correlations. To overcome this shortcoming of Adam, we propose a novel optimizer that avoids spurious correlations by balancing gradients of low- and high-frequencies (BGF). We present extensive experiments and analyses to demonstrate the broad and multifaceted advantages of BGF across various architectures and AR tasks. In particular, BGF expands the AR capability of all explored DNN models and even shows the potential to enable learning of tasks that they previously failed at. The observed success of BGF in climbing the Chomsky hierarchy underscores the importance of optimization for developing advanced artificial intelligence with DNNs.

3425Nash-GBML: Nash Gradient-Based Meta-Learning

[openreview] [pdf]

Abstract Meta-learning has been proposed to address fast adaptation to unseen tasks with little data. Traditional meta-learning is modeled as the Single-Leader Multi-Follower game consisting of inner and outer-level problems to minimize average or worst-case task loss. Because they assume all sampled tasks are independent, it reduces the flexibility of modeling complex interaction among tasks. Thus, we formulate meta-learning as a Single-Leader Multi-Follower game by considering the interaction among tasks at the inner level. We propose the Nash-GBML incorporating a penalty term into the task loss function to model the interaction among task-specific parameters. We discuss the iteration complexity and convergence of the Nash-GBML algorithm. To validate our Nash-GBML algorithm, we introduce two penalty terms, which are designed to reduce the average and worst-case task loss. We empirically show that the Nash-GBML with the proposed penalty terms outperforms traditional GBML for supervised learning experiments.

3426Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback

[openreview] [pdf]

Abstract In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagreement or Monte Carlo dropout to quantify when black-box IL policies are uncertain; however, these approaches can lead to overconfident estimates when faced with deployment-time distribution shifts. Instead, we contend that we need uncertainty quantification algorithms that can leverage the expert human feedback received during deployment time to adapt the robot’s uncertainty online. To tackle this, we draw upon online conformal prediction, a distribution-free method for constructing prediction intervals online given a stream of ground-truth labels. Human labels, however, are intermittent in the interactive IL setting. Thus, from the conformal prediction side, we introduce a novel uncertainty quantification algorithm called intermittent quantile tracking (IQT) that leverages a probabilistic model of intermittent labels, maintains asymptotic coverage guarantees, and empirically achieves desired coverage levels. From the interactive IL side, we develop ConformalDAgger, a new approach wherein the robot uses prediction intervals calibrated by IQT as a reliable measure of deployment-time uncertainty to actively query for more expert feedback. We compare ConformalDAgger to prior uncertainty-aware DAgger methods in scenarios where the distribution shift is (and isn’t) present because of changes in the expert’s policy. We find that in simulated and hardware deployments on a 7DOF robotic manipulator, ConformalDAgger detects high uncertainty when the expert shifts and increases the number of interventions compared to baselines, allowing the robot to more quickly learn the new behavior.

3427Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

[openreview] [pdf]

Abstract LLMs are an integral component of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the overall quality of end-to-end RAG systems, there is a gap in understanding the appropriateness of LLMs for the RAG task. To address this, we introduce Trust-Score, a holistic metric that evaluates the trustworthiness of LLMs within the RAG framework. Our results show that various prompting methods, such as in-context learning, fail to effectively adapt LLMs to the RAG task as measured by Trust-Score. Consequently, we propose Trust-Align, a method to align LLMs for improved Trust-Score performance. The LLaMA-3 family, aligned using our method, significantly outperforms open-source LLMs of similar sizes on ASQA (up 14.0), QAMPARI (up 28.9), and ELI5 (up 13.7). We also demonstrate the effectiveness of Trust-Align across different open-weight models, including the LLaMA series (1b to 8b), Qwen-2.5 series (0.5b to 7b), and Phi3.5 (3.8b). We release our code at \url{https://anonymous.4open.science/r/trust-align}.

3428TapWeight: Reweighting Pretraining Objectives for Task-Adaptive Pretraining

[openreview] [pdf]

Abstract Large-scale general domain pretraining followed by downstream-specific finetuning has become a predominant paradigm in machine learning. However, discrepancies between the pretraining and target domains can still lead to performance degradation in certain cases, underscoring the need for task-adaptive continued pretraining (TAP). TAP methods typically involve continued pretraining on task-specific unlabeled datasets or introducing additional unsupervised learning objectives to enhance model capabilities. While many TAP methods perform continued pretraining with multiple pretraining objectives, they often determine the tradeoff parameters between objectives manually, resulting in suboptimal outcomes and higher computational costs. In this paper, we propose TapWeight, a task-adaptive pretraining framework which automatically determines the optimal importance of each pretraining objective based on downstream feedback. TapWeight reweights each pretraining objective by solving a multi-level optimization problem. We applied TapWeight to both molecular property prediction and natural language understanding tasks, significantly surpassing baseline methods. Experimental results validate the effectiveness and generalizability of TapWeight. Our code is publicly available athttps://anonymous.4open.science/r/TapWeight-9A2E.

3429Gradient-based inference of abstract task representations for generalization in neural networks

[openreview] [pdf]

Abstract Humans and many animals show remarkably adaptive behavior and can respond differently to the same input depending on their internal goals. The brain not only represents the intermediate abstractions needed to perform a computation but also actively maintains a representation of the computation itself (task abstraction). Such separation of the computation and its abstraction is associated with faster learning, flexible decision-making, and broad generalization capacity. We investigate if such benefits might extend to neural networks trained with task abstractions. For such benefits to emerge, one needs a task inference mechanism that possesses two crucial abilities: First, the ability to infer abstract task representations when no longer explicitly provided (task inference), and second, manipulate task representations to adapt to novel problems (task recomposition). To tackle this, we cast task inference as an optimization problem from a variational inference perspective and ground our approach in an expectation-maximization framework. We show that gradients backpropagated through a neural network to a task representation layer are an efficient heuristic to infer current task demands, a process we refer to as gradient-based inference (GBI). Further iterative optimization of the task representation layer allows for recomposing abstractions to adapt to novel situations. Using a toy example, a novel image classifier, and a language model, we demonstrate that GBI provides higher learning efficiency and generalization to novel tasks and limits forgetting. Moreover, we show that GBI has unique advantages such as preserving information for uncertainty estimation and detecting out-of-distribution samples.

3430ROSARL: Reward-Only Safe Reinforcement Learning

[openreview] [pdf]

Abstract An important problem in reinforcement learning is designing agents that learn to solve tasks safely in an environment. A common solution is to define either a penalty in the reward function or a cost to be minimised when reaching unsafe states. However, designing reward or cost functions is non-trivial and can increase with the complexity of the problem. To address this, we investigate the concept of a Minmax penalty, the smallest penalty for unsafe states that leads to safe optimal policies, regardless of task rewards. We derive an upper and lower bound on this penalty by considering both environment diameter and solvability. Additionally, we propose a simple algorithm for agents to estimate this penalty while learning task policies. Our experiments demonstrate the effectiveness of this approach in enabling agents to learn safe policies in high-dimensional continuous control environments.

3431Cross-Domain Reinforcement Learning Under Distinct State-Action Spaces Via Hybrid Q Functions

[openreview] [pdf]

Abstract Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated interdomain mappings; (ii) The domain similarity in RL is not easily identifiable a priori, and hence CDRL can be prone to negative transfer. In this paper, we propose to jointly tackle these two challenges through the lens of hybrid Q functions. Specifically, we propose QAvatar, which combines the Q functions from both the source and target domains with a proper weight decay function. Through this design, we characterize the convergence behavior of QAvatar and thereby show that QAvatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through extensive experiments, we demonstrate that QAvatar achieves superior transferability across domains on a variety of RL benchmark tasks, such as locomotion and robot arm manipulation, even in the scenarios of potential negative transfer.

3432pMixFed: Mixing up model coefficients for Efficient Personalized Federated Learning

[openreview] [pdf]

Abstract Federated Learning enables decentralized collaborative learning of machine learning models which presents challenges such as data privacy and client drift for heterogeneous data. Traditional FL methods offer strong generalization but lack personalized solutions for non-IID data. Personalized federated learning (PFL) addresses data heterogeneity by tackling these issues through balancing generalization and personalization level. It, however, still faces challenges such as optimal model partitioning and catastrophic forgetting that reduce quality and accuracy of both local and global models. To address these challenges, we propose ``pMixFed’', a dynamic, layer-wise PFL approach integrating mixup between shared global and personalized local models. We develop adaptive partitioning between shared and personalized layers of the model, gradual transition of personalization to allow seamless adaptation of local clients, improved generalization across clients, and mitigation of catastrophic forgetting. We provide theoretical analysis of pMixFed. Further, we conduct extensive experiments to demonstrate its superior performance compared with the existing PFL methods. Empirical results hows faster training, increased robustness, and improved handling of heterogeneity when using pMixFed as compared with the state-of-the-art PFL models.

3433Data-Driven Creativity: Amplifying Imagination in LLM Writing

[openreview] [pdf]

Abstract During the alignment training of Large Language Models (LLMs), Reinforcement Learning from Human Feedback (RLHF) has proven to be effective in enhancing the model’s alignment with human preferences. The RLHF approach requires human annotators to provide data representative of human preferences, aiding the model in advancing towards human-preferred outcomes. In this process, high-quality human preference data is both crucial and challenging to obtain. While many tasks, such as coding and mathematics, can now be more efficiently annotated through Artificial Intelligence Feedback (AIF), numerous tasks still necessitate human input to provide human preference signals. Particularly creative tasks are typical tasks that involving complex human preference. Here, we focus on creative writing tasks and investigate how to collaborate with annotators to acquire high-quality, superior data. We propose an expert-assisted data generation process, named Expert-Objective-Personal-Subjective (EOPS), that can efficiently obtain high-quality ordinal data with minimal human resources. We conduct experiments on three kinds of tasks, and experimental results validat the effectiveness of our method.

3434Towards Syn-to-Real IQA: A Novel Perspective on Reshaping Synthetic Data Distributions

[openreview] [pdf]

Abstract Blind image quality assessment (BIQA) has advanced significantly through deep learning, but the scarcity of large-scale labeled datasets remains a challenge. While synthetic data offers a promising solution, models trained on existing synthetic datasets often show limited generalization ability. In this work, we make a key observation that representations learned from synthetic datasets often exhibit a discrete and clustered pattern that hinders regression performance: features of high-quality images cluster around reference images, while those of low-quality images cluster based on distortion types. Our analysis reveals that this issue stems from the distribution of synthetic data rather than model architecture. Consequently, we introduce a novel framework SynDR-IQA, which reshapes synthetic data distribution to enhance BIQA generalization. Based on theoretical derivations of sample diversity and redundancy’s impact on generalization error, SynDR-IQA employs two strategies: distribution-aware diverse content upsampling, which enhances visual diversity while preserving content distribution, and density-aware redundant cluster downsampling, which balances samples by reducing the density of densely clustered areas. Extensive experiments across three cross-dataset settings (synthetic-to-authentic, synthetic-to-algorithmic, and synthetic-to-synthetic) demonstrate the effectiveness of our method. Additionally, as a data-based approach, SynDR-IQA can be coupled with model-based methods without increasing inference costs.

3435Unlocking Structured Thinking in Language Models with Cognitive Prompting

[openreview] [pdf]

Abstract We propose cognitive prompting as a novel approach to guide problem-solving in large language models (LLMs) through structured, human-like cognitive operations such as goal clarification, decomposition, filtering, abstraction, and pattern recognition. By employing systematic, step-by-step reasoning, cognitive prompting enables LLMs to efficiently tackle complex, multi-step tasks. We evaluate the effectiveness of cognitive prompting on Meta’s LLaMA models, comparing performance on arithmetic reasoning tasks using the GSM8K dataset and on commonsense reasoning benchmarks. Our analysis includes comparisons between models without cognitive prompting, models with a static sequence of cognitive operations, and models using reflective cognitive prompting, where the LLM dynamically self-selects the sequence of cognitive operations. The results show that cognitive prompting, particularly when dynamically adapted, significantly improves the performance of larger models, such as LLaMA3.1 70B, and enhances their ability to handle multi-step reasoning tasks. This approach also improves interpretability and flexibility, highlighting cognitive prompting as a promising strategy for general-purpose AI reasoning.

3436Adaptive Inference-Time Compute: LLMs Can Predict if They Can Do Better, Even Mid-Generation

[openreview] [pdf]

Abstract Inference-time computation is a powerful paradigm to enhance the performance of large language models (LLMs), with Best-of-N sampling being a widely used technique. However, this method is computationally expensive, requiring both (1) an external reward model and (2) the generation of multiple samples. In this work, we introduce a new generative self-evaluation scheme designed to adaptively reduce the number of generated samples while maintaining or even improving performance. We use a generative reward model formulation, allowing the LLM to predict mid-generation the probability that restarting the generation will yield a better response. These predictions are obtained without an external reward model and can be used to decide whether or not to generate more samples, prune unpromising samples early on, or to pick the best sample. This capability is very inexpensive as it involves generating a single predefined token. Trained using a dataset constructed with real unfiltered LMSYS user prompts, Llama 3.1 8B’s win rate against GPT-4 on AlpacaEval increases from 21% to 34% with 16 samples and math performance on GSM8K improves from 84% to 91%. By sampling only when the LLM determines that it is beneficial to do so and adaptively adjusting temperature annealing, we demonstrate that 74% of the improvement from using 16 samples can be achieved with only 1.2 samples on average. We further demonstrate that 50–75% of samples can be pruned early in generation with minimal degradation in performance. Overall, our methods enable more efficient and scalable compute utilization during inference for LLMs.

3437Balancing Interpretability and Accuracy: Energy-Ensemble Concept Bottleneck Models for Enhanced Concept Inference

[openreview] [pdf]

Abstract Concept bottleneck models (CBM) have emerged as a promising solution to address the lack of interpretability in deep learning models. However, recent researches on CBM prioritize task accuracy at the expense of interpretability, weakening their ability to accurately infer key concepts. This work addresses this trade-off by introducing the energy ensemble CBM (EE-CBM). The EE-CBM leverages an energy-based concept encoder to effectively extract concepts, overcoming the information bottleneck common in conventional CBMs. Additionally, a novel energy ensemble gate within the EE-CBM architecture efficiently combines energy and concept probability to further address this bottleneck. Moreover, the EE-CBM employs the maximum mean discrepancy loss to enhance concept discrimination within the concept space and facilitate accurate concept inference. An experimental evaluation on benchmark datasets (CUB-200-2011, TravelingBirds, AwA2, CheXpert, and CelebA) demonstrates that EE-CBM achieve state-of-the-art performance in both concept accuracy and interpretability. This work positions the EE-CBM as a significant advancement in CBM researches, enabling them to effectively balance performance and interpretability for improved model transparency. Our code is available athttps://anonymous.4open.science/r/EE-CBM-F48D.

3438Riemannian denoising diffusion probabilistic models

[openreview] [pdf]

Abstract We propose Riemannian Denoising Diffusion Probabilistic Models (RDDPMs) for learning distributions on submanifolds of Euclidean space that are level sets of functions, including most of the manifolds interested in applications. Existing methods for generative modeling on manifolds rely on substantial geometric information such as geodesic curves or eigenfunctions of the Laplacian-Beltrami operator and, as a result, they are limited to manifolds where such information is available. In contrast, our method, built on a projection scheme, can be applied to more general manifolds, as it only requires being able to evaluate the value and the first order derivatives of the function that defines the submanifold. We provide a theoretical analysis of our method in the continuous-time limit, which elucidates the connection between our RDDPMs and score-based generative models on manifolds. The capability of our method is demonstrated on datasets from previous studies and on new datasets sampled from two high-dimensional manifolds, i.e. SO(10)\mathrm{SO}(10) and configuration space of the molecular system alanine dipeptide with fixed dihedral angle.

3439GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

[openreview] [pdf]

Abstract Existing retrieval-based reasoning approaches for large language models (LLMs) heavily rely on the density and quality of the non-parametric knowledge source to provide domain knowledge and explicit reasoning chain. However, inclusive knowledge sources are expensive and sometimes infeasible to build for scientific or corner domains. To tackle the challenges, we introduce Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to experts’ problem-solving, rather than gold answer retrieval. Specifically, the framework prompts LLMs to decompose the query into crucial concepts and attributes, construct entity groups with relevant entities, and build an augmented reasoning chain by probing potential relationships among node pairs across these entity groups. Our method incorporates both factual and extrapolated linkages to enable comprehensive understanding and response generation. Extensive experiments on reasoning-intense benchmarks on biomedical and commonsense QA demonstrate the effectiveness of our proposed method. Specifically, GIVE enables GPT3.5-turbo to outperform advanced models like GPT4 without any additional training cost, thereby underscoring the efficacy of integrating structured information and internal reasoning ability of LLMs for tackling specialized tasks with limited external resources.

3440Adversarial Training for Defense Against Label Poisoning Attacks

[openreview] [pdf]

Abstract As machine learning models advance in complexity and increasingly depend on large volumes of publicly sourced data, such as the human-annotated labels used in training large language models, they become more vulnerable to label poisoning attacks. These attacks, in which adversaries subtly alter the labels within a training dataset, can severely degrade model performance, posing significant risks in critical applications. In this paper, we propose Floral\textbf{Floral}, an adversarial training defense strategy based on support vector machines (SVMs) to counter label poisoning attacks. Utilizing a bilevel optimization framework, we cast the adversarial training process as a non-zero-sum Stackelberg game between an attacker\textit{attacker}, who strategically poisons critical training labels, and the model\textit{model}, which seeks to recover from such attacks. Our approach introduces a projected gradient descent algorithm with kernel SVMs for adversarial training. We provide a theoretical analysis of our algorithm’s convergence properties and empirically evaluate its effectiveness across diverse classification tasks including sentiment analysis on the IMDB dataset. Compared to baseline robust models and robust foundation models such as RoBERTa, our method consistently achieves higher robust accuracy as the attacker’s budget increases. These results underscore the potential of Floral\textbf{Floral} to enhance the resilience of machine learning models against label poisoning threats, thereby ensuring robust classification in adversarial environments.

3441JudgeRank: Leveraging Large Language Models for Reasoning-Intensive Reranking

[openreview] [pdf]

Abstract Accurate document retrieval is crucial for the success of retrieval-augmented generation (RAG) applications, including open-domain question answering and code completion. While large language models (LLMs) have been employed as dense encoders or listwise rerankers in RAG systems, they often struggle with reasoning-intensive tasks because they lack nuanced analysis when judging document relevance. To address this limitation, we introduce JudgeRank, a novel agentic reranker that emulates human cognitive processes when assessing document relevance. Our approach consists of three key steps: (1) query analysis to identify the core problem, (2) document analysis to extract a query-aware summary, and (3) relevance judgment to provide a concise assessment of document relevance. We evaluate JudgeRank on the reasoning-intensive BRIGHT benchmark, demonstrating substantial performance improvements over first-stage retrieval methods and outperforming other popular reranking approaches. In addition, JudgeRank performs on par with fine-tuned state-of-the-art rerankers on the popular BEIR benchmark, validating its zero-shot generalization capability. Through comprehensive ablation studies, we demonstrate that JudgeRank’s performance generalizes well across LLMs of various sizes while ensembling them yields even more accurate reranking than individual models.

3442TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters

[openreview] [pdf]

Abstract Transformers have become the predominant architecture in foundation models due to their excellent performance across various domains. However, the substantial cost of scaling these models remains a significant concern. This problem arises primarily from their dependence on a fixed number of parameters within linear projections. When architectural modifications (e.g., channel dimensions) are introduced, the entire model typically requires retraining from scratch. As model sizes continue growing, this strategy results in increasingly high computational costs and becomes unsustainable. To overcome this problem, we introduce Tokenformer, a natively scalable architecture that leverages the attention mechanism not only for computations among input tokens but also for interactions between tokens and model parameters, thereby enhancing architectural flexibility. By treating model parameters as tokens, we replace all the linear projections in Transformers with our token-parameter attention layer, where input tokens act as queries and model parameters as keys and values. This reformulation allows for progressive and efficient scaling without necessitating retraining from scratch. Our model scales from 124M to 1.4B parameters by incrementally adding new key-value parameter pairs, achieving performance comparable to Transformers trained from scratch while greatly reducing training costs. Code will be available.

3443Operator Deep Smoothing for Implied Volatility

[openreview] [pdf]

Abstract We devise a novel method for nowcasting implied volatility based on neural operators. Better known as implied volatility smoothing in the financial industry, nowcasting of implied volatility means constructing a smooth surface that is consistent with the prices presently observed on a given option market. Option price data arises highly dynamically in ever-changing spatial configurations, which poses a major limitation to foundational machine learning approaches using classical neural networks. While large models in language and image processing deliver breakthrough results on vast corpora of raw data, in financial engineering the generalization from big historical datasets has been hindered by the need for considerable data pre-processing. In particular, implied volatility smoothing has remained an instance-by-instance, hands-on process both for neural network-based and traditional parametric strategies. Our generaloperator deep smoothingapproach, instead, directly maps observed data to smoothed surfaces. We adapt the graph neural operator architecture to do so with high accuracy on ten years of raw intraday S&P 500 options data, using a single model instance. The trained operator adheres to critical no-arbitrage constraints and is robust with respect to subsampling of inputs (occurring in practice in the context of outlier removal). We provide extensive historical benchmarks and showcase the generalization capability of our approach in a comparison with classical neural networks and SVI, an industry standard parametrization for implied volatility. The operator deep smoothing approach thus opens up the use of neural networks on large historical datasets in financial engineering.

3444Neural Probabilistic Logic Learning for Knowledge Graph Reasoning

[openreview] [pdf]

Abstract Knowledge graph (KG) reasoning is a task that aims to predict unknown facts based on known factual samples. Reasoning methods can be divided into two categories: rule-based methods and KG-embedding based methods. The former possesses precise reasoning capabilities but finds it challenging to reason efficiently over large-scale knowledge graphs. While gaining the ability to reason over large-scale knowledge graphs, the latter sacrifices reasoning accuracy. This paper aims to design a reasoning framework called Neural Probabilistic Logic Learning(NPLL) that achieves accurate reasoning on knowledge graphs. Our approach introduces a scoring module that effectively enhances the expressive power of embedding networks. We strike a balance between model simplicity and reasoning capabilities by incorporating a Markov Logic Network based on variational inference. We empirically evaluate our approach on several benchmark datasets, and the experimental results validate that our method substantially enhances the accuracy and quality of the reasoning results.

3445Compositional Generative Inference Using Diffusion-based Optimization

[openreview] [pdf]

Abstract Compositional generative tasks, despite being important and having potential applications, have not been thoroughly addressed due to the unclear formulation and the challenges associated with selecting composition strategies. In this paper, we propose a probabilistic graphical approach to tackle the problem of compositional generative tasks and alleviate these challenges. Our approach formulates the problem as a Bayesian inference problem using a representative bipartite Bayesian network. In this network, one set of random variables represents the generation targets, while the other set represents observable variables with explicit or implicit distribution information. To solve this problem, we employ variational inference on the marginal distribution of observable variables. We approximate this distribution using diffusion models. We view the diffusion models as approximate Markov Chain Monte Carlo (MCMC) samplers for the marginals. Based on this perspective, we introduce a novel MCMC-based inference algorithm that incorporates per-step optimization using aggregated objectives from the diffusion models. We demonstrate the generality of our method and conduct experiments to validate its applicability to various compositional generation tasks.

3446Benchmarking Agentic Workflow Generation

[openreview] [pdf]

Abstract Large Language Models (LLMs), with their remarkable task-handling capabilities, have catalyzed significant achievements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorfEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent’s workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference.

3447Effective post-training embedding compression via temperature control in contrastive training

[openreview] [pdf]

Abstract Fixed-size learned representations (dense representations, or embeddings) are widely used in many machine learning applications across language, vision or speech modalities. This paper investigates the role of the temperature parameter in contrastive training for text embeddings. We shed light on the impact this parameter has on the intrinsic dimensionality of the embedding spaces obtained, and show that lower intrinsic dimensionality is further correlated with effective compression of embeddings. We still observe a trade-off between absolute performance and effective compression and we propose temperature aggregation methods which reduce embedding size by an order of magnitude with minimal impact on quality.

3448Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis

[openreview] [pdf]

Abstract In the realm of microservices architecture, the occurrence of frequent incidents necessitates the employment of Root Cause Analysis (RCA) for swift issue resolution. It is common that a serious incident can take several domain experts hours to identify the root cause. Consequently, a contemporary trend involves harnessing Large Language Models (LLMs) as automated agents for RCA. Though the recent ReAct framework aligns well with the Site Reliability Engineers (SREs) for its thought-action-observation paradigm, its hallucinations often lead to irrelevant actions and directly affect subsequent results. Additionally, the complex and variable clues of the incident can overwhelm the model one step further. To confront these challenges, we propose Flow-of-Action, a pioneering Standard Operation Procedure (SOP) enhanced LLM-based multi-agent system. By explicitly summarizing the diagnosis steps of SREs, SOP imposes constraints on LLMs at crucial junctures, guiding the RCA process towards the correct trajectory. To facilitate the rational and effective utilization of SOPs, we design an SOP-centric framework called SOP flow. SOP flow contains a series of tools, including one for finding relevant SOPs for incidents, another for automatically generating SOPs for incidents without relevant ones, and a tool for converting SOPs into code. This significantly alleviates the hallucination issues of ReAct in RCA tasks. We also design multiple auxiliary agents to assist the main agent by removing useless noise, narrowing the search space, and informing the main agent whether the RCA procedure can stop. Compared to the ReAct method’s 35.50% accuracy, our Flow-of-Action method achieves 64.01%, meeting the accuracy requirements for RCA in real-world systems.

3449A Provably Robust Algorithm for Differentially Private Clustered Federated Learning

[openreview] [pdf]

Abstract Federated Learning (FL), which is a decentralized machine learning (ML) approach, often incorporates differential privacy (DP) to enhance privacy guarantees for clients’ data. However, differentially private federated learning (DPFL) introduces performance disparities across clients, particularly affecting minority groups. Some recent works have attempted to address performance fairness under data heterogeneity in vanilla FL settings through clustering clients, but these existing methods remain sensitive and prone to errors, which are further exacerbated by the DP noise in DPFL. This shortcoming makes the existing methods inappropriate for DPFL settings. To fill this gap, we propose a robust algorithm for differentially private clustered FL, designed to be robust to the DP noise in the system and to identify clients’ clusters correctly. To this end, we propose to cluster clients on the server based on both their model updates and training loss values. Furthermore, when clustering clients’ model updates, our proposed approach addresses the server’s uncertainties by employing large batch sizes along with Gaussian Mixture Model (GMM) to reduce the impact of DP and stochastic noise and avoid potential clustering errors. This idea is efficient especially in privacy-sensitive scenarios, where more DP noise is used. We provide theoretical analysis justifying our proposed approach, and evaluate it extensively across diverse data distributions and privacy budgets. Our experimental results show its effectiveness in mitigating the disparate impact of DP in FL settings with a small computational cost, while enjoying DP privacy guarantees.

3450Learning with Preserving for Continual Multitask Learning

[openreview] [pdf]

Abstract Artificial Intelligence (AI) is driving advancements in many fields, enabling previously unattainable capabilities. Intelligent systems are increasingly incorporating more detailed tasks, such as enhancing tumor classification with tissue recognition or expanding driving assistance with lane detection. Typically, new tasks are integrated by training single-task models or re-training multi-task models, which proves impractical when previous data is unavailable or new data is limited. This paper introduces a novel problem category—continual multitask learning (CMTL)—crucial for future intelligent systems and largely overlooked in current research. CMTL presents unique challenges that traditional Continual Learning (CL) or Multitask Learning (MTL) methods are unable to address effectively. Therefore, we introduce Learning with Preserving (LwP), a novel approach for CMTL that maintains previously learned knowledge in a way that remains beneficial across diverse tasks. LwP employs a Dynamically Weighted Distance Preservation (DWDP) loss function to uphold the integrity of the representation space, ensuring it is conducive to learning both prior and future tasks without relying on a replay buffer. We extensively evaluate LwP on three benchmark datasets across two modalities—IMU sensing data for exercise quality assessment and image datasets. Results show that LwP outperforms existing continual learning baselines and effectively mitigates catastrophic forgetting, highlighting its robustness and generalizability in CMTL scenarios.

3451Incremental Exploits: Efficient Jailbreaks on Large Language Models with Multi-round Conversational Jailbreaking

[openreview] [pdf]

Abstract As large language models (LLMs) become widely deployed across various domains, security concerns---particularly jailbreak attacks that bypass built-in safety mechanisms---have emerged as significant risks. Existing jailbreak methods focus mainly on single-turn interactions and face limitations in generalizability and practicality. In this paper, we propose a novel method called Multi-round Conversational Jailbreaking (MRCJ), which exploits the unintended competition between a LLMs’ safety alignment and its in-context learning objectives during extended conversations. By incrementally introducing increasingly malicious content, the LLMs’ tendency to maintain contextual consistency can override its safety protocols, ultimately leading to harmful outputs. To facilitate conversation flow generation, we developed a dataset containing 12,000 questions, categorized into six types of security topics, and classified across four levels of severity, spanning ten languages. Compared to existing methods, MRCJ demonstrates superior efficiency, applicability, and effectiveness by fully exploiting the potential of multi-round conversations. In experiments, MRCJ achieves a jailbreak success rate of over 90% across widely-used LLMs, requiring fewer than five queries on average, and significantly outperforms baselines on both metrics. Our findings expose vulnerabilities in current LLMs during extended conversations and highlight the need for improved safety mechanisms that consider multi-round interactions. The source code and dataset are available at (URL omitted for double-blind reviewing; code available in supplementary materials).

3452Janus: Dual-server Multi-Round Secure Aggregation with Verifiability for Federated Learning

[openreview] [pdf]

Abstract Secure Aggregation (SA) in federated learning is essential for preserving user privacy by ensuring that model updates are masked or encrypted and remain inaccessible to servers. Although the advanced protocol Flamingo (S&P’23) has made significant strides with its multi-round aggregation and optimized communication, it still faces several critical challenges: (i) Dynamic User Participation\textit{Dynamic User Participation}, where Flamingo struggles with scalability due to the complex setups required when users join or leave the training process; (ii) Model Inconsistency Attacks\textit{Model Inconsistency Attacks} (MIA), where a malicious server could infer sensitive data, which poses severe privacy risks; and (iii) Verifiability\textit{Verifiability}, as most schemes lack an efficient mechanism for clients to verify the correctness of server-side aggregation, potentially allowing inaccuracies or malicious actions. We introduce Janus, a generic privacy-enhanced multi-round SA scheme through a dual-server architecture. A new user can participate in training by simply obtaining the servers’ public keys for aggregation, eliminating the need for complex communication graphs. Our dual-server model separates aggregation tasks, which ensures that neither server has access to the final aggregated results, thus effectively preventing MIA. Additionally, we propose a new cryptographic primitive, Separable Homomorphic Commitment\textit{Separable Homomorphic Commitment}, integrated with our dual-server approach to ensure the verifiability of aggregation results. Extensive experiments across various models and datasets show that Janus significantly boosts security while enhancing efficiency. It reduces per-client communication and computation overhead from logarithmic to constant scale compared to state-of-the-art methods, with almost no compromise in model accuracy.

3453Bridging the Gap Beteween SL and TD Learning via Q-conditioned maximization

[openreview] [pdf]

Abstract Recent research highlights the efficacy of supervised learning (SL) as a methodology within reinforcement learning (RL), yielding commendable results. Nonetheless, investigations reveal that SL-based methods lack the stitching capability typically associated with RL approaches such as TD learning, which facilitate the resolution of tasks by stitching diverse trajectory segments. This prompts the question: How can SL methods be endowed with stitching property and bridge the gap with TD learning? This paper addresses this challenge by exploring the maximization of the objective in the goal-conditioned RL. We introduce the concept of Q-conditioned maximization supervised learning, grounded in the assertion that the goal-conditioned RL objective is equivalent to the Q-function, thus embedding Q-function maximization into traditional SL-based methodologies. Building upon this premise, we propose Goal-Conditioned Reinforced Supervised Learning (GCReinSL), which enhances SL-based approaches by incorporating maximize Q-function. GCReinSL emphasizes the maximization of the Q-function during the training phase to estimate the maximum expected return within the distribution, subsequently guiding optimal action selection during the inference process. We demonstrate that GCReinSL enables SL methods to exhibit stitching property, effectively equivalent to applying goal data augmentation to SL methods. Experimental results on offline datasets designed to evaluate stitching capability show that our approach not only effectively selects appropriate goals across diverse trajectories but also outperforms previous works that applied goal data augmentation to SL methods.

3454Perlin Noise for Exploration in Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement Learning (RL) enables agents to solve tasks by autonomously acquiring policies by interacting with the environment receiving sparse or noisy feedback in the form of a reward. However, achieving successful optimization in RL requires efficient exploration, which remains a significant challenge, particularly in continuous action spaces. Existing exploration techniques often exhibit limited state-space reach and fail to overcome local optima, resulting in suboptimal policies. Additionally, these techniques can cause erratic movements, posing risks when applied to real-world robots. In this work, we introduce a novel exploration strategy leveraging Perlin Noise, a gradient noise function that generates smooth, continuous disturbances, thus enhancing the agent’s performance by promoting structured exploration and fluid motions. We quantitatively demonstrate the benefits of our approach compared to state-of-the-art methods, showing that it outperforms both unstructured and structured techniques in thorough experimental evaluations.

3455BEEM: Boosting Performance of Early Exit DNNs using Multi-Exit Classifiers as Experts

[openreview] [pdf]

Abstract Early Exit (EE) techniques have emerged as a means to reduce inference latency in Deep Neural Networks (DNNs). The latency improvement and accuracy in these techniques crucially depend on the criteria used to make exit decisions. We propose a new decision criterion BEEM where exit classifiers are treated as experts and aggregate their confidence scores. The confidence scores are aggregated only if neighbouring experts are consistent in prediction as the samples pass through them, thus capturing their ensemble effect. A sample exits when the aggregated confidence value exceeds a threshold. The threshold is set using the error rates of the intermediate exits aiming to surpass the performance of conventional DNN inference. Experimental results on the COCO dataset for Image captioning and GLUE datasets for various language tasks demonstrate that our method enhances the performance of state-of-the-art EE methods, achieving improvements in speed-up by a factor 1.5×1.5\times to 2.1×2.1\times. When compared to the final layer, its accuracy is comparable in harder Image Captioning and improves in the easier language tasks. The source code is available at \url{https://anonymous.4open.science/r/BEEM1-639C/README.md}

3456Learning from Imperfect Human Feedback: A Tale from Corruption-Robust Dueling

[openreview] [pdf]

Abstract This paper studies Learning from Imperfect Human Feedback (LIHF), addressing the potential irrationality or imperfect perception when learning from comparative human feedback. Building on evidences that human’s imperfection decays over time (i.e., humans learn to improve), we cast this problem as a concave-utility continuous-action dueling bandit but under a restricted form of corruption: i.e., the corruption scale is decaying over time as tρ1t^{\rho-1} for some ``imperfection rate’’ ρ[0,1]\rho \in [0, 1].With TT as the total number of iterations, we establish a regret lower bound of Ω(maxT,Tρ) \Omega(\max{\sqrt{T}, T^{\rho}}) for LIHF, even when ρ\rho is known. For the same setting, we develop the Robustified Stochastic Mirror Descent for Imperfect Dueling (RoSMID) algorithm, which achieves nearly optimal regret O~(maxT,Tρ)\tilde{\mathcal{O}}(\max{\sqrt{T}, T^{\rho}}). Core to our analysis is a novel framework for analyzing gradient-based algorithms for dueling bandit under corruption, and we demonstrate its general applicability by showing how this framework can be easily applied to obtain corruption-robust guarantees for other popular gradient-based dueling bandit algorithms. Our theoretical results are validated by extensive experiments.

3457Learning Monotonic Attention in Transducer for Streaming Generation

[openreview] [pdf]

Abstract Streaming generation models are increasingly utilized across various fields, with the Transducer architecture being particularly popular in industrial applications. However, its input-synchronous decoding mechanism presents challenges in tasks requiring non-monotonic alignments, such as simultaneous translation, leading to suboptimal performance in these contexts. In this research, we address this issue by tightly integrating Transducer’s decoding with the history of input stream via a learnable monotonic attention mechanism. Our approach leverages the forward-backward algorithm to infer the posterior probability of alignments between the predictor states and input timestamps, which is then used to estimate the context representations of monotonic attention in training. This allows Transducer models to adaptively adjust the scope of attention based on their predictions, avoiding the need to enumerate the exponentially large alignment space. Extensive experiments demonstrate that our MonoAttn-Transducer significantly enhances the handling of non-monotonic alignments in streaming generation, offering a robust solution for Transducer-based frameworks to tackle more complex streaming generation tasks. Codes are publicly available in supplementary materials.

3458Addressing Data Heterogeneity In Federated Learning With Adaptive Normalization-Free Feature Recalibration

[openreview] [pdf]

Abstract Federated learning is a decentralized collaborative training paradigm that preserves stakeholders’ data ownership while improving performance and generalization. However, statistical heterogeneity among client datasets poses a fundamental challenge by degrading system performance. To address this issue, we propose Adaptive Normalization-free Feature Recalibration (ANFR), an architecture-level approach that combines weight standardization and channel attention. Weight standardization normalizes the weights of layers instead of activations. This is less susceptible to mismatched client statistics and inconsistent averaging, thereby more robust under heterogeneity. Channel attention produces learnable scaling factors for feature maps, suppressing those that are inconsistent between clients due to heterogeneity. We demonstrate that combining these techniques boosts model performance beyond their individual contributions, by enhancing class selectivity and optimizing channel attention weight distribution. ANFR operates independently of the aggregation method and is effective in both global and personalized federated learning settings, with minimal computational overhead. Furthermore, when training with differential privacy, ANFR achieves an appealing balance between privacy and utility, enabling strong privacy guarantees without sacrificing performance. By integrating weight standardization and channel attention in the backbone model, ANFR offers a novel and versatile approach to the challenge of statistical heterogeneity. We demonstrate through extensive experiments that ANFR consistently outperforms established baselines across various aggregation methods, datasets, and heterogeneity conditions.

3459Rethinking Reward Modeling in Preference-based Large Language Model Alignment

[openreview] [pdf]

Abstract The Bradley-Terry (BT) model is a common and successful practice in reward modeling for Large Language Model (LLM) alignment. However, it remains unclearwhythis model --- originally developed for multi-player stochastic game matching --- can be adopted to convert pairwise response comparisons to reward values and make predictions. Especially given the fact that only a limited number of prompt-response pairs are sparsely compared with others. In this paper, we first establish the convergence rate of BT reward models based on deep neural networks using embeddings, providing a theoretical foundation for their use. Despite theoretically sound, we argue that the BT model is not a necessary choice from the perspective of downstream optimization, this is because a reward model only needs to preserve the correct ranking predictions through a monotonic transformation of the true reward. We highlight the critical concept oforder consistencyin reward modeling and demonstrate that the BT model possesses this property. Moreover, we propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers, as an alternative order-consistent reward modeling objective. To offer practical insights, we empirically evaluate the performance of these different reward modeling approaches across more than 12,000 experimental setups, using 6 base LLMs, 2 datasets, and diverse annotation designs that vary in quantity, quality, and pairing choices in preference annotations.

3460BRIDGE: Bootstrapping Text to Guide Time-Series Generation via Multi-Agent Iterative Optimisation and Diffusion Modelling

[openreview] [pdf]

Abstract Time-series Generation (TSG) is an impactful research direction, as generating realistic sequences can be used to create educational materials, in simulations and for counterfactual analysis in decision making. It has further the potential to alleviate the resource bottleneck that arises from a lack of diverse time-series data required to train large time-series foundational models. However, most existing TSG models are typically designed to generate data from a specified domain, which is due to the large divergence in patterns between different real-world TS domains. In this paper, we argue that text can provide semantic information (including cross-domain background knowledge and instance temporal patterns) to improve the generalisation of TSG. To do so, we introduce ``Text Guided Time Series Generation’’ (TG2^2)---the task of generating realistic time series from handful of example time series paired with their textual description. We further present a Self-Refine-based Multi-Agent LLM framework to synthesise a realistic benchmark for TG2^2 and show that the collected text descriptions are both realistic and useful for time-series generation. We develop a first strong baseline for the TG2^2, Bridge, which utilises LLMs and diffusion models to generate time series which encode semantic information as cross-domain condition. Our experimental results demonstrate that Bridge significantly outperforms existing time-series generation baselines on 10 out of 12 datasets, resulting in data distributions that are more closely aligned to target domains. Using the generated data for training positively impacts the performance of time series forecasting models, effectively addressing training data limitations. This work bridges the gap between LLMs and time series analysis, introducing natural language to help the time series generation and its applications.

3461Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

[openreview] [pdf]

Abstract We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.

3462MANTRA: The Manifold Triangulations Assemblage

[openreview] [pdf]

Abstract The rising interest in leveraging higher-order interactions present in complex systems has led to a surge in more expressive models exploiting high-order structures in the data, especially in topological deep learning (TDL), which designs neural networks on high-order domains such as simplicial complexes. However, progress in this field is hindered by the scarcity of datasets for benchmarking these architectures. To address this gap, we introduce MANTRA, the first large-scale, diverse, and intrinsically high-order dataset for benchmarking high-order models, comprising over 43,000 and 249,000 triangulations of surfaces and three-dimensional manifolds, respectively. With MANTRA, we assess several graph- and simplicial complex-based models on three topological classification tasks. We demonstrate that while simplicial complex-based neural networks generally outperform their graph-based counterparts in capturing simple topological invariants, they also struggle, suggesting a rethink of TDL. Thus, MANTRA serves as a benchmark for assessing and advancing topological methods, leading the way for more effective high-order models.

3463CoPS: Empowering LLM Agents with Provable Cross-Task Experience Sharing

[openreview] [pdf]

Abstract Sequential reasoning in agent systems has been significantly advanced by large language models (LLMs), yet existing approaches face limitations. Reflection-driven reasoning relies solely on knowledge in pretrained models, limiting performance in novel scenarios, while experience-assisted reasoning often depends on external experiences and lacks clear principles for selecting representative experiences. We address these limitations by proposing CoPS (Cross-Task Experience Sharing), a generalizable algorithm that enhances sequential reasoning by cross-task experience sharing and selection. In detail, CoPS leverages agents’ experiences on previous tasks, selecting distribution-matched experiences via a provable pessimism-based strategy to maximize utility while minimizing risks from distribution shifts. Extensive experimental results on benchmarks like Alfworld, Webshop, and HotPotQA demonstrate that CoPS consistently outperforms state-of-the-art baselines, with superior sample efficiency suitable for resource-constrained scenarios. Theoretically, we show that the performance of our algorithm depends on both the quality of the pretrained LLM and the matching between the agent’s task-dependent trial distribution and that generated by the LLM. Our work bridges the gap between existing sequential reasoning paradigms and validates the effectiveness of leveraging cross-task experiences, shedding light on the potential to improve agents’ generalization and adaptability across diverse tasks. Our codes are released atthis link.

3464Consistent Symmetry Representation over Latent Factors of Variation

[openreview] [pdf]

Abstract Recent symmetry-based methods on variational autoencoders have advanced disentanglement learning and combinatorial generalization, yet the appropriate symmetry representation for both tasks is under-clarified. We identify that existing methods struggle with maintaining the consistent symmetries\textit{consistent symmetries} when representing identical changes of latent factors of variation, and they cause issues in achieving equivari- ance. We theoretically prove the limitations of three frequently used group settings: matrix multiplication with General Lie Groups, defining group action with set of vectors and vector addition, and cyclic groups modeled through surjective functions. To overcome these issues, we introduce a novel method of conformal mapping\textit{conformal mapping} of latent vectors into a complex number space, ensuring consistent symmetries and cyclic semantics. Through empirical validation with ground truth of factors variation for transparent analysis, this study fills two significant gaps in the literature: 1) the inductive bias to enhance disentanglement learning and combinatorial generalization simultaneously, and 2) well-represented symmetries ensure significantly high disentanglement performance without a trade-off in reconstruction error, compared to current unsupervised methods. Additionally, we introduce less guidance-dependent validation results, extending our findings to more practical use. Our research highlights the significant impact of verifying consistent symmetry and suggests required future research for advancing combinatorial generalization and disentanglement learning.

3465Large Language Models are Interpretable Learners

[openreview] [pdf]

Abstract The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack expressiveness, whereas neural networks excel in performance but are known for being black boxes. In this paper, we show that a combination of large language models (LLMs) and symbolic programs can bridge this gap. In the proposed LLM-based Symbolic Programs (LSPs), the pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. Symbolic programs then integrate these modules into an interpretable decision rule. To train LSPs, we develop a divide-and-conquer approach to incrementally build the program from scratch, where the learning process of each step is guided by LLMs. To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP’s superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. Moreover, as the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable), and other LLMs, and generalizes well to out-of-distribution samples.

3466Model Comparisons: XNet Outperforms KAN

[openreview] [pdf]

Abstract In the fields of computational mathematics and artificial intelligence, the need for precise data modeling is crucial, especially for predictive machine learning tasks. This paper explores further XNet, a novel algorithm that employs the complex-valued Cauchy integral formula, offering a superior network architecture that surpasses traditional Multi-Layer Perceptrons (MLPs) and Kolmogorov-Arnold Networks (KANs). XNet significant improves speed and accuracy across various tasks in both low and high-dimensional spaces, redefining the scope of data-driven model development and providing substantial improvements over established time series models like LSTMs.

3467Geometrically Constrained Gaussian Splatting SLAM

[openreview] [pdf]

Abstract 3D Gaussian Splatting (3DGS) has emerged as a promising technique in SLAM due to its rapid and high-quality rendering capabilities. However, its reliance on discrete Gaussian ellipsoid primitives limits its effectiveness in capturing essential geometric features crucial for accurate pose estimation. To overcome this limitation, we propose a novel dense RGB-D SLAM system that integrates an implicit Truncated Signed Distance Function (TSDF) hash grid to constrain the distribution of Gaussian ellipsoids. This innovative approach enables precise estimation of the scene’s geometric structure by smoothing the discrete Gaussian ellipsoids and anchoring them to the scene’s surface. Acting as a low-pass filter, the implicit TSDF hash grid mitigates the inductive biases inherent in traditional 3DGS methods while preserving rendering quality. Our geometrically constrained map also significantly enhances generalization capabilities for depth estimation in novel views. Extensive experiments on the Replica, ScanNet, and TUM datasets demonstrate that our system achieves state-of-the-art tracking and mapping accuracy at speeds up to 30 times faster than existing 3DGS-based systems.

3468Evaluating Fairness and Mitigating Bias in Machine Learning: A Novel Technique using Tensor Data and Bayesian Regression

[openreview] [pdf]

Abstract Fairness is a critical component of Trustworthy AI. In this paper, we focus on Machine Learning (ML) and the performance of model predictions when dealing with skin color. Unlike other sensitive attributes, the nature of skin color differs significantly. In computer vision, skin color is represented as tensor data rather than categorical values or single numerical points. However, much of the research on fairness across sensitive groups has focused on categorical features such as gender and race. This paper introduces a new technique for evaluating fairness in ML for image classification tasks, specifically without the use of annotation. To address the limitations of prior work, we handle tensor data, like skin color, without classifying it rigidly. Instead, we convert it into probability distributions and apply statistical distance measures. This novel approach allows us to capture fine-grained nuances in fairness both within and across what would traditionally be considered distinct groups. Additionally, we propose an innovative training method to mitigate the latent biases present in conventional skin tone categorization. This method leverages color distance estimates calculated through Bayesian regression with polynomial functions, ensuring a more nuanced and equitable treatment of skin color in ML models.

3469How transformers learn structured data: insights from hierarchical filtering

[openreview] [pdf]

Abstract We introduce a hierarchical filtering procedure for generative models of sequences on trees, enabling control over the range of positional correlations in the data. Leveraging this controlled setting, we provide evidence that vanilla encoder-only transformers implement the optimal Belief Propagation algorithm on both root classification and masked language modeling tasks. Correlations at larger distances, corresponding to increasing layers of the hierarchy, are sequentially included by the network during training. We analyze how transformer layers succeed by considering attention maps from models trained with varying degrees of filtering. These attention maps show clear evidence of an iterative hierarchical reconstruction of correlations, which we relate to a plausible implementation of the exact inference algorithm.

3470Towards Empowerment Gain through Causal Structure Learning in Model-Based RL

[openreview] [pdf]

Abstract In Model-Based Reinforcement Learning (MBRL), incorporating causal structures into dynamics models provides agents with a structured understanding of the environments, enabling efficient decision. Empowerment as an intrinsic motivation enhances the ability of agents to actively control their environments by maximizing the mutual information between future states and actions. We posit that empowerment coupled with causal understanding can improve controllability, while enhanced empowerment gain can further facilitate causal reasoning in MBRL. To improve learning efficiency and controllability, we propose a novel framework, Empowerment through Causal Learning (ECL), where an agent with the awareness of causal dynamics models achieves empowerment-driven exploration and optimizes its causal structure for task learning. Specifically, ECL operates by first training a causal dynamics model of the environment based on collected data. We then maximize empowerment under the causal structure for exploration, simultaneously using data gathered through exploration to update causal dynamics model to be more controllable than dense dynamics model without causal structure. In downstream task learning, an intrinsic curiosity reward is included to balance the causality, mitigating overfitting. Importantly, ECL is method-agnostic and is capable of integrating various causal discovery methods. We evaluate ECL combined with 3 causal discovery methods across 6 environments including pixel-based tasks, demonstrating its superior performance compared to other causal MBRL methods, in terms of causal discovery, sample efficiency, and asymptotic performance.

3471SMART: Self-Learning Meta-strategy Agent for Reasoning Tasks

[openreview] [pdf]

Abstract Tasks requiring deductive reasoning, especially those involving multiple steps, often demand adaptive strategies such as intermediate generation of rationales or programs, as no single approach is universally optimal. While Language Models (LMs) can enhance their outputs through iterative self-refinement and strategy adjustments, they frequently fail to apply the most effective strategy in their first attempt. This inefficiency raises the question:Can LMs learn to select the optimal strategy in the first attempt, without a need for refinement?To address this challenge, we introduceSMART:Self-learningMeta-strategyAgent forReasoningTasks, a novel framework that enables LMs to autonomously learn and select the most effective strategies for various reasoning tasks. We model the strategy selection process as aMarkov Decision Processand leverage reinforcement learning-driven continuous self-improvement to allow the model to find the suitable strategy to solve a given task. Unlike traditional self-refinement methods that rely on multiple inference passes or external feedback,SMARTallows an LM to internalize the outcomes of its own reasoning processes and adjust its strategy accordingly, aiming for correct solutions on the first attempt. Our experiments across various reasoning datasets and with different model architectures demonstrate thatSMARTsignificantly enhances the ability of models to choose optimal strategies without external guidance (+15 points on the GSM8K dataset). By achieving higher accuracy with a single inference pass,SMARTnot only improves performance but also reduces computational costs for refinement-based strategies, paving the way for more efficient and intelligent reasoning in LMs.

3472Moirai-MoE: Empowering Time Series Foundation Models with Sparse Mixture of Experts

[openreview] [pdf]

Abstract Time series foundation models have demonstrated impressive performance as zero-shot forecasters, i.e. tackling a wide variety of downstream forecasting tasks without explicit task-specific training. However, achieving effectively unified training on time series remains an open challenge. Existing approaches introduce some level of model specialization to account for the highly heterogeneous nature of time series data. For instance, Moirai pursues unified training by employing multiple input/output projection layers, each tailored to handle time series at a specific frequency. Similarly, TimesFM maintains a frequency embedding dictionary for this purpose. We identify two major drawbacks to this human-imposed frequency-level model specialization: (1) Frequency is not a reliable indicator of the underlying patterns in time series. For example, time series with different frequencies can display similar patterns, while those with the same frequency may exhibit varied patterns. (2) Non-stationarity is an inherent property of real-world time series, leading to varied distributions even within a short context window of a single time series. Frequency-level specialization is too coarse-grained to capture this level of diversity. To address these limitations, this paper introduces Moirai-MoE, using a single input/output projection layer while delegating the modeling of diverse time series patterns to the sparse mixture of experts (MoE) within Transformers. With these designs, Moirai-MoE reduces reliance on human-defined heuristics and enables automatic token-level specialization. Extensive experiments on 39 datasets demonstrate the superiority of Moirai-MoE over existing foundation models in both in-distribution and zero-shot scenarios. Furthermore, this study conducts comprehensive model analyses to explore the inner workings of time series MoE foundation models and provides valuable insights for future research.

3473Stable Offline Value Function Learning with Bisimulation-based Representations

[openreview] [pdf]

Abstract In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can \emph{stabilize} value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (\textsc{krope}). \textsc{krope} uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that \textsc{krope}: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods for stable and accurate evaluation of offline reinforcement learning agents.

3474DailyDilemmas: Revealing Value Preferences of LLMs with Quandaries of Daily Life

[openreview] [pdf]

Abstract As we increasingly seek guidance from LLMs for decision-making in daily life, many of these decisions are not clear-cut and depend significantly on the personal values and ethical standards of the users. We present DailyDilemmas, a dataset of 1,360 moral dilemmas encountered in everyday life. Each dilemma includes two possible actions and with each action, the affected parties and human values invoked. Based on these dilemmas, we consolidated a set of human values across everyday topics e.g., interpersonal relationships, workplace, and environmental issues. We evaluated LLMs on these dilemmas to determine what action they will take and the values represented by these actions. Then, we analyzed these values through the lens of five popular theories inspired by sociology, psychology and philosophy. These theories are: World Value Survey, Moral Foundation Theory, Maslow’s Hierarchy of Needs, Aristotle’s Virtues, and Plutchik Wheel of Emotion. We find that LLMs are most aligned with the self-expression over survival values in terms of World Value Survey, care over loyalty in Moral Foundation Theory. Interestingly, we find large preferences differences in models for some core values such as truthfulness e.g., Mixtral-8x7B model tends to neglect it by 9.7% while GPT-4-turbo model tends to select it by 9.4%. We also study the recent guidance released by OpenAI (ModelSpec), and Anthropic (Constitutional AI) to understand how their released principles reflect their actual value prioritization when facing nuanced moral reasoning in daily-life settings. We find that end users cannot effectively steer such prioritization using system prompts.

3475Solving robust MDPs as a sequence of static RL problems

[openreview] [pdf]

Abstract esigning control policies whose performance level is guaranteed to remain above a given threshold in a span of environments is a critical feature for the adoption of reinforcement learning (RL) in real-world applications. The search for such robust policies is a notoriously difficult problem, related to the so-called dynamic model of transition function uncertainty, where the environment dynamics are allowed to change at each time step. But in practical cases, one is rather interested in robustness to a span of static transition models throughout interaction episodes. The static model is known to be harder to solve than the dynamic one, and seminal algorithms, such as robust value iteration, as well as most recent works on deep robust RL, build upon the dynamic model. In this work, we propose to revisit the static model. We suggest an analysis of why solving the static model under some mild hypotheses is a reasonable endeavor, based on an equivalence with the dynamic model, and formalize the general intuition that robust MDPs can be solved by tackling a series of static problems. We introduce a generic meta-algorithm called IWOCS, which incrementally identifies worst-case transition models so as to guide the search for a robust policy. Discussion on IWOCS sheds light on new ways to decouple policy optimization and adversarial transition functions and opens new perspectives for analysis. We derive a deep RL version of IWOCS and demonstrate it is competitive with state-of-the-art algorithms on classical benchmarks.

3476What are the Essential Factors in Crafting Effective Long Context Multi-Hop Instruction Datasets? Insights and Best Practices

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) with extended context windows have significantly improved tasks such as information extraction, question answering, and complex planning scenarios. In order to achieve success in long-context tasks, a large amount of work has been done to enhance the long-context capabilities of the model through synthetic data. Existing methods typically utilize the Self-Instruct framework to generate instruction-tuning data for better long-context capability improvement. However, our preliminary experiments indicate that less than 35% of samples generated by Qwen-272B_{72B} are multi-hop, and more than 40% exhibit poor quality, limiting comprehensive understanding and further research. To improve the quality of synthetic data, we propose the Multi-agent Interactive Multi-hop Generation (MIMG) framework, incorporating a Quality Verification Agent, a Single-hop Question Generation Agent, a Multiple Question Sampling Strategy, and a Multi-hop Question Merger Agent. This framework improves the data quality, with the proportion of high-quality, multi-hop, and diverse data exceeding 85%. Furthermore, we systematically investigate strategies for document selection, question merging, and validation techniques through extensive experiments across various models. Our findings show that our synthetic high-quality long-context instruction data significantly enhances model performance, even surpassing models trained on larger amounts of human-annotated data.

3477Robust Conformal Prediction with a Single Binary Certificate

[openreview] [pdf]

Abstract Conformal prediction (CP) converts any model’s output to prediction sets with a guarantee to cover the true label with (adjustable) high probability. Robust CP extends this guarantee to worst-case (adversarial) inputs. Existing baselines achieve robustness by bounding randomly smoothed conformity scores. In practice, they need expensive Monte-Carlo (MC) sampling (104\sim10^4 samples per point) to maintain an acceptable set size. We propose a robust conformal prediction that produces smaller sets even with just 102 MC samples. Our approach binarizes samples with an adaptive bin selected to preserve the coverage guarantee. Remarkably, we prove that robustness can be achieved by computing only one binary certificate, unlike previous methods that certify each calibration (or test) point. Thus, our method is faster and returns smaller robust sets. We also eliminate a previous limitation that requires a bounded score function.

3478Learning Spatiotemporal Dynamical Systems from Point Process Observations

[openreview] [pdf]

Abstract Spatiotemporal dynamics models are fundamental for various domains, from heat propagation in materials to oceanic and atmospheric flows. However, currently available neural network-based spatiotemporal modeling approaches fall short when faced with data that is collected randomly over time and space, as is often the case with sensor networks in real-world applications like crowdsourced earthquake detection or pollution monitoring. In response, we developed a new method that can effectively learn spatiotemporal dynamics from such point process observations. Our model integrates techniques from neural differential equations, neural point processes, implicit neural representations and amortized variational inference to model both the dynamics of the system and the probabilistic locations and timings of observations. It outperforms existing methods on challenging spatiotemporal datasets by offering substantial improvements in predictive accuracy and computational efficiency, making it a useful tool for modeling and understanding complex dynamical systems observed under realistic, unconstrained conditions.

3479Transfer Learning for Control Systems via Neural Simulation Relations

[openreview] [pdf]

Abstract Transfer learning is an umbrella term for machine learning approaches that leverage knowledge gained from solving one problem (the source domain) to improve speed, efficiency, and data requirements in solving a different but related problem (the target domain). The performance of the transferred model in the target domain is typically measured via some notion of loss function in the target domain. This paper focuses on effectively transferring control logic from a source control system to a target control system while providing approximately similar behavioral guarantees in both domains. However, in the absence of a complete characterization of behavioral specifications, this problem cannot be captured in terms of loss functions. To overcome this challenge, we use (approximate) simulation relations to characterize observational equivalence between the behaviors of two systems.Simulation relations ensure that the outputs of both systems, equipped with their corresponding controllers, remain close to each other over time, and their closeness can be quantified a priori. By parameterizing simulation relations with neural networks, we introduce the notion of neural simulation relations, which provides a data-driven approach to transfer any synthesized controller, regardless of the specification of interest, along with its proof of correctness. Compared with prior approaches, our method eliminates the need for a closed-loop mathematical model and specific requirements for both the source and target systems. We also introduce validity conditions that, when satisfied, guarantee the closeness of the outputs of two systems equipped with their corresponding controllers, thus eliminating the need for post-facto verification. We demonstrate the effectiveness of our approach through case studies involving a vehicle and a double inverted pendulum.

3480Monet: Mixture of Monosemantic Experts for Transformers

[openreview] [pdf]

Abstract Understanding the internal computations of large language models (LLMs) is crucial for aligning them with human values and preventing undesirable behaviors like toxic content generation. However, mechanistic interpretability is hindered by polysemanticity—where individual neurons respond to multiple, unrelated concepts due to the superposition hypothesis. While Sparse Autoencoders (SAEs) have attempted to disentangle these features, they face limitations from imperfect reconstruction loss, which impedes LLM’s performance. We introduce the Mixture of Monosemantic Experts for Transformers (Monet) architecture, which enhances interpretability by significantly increasing the number of experts to 262,144 per layer while maintaining parameter efficiency through a novel expert decomposition method. By designing the total parameters to scale proportionally to the square root of the number of experts, Monet enables effective specialization of experts. Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, Monet allows robust knowledge manipulation over knowledge domains, languages, and toxicity mitigation without degrading general performance. By overcoming the limitations of SAEs and conventional Mixture-of-Experts architectures, Monet advances the mechanistic interpretability of LLMs and provides practical benefits for controlling model behavior.

3481Language Representations Can be What Recommenders Need: Findings and Potentials

[openreview] [pdf]

Abstract Recent studies empirically indicate that language models (LMs) encode rich world knowledge beyond mere semantics, attracting significant attention across various fields. However, in the recommendation domain, it remains uncertain whether LMs implicitly encode user preference information. Contrary to prevailing understanding that LMs and traditional recommenders learn two distinct representation spaces due to the huge gap in language and behavior modeling objectives, this work re-examines such understanding and explores extracting a recommendation space directly from the language representation space. Surprisingly, our findings demonstrate that item representations, when linearly mapped from advanced LM representations, yield superior recommendation performance. This outcome suggests the possible homomorphism between the advanced language representation space and an effective item representation space for recommendation, implying that collaborative signals may be implicitly encoded within LMs. Motivated by the finding of homomorphism, we explore the possibility of designing advanced collaborative filtering (CF) models purely based on language representations without ID-based embeddings. To be specific, we incorporate several crucial components (i.e., a multilayer perceptron (MLP), graph convolution, and contrastive learning (CL) loss function) to build a simple yet effective model, with the language representations of item textual metadata (i.e., title) as the input. Empirical results show that such a simple model can outperform leading ID-based CF models on multiple datasets, which sheds light on using language representations for better recommendation. Moreover, we systematically analyze this simple model and find several key features for using advanced language representations: a good initialization for item representations, superior zero-shot recommendation abilities in new datasets, and being aware of user intention. Our findings highlight the connection between language modeling and behavior modeling, which can inspire both natural language processing and recommender system communities.

3482RE-Adapt: Reverse Engineered Adaptation of Large Language Models

[openreview] [pdf]

Abstract We introduce RE-Adapt, an approach to fine-tuning large language models on new domains without degrading any pre-existing instruction-tuning. We reverse engineer an adapter which isolates what an instruction-tuned model has learned beyond its corresponding pretrained base model. Importantly, this requires no additional data or training. We can then fine-tune the base model on a new domain and readapt it to instruction following with the reverse engineered adapter. RE-Adapt and our low-rank variant LoRE-Adapt both outperform other methods of fine-tuning, across multiple popular LLMs and datasets, even when the models are used in conjunction with retrieval-augmented generation.

3483Sharp Analysis for KL-Regularized Contextual Bandits and RLHF

[openreview] [pdf]

Abstract Reverse-Kullback-Leiblerregularization has emerged to be a predominant technique used to enhance policy optimization in reinforcement learning (RL) and reinforcement learning from human feedback (RLHF), which forces the learned policy to stay close to a reference policy. While the effectiveness and necessity of KL-regularization has been empirically demonstrated in various practical scenarios, current theoretical analysis of KL-regularized RLHF still obtain the same O(1/ϵ2)\mathcal{O}(1 / \epsilon^2) sample complexity as problems without KL-regularization. To understand the fundamental distinction between policy learning objectives with KL-regularization and ones without KL-regularization, we are the first to theoretically demonstrate the power of KL-regularization by providing a sharp analysis for KL-regularized contextual bandits and RLHF, revealing an O(1/ϵ)\mathcal{O}(1 / \epsilon) sample complexity when ϵ\epsilon is sufficiently small.We further explore the role of data coverage in contextual bandits and RLHF. While the coverage assumption is commonly employed in offline RLHF to link the samples from the reference policy to the optimal policy, often at the cost of a multiplicative dependence on the coverage coefficient, its impact on the sample complexity of online RLHF remains unclear. Previous theoretical analyses of online RLHF typically require explicit exploration and additional structural assumptions on the reward function class. In contrast, we show that with sufficient coverage from the reference policy, a simple two-stage mixed sampling strategy can achieve a sample complexity with only an additive dependence on the coverage coefficient. Our results provide a comprehensive understanding of the roles of KL-regularization and data coverage in RLHF, shedding light on the design of more efficient RLHF algorithms.

3484Graph Transformers Dream of Electric Flow

[openreview] [pdf]

Abstract We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The input to the Transformer is simply the graph incidence matrix; no other explicit positional encoding information is provided. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers’ errors by the errors of the underlying algorithms. We verify our theoretical findings experimentally on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a better positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data.

3485Binary-Feedback Active Test-Time Adaptation

[openreview] [pdf]

Abstract Deep learning models perform poorly when domain shifts exist between training and test data. Test-time adaptation (TTA) is a paradigm to mitigate this issue by adapting pre-trained models using only unlabeled test samples. However, existing TTA methods can fail under severe domain shifts, while recent active TTA approaches requiring full-class labels are impractical due to high labeling costs. To address this issue, we introduce a Binary-feedback Active Test-Time Adaptation (BATTA) setting, which uses a few binary feedbacks from annotators to indicate whether model predictions are correct, thereby significantly reducing the labeling burden of annotators. Under the setting, we propose BATTA-RL, a novel dual-path optimization framework that leverages reinforcement learning to balance binary feedback-guided adaptation on uncertain samples with agreement-based self-adaptation on confident predictions. Experiments show BATTA-RL achieves substantial accuracy improvements over state-of-the-art baselines, demonstrating its effectiveness in handling severe distribution shifts with minimal labeling effort.

3486ST-GCond: Self-supervised and Transferable Graph Dataset Condensation

[openreview] [pdf]

Abstract The increasing scale of graph datasets significantly enhances deep learning models but also presents substantial training challenges. Graph dataset condensation has emerged to condense large datasets into smaller yet informative ones that maintain similar test performance. However, these methods require downstream usage to match the original dataset and task, which is impractical in real-world scenarios. Our empirical studies show that existing methods fail in “cross-task” and “cross-dataset” scenarios, often performing worse than training from scratch. To address these challenges, we propose a novel method: Self-supervised and Transferable Graph dataset Condensation (ST-GCond). For cross-task transferability, we propose a task-disentangled meta optimization strategy to adaptively update the condensed graph according to the task relevance, encouraging information preservation for various tasks. For cross-dataset transferability, we propose a multi-teacher self-supervised optimization strategy to incorporate auxiliary self-supervised tasks to inject universal knowledge into the condensed graph. Additionally, we incorporate mutual information guided joint condensation mitigating the potential conflicts and ensure the condensing stability. Experiments on both node-level and graph-level datasets show that ST-GCond outperforms existing methods by 2.5% to 18.7% in all cross-task and cross-dataset scenarios, and also achieves state-of-the-art performance on 5 out of 6 datasets in the single dataset and task scenario.

3487Deep MMD Gradient Flow without adversarial training

[openreview] [pdf]

Abstract We propose a gradient flow procedure for generative modeling by transporting particles from an initial source distribution to a target distribution, where the gradient field on the particles is given by a noise-adaptive Wasserstein Gradient of the Maximum Mean Discrepancy (MMD). The noise adaptive MMD is trained on data distributions corrupted by increasing levels of noise, obtained via a forward diffusion process, as commonly used in denoising diffusion probabilistic models. The result is a generalization of MMD Gradient Flow, which we call Diffusion-MMD-Gradient Flow or DMMD. The divergence training procedure is related to discriminator training in Generative Adversarial Networks (GAN), but does not require adversarial training. We obtain competitive empirical performance in unconditional image generation on CIFAR10, MNIST, CELEB-A (64 x64) and LSUN Church (64 x 64). Furthermore, we demonstrate the validity of the approach when MMD is replaced by a lower bound on the KL divergence.

3488LDAdam: Adaptive Optimization from Low-Dimensional Gradient Statistics

[openreview] [pdf]

Abstract We introduce LDAdam, a memory-efficient optimizer for training large models, that performs adaptive optimization steps within lower dimensional subspaces, while consistently exploring the full parameter space during training. This strategy keeps the optimizer’s memory footprint to a fraction of the model size. LDAdam relies on a new projection-aware update rule for the optimizer states that allows for transitioning between subspaces, i.e., estimation of the statistics of the projected gradients. To mitigate the errors due to low-rank projection, LDAdam integrates a new generalized error feedback mechanism, which explicitly accounts for both gradient and optimizer state compression. We prove the convergence of LDAdam under standard assumptions, and provide empirical evidence that LDAdam allows for efficient fine-tuning and pre-training of language models.

3489Few-shot In-context Preference Learning using Large Language Models

[openreview] [pdf]

Abstract Designing reward functions is a core component of reinforcement learning but can be challenging for truly complex behavior. Reinforcement Learning from Human Feedback (RLHF) has been used to alleviate this challenge by replacing a hand-coded reward function with a reward function learned from preferences. However, it can be exceedingly inefficient to learn these rewards as they are often learned tabula rasa. We investigate whether Large Language Models (LLMs) can reduce this query inefficiency by converting an iterative series of human preferences into code representing the rewards. We propose In-Context Preference Learning (ICPL), a method that uses the grounding of an LLM to accelerate learning reward functions from preferences. ICPL takes the environment context and task description, synthesizes a set of reward functions, and then repeatedly updates the reward functions using human feedback over videos of the resultant policies over a small number of trials. Using synthetic preferences, we demonstrate that ICPL is orders of magnitude more efficient than RLHF and is even competitive with methods that use ground-truth reward functions instead of preferences. Finally, we perform a series of human preference-learning trials and observe that ICPL extends beyond synthetic settings and can work effectively with humans-in-the-loop.

3490Learning Disease Progression Models That Capture Health Disparities

[openreview] [pdf]

Abstract Disease progression models are widely used to inform the diagnosis and treatment of many progressive diseases. However, a significant limitation of existing models is that they do not account for health disparities that can bias the observed data. To address this, we develop an interpretable Bayesian disease progression model that captures three key health disparities: certain patient populations may (1) start receiving care only when their disease is more severe, (2) experience faster disease progression even while receiving care, or (3) receive follow-up care less frequently conditional on disease severity. We show theoretically and empirically that failing to account for disparities produces biased estimates of severity (underestimating severity for disadvantaged groups, for example). On a dataset of heart failure patients, we show that our model can identify groups that face each type of health disparity, and that accounting for these disparities meaningfully shifts which patients are considered high-risk.

3491Learning Disease Progression Models That Capture Health Disparities

[openreview] [pdf]

Abstract No absctract

3492Recovery of Causal Graph Involving Latent Variables via Homologous Surrogates

[openreview] [pdf]

Abstract Causal discovery with latent variables is an important and challenging problem. To identify latent variables and infer their causal relations, most existing works rely on the assumption that each latent variable has multiple pure children that must have no other parent except the latent variable itself. Considering that this assumption is potentially restrictive in practice and not strictly necessary in theory, by introducing the concept of homologous surrogate, this paper eliminates the need for pure child in the context of causal discovery with latent variables. The former fundamentally differs from the latter in the sense that a homologous surrogate is allowed to have other parents besides the latent variable it represents. We formulate two assumptions involving homologous surrogates and develop theoretical results under each assumption. Under the weaker assumption, our theoretical results imply that we can determine each variable’s ancestors, that is, partially recover the causal graph. The stronger assumption further enables us to determine each variable’s parents exactly, that is, fully recover the causal graph. Building on these theoretical results, we derive an algorithm that fully leverages the properties of homologous surrogates for causal graph recovery. Also, we validate its efficacy through experiments. Our work broadens the applicability of causal discovery.

3493Prompt Distribution Matters: Tuning Visual Prompt Through Semantic Metric Guidance

[openreview] [pdf]

Abstract Visual Prompt Tuning (VPT) has become a promising solution for Parameter-Efficient Fine-Tuning (PEFT) of pre-trained Vision Transformer (ViT) models on downstream vision tasks. VPT partially fine-tunes a set of learnable tokens while keeping the majority of the model parameters frozen. Recent research has explored modifying the connection structures of the prompts. However, the fundamental correlation and distribution between the prompts and image tokens remain unexplored. In this paper, we leverage \textit{metric learning} techniques to investigate how the distribution of prompts affects fine-tuning and transfer learning performance. Specifically, we propose a novel framework, \textbf{D}istribution \textbf{A}ware \textbf{V}isual \textbf{P}rompt Tuning (DA-VPT), to guide the distributions of the prompts by learning the distance metric from their class-related semantic data. Our method demonstrates that the prompts can serve as an effective bridge to share semantic information between image patches and the class token. We extensively evaluated our approach on popular benchmarks in both recognition and segmentation tasks. The results show the effectiveness of our proposed method and offer a new direction for PEFT optimization in vision transformers. We demonstrate that our approach enables more effective and efficient fine-tuning of ViT models by leveraging semantic information to guide the learning of the prompts, leading to improved performance on various downstream vision tasks.

3494Transformers Provably Learn Two-Mixture of Linear Classification via Gradient Flow

[openreview] [pdf]

Abstract Understanding how transformers learn and utilize hidden connections between tokens is crucial to understand the behavior of large language models. To understand this mechanism, we consider the task of two-mixture of linear classification which possesses a hidden correspondence structure among tokens, and study the training dynamics of a symmetric two-headed transformer with ReLU neurons. Motivated by the stage-wise learning phenomenon in our experiments, we design and theoretically analyze a three-stage training algorithm, which can effectively characterize the actual gradient descent dynamics when we simultaneously train the neuron weights and the softmax attention. The first stage is a neuron learning stage, where the neurons align with the underlying signals. The second stage is a attention feature learning stage, where we analyze the feature learning process of how the attention learns to utilize the relationship between the tokens to solve certain hard samples. In the meantime, the attention features evolve from a nearly non-separable state (at the initialization) to a well-separated state. The third stage is a convergence stage, where the population loss is driven towards zero. The key technique in our analysis of softmax attention is to identify a critical sub-system inside a large dynamical system and bound the growth of the non-linear sub-system by a linear system. Finally, we discuss the setting with more than two mixtures. We empirically show the difficulty of generalizing our analysis of the gradient flow dynamics to the case even when the number of mixtures equals three, although the transformer can still successfully learn such distribution. On the other hand, we show by construction that there exists a transformer that can solve mixture of linear classification given any arbitrary number of mixtures.

3495Online Detection for Black-Box Large Language Models with Adaptive Prompt Selection

[openreview] [pdf]

Abstract The widespread success of large language models (LLMs) has made them integral to various applications, yet security and reliability concerns are growing. It now becomes critical to safeguard LLMs from unintended changes caused by tampering, malicious prompt injection, or unauthorized parameter updates, etc. Early detection of these changes is essential to maintain the performance, fairness, and trustworthiness of LLM-powered applications. However, in black-box settings, where access to model parameters and output probabilities is unavailable, few detection methods exist. In this paper, we propose a novel online change-point detection method for quickly detecting changes in black-box LLMs. Our method features several key innovations: 1) we derive a CUSUM-type detection statistic based on the entropy and the Gini coefficient of the response distribution, and 2) we utilize a UCB-based adaptive prompt selection strategy for identifying change-sensitive prompts to enhance detection. We evaluate the effectiveness of the proposed method using synthetic data, where changes are simulated through watermarking and model version updates. Our proposed method is able to detect changes quickly while well controlling the false alarm rate. Moreover, for real-world data, our method also accurately detects announced changes in LLM APIs via daily online interactions with APIs. We also demonstrate strong evidence of unreported changes in APIs, which may be of independent interest.

3496Optimizing Calibration by Gaining Aware of Prediction Correctness

[openreview] [pdf]

Abstract Model calibration aims to align confidence with prediction correctness. The Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class. However, we find the CE loss has intrinsic limitations. For example, for a narrow misclassification, a calibrator trained by the CE loss often produces high confidence on the wrongly predicted class (e.g., a test sample is wrongly classified and its softmax score on the ground truth class is around 0.4), which is undesirable. In this paper, we propose a new post-hoc calibration objective derived from the aim of calibration. Intuitively, the proposed objective function asks that the calibrator decrease model confidence on wrongly predicted samples and increase confidence on correctly predicted samples. Because a sample itself has insufficient ability to indicate correctness, we use its transformed versions (e.g., rotated, greyscaled, and color-jittered) during calibrator training. Trained on an in-distribution validation set and tested with isolated, individual test samples, our method achieves competitive calibration performance on both in-distribution and out-of-distribution test sets compared with the state of the art. Further, our analysis points out the difference between our method and commonly used objectives such as CE loss and Mean Square Error (MSE) loss, where the latters sometimes deviates from the calibration aim.

3497DiffGAD: A Diffusion-based Unsupervised Graph Anomaly Detector

[openreview] [pdf]

Abstract Graph Anomaly Detection (GAD) is crucial for identifying abnormal entities within networks, garnering significant attention across various fields. Traditional unsupervised methods, which decode encoded latent representations of unlabeled data with a reconstruction focus, often fail to capture critical discriminative content, leading to suboptimal anomaly detection. To address these challenges, we present a Diffusion-based Graph Anomaly Detector (DiffGAD). At the heart of DiffGAD is a novel latent space learning paradigm, meticulously designed to enhance the model’s proficiency by guiding it with discriminative content. This innovative approach leverages diffusion sampling to infuse the latent space with discriminative content and introduces a content-preservation mechanism that retains valuable information across different scales, significantly improving the model’s adeptness at identifying anomalies with limited time and space complexity. Our comprehensive evaluation of DiffGAD, conducted on six real-world and large-scale datasets with various metrics, demonstrated its exceptional performance. Our code is available athttps://anonymous.4open.science/r/DiffGAD-440C/

3498Unified Multi-Task Learning & Model Fusion for Efficient Language Model Guardrailing

[openreview] [pdf]

Abstract The trend towards large language models (LLMs) for guardrailing against undesired behaviors is increasing and has shown promise for censoring user inputs. However, high inference speed, memory consumption, hosting expenses and generative non-structured outputs can make their use prohibitive.In this work, we show that task-specific data generation can lead to fine-tuned classifiers that significantly outperform current state of the art (SoTA) while being orders of magnitude smaller. Secondly, we show that using a single model, \texttt{MultiTaskGuard}, that is pretrained on a large synthetically generated dataset with unique task instructions further improves generalization. Thirdly, our most performant models, \texttt{UniGuard}, are found using our proposed search-based model merging approach that finds an optimal set of parameters to combine single-policy models and multi-policy guardrail modelsOn 7 public datasets and 4 new guardrail benchmarks we created, our efficient guardrail classifiers improve over the best performing SoTA publicly available LLMs and 3rd^{\text{rd}} party guardrail APIs in detecting unsafe and safe behaviors by an average \textbf{29.92} (\text{Aegis-LlamaGuard}) and \textbf{21.62} (\texttt{gpt-4o}) F1 respectively. Lastly, our guardrail synthetic data generation process leads to models that outperform training on real data using our custom defined policies that describe the guardrailing task.

3499Probabilistic Feature Smoothed Gaussian Process For Imbalanced Regression

[openreview] [pdf]

Abstract Gaussian Processes (GPs) are non-parametric Bayesian models widely used for regression, classification, and other tasks due to their explainability and versatility. However, GPs face challenges in imbalanced regression, where the skewed distribution of target labels can greatly harm models’ performances. In this work, we introduce the Probabilistic Feature Smoothed Partially Independent Training Conditional Approximation (PFS-PITC) to enhance GP performance in imbalanced scenarios. We extract statistical features from the observation space using equidistant label intervals and apply kernel smoothing to address sampling density discontinuities. This process enables PFS-PITC to utilize information from nearby labels within imbalanced datasets, thereby reducing GPs’ sensitivity to such imbalances. Empirical tests on various imbalanced regression datasets demonstrate the effectiveness of PFS-PITC, contributing to the robustness of GPs in handling flawed real-world data and expanding their applicability in challenging data processing tasks.

3500Classifier-Free Guidance is a Predictor-Corrector

[openreview] [pdf]

Abstract We investigate the theoretical foundations of classifier-free guidance (CFG). CFG is the dominant method of conditional sampling for text-to-image diffusion models, yet unlike other aspects of diffusion, it remains on shaky theoretical footing. In this paper, we disprove common misconceptions, by showing that CFG interacts differently with DDPM and DDIM, and neither sampler with CFG generates the gamma-powered distribution p(xc)γp(x)1γp(x|c)^\gamma p(x)^{1−\gamma}. Then, we clarify the behavior of CFG by showing that it is a kind of predictor-corrector method (Song et al., 2020) that alternates between denoising and sharpening, which we call predictor-corrector guidance (PCG). We prove that in the SDE limit, CFG is actually equivalent to combining a DDIM predictor for the conditional distribution together with a Langevin dynamics corrector for a gamma-powered distribution (with a carefully chosen gamma). Our work thus provides a lens to theoretically understand CFG by embedding it in a broader design space of principled sampling methods.

3501Model Collapse Analysis and Improvement for Rectified Flow Models

[openreview] [pdf]

Abstract Generative models aim to produce data indistinguishable from real distributions, but training on self-generated outputs can lead to model collapse, degrading performance. Focusing on Rectified Flow—a simulation-free model prone to this issue due to its iterative use of self-generated data—we provide a theoretical analysis of model collapse starting from Denoising Autoencoders. To prevent collapse, we propose methods that incorporate real data into the training process, even without direct noise-image pairs. Our approaches, Reverse Collapse-Avoiding (RCA) Reflow and Online Collapse-Avoiding Reflow (OCAR), effectively prevent collapse while maintaining sampling efficiency. Experiments on standard image datasets demonstrate improved generation quality and reduced sampling steps, confirming the effectiveness of our methods.

3502Bridging the Safety Gap: A Guardrail Pipeline for Trustworthy LLM Inferences

[openreview] [pdf]

Abstract We present Wildflare GuardRail, a guardrail pipeline designed to enhance the safety and reliability of Large Language Model (LLM) inferences. Wildflare GuardRail integrates four key functional modules, including SAFETY DETECTOR, GROUNDING, CUSTOMIZER, and REPAIRER, and addresses safety challenges across multiple dimensions of LLM inferences. Wildflare GuardRail incorporates an unsafe content detection model that identifies issues such as toxicity, bias, and prompt injection, a hallucination detection model that identifies hallucinated LLM outputs and simultaneously provides explanations for the hallucinations, and a fixing model that corrects LLM outputs based on these explanations. Additionally, Wildflare GuardRail employs GROUNDINGto enrich user queries with relevant context, and utilizes CUSTOMIZERto allow users to define flexible protocols for handling specific safety requirements. Our experiments demonstrate that Wildflare GuardRail enhances safety and robustness in LLM inferences, offering adaptable and scalable solutions for LLM inferences.

3503Prompt Injection Benchmark for Foundation Model Integrated Systems

[openreview] [pdf]

Abstract Foundation Models (FMs) are increasingly integrated with external data sources and tools to handle complex tasks, forming FM-integrated systems with different modalities. However, such integration introduces new security vulnerabilities, especially when FMs interact dynamically with the system environments. One of the most critical threats is the prompt injection attack, where adversaries inject malicious instructions into the input environment, causing the model to deviate from user-intended behaviors. To advance the study of prompt injection vulnerabilities in FM-integrated systems, a comprehensive benchmark is essential. However, existing benchmarks fall short in two key areas: 1) they primarily focus on text-based modalities, lacking thorough analysis of diverse threats and attacks across more integrated modalities such as code, web pages, and vision; and 2) they rely on static test suites, failing to capture the dynamic, adversarial interplay between evolving attacks and defenses, as well as the interactive nature of agent-based environments. To bridge this gap, we propose the Prompt Injection Benchmark for FM-integrated Systems (FSPIB), which offers comprehensive coverage across various dimensions, including task modalities, threat categories, various attack and defense algorithms. Furthermore, FSPIB is interactive and dynamic, with evaluations conducted in interactive environments, and features a user-friendly front end that supports extensible attacks and defenses for ongoing research. By analyzing the performance of baseline prompt injection attacks and defenses, our benchmark highlights the prevalence of security vulnerabilities in FM-integrated systems and reveals the limited effectiveness of existing defense strategies, underscoring the urgent need for further research into prompt injection mitigation.

3504qNBO: quasi-Newton Meets Bilevel Optimization

[openreview] [pdf]

Abstract Bilevel optimization, which addresses challenges in hierarchical learning tasks, has gained significant interest in machine learning. Implementing gradient descent for bilevel optimization presents computational hurdles, notably the need to compute the exact lower-level solution and the inverse Hessian of the lower-level objective. While these two aspects are inherently connected, existing methods typically handle them separately by solving the lower-level problem and a linear system for the inverse Hessian-vector product. In this paper, we introduce a general framework to tackle these computational challenges in a coordinated manner. Specifically, we leverage quasi-Newton algorithms to accelerate the solution of the lower-level problem while efficiently approximating the inverse Hessian-vector product. Furthermore, by leveraging the superlinear convergence properties of BFGS, we establish a non-asymptotic convergence analysis for the BFGS adaptation within our framework. Numerical experiments demonstrate the superior performance of our proposed algorithms in real-world learning tasks, including hyperparameter optimization, data hyper-cleaning, and few-shot meta-learning.

3505Explore To Mimic: A Reinforcement Learning Based Agent To Generate Online Signatures

[openreview] [pdf]

Abstract Recent advancements in utilising decision making capability of Reinforcement Learning (RL) have paved the way for innovative approaches in data generation. This research explores the application of model free on-policy RL algorithms for generating online signatures and its controlled variations. Online signatures are captured via e-pads as sequential structural coordinates. In this study, we have introduced a robust on-policy RL agent named as SIGN-Agent, capable of generating online signatures accurately. Unlike other RL algorithms, on-policy RL directly learns from the agent’s current policy, offering significant advantages in stability and faster convergence for sequential decision-making. The proposed SIGN-Agent operates in a random continuous action space with controlled exploration limits, allowing it to capture complex signature patterns while minimizing errors over time. The downstream applications of this system can be extended in diverse fields such as enhancing the robustness of signature authentication systems, supporting robotics, and even diagnosing neurological disorders. By generating reliable, human-like online signatures, our approach strengthens signature authentication systems by reducing susceptibility towards system-generated forgeries, if trained against them. Additionally, the proposed work is optimized for low-footprint edge devices, enabling it to function efficiently in the area of robotics for online signature generation tasks. Experimental results, tested on large, publicly available datasets, demonstrate the effectiveness of model free on-policy RL algorithms in generating online signature trajectories, that closely resemble user’s reference signatures. Our approach highlights the potential of model free on-policy RL as an advancement in the field of data generation targeting the domain of online signatures in this research.

3506Toward Learning Generalized Cross-Problem Solving Strategies for Combinatorial Optimization

[openreview] [pdf]

Abstract Combinatorial optimization (CO) problems are fundamental across various domains, with many sharing similarities in optimization objectives, decision variables, and constraints. Many traditional algorithms perform well on related problems using similar solution strategies, highlighting the commonality in solving different problems. However, most machine learning approaches treat each CO problem in isolation, failing to capitalize on the underlying relationships between problems. In this paper, we investigate the potential to learn generalized solving strategies that capture the shared structure among different CO problems, enabling easier adaptation to related tasks. To this end, we propose to first divide the model architecture into three components: a header, an encoder, and a decoder; where The header and decoder address problem-specific inputs and outputs, while the encoder is designed to learn shared strategies that generalize across different problems. To ensure this, we enforce alignment in the optimization directions of the encoder across problems, maintaining consistency in both gradient directions and magnitudes to harmonize optimization processes. This is achieved by introducing the additional problem-specific rotation matrices and loss weights to steer the gradients, which are updated via a gradient consistency loss. Extensive experiments on six CO problems demonstrate that our method enhances the model’s ability to capture shared solving strategies across problems. We show that the learned encoder on several problems can directly perform comparably on new problems to models trained from scratch, highlighting its potential to support developing the foundational model for combinatorial optimization. Source code will be made publicly available.

3507Blending Concepts in Text-to-Image Diffusion Models using the Black Scholes Algorithm

[openreview] [pdf]

Abstract Many image generation tasks, such as content creation, editing, personalization, and zero-shot generation, require generating unseen concepts without retraining the model or collecting additional data. These tasks often involve blending existing concepts by conditioning the diffusion model with text prompts at each denoising step, a process known as ``prompt mixing’'. We introduce a novel approach for prompt mixing to forecasts predictions w.r.t. the generated image and makes informed text conditioning decisions at each time step during diffusion denoising. To do so, we leverage the connection between diffusion models (rooted in non-equilibrium thermodynamics) and the Black-Scholes model for pricing options in Finance, and draw analogies between the variables in both contexts to derive an appropriate algorithm for prompt mixing using the Black Scholes model. Specifically, the parallels between diffusion models and the Black-Scholes model enable us to leverage properties related to the dynamics of the Markovian model derived in the Black-Scholes algorithm. Our prompt-mixing algorithm is data-efficient, meaning it does not need additional training. Furthermore, it operates without human intervention or hyperparameter tuning. We highlight the benefits of our approach by comparing it, qualitatively and quantitatively using CLIP scores, to other prompt mixing techniques, including linear interpolation, alternating prompts, step-wise prompt switching, and CLIP-guided prompt selection across various scenarios such as single object per text prompt, multiple objects per text prompt and objects against backgrounds. The resulting code will be made publicly available for research reproduction.

3508The best of both worlds: Improved outcome prediction using causal structure learning

[openreview] [pdf]

Abstract In limited data settings as in the medical domain, causal structure learning can be a powerful tool for understanding the relationships between variables and achieving out-of-sample generalisation for the prediction of a specific target variable. Most methods that learn causal structure from observational data rely on strong assumptions, such as the absence of unmeasured confounders, that are not valid in real world scenarios. In addition, due to evolving conditions and treatment approaches, causal relationships between the variables change over time. Moreover in a clinical setting, symptoms often need to be managed before finding the root cause of a problem, which puts the emphasis on accurate outcome prediction. Consequently, prediction of a specific target variable from retrospective observational data based on causal relationships alone will not be sufficient for generalisation to prospective data. To overcome these limitations, we opt for the best of both worlds in this work by learning a shared representation between causal structure learning and outcome prediction. We provide extensive empirical evidence to show that this would not only facilitate out-of-sample generalisation in outcome prediction but also enable robust causal discovery. We also highlight the strengths of our model in terms of time efficiency and interpretability.

3509Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Models

[openreview] [pdf]

Abstract When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.

3510Evaluating the Unseen: A Novel Framework for Assessing Unsupervised Concept Bottleneck Models

[openreview] [pdf]

Abstract In recent years, the field of explainable artificial intelligence (XAI) has gained significant traction, with concept bottleneck models (CBMs) emerging as a promising approach to enhance the interpretability of machine learning systems. However, CBMs often rely on expert-annotated concepts, which can be costly and time-consuming to acquire. To address this limitation, unsupervised and label-free CBMs have been proposed, but these come with their own challenges, particularly in assessing the reliability and accuracy of the generated concepts without ground-truth labels. This paper introduces a comprehensive evaluation framework designed to assess the quality of explanations produced by unsupervised CBMs. Our framework comprises a set of novel metrics that evaluate various aspects of the concept outputs, including their relevance, consistency, and informativeness. We demonstrate the effectiveness of our metrics through a series of experiments, showing certain positive correlations between our scores and both LLM evaluations and human judgments. Our work not only fills a critical gap in the evaluation of unsupervised CBMs but also provides a solid foundation for further research into more transparent and trustworthy AI systems.

3511Identification of Nonparametric Dynamic Causal Model and Latent Process for Climate Analysis

[openreview] [pdf]

Abstract The study of causal structure learning with latent variables has advanced our understanding of the world by uncovering causal relationships and latent factors. However, in real-world scenarios, such as those in climate systems, causal relationships are often nonparametric, dynamic, and exist among both observed variables and latent variables. To address these challenges, we consider a general setting where causal relations are nonparametric and unrestricted in their occurrence. With the aid of three measurements in temporal structure, we theoretically show that both latent variables and processes can be identified up to minor indeterminacy under mild assumptions. Furthermore, we establish that the observed causal structure is identifiable if, roughly speaking, the latent variables induce sufficient variations in the noise terms. Based on these theoretical insights, we develop a principled estimation approach that simultaneously learns both the causal structure and latent representation. Experimental results on simulation studies validate the theoretical foundations and demonstrate the effectiveness of the proposed methodology. In the climate data experiments, we show that it offers a powerful and in-depth understanding of climate system.

3512Pay Attention to What Matters

[openreview] [pdf]

Abstract Despite the remarkable success of Large Language Models (LLMs), they still exhibit a limited capability to align their outputs to the user instructions. In this work, we introduce a simple and effective method, which we name as GUIDE, that mechanistically increases attention scores in instruction tokens. To support this operation, we present Influence, a novel metric that highlights how the user’s instructions propagate with transformer layers and impact the LLM output. Our results show that GUIDE improves the accuracy of following certain instructions 29.4% to 60.4 %, outperforming natural prompting alternatives.

3513Action as a Modality: Turning Multi-Modal LLMs to General Action Planners

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated strong reasoning capabilities and possess extensive common knowledge. This enables them to adapt to a variety of complex tasks in a zero-shot manner, including functioning as controllers to manipulate automated systems and produce executable action sequences. However, a significant challenge in the existing framework is the misalignment between the general pre-trained LLM and the action space of specific control tasks. This misalignment necessitates extensive efforts in designing task-specific prompts, which are less generalizable and do not ensure consistent output when prompting a pre-trained LLM to generate the desired action sequences. To address this issue, we propose a novel solution, ActionVerse, which encodes action candidates into a series of modality tokens, coupled with an efficient alignment technique to synchronize the action tokens with the LLM’s language space. By leveraging this approach, the proposed ActionVerse successfully transforms a chat-based multi-modal LLM into a general action executor capable of handling tasks requiring step-by-step execution of various actions. Experiments on several sequential action tasks demonstrate the effectiveness of the proposed framework.

3514Not All LLM Reasoners Are Created Equal

[openreview] [pdf]

Abstract We study the depth of grade-school math (GSM) problem-solving capabilities of LLMs. To this end, we evaluate their performance on pairs of existing math word problems together so that the answer to the second problem depends on correctly answering the first problem. Our findings reveal a significant reasoning gap in most LLMs, that is performance difference between solving the compositional pairs and solving each question independently. This gap is more pronounced in smaller, more cost-efficient, and math-specialized models. Moreover, instruction-tuning recipes and code generation have varying effects across LLM sizes, while finetuning on GSM can lead to task overfitting. Our analysis indicates that large reasoning gaps are not because of test-set leakage, but due to distraction from additional context and poor second-hop reasoning. Overall, LLMs exhibit systematic differences in their reasoning abilities, despite what their performance on standard benchmarks indicates.

3515BOIL: Learning Environment Personalized Information

[openreview] [pdf]

Abstract Navigating complex environments poses challenges for multi-agent systems, requiring efficient extraction of insights from limited information. In this paper, we introduce the Blackbox Oracle Information Learning (BOIL) process, a scalable solution for extracting valuable insights from the environment structure. Leveraging the Pagerank algorithm and common information maximization, BOIL facilitates the extraction of information to guide long-term agent behavior applicable to problems such as coverage, patrolling, and stochastic reachability. Through experiments, we demonstrate the efficacy of BOIL in generating strategy distributions conducive to improved performance over extended time horizons, surpassing heuristic approaches in complex environments.

3516Brain-Like Replay Naturally Emerges in Reinforcement Learning Agents

[openreview] [pdf]

Abstract Replay is a powerful strategy to promote learning in artificial intelligence and the brain. However, the conditions to generate it and its functional advantages have not been fully recognized. In this study, we develop a modular reinforcement learning model that could generate replay. We prove that replay generated in this way helps complete the task. We also analyze the information contained in the representation and provide a mechanism for how replay makes a difference. Our design avoids complex assumptions and enables replay to emerge naturally within a task-optimized paradigm. Our model also reproduces key phenomena observed in biological agents. This research explores the structural biases in modular ANN to generate replay and its potential utility in developing efficient RL.

3517Personalize to generalize: Towards a universal medical multi-modality generalization through personalization

[openreview] [pdf]

Abstract Personalized medicine is a groundbreaking healthcare framework for the 21st21^{st} century, tailoring medical treatments to individuals based on unique clinical characteristics, including diverse medical imaging modalities. These modalities differ significantly due to distinct underlying imaging principles, creating substantial challenges for generalization in multi-modal medical image tasks. Previous methods addressing multi-modal generalization rarely consider personalization, primarily focusing on common anatomical information. This paper aims to connect multi-modal generalization with the concept of personalized medicine. Specifically, we propose a novel approach to derive a tractable form of the underlying personalized invariant representation Xh\mathbb{X}_h using individual-level constraints and a learnable biological prior. We demonstrate that learning a personalized Xh\mathbb{X}_h is both feasible and beneficial, as this representation proves highly generalizable and transferable across various multi-modal medical tasks. Our method is rigorously validated on medical imaging modalities emphasizing both physical structure and functional information, encompassing a range of tasks that require generalization. Extensive experimental results consistently show that our approach significantly improves performance across diverse scenarios, confirming its effectiveness.

3518DSBench: How Far Are Data Science Agents from Becoming Data Science Experts?

[openreview] [pdf]

Abstract Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have demonstrated impressive language/vision reasoning abilities, igniting the recent trend of building agents for targeted applications such as shopping assistants or AI software engineers. Recently, many data science benchmarks have been proposed to investigate their performance in the data science domain. However, existing data science benchmarks still fall short when compared to real-world data science applications due to their simplified settings. To bridge this gap, we introduce DSBench, a comprehensive benchmark designed to evaluate data science agents with realistic tasks. This benchmark includes 466 data analysis tasks and 74 data modeling tasks, sourced from Eloquence and Kaggle competitions. DSBench offers a realistic setting by encompassing long contexts, multimodal task backgrounds, reasoning with large data files and multi-table structures, and performing end-to-end data modeling tasks. Our evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle with most tasks, with the best agent solving only 34.12% of data analysis tasks and achieving a 34.74% Relative Performance Gap (RPG). These findings underscore the need for further advancements in developing more practical, intelligent, and autonomous data science agents.

3519Grounding Robot Policies with Visuomotor Language Guidance

[openreview] [pdf]

Abstract Recent advances in the fields of natural language processing and computer vision have shown great potential in understanding the underlying dynamics of the world from large-scale internet data. However, translating this knowledge into robotic systems remains an open challenge, given the scarcity of human-robot interactions and the lack of large-scale datasets of real-world robotic data. Previous robot learning approaches such as behavior cloning and reinforcement learning have shown great capabilities in learning robotic skills from human demonstrations or from scratch in specific environments. However, these approaches often require task-specific demonstrations or designing complex simulation environments, which limits the development of generalizable and robust policies for new settings. Aiming to address these limitations, we propose an agent-based framework for grounding robot policies to the current context, considering the constraints of a current robot and its environment using visuomotor-grounded language guidance. The proposed framework is composed of a set of conversational agents designed for specific roles—namely, high-level advisor, visual grounding, monitoring, and robotic agents. Given a base policy, the agents collectively generate guidance at run time to shift the action distribution of the base policy towards more desirable future states. We demonstrate that our approach can effectively guide manipulation policies to achieve significantly higher success rates both in simulation and in real-world experiments without the need for additional human demonstrations or extensive exploration. Project videos athttps://sites.google.com/view/motorcortex/home.

3520Action Sequence Augmentation for Action Anticipation

[openreview] [pdf]

Abstract Action anticipation models require an understanding of temporal action patterns and dependencies to predict future actions from previous events. The key challenges arise from the vast number of possible action sequences, given the flexibility in action ordering and the interleaving of multiple goals. Since only a subset of such action sequences are present in action anticipation datasets, there is an inherent ordering bias in them. Another challenge is the presence of noisy input to the models due to erroneous action recognition or other upstream tasks. This paper addresses these challenges by introducing a novel data augmentation strategy that separately augments observed action sequences and next actions. To address biased action ordering, we introduce a grammar induction algorithm that derives a powerful context-free grammar from action sequence data. We also develop an efficient parser to generate plausible next-action candidates beyond the ground truth. For noisy input, we enhance model robustness by randomly deleting or replacing actions in observed sequences. Our experiments on the 50Salads, EGTEA Gaze+, and Epic-Kitchens-100 datasets demonstrate significant performance improvements over existing state-of-the-art methods.

3521Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness

[openreview] [pdf]

Abstract Adversarial training aims to defend againstadversaries: malicious opponents whose sole aim is to harm predictive performance in any way possible. This presents a rather harsh perspective, which we assert results in unnecessarily conservative training. As an alternative, we propose to model opponents as simply pursuing their own goals—rather than working directly against the classifier. Employing tools from strategic modeling, our approach enables knowledge or beliefs regarding the opponent’s possible incentives to be used as inductive bias for learning. Accordingly, our method ofstrategic trainingis designed to defend against all opponents within an `incentive uncertainty set’. This resorts to adversarial training when the set is maximal, but offers potential gains when the set can be appropriately reduced. We conduct a series of experiments that show how even mild knowledge regarding the opponent’s incentives can be useful, and that the degree of potential gains depends on how these incentives relate to the structure of the learning task.

3522Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated impressive capability across various natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the targeted task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP datasets demonstrate the superiority of our method, contributing to improving the reasoning capability of existing open-source LLMs. Furthermore, the testing-time scaling law indicates that Q* can leverage increased computational power to improve reasoning performance.

3523Investigating Self-Attention: Its Impact on Sample Efficiency in Deep Reinforcement Learning

[openreview] [pdf]

Abstract Improving the sample efficiency of deep reinforcement learning (DRL) agents has been an ongoing challenge in research and real-world applications. Self-attention, a mechanism originally popularized in natural language processing, has shown great potential in enhancing sample efficiency when integrated with traditional DRL algorithms. However, the influence of self-attention mechanisms on the sample efficiency of DRL models has not been fully explored. In this paper, we ponder the fundamental operation of the self-attention mechanism in visual-based DRL settings and systematically investigate how different types of scaled dot-product attention impact the sample efficiency of the DRL algorithms. We design and evaluate the performance of our self-attention DRL models in the Arcade Learning Environment. Our results indicate that the design of the self-attention module influences the sample complexity of the DRL agent across various environments. To understand how self-attention modules influence the learning process, we perform an interpretability study from the perspectives of state representation and exploration. From our initial findings, we hypothesize that the interplay between feature extraction, action selection, and reward could be influenced subtly by the inductive biases of the proposed self-attention modules. This work contributes to the ongoing efforts to optimize DRL architectures, offering insights into the mechanisms that can enhance their performance in data-scarce scenarios.

3524Revisiting Prompt-based Methods in Class Incremental Learning

[openreview] [pdf]

Abstract In recent years, prompt-based methods have emerged as a promising direction for continual learning, demonstrating impressive performance across various benchmarks. These methods create learnable prompts to infer task identity, then select and integrate specific prompts into the pretrained model to generate instructed features for prediction. In this paper, we first analyze the working patterns of such method across different distribution scenarios through extensive empirical analysis. Our analysis exposes the limitations of existing methods: first, two-stage inference can make mistakes even when the first stage has already provided reliable predictions; second, enforcing identical architectures for both stages hampers performance gains. To address these issues, we incorporated a self-supervised learning objective to learn discriminative features, thereby boosting the plasticity of the model. During inference, we implemented a simple yet effective threshold filtering strategy to selectively pass data to the second stage. This approach prevents errors in the second stage when the first stage has already made reliable predictions, while also conserving computational resources. Ultimately, we explore utilizing self-supervised pretrained models as a unified task identity provider. Comparing to state-of-the-art methods, our method achieves comparable results under in-distribution scenarios and demonstrates substantial gains under out-of-distribution scenarios (e.g., up to 6.34% and 5.15% improvements on Split Aircrafts and Split Cars-196, respectively).

3525Steady and Fair Robustness Evaluation Based on Model Interpretation

[openreview] [pdf]

Abstract Adversarial robustness has become a major concern as machine learning models are increasingly deployed in security-sensitive applications. Evaluating adversarial robustness remains a challenging task, as current metrics are heavily affected by various factors, including attack methods, attack intensities, and model architecture. In this paper, we propose Steady and Fair Robustness Evaluation, a novel framework designed to mitigate the impact of these factors and provide a more stable evaluation of a model’s robustness. Our key insight is based on the strong correlation between the standard deviation (SD) of Shapley values, which measures the importance of individual neurons, and adversarial robustness. We demonstrate that models with lower SD of Shapley values are more robust to adversarial attacks, regardless of the attack method or model architecture. Extensive experiments across various models, training objectives, and attack scenarios show that our approach offers more consistent and interpretable robustness evaluation. We further introduce a new training strategy that incorporates the minimization of the SD of Shapley values for improving the robustness of the model. Our findings suggest that analysis based on Shapley value can provide a principled and efficient alternative to conventional robustness evaluation techniques.

3526Output Alignment: A Top-down Approach to Length Generalization

[openreview] [pdf]

Abstract Recently, large language models have exhibited impressive performance and surprising emergent properties. However, their abilities remain constrained by the preset context window of the Transformer architecture, and they continue to struggle with length generalization. In this work, we propose a new perspective on length generalization by focusing on the output distribution rather than the input, as most prior studies have done (e.g., through positional encodings or data structure). First, through case studies on simple synthetic tasks, we highlight the importance ofoutput alignment---the consistency of output distributions across sequences of varying lengths. We then extend this observation to natural language tasks and introduce a metric named Long-Short Misalignment to quantify output alignment, finding a strong correlation between this metric and length generalization performance. Based on these insights, we propose a regularization loss during training that improves output alignment. Extensive experiments confirm the effectiveness of this approach. Overall, our work provides a novel perspective for understanding and enhancing length generalization in large language models.

3527Let the Code LLM Edit Itself When You Edit the Code

[openreview] [pdf]

Abstract In this work, we investigate a typical scenario in code generation where a developer edits existing code in real time and requests a code assistant, e.g., a large language model, to re-predict the next token or next line on the fly. Naively, the LLM needs to re-encode the entire KV cache to provide an accurate prediction. However, this process is computationally expensive, especially when the sequence length is long. Simply encoding the edited subsequence and integrating it to the original KV cache meets the temporal confusion problem, leading to significantly worse performance. We address this efficiency and accuracy trade-off by introducing Positional Integrity Encoding\underline{\textbf{P}\text{ositional}\ \textbf{I}\text{ntegrity}\ \textbf{E}\text{ncoding}} (PIE). Building upon the rotary positional encoding, PIE first removes the rotary matrices in the Key cache that introduce temporal confusion and then reapplies the correct rotary matrices. This process ensures that positional relationships between tokens are correct and requires only a single round of matrix multiplication. We validate the effectiveness of PIE through extensive experiments on the RepoBench-C-8k dataset, utilizing DeepSeek-Coder models with 1.3B, 6.7B, and 33B parameters. Our evaluation includes three real-world coding tasks: code insertion, code deletion, and multi-place code editing. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach across all model sizes and tasks while well approximating the model performance.

3528Stable Hadamard Memory: Revitalizing Memory-Augmented Agents for Reinforcement Learning

[openreview] [pdf]

Abstract Effective decision-making in partially observable environments demands robust memory management. Despite their success in supervised learning, current deep-learning memory models struggle in reinforcement learning environments that are partially observable and long-term. They fail to efficiently capture relevant past information, adapt flexibly to changing observations, and maintain stable updates over long episodes. We theoretically analyze the limitations of existing memory models within a unified framework and introduce the Stable Hadamard Memory, a novel memory model for reinforcement learning agents. Our model dynamically adjusts memory by erasing no longer needed experiences and reinforcing crucial ones computationally efficiently. To this end, we leverage the Hadamard product for calibrating and updating memory, specifically designed to enhance memory capacity while mitigating numerical and learning challenges. Our approach significantly outperforms state-of-the-art memory-based methods on challenging partially observable benchmarks, such as meta-reinforcement learning, long-horizon credit assignment, and POPGym, demonstrating superior performance in handling long-term and evolving contexts.

3529End-to-End Learning under Endogenous Uncertainty

[openreview] [pdf]

Abstract How can we effectively learn to make decisions when there are no ground-truth counterfactual observations? We propose an end-to-end learning approach to the contextual stochastic optimization problem under decision-dependent uncertainty. We propose both exact methods and efficient sampling-based methods to implement our approach. We also introduce a new class of two-stage stochastic optimization problems to the end-to-end learning framework. Here, the first stage is an information-gathering problem to decide which random variable to ""poll’’ and gain information about before making a second-stage decision based off of it. We provide theoretical analysis showing that (1) optimally minimizing our proposed objective produces optimal decisions and (2) generalization bounds between in-sample and out-of-sample cost. We computationally test the proposed approach on multi-item assortment problems where demand is affected by cross-item complementary and supplementary effects. Our results show a performance improvement of over 20% compared to traditional methods. We also introduce an experiment for the information-gathering problem on a real-world electricity generation problem. We show our method proposes decisions with more than 7% lower cost than other decision-making methods.

3530Jet Expansions of Residual Computation

[openreview] [pdf]

Abstract We introduce a framework for expanding residual networks using \textit{jets}, operators that generalize truncated Taylor series. Our method provides a systematic approach to disentangle contributions of different computational paths to model predictions. In contrast to existing techniques such as distillation, probing, or early decoding, our expansions rely solely on the model itself and requires no data, training, or sampling from the model. We demonstrate how our framework grounds and subsumes the logit lens, reveals a (super-)exponential path structure in the network depth and opens up several applications. These include the extraction of nn-gram statistics from a transformer large language model, and the definition of data-free toxicity scores. Our approach enables data-free analysis of residual networks for model interpretation, development, and evaluation.

3531PASRL: Stabilising Reinforcement Learning with Past Action-State Representation Learning

[openreview] [pdf]

Abstract Although deep reinforcement learning (DRL) deals with sequential decision making problems, temporal information representation is absent from state of the art actor-critic algorithms. The reliance on only the current time step information and densely connected neural networks causes instability and oscillations in the smoothness of concurrent actions. Therefore many applied DRL robotics control methods employ various reward shaping, low-pass filter and traditional controller based methods to mitigate this effect. However the interactions of these different parts hinders the performance of the original goal for the RL algorithm. In this paper we present a reinforcement learning algorithm extended with past action-state representation learning, which allows for the end-to-end training of RL based control methods without the need for common heuristics. Our end-to-end training approach produces the smoothest actions while achieving performance scores comparable to the top mixed traditional control and reinforcement learning algorithms.

3532System 1.x: Learning to Balance Fast and Slow Planning with Language Models

[openreview] [pdf]

Abstract Language models can be used to solve long-horizon planning problems in two distinct modes. In a fast ‘System-1’ mode, models directly generate plans without any explicit search or backtracking, and in a slow ‘System-2’ mode, they plan step-by-step by explicitly searching over possible actions. System-2 planning, while typically more effective, is also computationally more expensive and often infeasible for long plans or large action spaces. Moreover, isolated System-1 or System-2 planning ignores the user’s end goals and constraints (e.g., token budget), failing to provide ways for the user to control the model’s behavior. To this end, we propose the System-1.x Planner, a framework for controllable planning with language models that is capable of generating hybrid plans and balancing between the two planning modes based on the difficulty of the problem at hand. System-1.x consists of (i) a controller, (ii) a System-1 Planner, and (iii) a System-2 Planner. Based on a user-specified hybridization factor x governing the degree to which the system uses System-1 vs. System-2, the controller decomposes a planning problem into subgoals, and classifies them as easy or hard to be solved by either System-1 or System-2, respectively. We fine-tune all three components on top of a single base LLM, requiring only search traces as supervision. Experiments with two diverse planning tasks -- Maze Navigation and Blocksworld -- show that our System-1.x Planner outperforms a System-1 Planner, a System-2 Planner trained to approximate A* search, and also a symbolic planner (A* search), given an exploration budget. We also demonstrate the following key properties of our planner: (1) controllability: by adjusting the hybridization factor x (e.g., System-1.75 vs. System-1.5) we can perform more (or less) search, improving performance, (2) flexibility: by building a neuro-symbolic variant composed of a neural System-1 planner and a symbolic System-2 planner, we can take advantage of existing symbolic methods, and (3) generalizability: by learning from different search algorithms (BFS, DFS, A*), we show that our method is robust to the choice of search algorithm used for training.

3533Kolmogorov-Arnold Transformer

[openreview] [pdf]

Abstract Transformers stand as the cornerstone of mordern deep learning. Traditionally, these models rely on multi-layer perceptron (MLP) layers to mix the information between channels. In this paper, we introduce the Kolmogorov–Arnold Transformer (KAT), a novel architecture that replaces MLP layers with Kolmogorov-Arnold Network (KAN) layers to enhance the expressiveness and performance of the model. Integrating KANs into transformers, however, is no easy feat, especially when scaled up. Specifically, we identify three key challenges: (C1) Base function. The standard B-spline function used in KANs is not optimized for parallel computing on modern hardware, resulting in slower inference speeds. (C2) Parameter and Computation Inefficiency. KAN requires a unique function for each input-output pair, making the computation extremely large. (C3) Weight initialization. The initialization of weights in KANs is particularly challenging due to their learnable activation functions, which are critical for achieving convergence in deep neural networks. To overcome the aforementioned challenges, we propose three key solutions: (S1) Rational basis. We replace B-spline functions with rational functions to improve compatibility with modern GPUs. By implementing this in CUDA, we achieve faster computations. (S2) Group KAN. We share the activation weights through a group of neurons, to reduce the computational load without sacrificing performance. (S3) Variance-preserving initialization. We carefully initialize the activation weights to make sure that the activation variance is maintained across layers. With these designs, KAT scales effectively and readily outperforms traditional MLP-based transformers. We demonstrate the advantages of KAT across various tasks, including image recognition, object detection, and semantic segmentation. It consistently enhances performance over the standard transformer architectures of different model sizes.

3534ALMANACS: A Simulatability Benchmark for Language Model Explainability

[openreview] [pdf]

Abstract How do we measure the efficacy of language model explainability methods? While many explainability methods have been developed, they are typically evaluated on bespoke tasks, preventing an apples-to-apples comparison. To help fill this gap, we present ALMANACS, a language model explainability benchmark. ALMANACS scores explainability methods on simulatability, i.e., how well the explanations improve behavior prediction on new inputs. The ALMANACS scenarios span twelve safety-relevant topics such as ethical reasoning and advanced AI behaviors; they have idiosyncratic premises to invoke model-specific behavior; and they have a train-test distributional shift to encourage faithful explanations. By using another language model to predict behavior based on the explanations, ALMANACS is a fully automated benchmark. While not a replacement for human evaluations, we aim for ALMANACS to be a complementary, automated tool that allows for fast, scalable evaluation. Using ALMANACS, we evaluate counterfactual, rationalization, attention, and Integrated Gradients explanations. Our results are sobering: when averaged across all topics, no explanation method outperforms the explanation-free control. We conclude that despite modest successes in prior work, developing an explanation method that aids simulatability in ALMANACS remains an open challenge.

3535Tackling the Generative learning trilemma through VAE and GMM-controlled latent space class expansion

[openreview] [pdf]

Abstract Achieving efficient data augmentation (DA) in time series classification is not a trivial task due to the high complexity of temporal data. Generative models, such as GANs (Generative Adversarial Networks), diffusion models, and Variational Autoencoders (VAEs), are powerful techniques to address the generative learning trilemma of producing (1) high-quality samples, (2) fast sampling, and (3) diversity. These methods vary in their ability to address the trilemma. Diffusion models allows for high diversity and high quality samples, while GAN allows for high quality samples and fast sampling, and VAE for high diversity and fast sampling. In this paper, we introduce a novel generative method, ASCENSION (VAE and GMM-controlled latent space class expansion), that retains the strengths of VAE in terms of diversity and fast sampling, while enabling controlled and quantifiable exploration of uncharted regions in the latent space. This approach not only enhances classification performance but also yields higher quality (more realistic) samples. ASCENSION leverages the probabilistic nature of the VAE’s latent space to represent classes as Gaussian mixture models (GMMs). By modifying this mixture, we enable precise manipulation of class probability densities and boundaries. To ensure intra-class compactness and maximize inter-class separation, we apply clustering constraints. Empirical evaluations on the UCR benchmark (102 datasets) show that ASCENSION outperforms state-of-the-art DA methods, achieving an average classification accuracy improvement of approximately 7% and excelling in all aspects of the generative learning trilemma.

3536How Can LLM Guide RL? A Value-Based Approach

[openreview] [pdf]

Abstract Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. However, RL algorithms may require extensive trial-and-error interactions to collect useful feedback for improvement. On the other hand, recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities for planning tasks, lacking the ability to autonomously refine their responses based on feedback. Therefore, in this paper, we study how the policy prior provided by the LLM can enhance the sample efficiency of RL algorithms. Specifically, we develop an algorithm named LINVIT\mathtt{LINVIT} that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning, particularly when the difference between the ideal policy and the LLM-informed policy is small, which suggests that the initial policy is close to optimal, reducing the need for further exploration. Additionally, we present a practical algorithm SLINVIT\mathtt{SLINVIT} that simplifies the construction of the value function and employs sub-goals to reduce the search complexity. Our experiments across three interactive environments---ALFWorld, InterCode, and BlocksWorld---demonstrate that the proposed method achieves state-of-the-art success rates and also surpasses previous RL and LLM approaches in terms of sample efficiency.

3537Graph-Supported Dynamic Algorithm Configuration for Multi-Objective Combinatorial Optimization

[openreview] [pdf]

Abstract Deep reinforcement learning (DRL) has been widely used for dynamic algorithm configuration, especially for evolutionary algorithms, which benefit from adaptive update of parameters during the algorithmic execution. However, applying DRL to algorithm configuration for multi-objective combinatorial optimization (MOCO) problems remains relatively unexplored. This paper presents a novel graph neural network (GNN) based DRL to configure multi-objective evolutionary algorithms. We model the dynamic algorithm configuration as a Markov decision process, representing the convergence of solutions in the objective space by a graph, with their embeddings learned by a GNN to enhance the state representation. Experiments on diverse MOCO challenges indicate that our method outperforms traditional and DRL-based algorithm configuration methods in terms of efficacy and adaptability. It also exhibits advantageous generalizability across objective types and problem sizes, and prospective applicability to different evolutionary algorithms.

3538Second-Order Algorithms for Finding Local Nash Equilibria in Zero-Sum Games

[openreview] [pdf]

Abstract Zero-sum games arise in a wide variety of problems, including robust optimization and adversarial learning. However, algorithms deployed for finding a local Nash equilibrium in these games often converge to non-Nash stationary points. This highlights a key challenge: for any algorithm, the stability properties of its underlying dynamical system can cause non-Nash points to be potential attractors. To overcome this challenge, algorithms must account for subtleties involving the curvatures of players’ costs. To this end, we leverage dynamical system theory and develop a second-order algorithm for finding a local Nash equilibrium in the smooth, possibly nonconvex-nonconcave, zero-sum game setting. First, we prove that this novel method guarantees convergence to only local Nash equilibria with a local linear\textit{linear} convergence rate. We then interpret a version of this method as a modified Gauss-Newton algorithm with local superlinear\textit{superlinear} convergence to the neighborhood of a point that satisfies first-order local Nash equilibrium conditions. In comparison, current related state-of-the-art methods do not offer convergence rate guarantees. Furthermore, we show that this approach naturally generalizes to settings with convex and potentially coupled constraints while retaining earlier guarantees of convergence to only local (generalized) Nash equilibria.

3539nGPT: Normalized Transformer with Representation Learning on the Hypersphere

[openreview] [pdf]

Abstract We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

3540Your Mixture-of-Experts LLM Is Secretly an Embedding Model for Free

[openreview] [pdf]

Abstract While large language models (LLMs) excel on generation tasks, their decoder-only architecture often limits their potential as embedding models if no further representation finetuning is applied. Does this contradict their claim of generalists? To answer the question, we take a closer look at Mixture-of-Experts (MoE) LLMs. Our study shows that the expert routers in MoE LLMs can serve as an off-the-shelf embedding model with promising performance on a diverse class of embedding-focused tasks, without requiring any finetuning. Moreover, our extensive analysis shows that the MoE routing weights (RW) is complementary to the hidden state (HS) of LLMs, a widely-used embedding. Compared to HS, we find that RW is more robust to the choice of prompts and focuses on high-level semantics. Motivated by the analysis, we propose MoEE combining RW and HS, which achieves better performance than using either separately. Our exploration of their combination and prompting strategy shed several novel insights, e.g., a weighted sum of RW and HS similarities outperforms the similarity on their concatenation. Our experiments are conducted on 6 embedding tasks with 20 datasets from the Massive Text Embedding Benchmark (MTEB). The results demonstrate the significant improvement brought by MoEE to LLM-based embedding without further finetuning.

3541Learning Dispersed Embeddings on Hyperspheres

[openreview] [pdf]

Abstract Learning well-separated features in high-dimensional spaces, such as text or image embeddings\textit{embeddings}, is crucial for many machine learning applications. Achieving such separation can be effectively accomplished through the dispersion\textit{dispersion} of embeddings, where unrelated vectors are pushed apart as much as possible. By constraining features to be on a hypersphere\textit{hypersphere}, we can connect dispersion to well-studied problems in mathematics and physics, where optimal solutions are known for limited low-dimensional cases. However, in representation learning we typically deal with a large number of features in high-dimensional space, which makes leveraging existing theoretical and numerical solutions impossible. Therefore, we rely on gradient-based methods to approximate the optimal dispersion on a hypersphere. In this work, we first give an overview of existing methods from disconnected literature. Next, we propose new reinterpretations of known methods, namely Maximum Mean Discrepancy (MMD) and Lloyd’s relaxation algorithm. Finally, we derive a novel dispersion method that directly exploits properties of the hypersphere. Our experiments show the importance of dispersion in image classification and natural language processing tasks, and how algorithms exhibit different trade-offs in different regimes.

3542Bitune: Leveraging Bidirectional Attention to Improve Decoder-Only LLMs

[openreview] [pdf]

Abstract Decoder-only large language models typically rely solely on masked causal attention, which limits their expressiveness by restricting information flow to one direction. We propose Bitune, a method that enhances pretrained decoder-only LLMs by incorporating bidirectional attention into prompt processing. We evaluate Bitune in instruction-tuning and question-answering settings, showing significant improvements in performance on commonsense reasoning, arithmetic, and language understanding tasks. Furthermore, extensive ablation studies validate the role of each component of the method, and demonstrate that Bitune is compatible with various parameter-efficient finetuning techniques and full model finetuning.

3543Controllable Continual Test-Time Adaptation

[openreview] [pdf]

Abstract Continual Test-Time Adaptation (CTTA) is an emerging and challenging task where a model trained in a source domain must adapt to continuously changing conditions during testing, without access to the original source data. CTTA is prone to error accumulation due to uncontrollable domain shifts, leading to blurred decision boundaries between categories. Existing CTTA methods primarily focus on suppressing domain shifts, which proves inadequate during the unsupervised test phase. In contrast, we introduce a novel approach that guides rather than suppresses these shifts. Specifically, we propose C\textbf{C}ontrollable Co\textbf{Co}ntinual T\textbf{T}est-T\textbf{T}ime A\textbf{A}daptation (C-CoTTA), which explicitly prevents any single category from encroaching on others, thereby mitigating the mutual influence between categories caused by uncontrollable shifts. Moreover, our method reduces the sensitivity of model to domain transformations, thereby minimizing the magnitude of category shifts. Extensive quantitative experiments demonstrate the effectiveness of our method, while qualitative analyses, such as t-SNE plots, confirm the theoretical validity of our approach.

3544Martryoshka: Learning to Drive Black-Box LLMs with LLMs

[openreview] [pdf]

Abstract Despite the impressive generative abilities of black-box large language models (LLMs), their inherent opacity hinders further advancements in capabilities such as reasoning, planning, and personalization. Existing works aim to enhance LLM capabilities via domain-specific adaptation or in-context learning, which require additional training on accessible model parameters, an infeasible option for black-box LLMs. To address this challenge, we introduce Martryoshika, a lightweight white-box LLM controller that guides a large-scale black-box LLM generator by decomposing complex tasks into a series of intermediate outputs. Specifically, we consider the black-box LLM as an environment, with Martryoshika serving as a policy to provide intermediate guidance through prompts for driving the black-box LLM. Martryoshika is trained to pivot the outputs of the black-box LLM aligning with preferences during iterative interaction, which enables controllable multi-turn generation and self-improvement in optimizing intermediate guidance. Empirical evaluations on three diverse tasks demonstrate that Martryoshika effectively enhances the capabilities of black-box LLMs in complex, long-horizon tasks, including reasoning, planning, and personalization. By leveraging this pioneering controller-generator framework to mitigate dependence on model parameters, Martryoshika provides a transparent and practical solution for improving black-box LLMs through controllable multi-turn generation using white-box LLMs.

3545A graph-based global optimization framework for problems with nonconvex norm constraints and penalty functions

[openreview] [pdf]

Abstract Optimization problems with norm-bounding constraints appear in various applications, from portfolio optimization to machine learning, feature selection, and beyond. A widely used variant of these problems relaxes the norm-bounding constraint through Lagrangian relaxation and moves it to the objective function as a form of penalty or regularization term. A challenging class of these models uses the zero-norm function to induce sparsity in statistical parameter estimation models. Most existing exact solution methods for these problems use additional binary variables together with artificial bounds on variables to formulate them as a mixed-integer program in a higher dimension, which is then solved by off-the-shelf solvers. Other exact methods utilize specific structural properties of the objective function to solve certain variants of these problems, making them non-generalizable to other problems with different structures. An alternative approach employs nonconvex penalties with desirable statistical properties, which are solved using heuristic or local methods due to the structural complexity of those terms. In this paper, we develop a novel graph-based method to globally solve optimization problems that contain a generalization of norm-bounding constraints. This includes standard p\ell_p-norms for p[0,)p \in [0, \infty) as well as nonconvex penalty terms, such as SCAD and MCP, as special cases. Our method uses decision diagrams to build strong convex relaxations for these constraints in the original space of variables without the need to introduce additional auxiliary variables or impose artificial variable bounds. We show that the resulting convexification method, when incorporated into a spatial branch-and-cut framework, converges to the global optimal value of the problem under mild conditions. To demonstrate the capabilities of the proposed framework, we conduct preliminary computational experiments on benchmark sparse linear regression problems with complex nonconvex penalty terms that existing global solvers cannot model or solve. This establishes our framework as the first algorithm capable of globally solving such challenging mixed-integer nonlinear programs.

3546G4Seg: Generation for Online Segmentation Refinement with Diffusion Models

[openreview] [pdf]

Abstract This paper considers the problem of utilizing a large-scale text-to-image diffusion model to tackle the challenging Inexact Segmentation (IS) task. Unlike traditional approaches that rely heavily on discriminative-model-based paradigm or dense visual representations derived from internal attention mechanisms, our method focuses on the intrinsic generative priors in Stable Diffusion~(SD). Specifically, we exploit the pattern discrepancies between original images and mask-conditional generated images to facilitate a coarse-to-fine segmentation refinement by establishing a semantic correspondence alignment and updating the foreground probability. Comprehensive quantitative and qualitative experiments validate the effectiveness and superiority of our plug-and-play design, underscoring the potential of leveraging generation discrepancies to model dense representations and encouraging further exploration of generative approaches for solving discriminative tasks.

3547Multi-environment Topic Models

[openreview] [pdf]

Abstract Probabilistic topic models are a powerful tool for extracting latent themes from large text datasets. In many text datasets, we also observe per-document covariates (e.g., source, style, political affiliation) that act as environments that modulate a “global” (environment-agnostic) topic representation. Accurately learning these representations is important for prediction on new documents in unseen environments and for estimating the causal effect of topics on real-world outcomes. To this end, we introduce the Multi-environment Topic Model (MTM), an unsupervised probabilistic model that separates global and environment-specific terms. Through experimentation on various political content, from ads to tweets and speeches, we show that the MTM produces interpretable global topics with distinct environment-specific words. On multi-environment data, the MTM outperforms strong baselines in and out-of-distribution. It also enables the discovery of accurate causal effects.

3548Generative Visual Instruction Tuning

[openreview] [pdf]

Abstract We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models.

3549Learning Generalizable Skills from Offline Multi-Task Data for Multi-Agent Cooperation

[openreview] [pdf]

Abstract Learning cooperative multi-agent policy from offline multi-task data that can generalize to unseen tasks with varying numbers of agents and targets is an attractive problem in many scenarios. Although aggregating general behavior patterns among multiple tasks as skills to improve policy transfer is a promising approach, two primary challenges hinder the further advancement of skill learning in offline multi-task MARL. Firstly, extracting general cooperative behaviors from various action sequences as common skills lack bringing cooperative temporal knowledge into them. Secondly, existing works only involve common skills and can not adaptively choose independent knowledge as task-specific skills in each task for fine-grained action execution. To address these challenges, we propose an approach named Hierarchical and Separate Skill Discovering (HiSSD) for generalizable offline multi-task MARL through skill learning. HiSSD leverages a hierarchical framework that jointly learns common and task-specific skills. The common skills learn cooperative temporal knowledge and enable in-sample exploration for offline multi-task MARL. The task-specific skills represent the priors of each task and achieve a task-guided fine-grained action execution. To verify the advancement of our method, we conduct experiments on multi-agent MuJoCo and SMAC benchmarks. After training policy using HiSSD on offline multi-task data, the empirical results show that HiSSD assigns effective cooperative behaviors and obtains superior performance in unseen tasks.

3550Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

[openreview] [pdf]

Abstract Large Language Models (LLMs) show impressive results in numerous practical applications, but they lack essential safety features that are common in other areas of computer science, particularly an explicit separation of instructions and data. This makes them vulnerable to manipulations such as indirect prompt injections and generally unsuitable for safety-critical tasks. Surprisingly, there is currently no established definition or benchmark to quantify this phenomenon. In this work, we close this gap by introducing a formal measure for instruction-data separation for single-turn language models and an empirical variant that is calculable from a model’s outputs. We also present a new dataset, SEP, that allows estimating the measure for real-world models. Our results on various LLMs show that the problem of instruction-data separation is real: all models fail to achieve high separation, and canonical mitigation techniques, such as prompt engineering and fine-tuning, either fail to substantially improve separation or reduce model utility.

3551Multi-agent cooperation through learning-aware policy gradients

[openreview] [pdf]

Abstract Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between `learning-aware’ agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner’s dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.

3552Latent-Predictive Empowerment: Measuring Empowerment without a Simulator

[openreview] [pdf]

Abstract Empowerment has the potential to help agents learn large skillsets, but is not yet a scalable solution for training general-purpose agents. Recent empowerment methods learn large skillsets by maximizing the mutual information between skills and states, but these approaches require a model of the transition dynamics, which can be challenging to learn in realistic settings with high-dimensional and stochastic observations. We present an algorithm, Latent-Predictive Empowerment (LPE), that can compute empowerment in a more scalable manner. LPE learns large skillsets by maximizing an objective that under certain conditions has the same optimal skillset as the mutual information between skills and states, but our objective is more tractable to optimize because it only requires learning a simpler latent-predictive model rather than a full simulator of the environment. We show empirically in a variety of settings, includes ones with high-dimensional observations and highly stochastic transition dynamics, that our empowerment objective learns similar-sized skillsets as the leading empowerment algorithm, which assumes access to a model of the transition dynamics, and outperforms other model-based approaches to empowerment.

3553Evaluating the World Models Used by Pretrained Learners

[openreview] [pdf]

Abstract A common approach for assessing whether large pretrained models develop world models is by studying the behavior of fixed models. However, many of the benefits of having a world model arise when transferring a model to new tasks (e.g. few- shot learning). In this paper, we ask: what does it mean to test if alearnerhas a world model embodied in it? We consider a simple definition of a true world model: a mapping from inputs to states. We introduce a procedure that assesses a learner’s world model by measuring its inductive bias when transferring to new tasks. This inductive bias can be measured in two distinct dimensions: does a learner extrapolate to new data by building functions of state, and to what degree do these functions capture the full state? We use this procedure to study the degree to which pretrained models extrapolate to new tasks based on state. We find that models that perform very well on next-token prediction can extrapolate to new tasks with very little inductive bias toward state. We conclude by assessing the possibility that these models learn bundles of heuristics that enable them to perform well on next-token prediction despite preserving little of state.

3554Visual Instruction Tuning with 500x Fewer Parameters through Modality Linear Representation-Steering

[openreview] [pdf]

Abstract Multimodal Large Language Models (MLLMs) have significantly advanced visual tasks by integrating visual representations into large language models (LLMs). The textual modality, inherited from LLMs, equips MLLMs with abilities like instruction following and in-context learning. In contrast, the visual modality enhances performance in downstream tasks by leveraging rich semantic content, spatial information, and grounding capabilities. These intrinsic modalities work synergistically across various visual tasks. Our research initially reveals a persistent imbalance between these modalities, with text often dominating output generation during visual instruction tuning. This imbalance occurs when using both full fine-tuning and parameter-efficient fine-tuning (PEFT) methods. We then found that re-balancing these modalities can significantly reduce the number of trainable parameters required, inspiring a direction for further optimizing visual instruction tuning. Hence, in this paper, we introduce Modality Linear Representation-Steering (MoReS) to achieve the goal. MoReS effectively re-balances the intrinsic modalities throughout the model, where the key idea is to steer visual representations through linear transformations in the visual subspace across each model layer. To validate our solution, we composed LLaVA Steering, a suite of models integrated with the proposed MoReS method. Evaluation results show that the composed LLaVA Steering models require, on average, 500 times fewer trainable parameters than LoRA needs while still achieving comparable performance across three visual benchmarks and eight visual question-answering tasks. Last, we present the LLaVA Steering Factory, an in-house developed platform that enables researchers to quickly customize various MLLMs with component-based architecture for seamlessly integrating state-of-the-art models, and evaluate their intrinsic modality imbalance. This open-source project enriches the research community to gain a deeper understanding of MLLMs.

3555Enhancing Outlier Knowledge for Few-Shot Out-of-Distribution Detection with Extensible Local Prompts

[openreview] [pdf]

Abstract Out-of-Distribution (OOD) detection, aiming to distinguish outliers from known categories, has gained prominence in practical scenarios. Recently, the advent of vision-language models (VLM) has heightened interest in enhancing OOD detection for VLM through few-shot tuning. However, existing methods mainly focus on optimizing global prompts, ignoring refined utilization of local information with regard to outliers. Motivated by this, we freeze global prompts and introduce a novel coarse-to-fine tuning paradigm to emphasize regional enhancement with local prompts. Our method comprises two integral components: global prompt guided negative augmentation and local prompt enhanced regional regularization. The former utilizes frozen, coarse global prompts as guiding cues to incorporate negative augmentation, thereby leveraging local outlier knowledge. The latter employs trainable local prompts and a regional regularization to capture local information effectively, aiding in outlier identification. We also propose regional-related metric to empower the enrichment of OOD detection. Moreover, since our approach explores enhancing local prompts only, it can be seamlessly integrated with trained global prompts during inference to boost the performance. Comprehensive experiments demonstrate the effectiveness and potential of our method. Notably, our method reduces average FPR95 by 5.17% against state-of-the-art method in 4-shot tuning on challenging ImageNet-1k dataset, even outperforming 16-shot results of previous methods. Code will be available upon acceptance.

3556Rethinking Self-Distillation: Label Averaging and Enhanced Soft Label Refinement with Partial Labels

[openreview] [pdf]

Abstract We investigate the mechanisms of self-distillation in multi-class classification, particularly in the context of linear probing with fixed feature extractors where traditional feature learning explanations do not apply. Our theoretical analysis reveals that multi-round self-distillation effectively performs label averaging among instances with high feature correlations, governed by the eigenvectors of the Gram matrix derived from input features. This process leads to clustered predictions and improved generalization, mitigating the impact of label noise by reducing the model’s reliance on potentially corrupted labels. We establish conditions under which multi-round self-distillation achieves 100% population accuracy despite label noise. Furthermore, we introduce a novel, efficient single-round self-distillation method using refined partial labels from the teacher’s top two softmax outputs, referred to as the PLL student model. This approach replicates the benefits of multi-round distillation in a single round, achieving comparable or superior performance--especially in high-noise scenarios--while significantly reducing computational cost.

3557Coupling Category Alignment for Graph Domain Adaptation

[openreview] [pdf]

Abstract Graph domain adaptation (GDA), which transfers knowledge from a labeled source domain to an unlabeled target graph domain, attracts considerable attention in numerous fields. Emerging methods commonly employ message-passing neural networks (MPNNs) to learn domain-invariant representations by aligning the entire domain distribution. However, these methods overlook the category-level distribution alignment across different domains, potentially leading to confusion of categories. To address the problem, we propose an effective framework named \textbf{Co}upling \textbf{C}ateg{o}ry \textbf{A}lignment (\method{}) for GDA, which effectively addresses the category alignment issue with theoretical guarantees. \method{} incorporates a graph convolutional network branch and a graph kernel network branch, which explore graph topology in implicit and explicit manners. To mitigate category-level domain shifts, we leverage knowledge from both branches, iteratively filtering highly reliable samples from the target domain using one branch and fine-tuning the other accordingly. Furthermore, with these reliable target domain samples, we incorporate the coupled branches into a holistic contrastive learning framework. This framework includes multi-view contrastive learning to ensure consistent representations across the dual branches, as well as cross-domain contrastive learning to achieve category-level domain consistency. Theoretically, we establish a sharper generalization bound, which ensures the effectiveness of category alignment. Extensive experiments on benchmark datasets validate the superiority of the proposed \method{} compared with baselines.

3558COOL: Efficient and Reliable Chain-Oriented Objective Logic with Neural Networks Feedback Control for Program Synthesis

[openreview] [pdf]

Abstract Program synthesis methods, whether formal or neural-based, lack fine-grained control and flexible modularity, which limits their adaptation to complex software development. These limitations stem from rigid Domain-Specific Language (DSL) frameworks and neural network incorrect predictions. To this end, we propose the \textbf{Chain of Logic (CoL)}, which organizes synthesis stages into a chain and provides precise heuristic control to guide the synthesis process. Furthermore, by integrating neural networks with libraries and introducing a \textbf{Neural Network Feedback Control (NNFC)} mechanism, our approach modularizes synthesis and mitigates the impact of neural network mispredictions. Experiments on relational and symbolic synthesis tasks show that CoL significantly enhances the efficiency and reliability of DSL program synthesis across multiple metrics. Specifically, CoL improves accuracy by 70% while reducing tree operations by 91% and time by 95%. Additionally, NNFC further boosts accuracy by 6%, with a 64% reduction in tree operations under challenging conditions such as insufficient training data, increased difficulty, and multidomain synthesis. These improvements confirm COOL as a highly efficient and reliable program synthesis framework.

3559GE-PEFT: Gated Expandable Parameter-Efficient Fine-Tuning for Continual Learning

[openreview] [pdf]

Abstract Continual learning (CL) is a research field focused on continuously adapting foundation models such as large language models (LMs) to newly emerging information sources and tasks. While aspects such as parameter efficiency, knowledge transfer, and managing model capacity have recently received attention, the main research focus in CL remains on preventing catastrophic forgetting. Specifically, there is a lack of solutions that address all these aspects simultaneously. We bridge this gap by introducing Gated Expandable Parameter-Efficient Fine-Tuning (GE-PEFT). Our approach shares knowledge of previous tasks through leveraging a single, dynamically expanding PEFT module within LMs while selectively gating irrelevant previous tasks. Our experiments across multiple task-incremental CL benchmarks demonstrate that GE-PEFT outperforms existing state-of-the-art CL approaches in both full CL and few-shot settings. Our ablation and parameter sensitivity studies highlight the benefit of each proposed component, demonstrating that GE-PEFT offers a more efficient and adaptive solution for CL in LMs.

3560AutoHijacker: Automatic Indirect Prompt Injection Against Black-box LLM Agents

[openreview] [pdf]

Abstract Although large Language Models (LLMs) and LLM agents have been widely adopted, they are vulnerable to indirect prompt injection attacks, where malicious external data is injected to manipulate model behaviors. Existing evaluations of LLM robustness against such attacks are limited by handcrafted methods and reliance on white-box or gray-box access—conditions unrealistic in practical deployments. To bridge this gap, we propose AutoHijacker, an automatic indirect black-box prompt injection attack. Built on the concept of LLM-as-optimizers, AutoHijacker introduces a batch-based optimization framework to handle sparse feedback and also leverages a trainable memory to enable effective generation of indirect prompt injections without continuous querying. Evaluations on two public benchmarks, AgentDojo and Open-Prompt-Injection, show that AutoHijacker outperforms 11 baseline attacks and achieves state-of-the-art performance without requiring external knowledge like user instructions or model configurations, and also demonstrates higher average attack success rates against 8 various defenses. Additionally, AutoHijacker successfully attacks a commercial LLM agent platform, achieving a 71.9% attack success rate in both document interaction and website browsing tasks.

3561Query-Efficient Planning with Language Models

[openreview] [pdf]

Abstract Planning in complex environments requires an agent to efficiently query a world model to find a feasible sequence of actions from start to goal. Recent work has shown that Large Language Models (LLMs), with their rich prior knowledge and reasoning capabilities, can potentially help with planning by searching over promising states and adapting to feedback from the world. In this paper, we propose and study two fundamentally competing frameworks that leverage LLMs for query-efficient planning. The first uses LLMs as a heuristic within a search-based planner to select promising nodes to expand and propose promising actions. The second uses LLMs as a generative planner to propose an entire sequence of actions, query the world model, and adapt based on feedback. We show that while both approaches improve upon comparable baselines, using an LLM as a generative planner results in significantly fewer interactions. Our key finding is that the LLM as a planner can more rapidly adapt its planning strategies based on immediate feedback than LLM as a heuristic. We present evaluations and ablations on Robotouille and PDDL planning benchmarks and discuss connections to existing theory on query-efficient planning algorithms.

3562Conversational Few-Shot Prompting: Rethinking Few-Shot Prompting for Chat Language Model

[openreview] [pdf]

Abstract In-context learning, also referred to as few-shot learning, enables language models to adapt to tasks using a limited number of examples embedded in the prompt. Traditional approaches typically present all examples in a single prompt, which works well for pre-trained base models. However, the application of this method to instruction-tuned chat models, such as ChatGPT, remains underexplored. In this paper, we introduce a novel conversational few-shot prompting technique, which structures few-shot examples as multi-turn conversation between the user and the assistant, rather than a single input prompt. This conversational framing better aligns with the interactive nature of chat models, enhancing their instruction-following abilities and generalization across tasks. Through experiments on various benchmarks, we demonstrate that this approach significantly improves performance, particularly in low-shot scenarios, compared to traditional few-shot prompting. Our results suggest that this method provides a more flexible and robust way to leverage few-shot examples in instruction-tuned chat models, improving task performance without the need for additional fine-tuning, reducing prompt sensitivity, and offering potential for diverse applications.

3563ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data

[openreview] [pdf]

Abstract Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using a wide variety of data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse video and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.

3564LLM Spark: Critical Thinking Evaluation of Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) excel in complex tasks but often struggle with inconsistencies in problem framing, a critical skill for real-world scenarios. This paper introduces SPARK, a novel evaluation framework grounded in the Hierar- chical Three-Space Theory, to assess LLMs’ ability to identify missing informa- tion and challenge flawed problem setups. We create benchmarks by introducing inconsistencies and misleading cues in diverse question-answering datasets, cov- ering mathematics, science, and reading comprehension. Our experiments with state-of-the-art LLMs reveal their limitations in critical thinking, particularly in recognizing inconsistencies. We also explore mitigation strategies such as modi- fied prompting and targeted fine-tuning. Furthermore, we conduct comprehensive experiments to investigate how model and problem properties influence critical thinking capabilities in LLMs.

3565Learning to Generate Diverse Pedestrian Movements from Web Videos with Noisy Labels

[openreview] [pdf]

Abstract Understanding and modeling pedestrian movements in the real world is crucial for applications like motion forecasting and scene simulation. Many factors influence pedestrian movements, such as scene context, individual characteristics, and goals, which are often ignored by the existing human generation methods. Web videos contain natural pedestrian behavior and rich motion context, but annotating them with pre-trained predictors leads to noisy labels. In this work, we propose learning diverse pedestrian movements from web videos. We first curate a large-scale dataset called CityWalkers that captures diverse real-world pedestrian movements in urban scenes. Then, based on CityWalkers, we propose a generative model called PedGen for diverse pedestrian movement generation. PedGen introduces automatic label filtering to remove the low-quality labels and a mask embedding to train with partial labels. It also contains a novel context encoder that lifts the 2D scene context to 3D and can incorporate various context factors in generating realistic pedestrian movements in urban scenes. Experiments show that PedGen outperforms existing baseline methods for pedestrian movement generation by learning from noisy labels and incorporating the context factors. In addition, PedGen achieves zero-shot generalization in both real-world and simulated environments. The code, model, and data will be made publicly available.

3566On the Adversarial Vulnerability of Label-Free Test-Time Adaptation

[openreview] [pdf]

Abstract Despite the success of Test-time adaptation (TTA), recent work has shown that adding relatively small adversarial perturbations to a limited number of samples leads to significant performance degradation. Therefore, it is crucial to rigorously evaluate existing TTA algorithms against relevant threats and implement appropriate security countermeasures. Importantly, existing threat models assume test-time samples will be labeled, which is impractical in real-world scenarios. To address this gap, we propose a new attack algorithm that does not rely on access to labeled test samples, thus providing a concrete way to assess the security vulnerabilities of TTA algorithms. Our attack design is grounded in theoretical foundations and can generate strong attacks against different state of the art TTA methods. In addition, we show that existing defense mechanisms are almost ineffective, which emphasizes the need for further research on TTA security. Through extensive experiments on CIFAR10-C, CIFAR100-C, and ImageNet-C, we demonstrate that our proposed approach closely matches the performance of state-of-the-art attack benchmarks, even without access to labeled samples. In certain cases, our approach generates stronger attacks, e.g., more than 4% higher error rate on CIFAR10-C.

3567PromptWizard: Task-Aware Prompt Optimization Framework

[openreview] [pdf]

Abstract Large language models (LLMs) have transformed AI across diverse domains, with \textit{prompting} being central to their success in guiding model outputs. However, manual prompt engineering is both labor-intensive and domain-specific, necessitating the need for automated solutions. We introduce PromptWizard, a novel, fully automated framework for discrete prompt optimization, utilizing a self-evolving, self-adapting mechanism. Through a feedback-driven critique and synthesis process, PromptWizard achieves an effective balance between exploration and exploitation, iteratively refining both prompt instructions and in-context examples to generate human-readable, task-specific prompts. This guided approach systematically improves prompt quality, resulting in superior performance across 45 tasks. PromptWizard excels even with limited training data, smaller LLMs, and various LLM architectures. Additionally, our cost analysis reveals a substantial reduction in API calls, token usage, and overall cost, demonstrating PromptWizard’s efficiency, scalability, and advantages over existing prompt optimization strategies.

3568HDFlow: Enhancing LLM Complex Problem-Solving with Hybrid Thinking and Dynamic Workflows

[openreview] [pdf]

Abstract Despite recent advancements in large language models (LLMs), their performance on complex reasoning problems requiring multi-step thinking and combining various skills is still limited. To address this, we propose a novel framework HDFlow for complex reasoning with LLMs that combines fast and slow thinking modes in an adaptive manner. Our approach consists of two key components: 1) a new approach for slow, deliberate reasoning called Dynamic Workflow, which automatically decomposes complex problems into more manageable sub-tasks and dynamically designs a workflow to assemble specialized LLM or symbolic reasoning tools to solve sub-tasks; 2) Hybrid Thinking, a general framework that dynamically combines fast and slow thinking based on problem complexity. Finally, we propose an easy-to-scale method for automatically synthesizing a large-scale dataset of 27K challenging reasoning problems for complex reasoning and a hybrid thinking tuning method that trains smaller LLMs on this dataset to internalize the fast/slow hybrid reasoning strategies. Experiments on four reasoning benchmark datasets demonstrate that our slow thinking with dynamic workflows significantly outperforms Chain-of-Thought, and hybrid thinking achieves the highest accuracy while providing an effective balance between computational efficiency and performance. Fine-tuning using our hybrid thinking approach also significantly boosts the complex reasoning capabilities of open-source language models. The results showcase the promise of slow thinking, dynamic workflows, and hybrid thinking in expanding the frontier of complex problem-solving with LLMs.

3569Unconstrained Robust Online Convex Optimization

[openreview] [pdf]

Abstract This paper addresses online learning with ‘‘corrupted’’ feedback. Our learner is provided with potentially corrupted gradients g~t\tilde g_t instead of the ‘‘true’’ gradients gtg_t. We make no assumptions about how the corruptions arise: they could be the result of outliers, mislabeled data, or even malicious interference. We focus on the difficult ‘‘unconstrained’’ setting in which our algorithm must maintain low regret with respect to any comparison point uRd||u|| \in \mathbb{R}^d. Perhaps surprisingly, the unconstrained setting is significantly more challenging as existing algorithms suffer extremely high regret even with very tiny amounts of corruption (which is not true in the case of a bounded domain). Our algorithms guarantee regret uG(T+k) ||u||G (\sqrt{T} + k) when Lipschitz constant GmaxtgtG \ge \max_t ||g_t|| is known, where kk is a measure of the total amount of corruption. When GG is unknown and incur an extra additive penalty of (u2+G2)k(||u||^2+G^2) k.

3570Prioritize Alignment in Dataset Distillation

[openreview] [pdf]

Abstract Dataset Distillation aims to compress a large dataset into a significantly more compact, synthetic one without compromising the performance of the trained models. To achieve this, existing methods use the agent model to extract information from the target dataset and embed it into the distilled dataset. Consequently, the quality of extracted and embedded information determines the quality of the distilled dataset. In this work, we find that existing methods introduce misaligned information in both information extraction and embedding stages. To alleviate this, we propose Prioritize Alignment in Dataset Distillation (\textbf{PAD}), which aligns information from the following two perspectives.We prune the target dataset according to the compressing ratio to filter the information that can be extracted by the agent model.We use only deep layers of the agent model to perform the distillation to avoid excessively introducing low-level information. This simple strategy effectively filters out misaligned information and brings non-trivial improvement for mainstream matching-based distillation algorithms. Furthermore, built on trajectory matching, \textbf{PAD} achieves remarkable improvements on various benchmarks, achieving state-of-the-art performance.

3571Location, Location, Location: Design Bias with Kernel Transformation

[openreview] [pdf]

Abstract It has been hypothesized that the old brain was compressed into cortical columns of the neocortex during the evolution of mammalian brains. Computational modeling of hippocampal-cortical interaction inspires us to propose a navigation-based implicit representation for manifold learning. The key new insight is to transform any explicit function (or geometrically a manifold) to an implicit representation using design bias for exploiting the concentration of measure (CoM) in high dimensional spaces. CoM-based blessing of dimensionality enables us to solve the manifold learning problem by direct-fit or local computation with guaranteed generalization property and without the need to discover global topology. We construct a memory encoding model, namely specification-before-generalization (SbG), and extend it into recursive kernel transformation to mirror the nested structure of the physical world. The biological plausibility of SbG learning is supported by its consistency with the wake-sleep cycles of mammalian brains. Finally, we showcase the application of design bias and recursive kernel transformation to understanding the phylogenetic continuity of navigation and memory and the manifold untangling of object recognition by the ventral stream.

3572Decoupling Angles and Strength in Low-rank Adaptation

[openreview] [pdf]

Abstract Parameter Efficient Fine-Tuning (PEFT) methods have recently gained extreme popularity thanks to the vast availability of large-scale models, allowing to quickly adapt pretrained models to downstream tasks with minimal computational costs. However, current additive finetuning methods such as LoRA show low robustness to prolonged training and hyperparameter choices, not allowing for optimal out-of-the-box usage. On the other hand, multiplicative and bounded approaches such as ETHER, even if providing higher robustness only allows for extremely low-rank adaptations and are limited to a fixed-strength transformation, hindering the expressive power of the adaptation. In this work, we propose a novel Decoupled Fine-Tuning (DeFT) paradigm that consists in decoupling the weight transformation strength from the angular information. We effectively show this proposed approach improves over two current PEFT methods, namely LoRA and ETHER. Integrating DeFT with LoRA normalizes and scales the learnable low-rank matrices and integrating with DeFT with ETHER allows for greater expressivity by increasing the rank of the updates, and by controlling the transformation boundaries. Code will be released upon acceptance.

3573Re-examining learning linear functions in context

[openreview] [pdf]

Abstract In context learning (ICL) is an attractive method of solving a wide range of problems. Inspired by Garg et al., we look closely at ICL in a variety of train and test settings for several transformer models of different sizes trained from scratch. Our study complements prior work by pointing out several systematic failures of these models to generalize to data not in the training distribution, thereby showing some limitations of ICL.

3574Liquid Dino: A Multi-Task Neural Network towards Autonomous Driving

[openreview] [pdf]

Abstract In the realm of advanced driver-assistance systems (ADAS) and autonomous driving, the accurate classification of driver emotions, behaviors and contextual environments is critical for enhancing vehicle safety and user experience. This study investigates the performance of various neural network architectures across four distinct classification tasks: Emotion Recognition, Driver Behavior Recognition, Scene-Centric Context Recognition and Vehicle-Based Context Recognition, all of which incorporate visual information captured through cameras. By utilizing camera-based data, we aim to evaluate how different neural architectures handle visual inputs in these diverse contexts, thereby exploring the robustness and generalization of each model to different real-world scenarios. We compare the performance of several state-of-the-art models and introduce a novel contribution that significantly improve classification accuracies in all areas. Our results demonstrate that the proposed Liquid Dino architecture achieves an overall average accuracy of 83.79%, outperforming other models in recognizing driver emotions, behaviors and contextual scenarios. These enhancements underscore the potential of our proposed methods in contributing to the development of more reliable and responsive ADAS.In the realm of advanced driver-assistance systems (ADAS) and autonomous driving, the accurate classification of driver emotions, behaviors and contextual environments is critical for enhancing vehicle safety and user experience. This study investigates the performance of various neural network architectures across four distinct classification tasks: Emotion Recognition, Driver Behavior Recognition, Scene-Centric Context Recognition and Vehicle-Based Context Recognition, all of which incorporate visual information captured through cameras. By utilizing camera-based data, we aim to evaluate how different neural architectures handle visual inputs in these diverse contexts, thereby exploring the robustness and generalization of each model to different real-world scenarios. We compare the performance of several state-of-the-art models and introduce a novel contribution that significantly improve classification accuracies in all areas. Our results demonstrate that the proposed Liquid Dino architecture achieves an overall average accuracy of 83.79%, outperforming other models in recognizing driver emotions, behaviors and contextual scenarios. These enhancements underscore the potential of our proposed methods in contributing to the development of more reliable and responsive ADAS.

3575Privacy-Aware Lifelong Learning

[openreview] [pdf]

Abstract Lifelong learning algorithms enable models to incrementally acquire new knowledge without forgetting previously learned information. Contrarily, the field of machine unlearning focuses on explicitly forgetting certain previous knowledge from pretrained models when requested, in order to comply with data privacy regulations on the right-to-be-forgotten. Enabling efficient lifelong learning with the capability to selectively unlearn sensitive information from models presents a critical and largely unaddressed challenge with contradicting objectives. We address this problem from the perspective of simultaneously preventing catastrophic forgetting and allowing forward knowledge transfer during task-incremental learning, while ensuring exact task unlearning and minimizing memory requirements, based on a single neural network model to be adapted. Our proposed solution, privacy-aware lifelong learning (PALL), involves optimization of task-specific sparse subnetworks with parameter sharing within a single architecture. We additionally utilize an episodic memory rehearsal mechanism to facilitate exact unlearning without performance degradations. We empirically demonstrate the scalability of PALL across various architectures in image classification, and provide a state-of-the-art solution that uniquely integrates lifelong learning and privacy-aware unlearning mechanisms for responsible AI applications.

3576Symbiotic Tuning: A Simple Approach for Enhancing Task Performance of Side-Tuning

[openreview] [pdf]

Abstract The reduction of the computational and memory overhead associated with fine-tuning large language models remains a significant challenge for current research in natural language processing. Achieving an optimal balance between task performance, adaptability, and low VRAM requirement often presents a complex trade-off. Parameter-efficient fine-tuning (PEFT) methods, such as LoRA, have gained attention for their ability to reduce the number of trainable parameters while preserving task performance. However, they have not yet achieved a notable reduction in VRAM usage, which is still predominantly consumed by model weights and activations during backpropagation. In contrast, Ladder Side-Tuning (LST) has been proposed as an alternative that effectively reduces VRAM usage by freezing the backbone language model (BLM) and training only lightweight side networks. Nevertheless, this reduction in memory usage often results in a decline in performance, as LST typically exhibits inferior performance compared to PEFT methods on the same BLM. To address these limitations, we propose Symbiotic Tuning (SymTune), a novel approach that extracts intermediate outputs from the BLM and integrates symbiotic modules to enhance feature processing capabilities. This method avoids a direct trade-off between performance and VRAM efficiency, offering two key advantages: 1) robust performance across a wide range of natural language tasks, and 2) reduced VRAM consumption through an improved side-tuning architecture. The experimental results demonstrate that SymTune provides a scalable and memory-efficient solution for fine-tuning language models.

3577A Framework of Distilling Multimodal Large Language Models

[openreview] [pdf]

Abstract The success of Large Language Models (LLM) has led researchers to explore Multimodal Large Language Models (MLLM) for unified visual and linguistic understanding. However, the increasing model size and computational complexity of MLLM limit their use in resource-constrained environments. Small-scale MLLM (ss-MLLM) aims to retain the capabilities of the large-scale model (ll-MLLM) while reducing computational demands, but resulting in a significant decline in performance. To address the aforementioned issues, we propose a novel LLaVA-KD framework to transfer knowledge from ll-MLLM to ss-MLLM. Specifically, we introduce Multimodal Distillation (MDist) to minimize the divergence between the visual-textual output distributions of ll-MLLM and ss-MLLM, and Relation Distillation (RDist) to transfer ll-MLLM’s ability to model correlations between visual features. Additionally, we propose a three-stage training scheme to fully exploit the potential of ss-MLLM: 1) Distilled Pre-Training to align visual-textual representations, 2) Supervised Fine-Tuning to equip the model with multimodal understanding, and 3) Distilled Fine-Tuning to further transfer ll-MLLM capabilities. Our approach significantly improves performance without altering the small model’s architecture. Extensive experiments and ablation studies validate the effectiveness of each proposed component. Code will be available.

3578Losing dimensions: Geometric memorization in generative diffusion

[openreview] [pdf]

Abstract Generative diffusion processes are state-of-the-art machine learning models deeply connected with fundamental concepts in statistical physics. Depending on the dataset size and the capacity of the network, their behavior is known to transition from an associative memory regime to a generalization phase in a phenomenon that has been described as a glassy phase transition. Here, using statistical physics techniques, we extend the theory of memorization in generative diffusion to manifold-supported data. Our theoretical and experimental findings indicate that different tangent subspaces are lost due to memorization effects at different critical times and dataset sizes, which depend on the local variance of the data along their directions. Perhaps counterintuitively, we find that, under some conditions, subspaces of higher variance are lost first due to memorization effects. This leads to a selective loss of dimensionality where some prominent features of the data are memorized without a full collapse on any individual training point. We validate our theory with a comprehensive set of experiments on networks trained both in image datasets and on linear manifolds, which result in a remarkable qualitative agreement with the theoretical predictions.

3579Robust Heterogeneous Graph Neural Network Explainer with Graph Information Bottleneck

[openreview] [pdf]

Abstract Explaining the prediction process of Graph Neural Network (GNN) is crucial for enhancing network transparency. However, real-world networks are predominantly heterogeneous and often beset with noise. The presence of intricate relationships in heterogeneous graphs necessitates a consideration of semantics during the explanation process, while mitigating the impact of noise remains unexplored. For GNN explainers heavily reliant on graph structure and raw features, erroneous predictions may lead to misguided explanations under the influence of noise. To address these challenges, we propose a Robust Heterogeneous Graph Neural Network Explainer with Graph Information Bottleneck, named RHGIB. We theoretically analyze the power of different heterogeneous GNN architectures on the propagation of noise information and exploit denoising variational inference. Specifically, we infer the latent distributions of both graph structure and features to alleviate the influence of noise. Subsequently, we incorporate heterogeneous edge types into the generation process of explanatory subgraph and utilize Graph Information Bottleneck framework for optimization, allowing the Explainer to learn heterogeneous semantics while enhancing robustness. Extensive experiments on multiple real-world heterogeneous graph datasets demonstrate the superior performance of RHGIB compared to state-of-the-art baselines.

3580KnowHalu: Multi-Form Knowledge Enhanced Hallucination Detection

[openreview] [pdf]

Abstract As large language models (LLMs) become increasingly integral to a wide array of applications, ensuring the factual accuracy of their outputs and mitigating hallucinations is paramount. Current approaches, which primarily rely on self-consistency checks or post-hoc fact-checking, often fall short by disregarding the nuanced structure of queries and the diverse forms of contextual knowledge required for accurate response generation. To address these shortcomings, we introduce KnowHalu (pronounced “No Halu”), the first multi-form knowledge-based hallucination detection framework. We also introduce a new category of hallucinations, off-target hallucinations, which occur when responses are factually accurate but irrelevant or nonspecific to the query (e.g., answering “What’s the primary language in Barcelona?” with “European language”). In particular, KnowHalu employs a rigorous two-phase process to detect hallucinations. In the first phase, it isolates off-target hallucinations by analyzing the semantic alignment between the response and the query. In the second phase, it conducts a novel multi-form knowledge-based fact-checking through a comprehensive pipeline of reasoning and query decomposition, knowledge retrieval, knowledge form optimization, judgment generation, and judgment aggregation. Extensive evaluations demonstrate that KnowHalu significantly surpasses state-of-the-art (SOTA) baselines across diverse tasks, achieving over 15% improvement in question answering (QA) and 6% in summarization tasks when applied to the same underlying LLM. These results underscore the effectiveness and versatility of KnowHalu, setting a new benchmark for hallucination detection and paving the way for safer and more reliable LLM applications.

3581Question-Aware Knowledge Graph Prompting for Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated significant advancements in various natural language processing tasks, yet they often struggle with tasks that require external domain-specific knowledge, such as Multiple Choice Question Answering (MCQA). Integrating Knowledge Graphs (KGs) with LLMs has been explored as a solution to enhance LLMs’ reasoning capabilities, while existing methods either involve computationally expensive finetuning processes or rely on the noisy retrieval of KG information. Recent efforts have focused on leveraging Graph Neural Networks (GNNs) to generate KG-based soft prompts for LLMs, which face challenges of lacking question-relevance assessment in GNN and utilization of relations among options. In this paper, we propose a novel approach, QAP, to address these challenges by optimizing the utilization of KG in MCQA tasks. Our method introduces question embeddings into the GNN aggregation process, enabling the model to assess the relevance of KG information based on the question context. Additionally, QAP facilitates inter-option interactions by employing an attention module that explicitly models relationships between answer options. Specifically, we use multiple attention heads for the GNN output, allowing the model to capture and compare features across different options, thereby enhancing cross-option reasoning. Our approach not only enhances the connection between GNNs and LLMs but also enables the model to better utilize the relationships between answer options. Experimental results demonstrate that QAP outperforms state-of-the-art models on multiple public MCQA datasets, validating its effectiveness and scalability.

3582Adversarially Robust Anomaly Detection through Spurious Negative Pair Mitigation

[openreview] [pdf]

Abstract Despite significant progress in Anomaly Detection (AD), the robustness of existing detection methods against adversarial attacks remains a challenge, compromising their reliability in critical real-world applications such as autonomous driving. This issue primarily arises from the AD setup, which assumes that training data is limited to a group of unlabeled normal samples, making the detectors vulnerable to adversarial anomaly samples during testing. Additionally, implementing adversarial training as a safeguard encounters difficulties, such as formulating an effective objective function without access to labels. An ideal objective function for adversarial training in AD should promote strong perturbations both within and between the normal and anomaly groups to maximize margin between normal and anomaly distribution. To address these issues, we first propose crafting a pseudo-anomaly group derived from normal group samples. Then, we demonstrate that adversarial training with contrastive loss could serve as an ideal objective function, as it creates both inter- and intra-group perturbations. However, we notice that spurious negative pairs compromise the conventional contrastive loss for achieving robust AD. Spurious negative pairs are those that should be mapped closely but are erroneously separated. These pairs introduce noise and misguide the direction of inter-group adversarial perturbations. To overcome the effect of spurious negative pairs, we define opposite pairs and adversarially pull them apart to strengthen inter-group perturbations. Experimental results demonstrate our superior performance in both clean and adversarial scenarios, with a 26.1% improvement in robust detection across various challenging benchmark datasets.

3583Aya in Action: An Investigation of its Abilities in Aspect-Based Sentiment Analysis, Hate Speech Detection, Irony Detection, and Question-Answering

[openreview] [pdf]

Abstract While resource-rich languages such as English and Mandarin drive considerable advancements, low-resource languages face challenges due to the scarcity of substantial digital and annotated linguistic resources. Within this context, in 2024, Aya was introduced, a multilingual generative language model supporting 101 languages, over half of which are lower-resourced. This study aims to assess Aya’s performance in tasks such as Aspect-Based Sentiment Analysis, Hate Speech Detection, Irony Detection, and Question-Answering, using a few-shot methodology in Brazilian Portuguese. The objective is to evaluate Aya’s effectiveness in these tasks without fine-tuning the pre-trained model, thereby exploring its potential to improve the quality and accuracy of outputs in various natural language understanding tasks. Results indicate that while Aya performs well in certain tasks like Question-Answering, where it surpassed Portuguese-specific models with an Exact Match score of 58.79%, it struggles in others. For the Hate Speech Detection task, Aya’s F1-score of 0.64 was significantly lower than the 0.94 achieved by the Sabiá-7B model. Additionally, the model’s performance on the Aspect-Based Sentiment Analysis task improved considerably when neutral examples were excluded, but its handling of complex slang and context-dependent features in other tasks remained challenging. These results suggest that multilingual models like Aya can perform competitively in some contexts but may require further tuning to match the effectiveness of models specifically trained for Portuguese.

3584Sketched Adaptive Federated Deep Learning: A Sharp Convergence Analysis

[openreview] [pdf]

Abstract Combining gradient sketching methods (e.g., CountSketch, quantization) and adaptive optimizers (e.g., Adam, AMSGrad) is a desirable goal in federated learning (FL), with potential benefits on both fewer communication rounds and smaller per-round communication. In spite of the preliminary empirical success of sketched adaptive methods, existing convergence analyses show the communication cost to have a linear dependence on the ambient dimension, i.e., number of parameters, which is prohibitively high for modern deep learning models.In this work, we introduce specific sketched adaptive federated learning (SAFL) algorithms and, as our main contribution, provide theoretical convergence analyses in different FL settings with guarantees on communication cost depending only logarithmically (instead of linearly) on the ambient dimension. Unlike existing analyses, we show that the entry-wise sketching noise existent in the preconditioners and the first moments of SAFL can be implicitly addressed by leveraging the recently-popularized anisotropic curvatures in deep learning losses, e.g., fast decaying loss Hessian eigen-values. In the i.i.d. client setting of FL, we show that SAFL achieves O(1/T)O(1/\sqrt{T}) convergence, and O(1/T)O(1/T) convergence near initialization. In the non-i.i.d. client setting, where non-adaptive methods lack convergence guarantees, we show that SACFL (SAFL with clipping) algorithms can provably converge in spite of the additional heavy-tailed noise. Our theoretical claims are supported by empirical studies on vision and language tasks, and in both fine-tuning and training-from-scratch regimes. Surprisingly, as a by-product of our analysis, the proposed SAFL methods are competitive with the state-of-the-art communication-efficient federated learning algorithms based on error feedback.

3585Individualized Private Graph Neural Network via Node Influence-based Noise Adaptation

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) with Differential Privacy (DP) guarantees have been proposed to preserve privacy when nodes contain sensitive information that needs to be kept private but is critical for training. Existing methods deploy a fixed uniform noise generation mechanism that lacks the flexibility to adjust between nodes, leading to increasing the risk of graph information leakage and decreasing the model’s overall performance. To address the above challenges, we propose NIP-GNN, a Node-level Individual Private GNN with DP guarantee based on the adaptive perturbation over sensitive components to safeguard node information. First, we propose a Topology-based Node Influence Estimation (TNIE) method to infer unknown node influence with neighborhood and centrality awareness. Second, given the obtained node influence rank, an adaptive private aggregation method is proposed to perturb neighborhood embeddings directed by node-wise influence. Third, we propose to privately train the graph learning algorithm over perturbed aggregations in adaptive residual connection mode over multi-layer convolution for node-wise tasks. Theoretically, analysis ensures that NIP-GNN satisfies DP guarantee. Empirical experiments over real-world graph datasets show that NIP-GNN presents a better resistance over node inference attacks and achieves a better trade-off between privacy and accuracy.

3586Theoretical Convergence Analysis for Hilbert Space MCMC with Score-based Priors for Nonlinear Bayesian Inverse Problems

[openreview] [pdf]

Abstract In recent years, several works have explored the use of score-based generative models as expressive priors in Markov chain Monte Carlo (MCMC) algorithms for provable posterior sampling, even in the challenging case of nonlinear Bayesian inverse problems. However, these approaches have been mostly limited to finite-dimensional approximations, while the original problems are typically defined in function spaces of infinite dimension. It is well known that algorithms designed for finite-dimensional settings can encounter theoretical and practical issues when applied to infinite-dimensional objects, such as an inconsistent behavior across different discretizations. In this work, we address this limitation by leveraging the recently developed framework for score-based generative models in Hilbert spaces to learn an infinite-dimensional score, which we use as a prior in a function-space Langevin-type MCMC algorithm, providing theoretical guarantees for convergence in the context of nonlinear Bayesian inverse problems. Crucially, we prove that controlling the approximation error of the score is not only essential for ensuring convergence but also that modifying the standard score-based Langevin MCMC through the selection of an appropriate preconditioner is necessary. Our analysis shows how the control over the score approximation error influences the design of the preconditioner---an aspect unique to the infinite-dimensional setting.

3587Linear Mode Connectivity in Differentiable Tree Ensembles

[openreview] [pdf]

Abstract Linear Mode Connectivity (LMC) refers to the phenomenon that performance remains consistent for linearly interpolated models in the parameter space. For independently optimized model pairs from different random initializations, achieving LMC is considered crucial for understanding the stable success of the non-convex optimization in modern machine learning models and for facilitating practical parameter-based operations such as model merging. While LMC has been achieved for neural networks by considering the permutation invariance of neurons in each hidden layer, its attainment for other models remains an open question. In this paper, we first achieve LMC for soft tree ensembles, which are tree-based differentiable models extensively used in practice. We show the necessity of incorporating two invariances: subtree flip invariance and splitting order invariance, which do not exist in neural networks but are inherent to tree architectures, in addition to permutation invariance of trees. Moreover, we demonstrate that it is even possible to exclude such additional invariances while keeping LMC by designing decision list-based tree architectures, where such invariances do not exist by definition. Our findings indicate the significance of accounting for architecture-specific invariances in achieving LMC.

3588Stochastically Capturing Partial Relationship among Features for Multivariate Forecasting

[openreview] [pdf]

Abstract When tackling forecasting problems that involve multiple time-series features, existing methods for capturing inter-feature information typically fall into three categories: complete-multivariate, partial-multivariate, and univariate. Complete-multivariate methods compute relationships among the entire set of features, whereas univariate cases ignore inter-feature information altogether. In contrast to these two, partial-multivariate methods group features into clusters and capture inter-feature relationships within each cluster. However, existing partial-multivariate methods deal only with specific cases where there is a single way of grouping so once the grouping way is selected, it remains unchanged. Therefore, we introduce a generalized version of partial-multivariate methods where grouping ways are sampled stochastically (called stochastic partial-multivariate methods), which can incorporate the deterministic cases using Dirac delta distributions. We propose SPMformer, a Transformer-based stochastic partial-multivariate model, with its training algorithm. We demonstrate that SPMformer outperforms various complete-multivariate, deterministic partial-multivariate, and univariate models in various forecasting tasks (long-term, short-term, and probabilistic forecasting), providing a theoretical rationale and empirical analysis for its superiority. Additionally, by proposing an inference method leveraging the inherent stochasticity in SPMformer, the forecasting accuracy is further enhanced. Finally, we highlight other advantages of SPMformer: efficiency and robustness under missing features.

3589IMPROVING FLOW FIELD PREDICTION OF COMPLEX GEOMETRIES USING SIMPLE GEOMETRIES

[openreview] [pdf]

Abstract In this study, we address the challenge of computationally expensive simulations of complex geometries, which are crucial for modern engineering design processes. While neural network-based flow field predictions have been suggested, prior studies generally exclude complex geometries. Our objective is to enhance flow predictions around complex geometries, which may often be deconstructed into multiple single, simple bodies, by leveraging existing data on these simple geometry flow fields. Using a case study of tandem-airfoils, we introduce a method employing the directional integrated distance representation for multiple objects, a residual pre-training scheme based on the freestream condition as a physical prior, and a residual training scheme utilising smooth combinations of single airfoil flow fields, also capitalising on the freestream condition. To optimise memory usage during training in large domains and improve prediction performance, we decom- pose simulation domains into smaller sub-domains, each processed by a different network. Extensive experiments on four new tandem-airfoil datasets, comprising over 2000 fluid simulations, demonstrate that our proposed method and techniques effectively enhance tandem-airfoil prediction accuracy by up to 96%.

3590RECAST: Reparameterized, Compact weight Adaptation for Sequential Tasks

[openreview] [pdf]

Abstract Incremental learning aims to adapt to new sets of categories over time with minimal computational overhead. Prior work often addresses this task by training efficient task-specific adaptors that modify frozen layer weights or features to capture relevant information without affecting predictions on any previously learned categories. While these adaptors are generally more efficient than finetuning the entire network, they still can require tens to hundreds of thousands task-specific trainable parameters even for relatively small networks, making it challenging to operate on resource-constrained environments with high communication costs like edge devices or mobile phones. Thus, we propose Reparameterized, Compact weight Adaptation for Sequential Tasks (RECAST), a novel method that dramatically reduces the number of task-specific trainable parameters to fewer than 50 – several orders of magnitude less than competing methods like LoRA. RECAST accomplishes this efficiency by learning to decompose layer weights into a soft parameter-sharing framework consisting of a set of shared weight templates and very few module-specific scaling factors or coefficients. This soft parameter-sharing framework allows for effective task-wise reparameterization by tuning only these coefficients while keeping templates frozen. A key innovation of RECAST is the novel weight reconstruction pipeline called Neural Mimicry, which eliminates the need for pretraining from scratch. This allows for high-fidelity emulation of existing pretrained weights within our framework and provides quick adaptability to any model scale and architecture. Extensive experiments across six diverse datasets demonstrate RECAST outperforms the state-of-the-art by up to 3% across various scales, architectures, and parameter spaces. Moreover, we show that RECAST’s architecture-agnostic nature allows for seamless integration with existing methods, further boosting performance.

3591Bi-perspective Splitting Defense: Achieving Clean-Data-Free Backdoor Security

[openreview] [pdf]

Abstract Backdoor attacks have seriously threatened deep neural networks (DNNs) by embedding concealed vulnerabilities through data poisoning. To counteract these attacks, training benign models from poisoned data garnered considerable interest from researchers. High-performing defenses often rely on additional clean subsets, which is untenable due to increasing privacy concerns and data scarcity. In the absence of clean subsets, defenders resort to complex feature extraction and analysis, resulting in excessive overhead and compromised performance. In the face of these challenges, we identify the key lies in sufficient utilization of the easier-to-obtain target labels and excavation of clean hard samples. In this work, we propose a Bi-perspective Splitting Defense (BSD). BSD splits the dataset using both semantic and loss statistics characteristics through open set recognition-based splitting (OSS) and altruistic model-based data splitting (ALS) respectively, achieving good clean pool initialization. BSD further introduces class completion and selective dropping strategies in the subsequent pool updates to avoid potential class underfitting and backdoor overfitting caused by loss-guided split. Through extensive experiments on 3 benchmark datasets and against 7 representative attacks, we empirically demonstrate that our BSD is robust across various attack settings. Specifically, BSD has an average improvement in Defense Effectiveness Rating (DER) by 16.29% compared to 5 state-of-the-art defenses, achieving clean-data-free backdoor security with minimal compromise in both Clean Accuracy (CA) and Attack Success Rate (ASR).

3592Action-Constrained Imitation Learning

[openreview] [pdf]

Abstract Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications. In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space. The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through trajectory alignment\textit{trajectory alignment} and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency.

3593Learning Latent Graph Structures and their Uncertainty

[openreview] [pdf]

Abstract Graph neural networks use relational information as an inductive bias to enhance prediction performance. Not rarely, task-relevant relations are unknown and graph structure learning approaches have been proposed to learn them from data. Given their latent nature, no graph observations are available to provide a direct training signal to the learnable relations. Therefore, graph topologies are typically learned on the prediction task alongside the other graph neural network parameters. In this paper, we demonstrate that minimizing point-prediction losses does not guarantee proper learning of the latent relational information and its associated uncertainty. Conversely, we prove that suitable loss functions on the stochastic model outputs simultaneously grant solving two tasks: (i) learning the unknown distribution of the latent graph and (ii) achieving optimal predictions of the model output. Finally, we propose a sampling-based method that solves this joint learning task. Empirical results validate our theoretical claims and demonstrate the effectiveness of the proposed approach.

3594Practical and Rigorous Extremal Bounds for Gaussian Process Regression via Chaining

[openreview] [pdf]

Abstract Gaussian Process Regression (GPR) is a popular nonparametric regression method based on Bayesian principles that, unlike most machine learning techniques, provides uncertainty estimates for its predictions. Recent research has focused on robustness to model misspecification but has neglected improvements to the underlying methods for computing bounds. Inspired by the chaining method, we applied it to the prediction intervals of GPR. This work addresses the limitations of current GPR methods, which rely heavily on scaling posterior standard deviations and assume well-specified models, limiting their adaptability and accuracy. We propose a novel chain-based approach that decomposes the problem into smaller, refined stages, enabling better error control and enhanced robustness, particularly in challenging scenarios. Additionally, we innovate by providing tighter, practical and theoretically sound bounds for commonly used kernels, including RBF and Matérn, improving both their theoretical understanding and practical utility. Numerical experiments validate our theoretical findings, demonstrating that our method outperforms existing approaches on synthetic and real-world datasets.

3595FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs

[openreview] [pdf]

Abstract Language models (LMs) hallucinate. We inquire: Can we detect and mitigate hallucinations before they happen? This work answers this research question in the positive, by showing that the internal representations of LMs provide rich signals that can be used for this purpose. We introduce FactCheckMate, which preemptively detects hallucinations by learning a classifier that predicts whether the LM will hallucinate, based on the model’s hidden states produced over the inputs, before decoding begins. If a hallucination is detected, FactCheckMate then intervenes, by adjusting the LM’s hidden states such that the model will produce more factual outputs. FactCheckMate provides fresh insights that the inner workings of LMs can be revealed by their hidden states. Practically, both the detection and mitigation models in FactCheckMate are lightweight, adding little inference overhead; FactCheckMate proves a more efficient approach for mitigating hallucinations compared to many post-hoc alternatives. We evaluate FactCheckMate over LMs of different scales and model families (including Llama, Mistral, and Gemma), across a variety of QA datasets from different domains. Our results demonstrate the effectiveness of leveraging internal representations for early hallucination detection and mitigation, achieving over 70% preemptive detection accuracy. On average, outputs generated by LMs with intervention are 34.4% more factual compared to those without intervention. The average overhead difference in the inference time introduced by FactCheckMate is around 3.16 seconds.

3596Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller

[openreview] [pdf]

Abstract We propose SelfControlSelfControl, an inference-time model control method utilizing gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a desired behavior expressed in a natural language suffix string concatenated to the input prompt, SelfControlSelfControl computes gradients of the LLM’s self-evaluation of the suffix with respect to its latent representations. The gradients are used to directly control the auto-regressive generation process towards desired behaviors, which eliminates human supervision, achieves precise and transparent control, and offers on-the-fly adaptability. To further enhance efficiency, we introduce SelfControlprefixSelfControl_{prefix}, a compact module that encapsulates the learned representations from gradients into a \pc, facilitating efficient inference-time control with no latency compared to the original model and allowing control for multiple behaviors simultaneously. Our experiments demonstrate SelfControlSelfControl’s efficacy across multiple domains, where it improves over SOTA for8.3%in detoxification,3.1%in truthfulness enhancement,4%\textasciitilde10%in controlling on emotion tones, and48.2%in privacy protection, i.e., completely remove privacy leakage issue.

3597Dynamic Interference Modeling For Estimating Treatment Effects From Dynamic Graphs

[openreview] [pdf]

Abstract Estimating treatment effects can assist decision-making in various areas, such as commerce and medicine. One application of the treatment effect estimation is to predict the effect of an advertisement on the purchase result of a customer, known as individual treatment effect (ITE). In online websites, the outcome of an individual can be affected by treatments of other individuals, as people often propagate information with their friends, a phenomenon referred to as interference. Prior studies have attempted to model interference for accurate ITE estimation under a static network among individuals. However, the network usually changes over time in real-world applications due to complex social activities among individuals. For instance, an individual can follow another individual on one day and unfollow this individual afterward on an online social website. In this case, the outcomes of individuals can be interfered with not only by treatments for current neighbors but also by past information and treatments for past neighbors, which we refer to as \emph{dynamic interference}. In this work, we model dynamic interference for the first time by developing an architecture to aggregate both the past information of individuals and their neighbors. Specifically, our proposed method contains a mechanism that summarizes historical information of individuals from previous time stamps, graph neural networks that propagate information about individuals within every time stamp, and a weighting mechanism that estimates the importance of different time stamps. Moreover, the model parameters should gradually change rather than drastically because information of every individual gradually changes over time. To take it into account, we also propose a variant of our method to evolve the model parameters over time with long short-term memory. In our experiments on multiple datasets with dynamic interference, our methods outperform existing methods for ITE estimation because they are unable to capture dynamic interference. This result corroborates the importance of dynamic interference modeling.

3598TSI-Bench: Benchmarking Time Series Imputation

[openreview] [pdf]

Abstract Effective imputation is a crucial preprocessing step for time series analysis. Despite the development of numerous deep learning algorithms for time series imputation, the community lacks standardized and comprehensive benchmark platforms to effectively evaluate imputation performance across different settings. Moreover, although many deep learning forecasting algorithms have demonstrated excellent performance, whether their modeling achievements can be transferred to time series imputation tasks remains unexplored. To bridge these gaps, we develop TSI-Bench, the first (to our knowledge) comprehensive benchmark suite for time series imputation utilizing deep learning techniques. The TSI-Bench pipeline standardizes experimental settings to enable fair evaluation of imputation algorithms and identification of meaningful insights into the influence of domain-appropriate missing rates and patterns on model performance. Furthermore, TSI-Bench innovatively provides a systematic paradigm to tailor time series forecasting algorithms for imputation purposes. Our extensive study across 34,804 experiments, 28 algorithms, and 8 datasets with diverse missingness scenarios demonstrates TSI-Bench’s effectiveness in diverse downstream tasks and potential to unlock future directions in time series imputation research and analysis. All source code and experiment logs are released.

3599Exact Community Recovery under Side Information: Optimality of Spectral Algorithms

[openreview] [pdf]

Abstract We study the problem of exact community recovery in general, two-community block models, in the presence of node-attributedside information. We allow for a very general side information channel for node attributes, and for pairwise (edge) observations, consider both Bernoulli and Gaussian matrix models, capturing the Stochastic Block Model, Submatrix Localization, and Z2\mathbb{Z}_2-Synchronization as special cases. A recent work of Dreveton et al. 2024 characterized the information-theoretic limit of a very general exact recovery problem with side information. In this paper, we show algorithmic achievability in the above important cases by designing a simple but optimal spectral algorithm that incorporates side information (when present) along with the eigenvectors of the pairwise observation matrix. Using the powerful tool of entrywise eigenvector analysis [Abbe et al. 2020], we show that our spectral algorithm can mimic the so calledgenie-aided estimators, where the ithi^{\mathrm{th}} genie-aided estimator optimally computes the estimate of the ithi^{\mathrm{th}} label, when all remaining labels are revealed by a genie. This perspective provides a unified understanding of the optimality of spectral algorithms for various exact recovery problems in a recent line of work.

3600RAC-LoRA: A Theoretical Optimization Framework for Low-Rank Adaptation

[openreview] [pdf]

Abstract Fine-tuning has become a popular approach to adapting large foundational models to specific tasks. As the size of models and datasets grows, parameter-efficient fine-tuning techniques are increasingly important. One of the most widely used methods is Low-Rank Adaptation (LoRA), with adaptation update expressed as the product of two low-rank matrices. While LoRA was shown to possess strong performance in fine-tuning, it often underperforms when compared to full-parameter fine-tuning (FPFT). Although many variants of LoRA have been extensively studied empirically, their theoretical optimization analysis is heavily under-explored. The starting point of our work is a demonstration that LoRA and its two extensions, Asymmetric LoRA and Chain of LoRA, indeed encounter convergence issues. To address these issues, we propose a general optimization framework that rigorously analyzes the convergence rates of LoRA-based methods. Our approach inherits the empirical benefits of LoRA-style heuristics, but introduces several small but important algorithmic modifications which turn it into a provably convergent method. Our framework serves as a bridge between FPFT and low-rank adaptation. We provide provable guarantees of convergence to the same solution as FPFT, along with the rate of convergence. Additionally, we present a convergence analysis for smooth, non-convex loss functions, covering gradient descent, stochastic gradient descent, and federated learning settings. Our theoretical findings are supported by experimental results.

3601MisAttributionLLM: Integrating Error Attribution Capability into LLM Evaluation

[openreview] [pdf]

Abstract With the widespread application of Large Language Models (LLMs) in various tasks, evaluating the performance of LLMs becomes an essential research topic. However, existing judge models lack the specific capability required for error attribution (i.e., identify the types of error made in responses). In this work, we first establish a comprehensive Misattribution Framework with 9 primary and 19 secondary categories, which are intended to facilitate in-depth analysis and enhance the performance of LLMs. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattributions, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first open-source, general-purpose judge model with error attribution capability which provides valuable insights into the model’s weaknesses and enables targeted improvements. Experimental results show that MisAttributionLLM achieves the highest Pearson correlation with human evaluators among 8 open-source and closed-source LLMs. Furthermore, MisAttributionLLM also obtains the highest accuracy and micro-F1 in the performance of error attribution. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.

3602Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

[openreview] [pdf]

Abstract Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for an actor and for a critic. We focus our study on understanding whether an actor and a critic will benefit from a decoupled, rather than shared, representation. Our primary finding is that when decoupled, the representations for the actor and critic systematically specialise in extracting different types of information from the environment---the actor’s representation tends to focus on action-relevant information, while the critic’s representation specialises in encoding value and dynamics information. Finally, we demonstrate how these insights help select representation learning objectives that play into the actor’s and critic’s respective knowledge specialisations, and improve performance in terms of agent returns.

3603Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics

[openreview] [pdf]

Abstract We study computationally and statistically efficient Reinforcement Learning algorithms for thelinear Bellman Completesetting, a setting that uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least square regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.

3604Iterative Dual-RL: An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

[openreview] [pdf]

Abstract We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.

3605MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More

[openreview] [pdf]

Abstract Mixture-of-Experts large language models (MoE-LLMs) marks a significant step forward of language models, however, they encounter two critical challenges in practice: 1) expert parameters lead to considerable memory consumption and loading latency; and 2) the current activated experts are redundant, as many tokens may only require a single expert. Motivated by these issues, we investigate the MoE-LLMs and make two key observations: a) different experts exhibit varying behaviors on activation reconstruction error, routing scores, and activated frequencies, highlighting their differing importance, and b) not all tokens are equally important-- only a small subset is critical. Building on these insights, we propose MC-MoE, a training-free Mixture-Compressor for MoE-LLMs, which leverages the significance of both experts and tokens to achieve an extreme compression. First, to mitigate storage and loading overheads, we introduce Pre-Loading Mixed-Precision Quantization (PMQ), which formulates the adaptive bit-width allocation as a Linear Programming (LP) problem, where the objective function balances multi-factors reflecting the importance of each expert. Additionally, we develop Online Dynamic Pruning (ODP), which identifies important tokens to retain and dynamically select activated experts for other tokens during inference to optimize efficiency while maintaining performance. Our MC-MoE integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss, ensuring an optimal trade-off between performance and efficiency Extensive experiments confirm the effectiveness of our approach. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss. During dynamic inference, we further reduce activated parameters by 15%, with a performance drop of less than 0.6%. Remarkably, MC-MoE even surpasses floating-point 13b dense LLMs with significantly smaller parameter sizes, suggesting that mixture compression in MoE-LLMs has the potential to outperform both comparable and larger dense LLMs.

3606Emergent Orientation Maps —— Mechanisms, Coding Efficiency and Robustness

[openreview] [pdf]

Abstract Extensive experimental studies have established that in less visually advanced animals, the neuronal preference for input orientation in the primary visual cortex (V1) is organized in a disordered fashion (known as salt-and-pepper organizations). In contrast, in visually advanced animals, the orientation preference varies continuously across V1, forming pinwheel-like structures. However, the mechanisms underlying the emergence of these two seemingly distinctive structures are not fully understood, and their differential influences on visual encoding remain largely unexplored. To address these questions, we introduce a self-evolving spiking neural network model with plasticity that can reproduce those emergent structures in several representative animals by incorporating data on retinotopy, neuronal morphology, and connectivity for each species. We show that the salt-and-pepper organizations and pinwheel structures actually sit at the two ends of the same spectrum using a metric that involves the overlap of receptive fields and neuronal density. We also find the same mechanisms account for the formation of both structures through local recurrent connections guided by Hebbian-like learning rules. Next, we show functionally that pinwheel structures exhibit lower wiring costs and higher encoding efficiency than salt-and-pepper organizations. Finally, pinwheel structures exhibit sparse coding and greater robustness against noise in natural stimuli. These functional advantages may inspire the deep learning community to revisit the possibility of recurrent connectivity within each layer for higher coding efficiency and robustness.

3607UnoLoRA: Single Low-Rank Adaptation for Efficient Multitask Fine-tuning

[openreview] [pdf]

Abstract Recent research has demonstrated the efficacy of Low-Rank Adaptation (LoRA) as an effective implicit regularizer for large language models. Building on these findings, we investigate whether LoRA can be leveraged for efficient multi-task learning. This study presents experimental observations on utilizing a single LoRA module for multiple tasks in the fine-tuning of large language models. We introduce UnoLoRA*, a novel method for multi-task finetuning, which significantly reduces trainable parameters to just 0.05% per task. Our approach not only uncovers insights into low-rank representations and multitask generalization but also explores LoRA’s capacity to capture task-agnostic knowledge. Our findings affirm that sharing a single LoRA adapter effectively boosts parameter efficiency while ensuring that it learns a more general representation, even as it yields a competitive performance.

3608Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

[openreview] [pdf]

Abstract Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. This contrasts with in-weights learning (IWL), where memorized information is encoded in model parameters from iterated observations of the data (e.g., common sayings). An ideal model should be able to maintain both of these abilities. Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens (Land & Bartolo, 2024). Hence, we study structural in-context learning\textbf{structural in-context learning}, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens – so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and natural tasks using both toy models and MultiBERT models (Sellam et al., 2021). We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on the Chen et al. (2024) 's active forgetting method used to help models learn new languages, we introduce a pretraining method that can modulate the preference for true structural ICL and IWL. Importantly, this allows us to induce a dual process strategy\textit{dual process strategy} where in-context and in-weights solutions coexist within a single model.

3609Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency

[openreview] [pdf]

Abstract Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximation guarantees and the rigid requirement for a fully connected communication graph. To address these challenges, we firstly present a MA-OSMA\textbf{MA-OSMA} algorithm, which employs the multi-linear extension to transfer the discrete submodular maximization problem into a continuous optimization, thereby allowing us to reduce the strict dependence on a complete graph through consensus techniques. Moreover, MA-OSMA\textbf{MA-OSMA} leverages a novel surrogate gradient to avoid sub-optimal stationary points. To eliminate the computationally intensive projection operations in MA-OSMA\textbf{MA-OSMA}, we also introduce a projection-free MA-OSEA\textbf{MA-OSEA} algorithm skillfully harnessing the KL divergence by mixing a uniform distribution. Theoretically, we confirm that both algorithms achieve a regret bound of O~(CTT1β)\widetilde{O}(\sqrt{\frac{C_{T}T}{1-\beta}}) against a  (1ecc)(\frac{1-e^{-c}}{c})-approximation to the best comparator in hindsight, where CTC_{T} is the deviation of maximizer sequence, β\beta is the spectral gap of the network and cc is the joint curvature of submodular objectives. This result significantly improves the (11+c)(\frac{1}{1+c})-approximation provided by the state-of-the-art OSG algorithm. Finally, we demonstrate the effectiveness of our proposed algorithms through simulation-based multi-target tracking.

3610Adversarial Masked Autoencoder Purifier with Defense Transferability

[openreview] [pdf]

Abstract The study of adversarial defense still struggles to combat with advanced adversarial attacks. In contrast to most prior studies that rely on the diffusion model for test-time defense to remarkably increase the inference time, we propose Masked AutoEncoder Purifier (MAEP), which integrates Masked AutoEncoder (MAE) into an adversarial purifier framework for test-time purification. While MAEP achieves promising adversarial robustness, it particularly features model defense transferability without relying on using additional data that is different from the training dataset. To our knowledge, MAEP is the first study of adversarial purifier based on masked autoencoder. Extensive experiments validate the proposed method. Notably, MAEP trained on CIFAR10 achieves state-of-the-art performance even when tested directly on ImageNet, outperforming existing diffusion-based models trained specifically on ImageNet.

3611Sample Efficient Robust Offline Self-Play for Model-based Reinforcement Learning

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL), as a thriving field, explores how multiple agents independently make decisions in a shared dynamic environment. Due to environmental uncertainties and fluctuations, policies in MARL must remain robust to tackle the sim-to-real gap. Although robust RL has been extensively explored in single-agent settings, it has seldom received attention in self-play, where strategic interactions heighten uncertainties. We focus on robust two-player zero-sum Markov games (TZMGs) in offline RL, specifically on tabular robust TZMGs (RTZMGs) with a given uncertainty set. To address sample scarcity, we introduce a model-based algorithm (RTZ-VI-LCB) for RTZMGs, which integrates robust value iteration considering uncertainty level, applying a data-driven penalty term to the robust value estimates. We establish the finite-sample complexity of RTZ-VI-LCB by accounting for distribution shifts in the historical dataset, without requiring for full state-action space coverage. To the best of our knowledge, we provide the upper bound in RTZMGs, which first achieves optimal sample complexity on the dependency of action spaces. Our algorithm is capable of learning under partial coverage and environmental uncertainty. An information-theoretic lower bound is developed to show that learning RTZMGs is at least as difficult as standard TZMGs when the uncertainty level is sufficiently small. This result confirms the tightness of our upper bound, which is near-optimal for the big uncertainty level, except for the horizon length.

3612Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding

[openreview] [pdf]

Abstract Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps. While prompt-based methods like Chain-of-Thought (CoT) can improve LLM reasoning at inference time, optimizing reasoning capabilities during training remains challenging. We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution and optimizes it via variational approaches. LaTRO enables LLMs to concurrently improve both their reasoning process and ability to evaluate reasoning quality, without requiring external feedback or reward models. We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures. On GSM8K, LaTRO improves zero-shot accuracy by an average of 12.5% over base models and 9.6% over supervised fine-tuning across Phi-3.5-mini, Mistral-7B, and Llama-3.1-8B. Our findings suggest that pre-trained LLMs possess latent reasoning capabilities that can be unlocked and enhanced through our proposed optimization approach in a self-improvement manner.

3613ZeroTS: Zero-shot time series prediction via multi-party data-model interaction

[openreview] [pdf]

Abstract Time series forecasting (TS) is a fundamental task in artificial intelligence, with applications ranging from weather prediction, stock market analysis to electricity demand forecasting. While existing models, particularly large language models (LLMs) tailored for TS, primarily focus on improving accuracy and generalization through pre-training and fine-tuning, zero-shot prediction without task-specific fine-tuning, still remains underexplored. This limitation arises from the restricted scalability and flexibility of current LLMs for TS, which struggle to fully capture the interactions between data and model. In this work, we introduce ZeroTS, a novel approach that bridges open-world knowledge with inherent data regularities by constructing multi-party interactions between data and models. On the data side, we propose a TS-RAG (Retrieval-Augmented Generation for Time Series), which efficiently retrieves both meta and series information, enabling diverse domain-specific time series to be used as prompts. On the model side, we develop a reinforcement learning framework that treats ground-truth as environments, providing error feedback to optimize a smaller model and harnessing the capabilities of LLMs. This allows ZeroTS to incrementally approach inherent data regularities while iteratively refining its outputs. We validate ZeroTS via extensive experiments on zero-shot and long-horizon forecasting. ZeroTS achieves best or second best results with comparative parameters, 1/4 memory and 1/7 inference speed, demonstrating its efficiency and effectiveness. Our results highlight the potential of Data-LLM interactions for zero-shot learning with acceptable parameters, opening new avenues on research of this underexplored area.

3614Instant Transformer Adaption via HyperLoRA

[openreview] [pdf]

Abstract While Foundation Models provide a general tool for rapid content creation, they regularly require task-specific adaptation. Traditionally, this exercise involves careful curation of datasets and repeated fine-tuning of the underlying model. Fine-tuning techniques enable practitioners to adapt foundation models for many new applications but require expensive and lengthy training while being notably sensitive to hyper-parameter choices. To overcome these limitations, we introduce HyperLoRA, a model capable of adapting Large Language Models on the fly---solely based on a natural language description of the target task. HyperLoRA is a hypernetwork trained to construct LoRAs in a single inexpensive forward pass. After training HyperLoRA on a suite of 9 pre-trained LoRA adapters (GSM8K, Arc, etc.), we show that the ad-hoc reconstructed LoRA instances match the performance of task-specific adapters across the corresponding test sets. Furthermore, HyperLoRA can compress hundreds of LoRA instances and zero-shot generalize to entirely unseen tasks. This approach provides a significant step towards democratizing the specialization of foundation models and enables language-based adaptation with minimal compute requirements. Our code and pre-trained checkpoints will be available throughhttps://github.com/AnonymousAuthor/hyperloraandhttps://huggingface.co/upon publication.

3615Lost-in-Distance: Impact of Contextual Proximity on LLM Performance in Graph Tasks

[openreview] [pdf]

Abstract Despite significant advancements, Large Language Models (LLMs) exhibit blind spots that impair their ability to retrieve and process relevant contextual data effectively. We demonstrate that LLM performance in graph tasks with complexities beyond the “needle-in-a-haystack” scenario—where solving the problem requires cross-referencing and reasoning across multiple subproblems jointly—is influenced by the proximity of relevant information within the context, a phenomenon we term “lost-in-distance”. We examine two fundamental graph tasks: identifying common connections between two nodes and assessing similarity among three nodes, and show that the model’s performance in these tasks significantly depends on the relative positioning of common edges. We evaluate three publicly available LLMs—Llama-3-8B, Llama-3-70B, and GPT-4—using various graph encoding techniques that represent graph structures for LLM input. We propose a formulation for the lost-in-distance phenomenon and demonstrate that lost-in-distance and lost-in-the middle phenomenas occur independently. Results indicate that model accuracy can decline by up to 6x as the distance between node connections increases, independent of graph encoding and model size.

3616Simple, Accurate, and Efficient Axis-Aligned Decision Tree Learning

[openreview] [pdf]

Abstract Decision Trees (DTs) are widely used in various domains for their simplicity and interpretability. However, traditional DTs often suffer from low accuracy and reduced robustness because they rely on fixed splits and a greedy approach to decision-making. While recent approaches combining decision trees with optimization seek to balance accuracy, computational efficiency, and interpretability, they still fall short. In this paper, we introduce a novel Probabilistic univariate Decision Tree (ProuDT), a non-greedy, axis-aligned tree that aims to address these challenges and achieve significant improvements. By assigning a single deterministic feature to each decision node, ProuDT ensures univariate splits while preserving the differentiability of soft decision trees for gradient-based optimization. This tree enhances interpretability through transparent feature utilization in decision-making. Additionally, ProuDT simplifies the optimization process and reduces computational cost by avoiding complex parameters. Extensive experiments on tabular datasets demonstrate ProuDT’s superior performance and scalability in binary and multi-class classification tasks.

3617Improving Deep Regression with Tightness

[openreview] [pdf]

Abstract For deep regression, preserving the ordinality of the targets with respect to the feature representation improves performance across various tasks. However, a theoretical explanation for the benefits of ordinality is still lacking. This work reveals that preserving ordinality reduces the conditional entropy H(ZY)H(Z|Y) of representation ZZ conditional on the target YY. However, our findings reveal that typical regression losses do little to reduce H(ZY)H(Z|Y), even though it is vital for generalization performance.With this motivation, we introduce an optimal transport-based regularizer to preserve the similarity relationships of targets in the feature space to reduce H(ZY)H(Z|Y). Additionally, we introduce a simple yet efficient strategy of duplicating the regressor targets, also with the aim of reducing H(ZY)H(Z|Y). Experiments on three real-world regression tasks verify the effectiveness of our strategies to improve deep regression. Code will be released upon paper acceptance.

3618Q-Adapter: Customizing Pre-trained LLMs to New Preferences with Forgetting Mitigation

[openreview] [pdf]

Abstract Large Language Models (LLMs), trained on a large amount of corpus, have demonstrated remarkable abilities. However, it may not be sufficient to directly apply open-source LLMs like Llama to certain real-world scenarios, since most of them are trained for \emph{general} purposes. Thus, the demands for customizing publicly available LLMs emerge, but are currently under-studied. In this work, we consider customizing pre-trained LLMs with new human preferences. Specifically, the LLM should not only meet the new preference but also preserve its original capabilities after customization. Drawing inspiration from the observation that human preference can be expressed as a reward model, we propose to cast LLM customization as optimizing the sum of two reward functions, one of which (denoted as r1r_1) was used to pre-train the LLM while the other (denoted as r2r_2) characterizes the new human preference. The obstacle here is that both reward functions are unknown, making the application of modern reinforcement learning methods infeasible. Thanks to the residual Q-learning framework, we can restore the customized LLM with the pre-trained LLM and the \emph{residual Q-function} without the reward function r1r_1. Moreover, we find that for a fixed pre-trained LLM, the reward function r2r_2 can be derived from the residual Q-function, enabling us to directly learn the residual Q-function from the new human preference data upon the Bradley-Terry model. We name our method Q-Adapter as it introduces an adapter module to approximate the residual Q-function for customizing the pre-trained LLM towards the new preference. Experiments based on the Llama-3.1 model on the DSP dataset and HH-RLHF dataset illustrate the superior effectiveness of Q-Adapter on both retaining existing knowledge and learning new preferences.

3619Forking Paths in Neural Text Generation

[openreview] [pdf]

Abstract Estimating uncertainty in Large Language Models (LLMs) is important for properly evaluating LLMs, and ensuring safety for users. However, prior approaches to uncertainty estimation focus on the final answer in generated text, ignoring intermediate steps that might dramatically impact the outcome. We hypothesize that there exist key forking tokens, such that re-sampling the system at those specific tokens, but not others, leads to very different outcomes. To test this empirically, we develop a novel approach to representing uncertainty dynamics across individual tokens of text generation, and applying statistical models to test our hypothesis. Our approach is highly flexible: it can be applied to any dataset and any LLM, without fine tuning or accessing model weights. We use our method to analyze LLM responses on 7 different tasks across 4 domains, spanning a wide range of typical use cases. We find many examples of forking tokens, including surprising ones such as a space character instead of a colon, suggesting that LLMs are often just a single token away from saying something very different.

3620Active Task Disambiguation with LLMs

[openreview] [pdf]

Abstract Despite the impressive performance of large language models (LLMs) across various benchmarks, their ability to address ambiguously specified problems—frequent in real-world interactions—remains underexplored. To address this gap, we introduce a formal definition of task ambiguity and frame the problem of task disambiguation through the lens of Bayesian Experimental Design. By posing clarifying questions, LLM agents can acquire additional task specifications, progressively narrowing the space of viable solutions and reducing the risk of generating unsatisfactory outputs. Yet, generating effective clarifying questions requires LLM agents to engage in a form of meta-cognitive reasoning, an ability LLMs may presently lack. Our proposed approach of active task disambiguation enables LLM agents to generate targeted questions maximizing the information gain. Effectively, this approach shifts the load from implicit to explicit reasoning about the space of viable solutions. Empirical results demonstrate that this form of question selection leads to more effective task disambiguation in comparison to approaches relying on reasoning solely within the space of questions.

3621LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters

[openreview] [pdf]

Abstract The rapid expansion of large language models (LLMs) has underscored the need for parameter-efficient fine-tuning methods, with LoRA (Low-Rank Adaptation) emerging as a popular solution. Although LoRA reduces the number of trainable parameters, serving multiple (task or user-specific) LoRA modules on top of a base model still creates significant storage challenges. To address this, using theoretical derivation, we introduce LoRA-XS (Low-Rank Adaptation with eXtremely Small number of parameters), a novel low-rank adaptation method that considerably reduces the trainable parameters while showing superior or competitive performance. LoRA-XS achieves this by inserting a small, trainable r×rr \times r weight matrix between frozen low-rank matrices, which are constructed by Singular Value Decomposition (SVD) of the original weight matrix. This lightweight matrix enables fine-tuning with drastically reduced storage requirements, making it feasible to deploy millions of personalized models while minimizing memory overhead. For instance, LoRA-XS achieves a remarkable reduction of trainable parameters by over 100x in 7B models compared to LoRA. Our evaluations across various benchmarks (including GLUE, GSM8K, MATH, and eight commonsense reasoning datasets) demonstrate that LoRA-XS performs competitively or better than LoRA and other recent methods like VeRA while being significantly more parameter efficient. We also provide an extensive ablation study on the importance of singular vectors in transformer weights, shedding light on the underlying mechanisms driving LoRA-XS’s enhanced efficiency. These findings suggest that LoRA-XS is not only a storage-efficient alternative, but also a powerful tool for scaling and personalizing LLMs at unprecedented scales.

3622Identifying latent state transitions in non-linear dynamical systems

[openreview] [pdf]

Abstract This work improves generalization and interpretability of dynamical systems by recovering the underlying lower-dimensional latent states and their time evolution. Previous work on disentangled representation learning within the realm of dynamical systems focused on the latent states, possibly with linear transition approximations. As such, they cannot identify nonlinear transition dynamics, and hence fail to reliably predict complex future behavior. Inspired by the advances in nonlinear ICA, we propose a state-space modeling framework in which we can identify not just the latent states but also the unknown transition function that maps the past states to the present. Our identifiability theory relies on two key assumptions: (i) sufficient variability in the latent noise, and (ii) the bijectivity of the augmented transition function. Drawing from this theory, we introduce a practical algorithm based on variational auto-encoders. We empirically demonstrate that it can (i) recover latent state dynamics with high accuracy, (ii) correspondingly achieve high future prediction accuracy, and (iii) adapt fast to new environments. Additionally, for complex real-world dynamics, (iv) it produces state-of-the-art future prediction results for long horizons, highlighting its usefulness for practical scenarios.

3623Flow Graph Neural Networks

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have become essential for learning from graph-structured data. However, existing GNNs do not consider the conservation law inherent in graphs associated with a flow of physical resources, such as electrical current in power grids or traffic in transportation networks. To address this limitation and enhance the performance on tasks where accurate modeling of resource flows is crucial, we propose Flow Graph Neural Networks (FlowGNNs). This novel GNN framework adapts existing graph attention mechanisms to reflect the conservation of resources by distributing a node’s message among its outgoing edges instead of allowing arbitrary duplication of the node’s information. We further extend this framework to directed acyclic graphs (DAGs), enabling discrimination between non-isomorphic flow graphs that would otherwise be indistinguishable for standard GNNs tailored to DAGs. We validate our approach through extensive experiments on two different flow graph domains—electronic circuits and power grids—and demonstrate that the proposed framework enhances the performance of traditional GNN architectures on both graph-level classification and regression tasks.

3624Distinguishing Ignorance from Error in LLM Hallucinations

[openreview] [pdf]

Abstract Large language models (LLMs) are susceptible to hallucinations---outputs that are ungrounded, factually incorrect, or inconsistent with prior generations. We focus on close-book Question Answering (CBQA), where previous work has not fully addressed the distinction between two possible kinds of hallucinations, namely, whether the model (1) does not hold the correct answer in its parameters or (2) answers incorrectly despite having the required knowledge. We argue that distinguishing these cases is crucial for detecting and mitigating hallucinations. Specifically, case (2) may be mitigated by intervening in the model’s internal computation, as the knowledge resides within the model’s parameters. In contrast, in case (1) there is no parametric knowledge to leverage for mitigation, so it should be addressed by resorting to an external knowledge source or abstaining. To help distinguish between the two cases, we introduce Wrong Answer despite having Correct Knowledge (WACK), an approach for constructing model-specific datasets for the second hallucination type. Our probing experiments indicate that the two kinds of hallucinations are represented differently in the model’s inner states. Next, we show that datasets constructed using WACK exhibit variations across models, demonstrating that even when models share knowledge of certain facts, they still vary in the specific examples that lead to hallucinations. Finally, we show that training a probe on our WACK datasets leads to better hallucination detection of case (2) hallucinations than using the common generic one-size-fits-all datasets.

3625Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

[openreview] [pdf]

Abstract In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduceCavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To our best knowledge, Cavia is the first of its kind that allows the user to precisely specify camera motion while obtaining object motion. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality.

3626Retrieval Augmented Thought Process for Private Data Handling in Healthcare

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated the strong potential to assist both clinicians and the general public with their extensive medical knowledge. However, their application in healthcare is constrained due to concerns about the privacy of data used in training, which prevents the integration of private and personal information because of security and ethical issues. Moreover, if their capabilities can be enhanced with information retrieval to access up-to-date knowledge, the current integration of LLMs with Information retrieval lacks robustness to imperfect retrieval, which can hinder their effectiveness and even reduce overall performance. In this work, we address this challenge by introducing the Retrieval-Augmented Thought Process (RATP). Given access to external knowledge, RATP formulates the thought generation of LLMs as a multiple-step decision process. To optimise such a thought process, RATP leverages Monte-Carlo Tree Search and learns a proxy reward function that permits cost-efficient inference. On a private dataset of electronic medical records, deliberately excluded from any LLM training set, RATP achieves 35% additional accuracy compared to in-context retrieval-augmented generation for the question-answering task.

3627Remove Symmetries to Control Model Expressivity

[openreview] [pdf]

Abstract When symmetry is present in the loss function, the model is likely to be trapped in a low-capacity state that is sometimes known as a ``collapse." Being trapped in these low-capacity states can be a major obstacle to training across many scenarios where deep learning technology is applied. We first prove two concrete mechanisms through which symmetries lead to reduced capacities and ignored features during training. We then propose a simple and theoretically justified algorithm, \textit{syre}, to remove almost all symmetry-induced low-capacity states in neural networks. The proposed method is shown to improve the training of neural networks in scenarios when this type of entrapment is especially a concern. A remarkable merit of the proposed method is that it is model-agnostic and does not require any knowledge of the symmetry.

3628Selective Unlearning via Representation Erasure Using Adversarial Training

[openreview] [pdf]

Abstract When deploying machine learning models in the real world, we often face the challenge of “unlearning” specific data points or subsets after training. Inspired by Domain-Adversarial Training of Neural Networks (DANN), we propose a novel algorithm,SURE, for targeted unlearning.SURE treats the process as a domain adaptation problem, where the “forget set” (data to be removed) and a validation set from the same distribution form two distinct domains. We train a domain clas-sifier to discriminate between representations from the forget and validation sets.Using a gradient reversal strategy similar to DANN, we perform gradient updates to the representations to “fool” the domain classifier and thus obfuscate representations belonging to the forget set. Simultaneously, gradient descent is applied to the retain set (original training data minus the forget set) to preserve its classification performance. Unlike other unlearning approaches whose training objectives are built based on model outputs,SURE directly manipulates there presentations.This is key to ensure robustness against a set of more powerful attacks than currently considered in the literature, that aim to detect which examples were unlearned through access to learned embeddings. Our thorough experiments reveal that SURE has a better unlearning quality to utility trade-off compared to other standard unlearning techniques for deep neural networks.

3629Self-Augmented Preference Optimization: Off-Policy Paradigms for Language Model Alignment

[openreview] [pdf]

Abstract Traditional language model alignment methods, such as Direct Preference Optimization (DPO), are limited by their dependence on static, pre-collected paired preference data, which restricts their adaptability and practical applicability. To address this limitation, we introduce Self-Augmented Preference Optimization (SAPO), an effective and scalable training paradigm without the need of existing paired data. Built upon the self-play concept that autonomously generate negative responses, we further involve the off-policy learning pipeline to improve the data exploration and exploitation. Specifically, we employ an Exponential Moving Average (EMA) model along with a replay buffer to enable dynamic updates of response segments, effectively integrating real-time feedback with historical data insights. Our comprehensive evaluations of the LLaMA3-8B and Mistral-7B models across benchmarks—including the Open LLM Leaderboard, IFEval, AlpacaEval 2.0, and MT-Bench—demonstrate that SAPO matches or surpasses established offline contrastive baselines, such as DPO and Odds Ratio Preference Optimization (ORPO), and outperforms offline self-play methods like SPIN.

3630Online Preference Alignment for Language Models via Count-based Exploration

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has shown great potential in fine-tuning Large Language Models (LLMs) to align with human preferences. Existing methods perform preference alignment from a fixed dataset, which can be limited in data coverage and the resulting reward model is hard to generalize in out-of-distribution responses. Thus, online RLHF is more desirable to empower the LLM to explore outside the support of the initial dataset by iteratively collecting the prompt-response pairs. In this paper, we study the fundamental problem in online RLHF, i.e., how to explore for LLM. We give a theoretical motivation in linear reward assumption to show that an optimistic reward with an upper confidence bound (UCB) term leads to a provably efficient RLHF policy. Then, we reformulate our objective to direct preference optimization with an exploration term, where the UCB-term can be converted to a count-based exploration bonus. We further propose a practical algorithm, named Count-based Online Preference Optimization (COPO), which leverages a simple coin-flip counting module to estimate the pseudo-count of a prompt-response pair in previously collected data. COPO encourages LLMs to balance exploration and preference optimization in an iterative manner, which enlarges the exploration space and the entire data coverage of iterative LLM policies. We conduct online RLHF experiments on Zephyr and Llama-3 models. The results on instruction-following and standard academic benchmarks show that COPO significantly increases performance.

3631Enhancing Integrated Gradients Using Emphasis Factors and Attention for Effective Explainability of Large Language Models

[openreview] [pdf]

Abstract Understanding the decision-making processes of large language models (LLMs) is critical for ensuring transparency and trustworthiness. While Integrated Gradients (IG) is a popular method for model explainability, it faces limitations when applied to autoregressive models due to issues like exploding gradients and the neglect of the attention mechanisms. In this paper, we propose an enhanced explainability framework that augments IG with emphasis factors and attention mechanisms. By incorporating attention, we capture contextual dependencies between words, and the introduction of emphasis factors mitigates gradient issues encountered during attribution calculations. Our method provides more precise and interpretable explanations for autoregressive LLMs, effectively highlighting word-level contributions in text generation tasks. Experimental results demonstrate that our approach outperforms standard IG and baseline models in explaining word-level attributions, advancing the interpretability of LLMs.

3632Tight Stability, Convergence, and Robustness Bounds for Predictive Coding Networks

[openreview] [pdf]

Abstract Energy-based learning algorithms, such as predictive coding (PC), have garnered significant attention in the machine learning community due to their theoretical properties, such as local operations and biologically plausible mechanisms for error correction. In this work, we rigorously analyze the stability, robustness, and convergence of PC through the lens of dynamical systems theory. We show that, first, PC is Lyapunov stable under mild assumptions on its loss and residual energy functions, which implies intrinsic robustness to small random perturbations due to its well-defined energy-minimizing dynamics. Second, we formally establish that the PC updates approximate quasi-Newton methods by incorporating higher-order curvature information, which makes them more stable and able to converge with fewer iterations compared to models trained via backpropagation (BP). Furthermore, using this dynamical framework, we provide new theoretical bounds on the similarity between PC and other algorithms, i.e., BP and target propagation (TP), by precisely characterizing the role of higher-order derivatives. These bounds, derived through detailed analysis of the Hessian structures, show that PC is significantly closer to quasi-Newton updates than TP, providing a deeper understanding of the stability and efficiency of PC compared to conventional learning methods.

3633An Engorgio Prompt Makes Large Language Model Babble on

[openreview] [pdf]

Abstract Auto-regressive large language models (LLMs) have yielded impressive performance in many real-world tasks. However, the new paradigm of these LLMs also exposes novel threats. In this paper, we explore their vulnerability to inference cost attacks, where a malicious user crafts Engorgio prompts to intentionally increase the computation cost and latency of the inference process. We design Engorgio, a novel methodology, to efficiently generate adversarial Engorgio prompts to affect the target LLM’s service availability. Engorgio has the following two technical contributions. (1) We employ a parameterized distribution to track LLMs’ prediction trajectory. (2) Targeting the auto-regressive nature of LLMs’ inference process, we propose novel loss functions to stably suppress the appearance of the token, whose occurrence will interrupt the LLM’s generation process. We conduct extensive experiments on 13 open-sourced LLMs with parameters ranging from 125M to 30B. The results show that Engorgio prompts can successfully induce LLMs to generate abnormally long outputs (i.e., roughly 2-13×\times longer to reach 90%+ of the output length limit) in a white-box scenario and our real-world experiment demonstrates Engergio’s threat to LLM service with limited computing resources. The code is accessible in \url{https://anonymous.4open.science/r/Engorgio}.

3634Understanding Model Ensemble in Transferable Adversarial Attack

[openreview] [pdf]

Abstract Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack.We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components.Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, contributing to three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.

3635EVA: An Embodied World Model for Future Video Anticipation

[openreview] [pdf]

Abstract World models integrate raw data from various modalities—such as images and language to simulate comprehensive interactions in the world, thereby displaying crucial roles in fields like mixed reality and robotics. Yet, applying the world model for accurate video prediction is quite challenging due to the complex and dynamic intentions of the various scenes in practice. In this paper, inspired by the human rethinking process, we decompose the complex video prediction into four meta-tasks that enable the world model to handle this issue in a more fine-grained manner. Alongside these tasks, we introduce a new benchmark named Embodied Video Anticipation Benchmark (EVA-Bench) to provide a well-rounded evaluation. EVA-Bench focused on evaluating the video prediction ability of human and robot actions, presenting significant challenges for both the language model and the generation model. Targeting embodied video prediction, we propose the Embodied Video Anticipator (EVA), a unified framework aiming at video understanding and generation. EVA integrates a video generation model with a visual language model, effectively combining reasoning capabilities with high-quality generation. Moreover, to enhance the generalization of our framework, we tailor-designed a multi-stage pretraining paradigm that adaptatively ensembles LoRA to produce high-fidelity results.Extensive experiments on EVA-Bench highlight the potential of EVA to significantly improve performance in embodied scenes, paving the way for large-scale pre-trained models in real-world prediction tasks. The video demo and benchmark information will be available at \hyperlink{https://sites.google.com/view/iclr25-eva}{https://sites.google.com/view/iclr25-eva}.

3636Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models

[openreview] [pdf]

Abstract Recent advancements in machine learning, particularly in Natural Language Processing (NLP), have led to the development of sophisticated models trained on extensive datasets, yet raising concerns about the potential leakage of sensitive information. In response, regulatory measures such as the European Union’s General Data Protection Regulation (GDPR) have driven increasing interest in Machine Unlearning techniques, which enable models to selectively forget specific data entries. Early approaches primarily relied on pre-processing methods, while more recent research has shifted towards training-based unlearning techniques. Despite their effectiveness, most existing methods require access to the original training data, which is often inaccessible. Additionally, directly applying unlearning techniques bear the cost of undermining the model’s expressive capabilities. To address these challenges, we introduce theIterativeContrastiveUnlearning (ICU) framework, which consists of three core components: A Knowledge Unlearning Induction module designed to remove specific knowledge through an unlearning loss; A Contrastive Learning Enhancement module to preserve the model’s expressive capabilities against the pure unlearning goal; And an Iterative Unlearning Refinement module that dynamically assess the unlearning extent on specific data pieces and make iterative update. Experimental results demonstrate the efficacy of our ICU method in unlearning sensitive information while maintaining the model’s overall performance, offering a promising solution for privacy-conscious machine learning applications.

3637Non-negative Tensor Mixture Learning for Discrete Density Estimation

[openreview] [pdf]

Abstract We present an expectation-maximization (EM) based unified framework for non-negative tensor decomposition that optimizes the Kullback-Leibler divergence. To avoid iterations in each M-step and learning rate tuning, we establish a general relationship between low-rank decomposition and many-body approximation. Using this connection, we exploit that the closed-form solution of the many-body approximation can be used to update all parameters simultaneously in the M-step. Our framework offers not only a unified methodology for a variety of low-rank structures, including CP, Tucker, and Train decompositions but also their combinations forming mixtures of tensors. The weights of each low-rank tensor in the mixture can be learned from the data, which eliminates the need to carefully choose a single low-rank structure in advance. We empirically demonstrate that our framework provides superior generalization for discrete density estimation compared to conventional tensor-based approaches.

3638Feature Responsiveness Scores: Model-Agnostic Explanations for Recourse

[openreview] [pdf]

Abstract Machine learning models are often used to automate or support decisions in applications such as lending and hiring. In such tasks, consumer protection rules mandate that we provide a list of ``principal reasons” to consumers who receive adverse decisions. In practice, lenders and employers identify principal reasons by returning the top-scoring features from a feature attribution method. In this work, we study how this approach aligns with one of the underlying goals of consumer protection: recourse, i.e., educating individuals on how they can attain a desired outcome. We show that standard attribution methods can mislead individuals by highlighting features that cannot be changed to achieve recourse – i.e., by providing them with reasons without recourse. We propose to address these issues by scoring features on the basis of responsiveness – i.e., the probability that an individual can attain a desired outcome by changing a specific feature. We develop efficient methods to compute feature responsiveness scores for any model and any dataset under complex actionability constraints. We present an extensive empirical study on the responsiveness of explanations in consumer finance, and demonstrate that responsiveness scores can flag instances with fixed predictions and identify features that lead to recourse.

3639Divide And Conquer: Efficiently Decoupling Consensus And Divergence For Federated Large Language Model Fine-Tuning

[openreview] [pdf]

Abstract Federated Learning provides an efficient framework for fine-tuning Large Language Models (LLMs) on diverse private datasets, addressing the growing scarcity of publicly available training data while maintaining data privacy. However, in practice, client data typically spans multiple domains, posing significant challenges for the global model’s generalization capabilities. To address this issue, we introduce a novel framework,FederatedConsensus-DivergenceDecoupling for LLM Fine-Tuning (FedCDD), designed to enhance global model performance in such heterogeneous environments. Our framework introduces a mechanism for consensus aggregation and divergence alignment, decoupling client updates into “consensus” and “divergence” parts. This allows the LLM to maintain a unified consensus while accommodating domain-specific divergences. Additionally, we employ a Gaussian-Noise Mask to regulate local model uploads, preventing the LLM from overfitting to domain-specific knowledge. Experimental results on heterogeneous datasets demonstrate the superiority of our approach over existing methods. The code is anonymously available athttps://anonymous.4open.science/r/FedCDD-5DA6.

3640A Decade’s Battle on Dataset Bias: Are We There Yet?

[openreview] [pdf]

Abstract We revisit the ``dataset classification’’ experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be explained by memorization. We hope our discovery will inspire the community to rethink issues involving dataset bias.

3641Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

[openreview] [pdf]

Abstract Learning conditional distributions π(x)\pi^*(\cdot|x) is a central problem in machine learning, which is typically approached via supervised methods with paired data (x,y)π(x,y) \sim \pi^*. However, acquiring paired data samples is often challenging, especially in problems such as domain translation. This necessitates the development ofsemi-supervisedmodels that utilize both limited paired data and additional unpaired i.i.d. samples xπxx \sim \pi^*_x and yπyy \sim \pi^*_y from the marginal distributions. The usage of such combined data is complex and often relies on heuristic approaches. To tackle this issue, we propose a new learning paradigm that integrates both paired and unpaired dataseamlesslythrough the data likelihood maximization techniques. We demonstrate that our approach also connects intriguingly with inverse entropic optimal transport (OT). This finding allows us to apply recent advances in computational OT to establish alightlearning algorithm to get π(x)\pi^*(\cdot|x). Furthermore, we demonstrate through empirical tests that our method effectively learns conditional distributions using paired and unpaired data simultaneously.

3642AgentSquare: Automatic LLM Agent Search in Modular Design Space

[openreview] [pdf]

Abstract Recent advancements in Large Language Models (LLMs) have led to a rapid growth of agentic systems capable of handling a wide range of complex tasks. However, current research largely relies on manual, task-specific design, limiting their adaptability to novel tasks. In this paper, we introduce a new research problem: Modularized LLM Agent Search (MoLAS). We propose a modular design space that abstracts existing LLM agent designs into four fundamental modules with uniform IO interface: Planning, Reasoning, Tool Use, and Memory. Building on this design space, we present a novel LLM agent search framework called AgentSquare, which introduces two core mechanisms, i.e., module evolution and recombination, to efficiently search for optimized LLM agents. To further accelerate the process, we design a performance predictor that uses in-context surrogate models to skip unpromising agent designs. Extensive experiments across six benchmarks, covering the diverse scenarios of web, embodied, tool use and game applications, show that AgentSquare substantially outperforms hand-crafted agents, achieving an average performance gain of 17.2% against best-known human designs. Moreover, AgentSquare can generate interpretable design insights, enabling a deeper understanding of agentic architecture and its impact on task performance. We believe that the modular design space and AgentSquare search framework offer a platform for fully exploiting the potential of prior successful designs and consolidate the collective efforts of research community. Code repo is available athttps://github.com/ICLR-10021/AgentSquare.

3643CityBench: Evaluating the Capabilities of Large Language Models for Urban Tasks

[openreview] [pdf]

Abstract Recently, large language models (LLMs) with extensive general knowledge and powerful reasoning abilities have seen rapid development and widespread application. A systematic and reliable evaluation of LLMs or visual language model (VLMs) is a crucial step in applying and developing them for various fields. There have been some early explorations about the usability of LLMs for limited urban tasks, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for urban research lies in the diversity of urban data, the complexity of application scenarios and the highly dynamic nature of the urban environment. In this paper, we design CityBench, an interactive simulator based evaluation platform, as the first systematic benchmark for evaluating the capabilities of LLMs for diverse tasks in urban research. First, we build CityData to integrate the diverse urban data and CitySim to simulate fine-grained urban dynamics. Based on CityData and CitySim, we design 8 representative urban tasks in 2 categories of perception-understanding and decision-making as the CityBench. With extensive results from 30 well-known LLMs and VLMs in 13 cities around the world, we find that advanced LLMs and VLMs can achieve competitive performance in diverse urban tasks requiring commonsense and semantic understanding abilities, e.g., understanding the human dynamics and semantic inference of urban images. Meanwhile, they fail to solve the challenging urban tasks requiring professional knowledge and high-level reasoning abilities, e.g., geospatial prediction and traffic control task. These observations provide valuable perspectives for utilizing and developing LLMs in the future. The dataset, benchmark and source codes are openly accessible to the research community viahttps://github.com/CityBench24/CityBench.

3644SubgoalXL: Subgoal-based Expert Learning for Theorem Proving

[openreview] [pdf]

Abstract Formal theorem proving, a field at the intersection of mathematics and computer science, has seen renewed interest with advancements in large language models (LLMs). This paper introduces SubgoalXL, a novel approach that synergizes subgoal-based proofs with expert learning to enhance LLMs’ capabilities in formal theorem proving within the Isabelle environment. SubgoalXL addresses two critical challenges: the scarcity of specialized mathematics and theorem-proving data, and the need for improved multi-step reasoning abilities in LLMs. By optimizing data efficiency and employing subgoal-level supervision, SubgoalXL extracts richer information from limited human-generated proofs. The framework integrates subgoal-oriented proof strategies with an expert learning system, iteratively refining formal statement, proof, and subgoal generators. Leveraging the Isabelle environment’s advantages in subgoal-based proofs, SubgoalXL achieves a new state-of-the-art performance of 56.1% in Isabelle on the standard miniF2F dataset, marking an absolute improvement of 4.9%. Notably, SubgoalXL successfully solves 41 AMC12, 9 AIME, and 3 IMO problems from miniF2F. These results underscore the effectiveness of maximizing limited data utility and employing targeted guidance for complex reasoning in formal theorem proving, contributing to the ongoing advancement of AI reasoning capabilities.

3645STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly being applied to economic tasks like stock picking and financial analysis. Existing LLM benchmarks tend to focus on specific applications and often fail to describe a rich variety of economic tasks. Raman et al. (2024) offer a blueprint for comprehensively benchmarking strategic decision-making. However, their work failed to address the non-strategic settings prevalent in micro-economics. We address this gap by taxonomizing micro-economic reasoning into 57 distinct elements, each grounded in up to 10 distinct domains, 5 perspectives, and 2 types. The generation of benchmark data across this combinatorial space is powered by a novel LLM-assisted data generation protocol that we dub auto-STEER which generates a set of questions by adapting handwritten templates to target new domains and perspectives. By generating fresh questions for each element, auto-STEER helps reduce the risk of data contamination, ensuring that \model evaluations remain valuable over time. We leveraged our benchmark to evaluate 15 LLMs over each of the instantiated elements, examined their ability to reason through and solve microeconomic problems and compared LLM performance across a suite of adaptations and metrics. Our work provides insights into the current capabilities and limitations of LLMs in non-strategic economic decision-making and a tool for fine-tuning these models to improve performance.

3646Mitigating the Influence of Distractor Tasks in LMs with Prior-Aware Decoding

[openreview] [pdf]

Abstract The broad capabilities of Language Models (LMs) can be limited by their sensitivity to distractor tasks: LMs can infer secondary tasks from the prompt in addition to the intended one, leading to unwanted outputs. For example, prompt injection attacks can cause models to deviate from explicit directives. In some ‘inverse scaling’ cases, this unwanted behaviour actually worsens as models scale up to at least 540B parameters. We present a theoretical framework that interprets LMs as a product of experts that combine multiple data generation processes. Based on this framework, we introduce prior-aware decoding (PAD) -- a simple contrastive inference method to reduce the influence of distractor tasks. We apply PAD to eleven models, across four datasets, and find improvements in 41 out of 44 task-model combinations, with a median increase in task completion proportion of 40%. The results suggest a promising direction for further development towards more reliable language models.

3647Update larger, train faster: Stable Test-time adaptation utilizing noisy-pseudo labels

[openreview] [pdf]

Abstract We investigate the role of pseudo-labels in the test-time adaptation (TTA) problem. When working with unlabeled samples in TTA, pseudo-labels have become a natural approach to updating the target model. However, pseudo-label learning also presents some challenges: it suffers from a memorization effect (the model learns from clean labels first, then memorizes the noisy ones) and confirmation bias (errors from noisy labels increase over time and disrupt model performance when they become significant). Our work first identifies two underlying mechanisms leading to these obstacles. On the one hand, existing methods follow a “slow” adaptation to the target domain, allowing sufficient time for the model to memorize noisy labels (memorization effect) and accumulate errors (confirmation bias). Furthermore, training with noisy labels blurs the decision boundary with nearby classes. To address the first issue, we propose a novel loss function, namely sparse cross logit (sparse-CL), that operates in the logit space and allows the model to take larger learning steps in a stable training manner. This helps the target model reach a better solution faster under the same number of updating steps. To address the second issue, we introduce a regularization that penalizes negative pseudo-labels while encouraging positive ones, which can increase the boundary between nearby classes. We demonstrate that our methods outperform state-of-the-art methods in a diverse set of TTA experiments.

3648Monty Hall and Optimized Conformal Prediction to Improve Decision-Making with LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) are empowering decision-making in open-world agents in several applications, including tool or API usage and answering multiple choice questions (MCQs). However, they often make overconfident, incorrect predictions, which can be risky in high-stakes settings like healthcare and finance. To mitigate these risks, recent works have used conformal prediction (CP), a model-agnostic framework for distribution-free uncertainty quantification. CP transforms a \emph{score function} into prediction sets that contain the true answer with high probability. While CP provides this coverage guarantee for arbitrary scores, the score quality significantly impacts prediction set sizes. Prior works have relied on LLM logits or other heuristic scores, lacking quality guarantees. We address this limitation by introducing CP-OPT, an optimization framework to learn scores that minimize set sizes while maintaining coverage. Furthermore, inspired by the Monty Hall problem, we extend CP’s utility beyond uncertainty quantification to improve accuracy. We propose a method called \emph{conformal revision of questions} (CROQ) to revise the problem by narrowing down the available choices to those in the prediction set. The coverage guarantee of CP ensures that the correct choice is in the revised question prompt with high probability, while the smaller number of choices increases the LLM’s chances of answering it correctly. Experiments on the MMLU, ToolAlpaca, and TruthfulQA datasets with Llama-3 and Phi-3 models show that optimized CP scores reduce set sizes while maintaining coverage guarantee, and CROQ shows significant improvement in accuracy over the standard inference procedure.

3649Sample-Efficient Co-Optimization of Agent Morphology and Policy with Self-Imitation Learning

[openreview] [pdf]

Abstract The task of co-optimizing the body and behaviour of agents has been a longstanding problem in the fields of evolutionary robotics and embodied AI. Previous work has largely focused on the development of learning methods exploiting massive parallelization of agent evaluations with large population sizes, a paradigm which is applicable to simulated agents but cannot be transferred to the real world due to the assoicated costs with the production of embodiments and robots. Furthermore, recent data-efficient approaches utilizing reinforcement learning can suffer from distributional shifts in transition dynamics as well as in state and action spaces when experiencing new body morphologies. In this work, we propose a new co-adaptation method combining reinforcement learning and State-Aligned SelfImitation Learning to co-optimize embodiment and behavioural policies withing a handful of design iterations. We show that the integration of a self-imitation signal improves the data-efficiency of the co-adaptation process as well as the behavioural recovery when adapting morphological parameters.

3650Leveraging Imitation Learning and LLMs for Efficient Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we introduce an innovative framework that combines Hierarchical Reinforcement Learning (HRL) with Large Language Models (LLMs) to tackle the challenges of complex, sparse-reward environments. A key contribution of our approach is the emphasis on imitation learning during the early training stages, where the LLM plays a crucial role in guiding the agent by providing high-level decision-making strategies. This early-stage imitation learning significantly accelerates the agent’s understanding of task structure, reducing the time needed to adapt to new environments. By leveraging the LLM’s ability to generate abstract representations of the environment, the agent can efficiently explore potential strategies, even in tasks with high-dimensional state spaces and delayed rewards. Our method introduces a dynamic annealing strategy in action sampling, balancing the agent’s reliance on the LLM’s guidance with its own learned policy as training progresses. Additionally, we implement a novel value function which incorporates the LLM’s predictions to guide decision-making while optimizing token efficiency. This approach reduces computational costs and enhances the agent’s learning process. Experimental results across three environments—MiniGrid, NetHack, and Crafter—demonstrate that our method significantly outperforms baseline HRL algorithms in terms of training speed and success rates. The imitation learning phase proves critical in enabling the agent to adapt quickly and perform efficiently, highlighting the potential of integrating LLMs into HRL for complex tasks.

3651xLSTM-Mixer: Multivariate Time Series Forecasting by Mixing via Scalar Memories

[openreview] [pdf]

Abstract Time series data is prevalent across numerous fields, necessitating the development of robust and accurate forecasting models. Capturing patterns both within and between temporal and multivariate components is crucial for reliable predictions. We introduce xLSTM-Mixer, a model designed to effectively integrate temporal sequences, joint time-variate information, and multiple perspectives for robust forecasting. Our approach begins with a linear forecast shared across variates, which is then refined by xLSTM blocks. They serve as key elements for modeling the complex dynamics of challenging time series data. xLSTM-Mixer ultimately reconciles two distinct views to produce the final forecast. Our extensive evaluations demonstrate its superior long-term forecasting performance compared to recent state-of-the-art methods. A thorough model analysis provides further insights into its key components and confirms its robustness and effectiveness. This work contributes to the resurgence of recurrent models in time series forecasting.

3652Rational Decision-Making Agent with Learning Internal Utility Judgment

[openreview] [pdf]

Abstract With remarkable advancements, large language models (LLMs) have attracted significant efforts to develop LLM-based agents capable of executing intricate multi-step decision-making tasks. Existing approaches predominantly build upon the external performance measure to guide the decision-making process but the reliance on the external performance measure as prior is problematic in real-world scenarios, where such prior may be unavailable, flawed, or even erroneous. For genuine autonomous decision-making for LLM-based agents, it is imperative to develop rationality from their posterior experiences to judge the utility of each decision independently. In this work, we propose RaDAgent (Rational Decision-Making Agent), which fosters the development of its rationality through an iterative framework involving Experience Exploration and Utility Learning. Within this framework, Elo-based Utility Learning is devised to assign Elo scores to individual decision steps to judge their utilities via pairwise comparisons. Consequently, these Elo scores guide the decision-making process to derive optimal outcomes. Experimental results on the Game of 24, WebShop, ToolBench and RestBench datasets demonstrate RaDAgent’s superiority over baselines, achieving about 7.8% improvement on average. Besides, RaDAgent also can reduce costs (ChatGPT API calls), highlighting its effectiveness and efficiency.

3653Towards Generalisable Time Series Understanding Across Domains

[openreview] [pdf]

Abstract In natural language processing and computer vision, self-supervised pre-training on large datasets unlocks foundational model capabilities across domains and tasks. However, this potential has not yet been realised in time series analysis, where existing methods disregard the heterogeneous nature of time series characteristics. Time series are prevalent in many domains, including medicine, engineering, natural sciences, and finance, but their characteristics vary significantly in terms of variate count, inter-variate relationships, temporal dynamics, and sampling frequency. This inherent heterogeneity across domains prevents effective pre-training on large time series corpora. To address this issue, we introduce OTiS, an open model for general time series analysis, that has been specifically designed to handle multi-domain heterogeneity. We propose a novel pre-training paradigm including a tokeniser with learnable domain-specific signatures, a dual masking strategy to capture temporal causality, and a normalised cross-correlation loss to model long-range dependencies. Our model is pre-trained on a large corpus of 640,187 samples and 11 billion time points spanning 8 distinct domains, enabling it to analyse time series from any (unseen) domain. In comprehensive experiments across 15 diverse applications - including classification, regression, and forecasting - OTiS showcases its ability to accurately capture domain-specific data characteristics and demonstrates its competitiveness against state-of-the-art baselines. Our code and pre-trained weights are publicly available at \url{https://github.com/OTiS-official/OTiS}.

3654Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Standard multi-agent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. Solving RMGs remains under-explored, from problem formulation to the development of sample-efficient algorithms. A notorious yet open challenge is if RMGs can escape the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs where the uncertainty set of each agent is shaped by both the environment and other agents’ strategies in a best-response manner. We first establish the well-posedness of these RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs.

3655Schur’s Positive-Definite Network: Deep Learning in the SPD cone with structure

[openreview] [pdf]

Abstract Estimating matrices in the symmetric positive-definite (SPD) cone is of interest for many applications ranging from computer vision to graph learning. While there exist various convex optimization-based estimators, they remain limited in expressivity due to their model-based approach. The success of deep learning motivates the use of learning-based approaches to estimate SPD matrices with neural networks in a data-driven fashion. However, designing effective neural architectures for SPD learning is challenging, particularly when the task requires additional structural constraints, such as element-wise sparsity. Current approaches either do not ensure that the output meets all desired properties or lack expressivity. In this paper, we introduce SpodNet, a novel and generic learning module that guarantees SPD outputs and supports additional structural constraints. Notably, it solves the challenging task of learning jointly SPD and sparse matrices. Our experiments illustrate the versatility and relevance of SpodNet layers for such applications.

3656Finding Second-order Stationary Points for Generalized-Smooth Nonconvex Minimax Optimization via Gradient-based Algorithm

[openreview] [pdf]

Abstract Nonconvex minimax problems have received intense interest in many machine learning applications such as generative adversarial network, robust optimization and adversarial Recently, a variety of minimax optimization algorithms based on Lipschitz smoothness for finding first-order or second-order stationary points have been proposed. However, the standard Lipschitz continuous gradient or Hessian assumption could fail to hold even in some classic minimax problems, rendering conventional minimax optimization algorithms fail to converge in practice. To address this challenge, we demonstrate a new gradient-based method for nonconvex-strongly-concave minimax optimization under a generalized smoothness assumption. Motivated by the important application of escaping saddle points, we propose a generalized Hessian smoothness condition, under which our gradient-based method can achieve the complexity of O(ϵ1.75logn)\mathcal{O}(\epsilon^{-1.75}\log n) to find a second-order stationary point with only gradient calls involved, which improves the state-of-the-art complexity results for the nonconvex minimax optimization even under standard Lipschitz smoothness condition. To the best of our knowledge, this is the first work to show convergence for finding second-order stationary points on nonconvex minimax optimization with generalized smoothness. The experimental results on the application of domain adaptation confirm the superiority of our algorithm compared with existing methods.

3657Uncertainty-Based Extensible Codebook for Discrete Federated Learning in Heterogeneous Data Silos

[openreview] [pdf]

Abstract Federated learning (FL), aimed at leveraging vast distributed datasets, confronts a crucial challenge: the heterogeneity of data across different silos. While previous studies have explored discrete representations to enhance model generalization across minor distributional shifts, these approaches often struggle to adapt to new data silos with significantly divergent distributions. In response, we have identified that models derived from FL exhibit markedly increased uncertainty when applied to data silos with unfamiliar distributions. Consequently, we propose an innovative yet straightforward iterative framework, termed Uncertainty-Based Extensible-Codebook Federated Learning (UEFL). This framework dynamically maps latent features to trainable discrete vectors, assesses the uncertainty, and specifically extends the discretization dictionary or codebook for silos exhibiting high uncertainty. Our approach aims to simultaneously enhance accuracy and reduce uncertainty by explicitly addressing the diversity of data distributions, all while maintaining minimal computational overhead in environments characterized by heterogeneous data silos. Through experiments conducted on six datasets, our method has demonstrated its superiority, achieving significant improvements in accuracy (by 3%--22.1%) and uncertainty reduction (by 38.83%--96.24%), thereby outperforming contemporary state-of-the-art methods.

3658Generalized Resource-Aware Distributed Minimax Optimization

[openreview] [pdf]

Abstract Traditional distributed minimax optimization algorithms cannot be applied in resource-limited clients dealing with large-scale models. In this work, we presentSubDisMO, a generalized resource-aware distributed minimax optimization algorithm.SubDisMOprunes the global large-scale model into adaptive-sized submodels to accommodate varying resources during each communication round. However, the randomly pruned submodels are susceptible toarbitrary submodel sharpness, which can hinder generalization and lead to slow convergence. To address this issue,SubDisMOtrains the arbitrarily pruned submodels with perturbations by optimizing the minimax objectives, enhancing thegeneralizationperformance of the aggregated full model. We theoretically analyze our proposed resource-awareSubDisMOalgorithm, demonstrating that it achieves an asymptotically optimal convergence rate of O(1/QTC)O(1/\sqrt{QT\mathcal{C}^*}), which is dominated by the minimum covering number C\mathcal{C}^*. We also show the generalization bound ofSubDisMOcorresponding to the remaining rate in each layer. Extensive experiments onCIFAR-10andCIFAR-100datasets demonstrate thatSubDisMOachieves superior generalization and effectiveness compared to state-of-the-art baselines.

3659Diffusing States and Matching Scores: A New Framework for Imitation Learning

[openreview] [pdf]

Abstract Adversarial Imitation Learning is traditionally framed as a two-player zero-sum game between a learner and an adversarially chosen cost function, and can therefore be thought of as the sequential generalization of a Generative Adversarial Network (GAN). A prominent example of this framework is Generative Adversarial Imitation Learning (GAIL). However, in recent years, diffusion models have emerged as a non-adversarial alternative to GANs that merely require training a score function via regression, yet produce generations of a higher quality. In response, we investigate how to lift insights from diffusion modeling to the sequential setting. We propose diffusing states and performing \textit{score-matching} along diffused states to measure the discrepancy between the expert’s and learner’s states. Thus, our approach only requires training score functions to predict noises via standard regression, making it significantly easier and more stable to train than adversarial methods. Theoretically, we prove first- and second-order instance-dependent bounds with linear scaling in the horizon, proving that our approach avoids the compounding errors that stymie offline approaches to imitation learning. Empirically, we show our approach outperforms GAN-style imitation learning baselines across various continuous control problems, including complex tasks like controlling humanoids to walk, sit, and crawl.

3660Battle of the Wordsmiths: Comparing ChatGPT, GPT-4, Claude, and Bard

[openreview] [pdf]

Abstract Although informal evaluations of modern LLMs can be found on social media, blogs, and news outlets, a formal and comprehensive comparison among them has yet to be conducted. In response to this gap, we have undertaken an extensive benchmark evaluation of LLMs and conversational bots. Our evaluation involved the collection of 1002 questions encompassing 27 categories, which we refer to as the “Wordsmiths dataset.” These categories include reasoning, logic, facts, coding, bias, language, humor, and more. Each question in the dataset is accompanied by an accurate and verified answer. We meticulously assessed four leading chatbots: ChatGPT, GPT-4, Bard, and Claude, using this dataset. The results of our evaluation revealed the following key findings: a) GPT-4 emerged as the top-performing chatbot across almost all categories, achieving a success rate of 84.1%. On the other hand, Bard faced challenges and achieved a success rate of 62.4%. b) Among the four models evaluated, one of them responded correctly approximately 93% of the time. However, all models were correct only about 44%. c) Bard is less correlated with other models while ChatGPT and GPT-4 are highly correlated in terms of their responses. d) Chatbots demonstrated proficiency in language understanding, facts, and self-awareness. However, they encountered difficulties in areas such as math, coding, IQ, and reasoning. e) In terms of bias, discrimination, and ethics categories, models generally performed well, suggesting they are relatively safe to utilize. To make future model evaluations on our dataset easier, we also provide a multiple-choice version of it (called WordsmithsMCQ).

3661NextBestPath: Efficient 3D Mapping of Unseen Environments

[openreview] [pdf]

Abstract This work addresses the problem of active 3D mapping, where an agent must find an efficient trajectory to exhaustively reconstruct a new scene. Previous approaches mainly predict the next best view near the agent’s location, which is prone to getting stuck in local areas. Additionally, existing indoor datasets are insufficient due to limited geometric complexity and inaccurate ground truth meshes. To overcome these limitations, we introduce a novel dataset AiMDoom with a map generator for the Doom video game, enabling to better benchmark active 3D mapping in diverse indoor environments. Moreover, we propose a new method we call next-best-path (NBP), which predicts long-term goals rather than focusing solely on short-sighted views. The model jointly predicts accumulated surface coverage gains for long-term goals and obstacle maps, allowing it to efficiently plan optimal paths with a unified model. By leveraging online data collection, data augmentation and curriculum learning, NBP significantly outperforms state-of-the-art methods on both the existing MP3D dataset and our AiMDoom dataset, achieving more efficient mapping in indoor environments of varying complexity.

3662GraphRouter: A Graph-based Router for LLM Selections

[openreview] [pdf]

Abstract The rapidly growing number and variety of Large Language Models (LLMs) present significant challenges in efficiently selecting the appropriate LLM for a given query, especially considering the trade-offs between performance and computational cost. Current LLM selection methods often struggle to generalize across new LLMs and different tasks because of their limited ability to leverage contextual interactions among tasks, queries, and LLMs, as well as their depen- dence on a transductive learning framework. To address these shortcomings, we introduce a novel inductive graph framework, named as GraphRouter, which fully utilizes the contextual information among tasks, queries, and LLMs to en- hance the LLM selection process. GraphRouter constructs a heterogeneous graph comprising task, query, and LLM nodes, with interactions represented as edges, which efficiently captures the contextual information between the query’s requirements and the LLM’s capabilities. Through an innovative edge prediction mechanism, GraphRouter is able to predict attributes (the effect and cost of LLM response) of potential edges, allowing for optimized recommendations that adapt to both existing and newly introduced LLMs without requiring retraining. Comprehensive experiments across three distinct effect-cost weight scenarios have shown that GraphRouter substantially surpasses existing routers, delivering a minimum performance improvement of 12.3%. In addition, it achieves enhanced generalization across new LLMs settings and supports diverse tasks with at least a 9.5% boost in effect and a significant reduction in computational demands. This work endeavors to apply a graph-based approach for the contextual and adaptive selection of LLMs, offering insights for real-world applications.

3663Physical Backdoor Attack can Jeopardize Driving with Vision-Large-Language Models

[openreview] [pdf]

Abstract Vision-Large-Language-models (VLMs) have great application prospects in autonomous driving. Despite the ability of VLMs to comprehend and make decisions in complex scenarios, their integration into safety-critical autonomous driving systems poses serious security risks. In this paper, we propose \texttt{BadVLMDriver}, the first backdoor attack against VLMs for autonomous driving that can be launched in practice using \textit{physical} objects. Unlike existing backdoor attacks against VLMs that rely on digital modifications, \texttt{BadVLMDriver} uses common physical items, such as a red balloon, to induce unsafe actions like sudden acceleration, highlighting a significant real-world threat to autonomous vehicle safety. To execute \texttt{BadVLMDriver}, we develop an automated pipeline utilizing natural language instructions to generate backdoor training samples with embedded malicious behaviors. This approach allows for flexible trigger and behavior selection, enhancing the stealth and practicality of the attack in diverse scenarios. We conduct extensive experiments to evaluate \texttt{BadVLMDriver} for two representative VLMs, five different trigger objects, and two types of malicious backdoor behaviors. \texttt{BadVLMDriver} achieves a 92% attack success rate in inducing a sudden acceleration when coming across a pedestrian holding a red balloon. Thus, \texttt{BadVLMDriver} not only demonstrates a critical security risk but also emphasizes the urgent need for developing robust defense mechanisms to protect against such vulnerabilities in autonomous driving technologies.

3664Global-to-Local Support Spectrums for Model Explainability

[openreview] [pdf]

Abstract Existing sample-based methods, like influence functions and representer points, measure the importance of a training point by approximating the effect of its removal from training. As such, they are skewed towards outliers and points that are very close to the decision boundaries. The explanations provided by these methods are often static and not specific enough for different test points. In this paper, we propose a method to generate an explanation in the form of support spectrums which are based on two main ideas: the support sets and a global-to-local importance measure. The support set is the set of training points, in the predicted class, that ``lie in between’’ the test point and training points in the other classes. They indicate how well the test point can be distinguished from the points not in the predicted class. The global-to-local importance measure is obtained by decoupling existing methods into the global and local components which are then used to select the points in the support set. Using this method, we are able to generate explanations that are tailored to specific test points. In the experiments, we show the effectiveness of the method in image classification and text generation tasks.

3665Soft Preference Optimization: Aligning Language Models to Expert Distributions

[openreview] [pdf]

Abstract We propose Soft Preference Optimization (SPO), a method for aligning generative models, such as Large Language Models (LLMs), with human preferences, without the need for a reward model. SPO optimizes model outputs directly over a preference dataset through a natural loss function that integrates preference loss with a regularization term across the model’s entire output distribution rather than limiting it to the preference dataset. Although SPO does not require the assumption of an existing underlying reward model, we demonstrate that, under the Bradley-Terry (BT) model assumption, it converges to a softmax of scaled rewards, with the distribution’s ``softness" adjustable via the softmax exponent, an algorithm parameter. We showcase SPO’s methodology, its theoretical foundation, and its comparative advantages in simplicity and alignment precision.

3666kNN Attention Demystified: A Theoretical Exploration for Scalable Transformers

[openreview] [pdf]

Abstract Despite their power, Transformers \citep{vaswani2017attention} face challenges with long sequences due to the quadratic complexity of self-attention. To address this limitation, methods like k-Nearest-Neighbor (kkNN) attention have been introduced \citep{roy2021efficient}, enabling each token to attend to only its kk closest tokens. While kkNN attention has shown empirical success in making Transformers more efficient, its exact approximation guarantees have not been theoretically analyzed. In this work, we establish a theoretical framework for kkNN attention, reformulating self-attention as expectations over softmax distributions and leveraging lazy Gumbel sampling \citep{mussmann2017fast} with kkNN indices for efficient approximation. Building on this framework, we also propose novel sub-quadratic algorithms that approximate self-attention gradients by leveraging efficient sampling techniques, such as Markov Chain-based estimation. Finally, we demonstrate the practical effectiveness of these algorithms through empirical experiments, showcasing their benefits in both training and inference.

3667Rethinking Light Decoder-based Solvers for Vehicle Routing Problems

[openreview] [pdf]

Abstract Light decoder-based solvers have gained popularity for solving vehicle routing problems (VRPs) due to their efficiency and ease of integration with reinforcement learning algorithms. However, they often struggle with generalization to larger problem instances or different VRP variants. This paper revisits light decoder-based approaches, analyzing the implications of their reliance on static embeddings and the inherent challenges that arise. Specifically, we demonstrate that in the light decoder paradigm, the encoder is implicitly tasked with capturing information for all potential decision scenarios during solution construction within a single set of embeddings, resulting in high information density. Furthermore, our empirical analysis reveals that the overly simplistic decoder struggles to effectively utilize this dense information, particularly as task complexity increases, which limits generalization to out-of-distribution (OOD) settings. Building on these insights, we show that enhancing the decoder capacity, with a simple addition of identity mapping and a feed-forward layer, can considerably alleviate the generalization issue. Experimentally, our method significantly enhances the OOD generalization of light decoder-based approaches on large-scale instances and complex VRP variants, narrowing the gap with the heavy decoder paradigm.

3668Uncertainty-Guided Optimization on Large Language Model Search Trees

[openreview] [pdf]

Abstract Tree search algorithms such as greedy and beam search are the standard when it comes to finding sequences of maximum likelihood in the decoding processes of large language models (LLMs). However, they are myopic since they do not take the complete root-to-leaf path into account. Moreover, they are agnostic to prior knowledge available about the process: For example, it does not consider that the objective being maximized is a probability and thereby has specific properties like being bound in the unit interval. Taking a probabilistic approach, we define prior beliefs over LLMs’ transition probabilities and obtain posterior beliefs over the most promising paths in each iteration. These beliefs are useful for defining a sample-based, non-myopic acquisition function that allows for a more data-efficient exploration scheme than standard search algorithms on LLMs. Crucially, unlike expensive simulation-based non-myopic methods like the Monte Carlo tree search, our method only requires samples from the beliefs. Our formulation thus views LLM decoding as Bayesian optimization on trees. We discuss how to select the prior and the acquisition function, and demonstrate in experiments with various LLMs that our method achieves higher efficiency than recent baselines: Our method achieves the same or a higher likelihood while expanding fewer nodes.

3669Reinforcement Learning for Finite Space Mean-Field Type Games

[openreview] [pdf]

Abstract Mean field type games (MFTGs) describe Nash equilibria between large coalitions: each coalition consists of a continuum of cooperative agents who maximize the average reward of their coalition while interacting non-cooperatively with a finite number of other coalitions. Although the theory has been extensively developed, we are still lacking efficient and scalable computational methods. Here, we develop reinforcement learning methods for such games in a finite space setting with general dynamics and reward functions. We start by proving that MFTG solution yields approximate Nash equilibria in finite-size coalition games. We then propose two algorithms. The first is based on quantization of mean-field spaces and Nash Q-learning. We provide convergence and stability analysis. We then propose a deep reinforcement learning algorithm, which can scale to larger spaces. Numerical experiments in 5 environments with mean-field distributions of dimension up to 200 show the scalability and efficiency of the proposed method.

3670Federated Residual Low-Rank Adaption of Large Language Models

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) presents an effective solution for federated fine-tuning of Large Language Models (LLMs), as it substantially reduces communication overhead. However, a straightforward combination of FedAvg and LoRA results in suboptimal performance, especially under data heterogeneity. We noted this stems from both intrinsic (i.e., constrained parameter space) and extrinsic (i.e., client drift) limitations, which hinder it effectively learn global knowledge. In this work, we proposed a novel Federated Residual Low-Rank Adaption method, namely FRLoRA, to tackle above two limitations. It directly sums the weight of the global model parameters with a residual low-rank matrix product (\ie, weight change) during the global update step, and synchronizes this update for all local models. By this, FRLoRA performs global updates in a higher-rank parameter space, enabling a better representation of complex knowledge structure. Furthermore, FRLoRA reinitializes the local low-rank matrices with the principal singular values and vectors of the pre-trained weights in each round, to calibrate their inconsistent convergence, thereby mitigating client drift. Our extensive experiments demonstrate that FRLoRA consistently outperforms various state-of-the-art FL methods across nine different benchmarks in natural language understanding and generation under different FL scenarios.

3671Spurious Forgetting in Continual Learning of Language Models

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) reveal a perplexing phenomenon in continual learning: despite extensive training, models experience significant performance declines, raising questions about task alignment and underlying knowledge retention. This study first explores the concept of “spurious forgetting”, proposing that such performance drops often reflect a decline in task alignment rather than true knowledge loss. Through controlled experiments with a synthesized dataset, we investigate the dynamics of model performance during the initial training phases of new tasks, discovering that early optimization steps can disrupt previously established task alignments. Our theoretical analysis connects these shifts to orthogonal updates in model weights, providing a robust framework for understanding this behavior. Ultimately, we introduce a Freezing strategy that fix the bottom layers of the model, leading to substantial improvements in four continual learning scenarios. Our findings underscore the critical distinction between task alignment and knowledge retention, paving the way for more effective strategies in continual learning.

3672Adversarial Multi-Agent Evaluation of Large Language Models through Iterative Debate

[openreview] [pdf]

Abstract We propose a novel framework for evaluating large language model (LLM) outputs using LLMs themselves as interacting agents in an adversarial debate system. Our approach casts LLMs as advocates, judges, and juries within a structured courtroom-inspired setting. Advocate LLMs engage in iterative argumentation to refine and critique responses, while judge and jury LLMs moderate and assess the debate. We introduce a probabilistic model using Beta-Binomial distribution to analyze error reduction dynamics in this iterative process. Comparative studies of ranking versus scoring methods for LLM jurors reveal advantages of fine-grained scoring in capturing nuanced quality assessments. Experiments across diverse language tasks demonstrate our framework’s superior performance in agreement with human judgments and provision of interpretable feedback compared to traditional evaluation methods. This work contributes a theoretically grounded, scalable approach to LLM evaluation that addresses limitations of existing techniques and adapts to rapid advancements in language AI technologies.

3673Language Model Alignment in Multilingual Trolley Problems

[openreview] [pdf]

Abstract We evaluate the moral alignment of large language models (LLMs) with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs’ decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide.

3674SePer: Measure Retrieval Utility Through The Lens Of Semantic Perplexity Reduction

[openreview] [pdf]

Abstract Large language models (LLM) can leverage retrieved external knowledge to improve the generation performance, a paradigm called retrieval-augmented generation (RAG). Current automatic evaluation of RAG either treats retrieval and generation as a whole or independently evaluates retrievers with traditional retrieval metrics regardless of generator, bringing a gap to understand retrieval utility with a systematic view. In this work, we propose an automatic method to measure the retrieval quality from the lens of information gains of the RAG system. Specifically, we introduce semantic perplexity SePer to measure LLM’s uncertainty about the ground-truth, and quantify retrieval utilities as the uncertainty reduction of semantic perplexity after retrieval. With experiments from various RAG scenarios, we demonstrate that SePer not only aligns well with human preferences but also provides more fine-grained evaluation of context quality and retrieval utility efficiently and robustly.

3675Rehearsal-Free Continual Federated Learning with Synergistic Regularization

[openreview] [pdf]

Abstract Continual Federated Learning (CFL) allows distributed devices to collaboratively learn novel concepts from continuously shifting training data while avoiding \textit{knowledge forgetting} of previously seen tasks. To tackle this challenge, most current CFL approaches rely on extensive rehearsal of previous data. Despite effectiveness, rehearsal comes at a cost to memory, and it may also violate data privacy. Considering these, we seek to apply regularization techniques to CFL by considering their cost-efficient properties that do not require sample caching or rehearsal. Specifically, we first apply traditional regularization techniques to CFL and observe that existing regularization techniques, especially synaptic intelligence, can achieve promising results under homogeneous data distribution but fail when the data is heterogeneous. Based on this observation, we propose a simple yet effective regularization algorithm for CFL named \textbf{FedSSI}, which tailors the synaptic intelligence for the CFL with heterogeneous data settings. FedSSI can not only reduce computational overhead without rehearsal but also address the data heterogeneity issue. Extensive experiments show that FedSSI achieves superior performance compared to state-of-the-art methods.

3676A Competitive-Cooperative Actor-critic Framework for Reinforcement Learning

[openreview] [pdf]

Abstract In the field of Deep reinforcement learning (DRL), enhancing exploration capabilities and improving the accuracy of Q-value estimation remain two major challenges. Recently, double-actor DRL methods have emerged as a promising class of DRL approaches, achieving substantial advancements in both exploration and Q-value estimation. However, existing double-actor DRL methods feature actors that operate independently in exploring the environment, lacking mutual learning and collaboration, which leads to suboptimal policies. To address this challenge, this work proposes a generic solution that can be seamlessly integrated into existing double-actor DRL methods by promoting mutual learning among the actors to develop improved policies. Specifically, we calculate the difference in actions output by the actors and minimize this difference as a loss during training to facilitate mutual imitation among the actors. Simultaneously, we also minimize the differences in Q-values output by the various critics as part of the loss, thereby avoiding significant discrepancies in value estimation for the imitated actions. We present two specific implementations of our method and extend these implementations beyond double-actor DRL methods to other DRL approaches to encourage broader adoption. Experimental results demonstrate that our method effectively enhances four state-of-the-art (SOTA) double-actor DRL methods and five other types of SOTA DRL methods across four MuJoCo tasks, as measured by return.

3677DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient Self-Driving

[openreview] [pdf]

Abstract Current end-to-end autonomous driving methods resort to unifying modular designs for various tasks (e.g. perception, prediction and planning). Although optimized in a planning-oriented spirit with a fully differentiable framework, existing end-to-end driving systems without ego-centric designs still suffer from unsatisfactory performance and inferior efficiency, owing to the rasterized scene representation learning and redundant information transmission. In this paper, we revisit the human driving behavior and propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving. Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner. The sparse perception module performs detection, tracking and online mapping based on sparse representation of the driving scene. The hierarchical interaction module aims to select the Closest In-Path Vehicle / Stationary (CIPV / CIPS) from coarse to fine, benefiting from an additional geometric prior. As for the iterative motion planner, both selected interactive agents and ego-vehicle are considered for joint motion prediction, where the output multi-modal ego-trajectories are optimized in an iterative fashion. Besides, both position-level motion diffusion and trajectory-level planning denoising are introduced for uncertainty modeling, thus facilitating the training stability and convergence of the whole framework. Extensive experiments conducted on nuScenes dataset demonstrate the superior planning performance and great efficiency of DiFSD, which significantly reduces the average L2 error by 66% and collision rate by 77% than UniAD while achieves 8.2x faster running efficiency.

3678On Designing Effective RL Reward at Training Time for LLM Reasoning

[openreview] [pdf]

Abstract Reward models have been increasingly critical for improving the reasoning capability of LLMs. Existing research has shown that a well-trained reward model can substantially improve model performances atinference timevia search or best-of-N votes. However, the potential of reward models duringRL training timestill remains largely under-explored. It is currently unclear whether these reward models can provide additional training signals to enhance the reasoning capabilities of LLMs in RL training that uses sparse success rewards, which verify the correctness of solutions. In this work, we evaluate popular reward models for RL training, including the Outcome-supervised Reward Model (ORM) and the Process-supervised Reward Model (PRM), and train a collection of LLMs for math problems using RL by combining these learned rewards with success rewards. Surprisingly, even though these learned reward models have strong inference-time performances, they may NOT help or even hurt RLtraining, producing worse performances than LLMs trained with the success reward only. Our analysis reveals that an LLM can receive high rewards from some of these reward models by repeating correct but unnecessary reasoning steps, leading to a severe reward hacking issue for RL training. Therefore, we introduce two novel reward refinement techniques, includingClippingandDelta. The key idea is to ensure the accumulative reward of any reasoning trajectory is upper-bounded to keep a learned reward model effective without being exploited. We evaluate our techniques with multiple reward models over a set of 1.5B and 7B LLMs on MATH and GSM8K benchmarks, where bothClippingandDeltaconsistently stabilize RL training. Finally, we also demonstrate that with a carefully designed reward function, pure RL training without any additional supervised tuning can further improve all the evaluated LLMs, including the state-of-the-art 7B LLM Qwen2.5-Math-7B-Instruct on MATH and GSM8K benchmarks.

3679AdCorDA: Classifier Refinement via Adversarial Correction and Domain Adaptation

[openreview] [pdf]

Abstract This paper describes a simple yet effective technique for refining a pretrained classifier network. The proposed AdCorDA method consists of two stages - adversarial correction followed by domain adaptation. Adversarial correction uses adversarial attacks to correct misclassified training-set classifications. The incorrectly classified samples of the training set are removed and replaced with the adversarially corrected samples to form a new training set, and then, in the second stage, domain adaptation is performed back to the original training set. Extensive experimental validations show significant accuracy boosts of over 5% on the CIFAR-100 dataset and 1% on the CINIC-10 dataset. The technique can be straightforwardly applied to the refinement of weight-quantized neural networks, where experiments show substantial enhancement in performance over the baseline. The adversarial correction technique also results in enhanced robustness to adversarial attacks.

3680Understanding the Role of Spectral Signal in Unsupervised Graph Domain Adaptation

[openreview] [pdf]

Abstract Unsupervised graph domain adaptation (GDA) addresses the challenge of transferring knowledge from labeled source graphs to unlabeled target graphs. However, existing methods primarily implement spatial message-passing operators, which are limited by the neglect of the unique roles of spectral signals in unsupervised GDA. In this paper, we initially investigate an experimental study and find that the low-frequency topology signals signify the shared cross-domain features, while the high-frequency information indicates domain-specific knowledge. However, how to effectively leverage the above findings persists as a perplexing conundrum. To tackle the above issue, we propose an effective framework named Synergy Low-High Frequency Cross-Domain Network (SnLH) for unsupervised GDA. Specifically, we decouple the low- and high-frequency components in the original graph, extracting global structures and local details to capture richer semantic information and enhance the graph-level semantics. For the low-frequency components, we design an optimization objective to maximize the mutual information among low-frequency features, promoting the model to learn more generalized low-frequency information. To further mitigate domain discrepancy, we introduce high-frequency information cross-domain contrastive learning to impose constraints on the domains. By effectively leveraging both low and high-frequency information, the learned features turn out to be both discriminative and domain-invariant, thereby attaining effective cross-domain knowledge transfer. Extensive experiments demonstrate the superiority and effectiveness of the proposed framework across various state-of-the-art unsupervised GDA baselines.

3681VoxDialogue: Can Spoken Dialogue Systems Understand Information Beyond Words?

[openreview] [pdf]

Abstract With the rapid advancement of large models, voice assistants are gradually acquiring the ability to engage in open-ended daily conversations with humans. However, current spoken dialogue systems often overlook multi-modal information in audio beyond text, such as speech rate, volume, emphasis, and background sounds. Relying solely on Automatic Speech Recognition (ASR) can lead to the loss of valuable auditory cues, thereby weakening the system’s ability to generate contextually appropriate responses. To address this limitation, we propose \textbf{VoxDialogue}, a comprehensive benchmark for evaluating the ability of spoken dialogue systems to understand multi-modal information beyond text. Specifically, we have identified 12 attributes highly correlated with acoustic information beyond words and have meticulously designed corresponding spoken dialogue test sets for each attribute, encompassing a total of 4.5K multi-turn spoken dialogue samples. Finally, we evaluated several existing spoken dialogue models, analyzing their performance on the 12 attribute subsets of VoxDialogue. Experiments have shown that in spoken dialogue scenarios, many acoustic cues cannot be conveyed through textual information and must be directly interpreted from the audio input. In contrast, while direct spoken dialogue systems excel at processing acoustic signals, they still face limitations in handling complex dialogue tasks due to their restricted context understanding capabilities. All data and code will be open source at \url{https://voxdialogue.github.io/}.

3682Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting

[openreview] [pdf]

Abstract Retrieval augmented generation (RAG) combines the generative abilities of large language models (LLMs) with external knowledge sources to provide more accurate and up-to-date responses. Recent RAG advancements focus on improving retrieval outcomes through iterative LLM refinement or self-critique capabilities acquired through additional instruction tuning of LLMs. In this work, we introduce Speculative RAG - a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM. Each draft is generated from a distinct subset of retrieved documents, offering diverse perspectives on the evidence while reducing input token counts per draft. This approach enhances comprehension of each subset and mitigates potential position bias over long context. Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts. Extensive experiments demonstrate that Speculative RAG achieves state-of-the-art performance with reduced latency on TriviaQA, MuSiQue, PopQA, PubHealth, and ARC-Challenge benchmarks. It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.

3683Auto-Demo Prompting: Leveraging Generated Outputs as Demonstrations for Enhanced Batch Prompting

[openreview] [pdf]

Abstract Batch prompting is a common technique in large language models (LLMs) used to process multiple inputs simultaneously, aiming to improve computational efficiency. However, as batch sizes increase, performance degradation often occurs due to the model’s difficulty in handling lengthy context inputs. Existing methods that attempt to mitigate these issues rely solely on batch data arrangement and majority voting rather than improving the design of the batch prompt itself. In this paper, we address these limitations by proposing “Auto-Demo Prompting,” a novel approach that leverages the question-output pairs from earlier questions within a batch as demonstrations for subsequent answer inference. We provide a formal theoretical analysis of how Auto-Demo Prompting functions within the autoregressive generation process of LLMs, illustrating how it utilizes prior outputs to optimize the model’s internal representations. Our method effectively bridges the gap between batch prompting and few-shot prompting, enhancing performance with only a slight compromise in token usage. Experimental results across five NLP tasks demonstrate its effectiveness in mitigating performance degradation and occasionally outperforming single prompts. Furthermore, it opens new avenues for applying few-shot learning techniques, such as demonstration selection, within batch prompting, making it a robust solution for real-world applications.

3684AnomalyTCN: Dual-branch Convolution with Contrastive Representation for Efficient Time Series Anomaly Detection

[openreview] [pdf]

Abstract This paper focuses on the rising contrastive-based method for time series anomaly detection, which works on the idea of contrastive discrepancy learning and breaks through the performance bottleneck of previous reconstruction-based methods. But we also find that, existing contrastive-based methods only work with the complicated attention mechanisms, which brings heavier computational costs. To address this efficiency issue, we propose AnomalyTCN as a more efficient and effective contrastive-based solution. In detail, we design a dual-branch convolution structure to produce different representations of the same input under two different views for contrastive learning. Then we adopt the representation discrepancy between these two branches as a more distinguishable criterion to detect the anomalies, leading to better detection performance. Meanwhile, since we adopt a simple and light-weight pure convolution structure to avoid the complicated attention computation, our method can enjoy much more advantages in efficiency. Experimentally, our AnomalyTCN achieves the consistent state-of-the-art performance on various time series anomaly detection tasks while saving 83.6% running time and 20.1% memory usage. These results validate that our AnomalyTCN is a novel solution for time series anomaly detection with a better balance of performance and efficiency.

3685RED – ROBUST ENVIRONMENTAL DESIGN

[openreview] [pdf]

Abstract The classification of road signs by autonomous systems, especially those reliant on visual inputs, is highly susceptible to adversarial attacks. Traditional approaches to mitigating such vulnerabilities have focused on enhancing the robustness of classi- fication models. In contrast, this paper adopts a fundamentally different strategy aimed at increasing robustness through the redesign of road signs themselves. We propose an attacker-agnostic learning scheme to automatically design road signs that are robust to a wide array of patch-based attacks. Empirical tests conducted in both digital and physical environments demonstrate that our approach significantly reduces vulnerability to patch attacks, outperforming existing techniques.

3686Large Language Models can be Strong Self-Detoxifiers

[openreview] [pdf]

Abstract Reducing the likelihood of generating harmful and toxic output is an essential task when aligning large language models (LLMs). Existing methods mainly rely on training an external reward model (i.e., another language model) or fine-tuning the LLM using self-generated data to influence the outcome. In this paper, we show that LLMs have the capability of self-detoxification without the use of an additional reward model or re-training. We propose \textit{Self-disciplined Autoregressive Sampling (SASA)}, a lightweight controlled decoding algorithm for toxicity reduction of LLMs. SASA leverages the contextual representations from an LLM to learn linear subspaces characterizing toxic v.s. non-toxic output in analytical forms. When auto-completing a response token-by-token, SASA dynamically tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. Evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks, SASA markedly enhances the quality of the generated sentences relative to the original models and attains comparable performance to state-of-the-art detoxification techniques, significantly reducing the toxicity level by only using the LLM’s internal representations.

3687A new perspective on applying mesoscience to explore the model generalizability

[openreview] [pdf]

Abstract The black-box nature is one of bottlenecks constraining machine learning (ML) models, especially, neural networks, from playing a more important role in the field of engineering. The decision-making process of the model often lacks transparency and is difficult to interpret, which limits its use in the high-risk domain. Thus, explaining the generalizability of ML models is a crucial topic in the field of AI. However, there has been no unified understanding of this issue. This work attempts to introduce the concept of compromise in competition (CIC) in mesoscience into the explanation of the generalizability of ML models. In this work, a scale decomposition method is proposed from the perspective of training samples, and the CIC between memorizing and forgetting, refined as dominant mechanisms, is studied. Empirical studies on computer vision (CV) and natural language processing (NLP) datasets demonstrate that the CIC between memorizing and forgetting significantly influences model generalizability. Additionally, dropout, L2 regularization, etc., aimed at mitigating overfitting, can be uniformly reinterpreted through the CIC between memorizing and forgetting. Collectively, this work proposes a new perspective to explain the generalizability of ML models, in order to provide inherent support for further applications of ML in the field of engineering.

3688Solving Video Inverse Problems Using Image Diffusion Models

[openreview] [pdf]

Abstract Recently, diffusion model-based inverse problem solvers (DIS) have emerged as state-of-the-art approaches for addressing inverse problems, including image super-resolution, deblurring, inpainting, etc. However, their application to video inverse problems arising from spatio-temporal degradation remains largely unexplored due to the challenges in training video diffusion models. To address this issue, here we introduce an innovative video inverse solver that leverages only image diffusion models. Specifically, by drawing inspiration from the success of the recent decomposed diffusion sampler (DDS), our method treats the time dimension of a video as the batch dimension of image diffusion models and solves spatio-temporal optimization problems within denoised spatio-temporal batches derived from each image diffusion model. Moreover, we introduce a batch-consistent diffusion sampling strategy that encourages consistency across batches by synchronizing the stochastic noise components in image diffusion models. Our approach synergistically combines batch-consistent sampling with simultaneous optimization of denoised spatio-temporal batches at each reverse diffusion step, resulting in a novel and efficient diffusion sampling strategy for video inverse problems. Experimental results demonstrate that our method effectively addresses various spatio-temporal degradations in video inverse problems, achieving state-of-the-art reconstructions. Project page:https://solving-video-inverse.github.io/main

3689Memorization in In-Context Learning

[openreview] [pdf]

Abstract In-context learning (ICL) has proven to be an effective strategy for improving the performance of large language models (LLMs) with no additional training. However, the exact mechanism behind this performance improvement remains unclear. This study is the first to show how ICL surfaces memorized training data and to explore the correlation between this memorization and performance on downstream tasks across various ICL regimes: zero-shot, few-shot, and many-shot. Our most notable findings include: (1) ICL significantly surfaces memorization compared to zero-shot learning in most cases; (2) demonstrations, without their labels, are the most effective element in surfacing memorization; (3) ICL improves performance when the surfaced memorization in few-shot regimes reaches a high level (about 40%); and (4) there is a very strong correlation between performance and memorization in ICL when it outperforms zero-shot learning. Overall, our study uncovers memorization as a new factor impacting ICL, raising an important question: to what extent do LLMs truly generalize from demonstrations in ICL, and how much of their success is due to memorization?

3690Rethinking the Roles of Time and Frequency Domains Before Tackling Time Series UDA

[openreview] [pdf]

Abstract In time-series unsupervised domain adaptation (UDA), the adaptation of both temporal and frequency domain features has been relatively underexplored. To address this gap, we conduct a comprehensive series of experiments to revisit the roles of these domains in UDA. Our findings reveal that the temporal domain contains more diverse features, offering higher discriminability, while the frequency domain is more domain-invariant, providing better transferability. Combining the strengths of both domains, we propose TidalFlow, a UDA framework that synergistically integrates temporal and frequency domain features. TidalFlow enhances feature extraction and captures subtle, class-specific features without relying on traditional alignment strategies. By utilizing simple hyperparameter adjustments and using frequency embeddings from the source domain as reference points for domain adaptation, TidalFlow achieves nearly a 10% improvement across five benchmark datasets in time-series UDA. This research highlights the unique strengths of both domains and marks a paradigm shift in UDA methods, showcasing TidalFlow’s robust performance in real-world applications.

3691A Simple but Strong Baseline for Sounding Video Generation: Effective Adaptation of Audio and Video Diffusion Models for Joint Generation

[openreview] [pdf]

Abstract In this work, we build a simple but strong baseline for sounding video generation. Given base diffusion models for audio and video, we integrate them with additional modules into a single model and train it to make the model jointly generate audio and video. To enhance alignment between audio-video pairs, we introduce two novel mechanisms in our model. The first one is timestep adjustment, which provides different timestep information to each base model. It is designed to align how samples are generated along with timesteps across modalities. The second one is a new design of the additional modules, termed Cross-Modal Conditioning as Positional Encoding (CMC-PE). In CMC-PE, cross-modal information is embedded as if it represents temporal position information, and the embeddings are fed into the model like positional encoding. Compared with the popular cross-attention mechanism, CMC-PE provides a better inductive bias for temporal alignment in the generated data. Experimental results validate the effectiveness of the two newly introduced mechanisms and also demonstrate that our method outperforms existing methods. The source code will be released upon acceptance.

3692Geometric Graph Neural Network based track finding

[openreview] [pdf]

Abstract An essential component of event reconstruction in particle physics experiments is identifying the trajectory of charged particles in the detector. Traditional methods for track finding are often complex, and tailored to specific detectors and input geometries, limiting their adaptability to new detector designs and optimization processes. To overcome these limitations, we present a novel, end-to-end track finding algorithm that is detector-agnostic and can take into account multiple input geometric types. To achieve this, our approach unifies inputs from multiple sub-detectors and detector types into a single geometric algebra representation, simplifying data handling compared to traditional methods. Then, we leverage an equivariant graph neural network, GATr, to perform track finding across all data from an event simultaneously. We validate the effectiveness of our pipeline on various detector concepts with different technologies for the FCC-ee at CERN, specifically the IDEA and CLD detectors. This work generalizes track finding across diverse types of input geometric data and tracking technologies, facilitating the development of innovative detector concepts, accelerating detector development cycles, and enabling comprehensive detector optimization.

3693Minixax Optimal Two-Stage Algorithm For Moment Estimation Under Covariate Shift

[openreview] [pdf]

Abstract Covariate shift occurs when the distribution of input features differs between the training and testing phases. In covariate shift, estimating an unknown function’s moment is a classical problem that remains under-explored, despite its common occurrence in real-world scenarios. In this paper, we investigate the minimax lower bound of the problem when the source and target distributions are known. To achieve the minimax optimal bound (up to a logarithmic factor), we propose a two-stage algorithm. Specifically, it first trains an optimal estimator for the function under the source distribution, and then uses a likelihood ratio reweighting procedure to calibrate the moment estimator. In practice, the source and target distributions are typically unknown, and estimating the likelihood ratio may be unstable. To solve this problem, we propose a truncated version of the estimator that ensures double robustness and provide the corresponding upper bound. Extensive numerical studies on synthetic examples confirm our theoretical findings and further illustrate the effectiveness of our proposed method.

3694Targeted Unlearning via Single Layer Unlearning Gradient

[openreview] [pdf]

Abstract The unauthorized generation of privacy-related and copyright-infringing content using generative-AI is becoming a significant concern for society, raising ethical, legal, and privacy issues that demand urgent attention. Recently, machine unlearning techniques have arisen that attempt to eliminate the influence of sensitive content used during model training, but they often require extensive updates in the model, reduce the utility of the models for unrelated content, and/or incur substantial computational costs. In this work, we propose a novel and efficient method called Single Layer Unlearning Gradient (SLUG), that can unlearn targeted information by updating a single targeted layer of a model using a one-time gradient computation. We introduce two metrics: layer importance and gradient alignment, to identify the appropriate layers for unlearning targeted information. Our method is highly modular and enables selective removal of multiple concepts from the generated outputs of widely used foundation models (e.g., CLIP), generative models (e.g., Stable Diffusion) and Vision-Language models. Our method shows effectiveness on a broad spectrum of concepts ranging from concrete (e.g., celebrity name, intellectual property figure, and object) to abstract (e.g., novel concept and artistic style). Our method also exhibits state-of-the-art efficiency with effective unlearning and retention on the comprehensive benchmark UnlearnCanvas. Our code is available athttps://anonymous.4open.science/r/SLUG-6CDF

3695The Challenging Growth: Evaluating the Scalability of Causal Models

[openreview] [pdf]

Abstract One of the pillars of causality is the study of causal models and understanding under which hypotheses we can guarantee their ability to grasp causal information and to leverage it for making inferences. Real causal phenomena, however, may involve drastically different settings such as high dimensionality, causal insufficiency, and nonlinearities, which can be in stark contrast with the initial assumptions made by most models. Additionally, providing fair benchmarks under such conditions presents challenges due to the lack of realistic data where the true data generating process is known. Consequently, most analyses converge towards either small and synthetic toy examples or theoretical analyses, while empirical evidence is limited. In this work, we present in-depth experimental results on two large datasets modeling a real manufacturing scenario, which we release. We show the nontrivial behavior of a well-known manufacturing process, simulated using a physics-based simulator built and validated by domain experts. We demonstrate the inadequacy of many state-of-the-art models and analyze the wide differences in their performance and tractability, both in terms of runtime and memory complexity. We observe that a wide range of causal models are computationally prohibitive for certain tasks, whereas others do not suffer from those burdens by design, but require to pay a price in terms of expressiveness. Upon publication, all artefacts will be released to serve as reference for future research on real world applications of causality, including a general web-page and a leaderboard for benchmarking.

3696I-Lora: Iterative Merging of Routing-Tuned Low-Rank Adapters for Multi-task Learning

[openreview] [pdf]

Abstract The advancement of vision-language models has significantly boosted the performance of embodied and game AI, endowing them with more robust general visual understanding capabilities and logical abilities for action planning. However, the substantial computational cost of model training and the performance degradation during fine-tuning limit the models’ ability to learn emerging new tasks continually. Creating a versatile and dynamically updatable vision-language model is an essential area of research. To this end, we propose a Low-Rank Adapter-based fine-tuning approach called I-LoRA, which enables iterative and independent learning of new tasks while preserving the logical capabilities of the previously trained model. Specifically, we first design the routing-tuning method to minimize the impact of original capabilities from the new task by minimizing activation values of LoRA matrices as low as possible in the general task. Secondly, we propose a novel approach to iteratively merge new adapters, allowing for continuous integration of adapters trained on new tasks without being influenced by task order, thereby reducing interference between them. Finally, we conducted extensive experiments on public datasets with significant behavioral and logical differences between tasks. The results demonstrate that our approach achieves excellent single-task performance, strong multi-task compatibility, and flexible scalability without increasing the number of model parameters.

3697From General to Expert: Custom Pruning LLMs Across Language, Domain, and Task

[openreview] [pdf]

Abstract Large Language Models (LLMs) have transformed natural language processing, yet their substantial model sizes often demand significant computational resources. To conserve computing resources and increase inference speed, it is crucial to prune redundant parameters, especially for general users who often need expert models tailored to specific downstream scenarios. However, current pruning methods primarily focus on maintaining models’ general capabilities, either requiring extensive post-training or performing poorly due to coarse-grained pruning. In this work, we design a Cus\underline{Cus}tom Prun\underline{Prun}ing method (Cus-Prun\texttt{Cus-Prun}) to prune a large general model into a smaller expert model for specific scenarios. Cus-Prun\texttt{Cus-Prun} positions an expert model along the “language”, “domain” and “task” dimensions. By identifying and pruning irrelevant neurons, it creates expert models without any post-training. Our experiments demonstrate that Cus-Prun\texttt{Cus-Prun} consistently outperforms other methods, achieving minimal loss in both expert and general capabilities across various models from different model families and sizes.

3698Loss in the Crowd: Hidden Breakthroughs in Language Model Training

[openreview] [pdf]

Abstract The training loss curves of a neural network are typically smooth. Any visible discontinuities draw attention as discrete conceptual breakthroughs, while the rest of training is less carefully studied. In this work we hypothesize that similar breakthroughs actually occur frequently throughout training, though their presence is obscured when monitoring the aggregate train loss. To find these hidden transitions, we introduce POLCA, a method for decomposing changes in loss along an arbitrary basis of the low rank training subspace. We use our method to identify clusters of samples that exhibit similar changes in loss through training, disaggregating the overall loss into that of smaller groups of conceptually similar datapoints. We validate our method on synthetic arithmetic and natural language, showing that POLCA recovers clusters which represent easily interpretable breakthroughs in the model’s capabilities whose existence would otherwise be lost in the crowd.

3699Holistic Unlearning Benchmark: A Multi-Faceted Evaluation for Text-to-Image Diffusion Model Unlearning

[openreview] [pdf]

Abstract As text-to-image diffusion models become advanced enough for commercial applications, there is also increasing concern about their potential for malicious and harmful use. Model unlearning has been proposed to mitigate the concerns by removing undesired and potentially harmful information from the pre-trained model. So far, the success of unlearning is mainly measured by whether the unlearned model can generate a target concept while maintaining image quality. However, unlearning is typically tested under limited scenarios, and the side effects of unlearning have barely been studied in the current literature. In this work, we thoroughly analyze unlearning under various scenarios with five key aspects. Our investigation reveals that every method has side effects or limitations, especially in more complex and realistic situations. By releasing our comprehensive evaluation framework with the source codes and artifacts, we hope to inspire further research in this area, leading to more reliable and effective unlearning methods.

3700Formulating AutoML as a Variable-Length Optimization Problem: A Tree of Thought Approach with LLM-Driven Code Generation

[openreview] [pdf]

Abstract Recent advancements in machine learning have created a demand for automated systems that enable efficient development and deployment of machine learning applications. Traditional Automated Machine Learning (AutoML) approaches often rely on fixed pipeline structures, which limit adaptability to diverse task complexities. In this paper, we introduce a novel formulation of AutoML as a variable-length optimization problem, allowing for dynamic adjustment of model architectures based on task requirements. To effectively navigate the expanded search space of variable-length models, we employ the Tree of Thoughts (ToT) method combined with Large Language Models (LLMs). This framework utilizes a sequential decision-making process, allowing models to be incrementally constructed by evaluating prior outcomes. Additionally, LLMs automatically generate the code corresponding to each decision, transforming model configurations into executable pipelines and reducing manual intervention. Our approach enhances efficiency by focusing on promising pathways and improves transparency by explicitly showcasing how each decision contributes to the overall optimization. Experiments conducted on diverse datasets, including OpenML and clinical tasks, demonstrate that our method outperforms traditional AutoML systems, delivering superior model performance and better adaptability across different task complexities.

3701Formulating AutoML as a Variable-Length Optimization Problem: A Tree of Thought Approach with LLM-Driven Code Generation

[openreview] [pdf]

Abstract No absctract

3702F-Fidelity: A Robust Framework for Faithfulness Evaluation of Explainable AI

[openreview] [pdf]

Abstract With the rapid development of eXplainable AI, various methods are proposed to explain attributions, such as Integral Gradient, and SmoothGrad. However, how to measure the faithfulness of explanations maintains an open question. The most popular removal strategy suffers from the Out-of-Distribution(OOD) problem. The RemOve And Retrain and Importance Measure try to solve the OOD problem by retraining while suffering information leakage and convergence problems. To address these problems and provide a robust evaluation method for faithfulness measurement, we propose a new method, Fine-tuned Fidelity(F-Fidelity). It alleviates the OOD problem by using consistent augmentation operations in fine-tuning and evaluation stages to reduce the gap between the training set and evaluation inputs. To verify the effectiveness of F-Fidelity, we proposed a fair comparison strategy employing various degraded explanations. We have conducted experiments on Image and Natural Language Processing classification tasks with two datasets and two architectures for each task. The results demonstrate the generality, and robustness of F-Fidelity.

3703Learning General Representations Across Graph Combinatorial Optimization Problems

[openreview] [pdf]

Abstract Combinatorial optimization (CO) problems are classical and crucial in many fields, with many NP-complete (NPC) examples being reducible to one another, revealing an underlying connection between them. Existing methods, however, primarily focus on task-specific models trained on individual datasets, limiting the quality of learned representations and the transferability to other CO problems. Given the reducibility among these problems, a natural idea is to abstract a higher-level representation that captures the essence shared across different problems, enabling knowledge transfer and mutual enhancement. In this paper, we propose a novel paradigm CORAL that treats each CO problem type as a distinct modality and unifies them by transforming all instances into representations of the fundamental Boolean satisfiability (SAT) problem. Our approach aims to capture the underlying commonalities across multiple problem types via cross-modal contrastive learning with supervision, thereby enhancing representation learning. Extensive experiments on seven graph decision problems (GDPs) demonstrate the effectiveness of CORAL, showing that our approach significantly improves the quality and generalizability of the learned representations. Furthermore, we showcase the utility of the pre-trained unified SAT representations on related tasks, including satisfying assignment prediction and unsat core variable prediction, highlighting the potential of CORAL as a unified pre-training paradigm for CO problems.

3704Learning Visual Prompts for Guiding the Attention of Vision Transformers

[openreview] [pdf]

Abstract to be completed laterVisual prompting infuses visual information into the input image to adapt models toward specific predictions and tasks. Recently, manually crafted markers such as red circles are shown to guide the model to attend to a target region on the image. However, these markers only work on models trained with data containing those markers. Moreover, finding these prompts requires guesswork or prior knowledge of the domain on which the model is trained. This work circumvents manual design constraints by proposing to learn the visual prompts for guiding the attention of vision transformers. The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image. Specifically, the prompt is learned in a self-supervised manner without requiring annotations and without fine-tuning the vision transformer. Our experiments demonstrate the effectiveness of the proposed optimization-based visual prompting strategy across various pre-trained vision encoders.

3705Solving the Fuzzy Job Shop Scheduling Problem via Learning Approaches

[openreview] [pdf]

Abstract The fuzzy job shop scheduling problem (FJSSP) emerges as an innovative extension to the conventional job shop scheduling problem (JSSP), incorporating a layer of uncertainty that aligns the model more closely with the complexities of real-world manufacturing environments. This enhancement, while enhancing its applicability, concurrently escalates the computational complexity of deriving solutions. In the domain of traditional scheduling, neural combinatorial optimization (NCO) has recently demonstrated remarkable efficacy. However, its application to the realm of fuzzy scheduling has been relatively unexplored. This paper aims to bridge this gap by investigating the feasibility of employing neural networks to assimilate and process fuzzy information for the resolution of FJSSP, thereby leveraging the advancements in NCO to enhance fuzzy scheduling methodologies. To this end, we present a self-supervised algorithm for the FJSSP (SS-FJSSP). This algorithm employs an iterative mechanism to refine pseudo-labels, progressively transitioning from suboptimal to optimal solutions. This innovative approach adeptly circumvents the significant challenge of procuring true labels, a common challenge in NCO frameworks. Experiments demonstrate that our SS-FJSSP algorithm yields results on a par with the state-of-the-art methods while achieving a remarkable reduction in computational time, specifically being two orders of magnitude faster.

3706CBMA: Improving Conformal Prediction through Bayesian Model Averaging

[openreview] [pdf]

Abstract Conformal prediction has emerged as a popular technique for facilitating valid predictive inference across a spectrum of machine learning models, under minimal assumption of exchangeability. Recently, Hoff (2023) showed that full conformal Bayes provides the most efficient prediction sets (smallest by expected volume) among all prediction sets that are valid at the (1α)(1 - \alpha) level if the model is correctly specified. However, a critical issue arises when the Bayesian model itself may be mis-specified, resulting in prediction interval that might be suboptimal, even though it still enjoys the frequentist coverage guarantee. To address this limitation, we propose an innovative solution that combines Bayesian model averaging (BMA) with conformal prediction. This hybrid not only leverages the strengths of Bayesian conformal prediction but also introduces a layer of robustness through model averaging. Theoretically, we prove that the resulting prediction interval will converge to the optimal level of efficiency, if the true model is included among the candidate models. This assurance of optimality, even under potential model uncertainty, provides a significant improvement over existing methods, ensuring more reliable and precise uncertainty quantification.

3707Intrinsic User-Centric Interpretability through Global Mixture of Experts

[openreview] [pdf]

Abstract In human-centric settings like education or healthcare, model accuracy and model explainability are key factors for user adoption. Towards these two goals, intrinsically interpretable deep learning models have gained popularity, focusing on accurate predictions alongside faithful explanations. However, there exists a gap in the human-centeredness of these approaches, which often produce nuanced and complex explanations that are not easily actionable for downstream users. We present InterpretCC (interpretable conditional computation), a family of intrinsically interpretable neural networks at a unique point in the design space that optimizes for ease of human understanding and explanation faithfulness, while maintaining comparable performance to state-of-the-art models. InterpretCC achieves this through adaptive sparse activation of features before prediction, allowing the model to use a different, minimal set of features for each instance. We extend this idea into an interpretable, global mixture-of-experts (MoE) model that allows users to specify topics of interest, discretely separates the feature space for each data point into topical subnetworks, and adaptively and sparsely activates these topical subnetworks for prediction. We apply InterpretCC for text, time series and tabular data across several real-world datasets, demonstrating comparable performance with non-interpretable baselines and outperforming intrinsically interpretable baselines. Through a user study involving 56 teachers, InterpretCC explanations are found to have higher actionability and usefulness over other intrinsically interpretable approaches.

3708Enhancing Large Language Models’ Situated Faithfulness to External Contexts

[openreview] [pdf]

Abstract Large Language Models (LLMs) are often augmented with external information as contexts, but this external information can sometimes be inaccurate or even intentionally misleading. We argue that robust LLMs should demonstrate situated faithfulness, dynamically calibrating their trust in external information based on their confidence in the internal knowledge and the external context. To benchmark this capability, we evaluate LLMs across several QA datasets, including a newly created dataset featuring in-the-wild incorrect contexts sourced from Reddit posts. We show that when provided with both correct and incorrect contexts, both open-source and proprietary models tend to overly rely on external information, regardless of its factual accuracy. To enhance situated faithfulness, we propose two approaches: Self-Guided Confidence Reasoning (SCR) and Rule-Based Confidence Reasoning (RCR). SCR enables models to self-access the confidence of external information relative to their own internal knowledge to produce the most accurate answer. RCR, in contrast, extracts explicit confidence signals from the LLM and determines the final answer using predefined rules. Our results show that for LLMs with strong reasoning capabilities, such as GPT-4o and GPT-4o mini, SCR outperforms RCR, achieving improvements of up to 24.2% over a direct input augmentation baseline. Conversely, for a smaller model like Llama-3-8B, RCR outperforms SCR. Fine-tuning SCR with our proposed Confidence Reasoning Direct Preference Optimization (CR-DPO) method improves performance on both seen and unseen datasets, yielding an average improvement of 8.9% on Llama-3-8B. In addition to quantitative results, we offer insights into the relative strengths of SCR and RCR. Our findings highlight promising avenues for improving situated faithfulness in LLMs.

3709AdaFisher: Adaptive Second Order Optimization via Fisher Information

[openreview] [pdf]

Abstract First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs are still limited due to increased per-iteration computations and suboptimal accuracy compared to the first order methods. We present AdaFisher--an adaptive second-order optimizer that leverages a block-diagonal approximation to the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced convergence capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modelling and stand out for its stability and robustness in hyperparameter tuning. We demonstrate that AdaFisher outperforms the SOTA optimizers in terms of both accuracy and convergence speed.

3710No Access, No Safety: Free Lunch Adversarial Attacks on Black-box NLP Models

[openreview] [pdf]

Abstract Textual adversarial attacks confuse Natural Language Processing (NLP) models, such as Large Language Models (LLMs), by finely modifying the text, resulting in incorrect decisions. Although existing adversarial attacks are effective, they typically rely on knowing the victim model, using extensive queries, or grasping training data, which limits their real-world applications. In situations where there is neither knowledge of nor access to the victim model, we introduce the Free Lunch Adversarial Attack (FLA), demonstrating that attackers can successfully execute attacks armed only with victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models that approximate the victim’s decision boundaries to enhance ASR. Concurrently, we use diverse adversarial example generation, employing various attack methods to reduce the frequency of model training, balancing effectiveness with efficiency. Experiments with the Emotion and SST5 datasets show that the FLA outperforms existing state-of-the-art methods while lowering the attack cost to zero. More importantly, we discover that FLA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks.

3711IN the known, OUT of the ordinary: Probing OOD detection methods with Synthetic datasets.

[openreview] [pdf]

Abstract Out-of-distribution (OOD) detection is crucial for ensuring the reliability of machine learning models, especially in visual tasks. Most existing benchmarks focus on isolating distribution shifts and creating varying levels of detection difficulty, often relying on manual curation or classifier-based scoring with human annotations. Additionally, large-scale benchmarks are typically derivatives of ImageNet-21k classes or combinations of ImageNet with other datasets. However, no existing work offers a setup where only one attribute such as color or class changes in a controlled manner, while other attributes of the object remain constant. This limits our ability to precisely study the impact of individual attributes on OOD detection performance. We aim to address this by proposing two novel synthetic datasets, SHAPES and CHARS, designed to explore OOD detection under controlled and fine-grained distribution shifts. SHAPES consist of 2D and 3D geometric shapes with variations in color, size, position, and rotation, while CHARS consists of alphanumeric characters with similar variations. Each dataset presents three scenarios: (1) known classes with unseen attributes, (2) unseen classes with known attributes, and (3) entirely novel classes and attributes. We train 10 architectures and assess 13 OOD detection methods across the three scenarios, concentrating on the impact of attribute shifts on OOD scores, while also conducting additional analysis on how image corruption influences OOD scores. By systematically examining how specific attribute shifts affect OOD scores and the affects of noisy test samples, we aim to bring greater transparency to where these methods succeed or fail, helping to identify their limitations under various conditions.

3712LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits

[openreview] [pdf]

Abstract Reward Models (RMs) play a crucial role in aligning large language models (LLMs) with human preferences, enhancing their performance by ranking outputs during inference or iterative training. However, the degree to which an RM generalizes to new tasks is often not knowna priori. For instance, some RMs may excel at scoring creative writing, while others specialize in evaluating math reasoning. Therefore, using only one fixed RM while training LLMs can besuboptimal. Moreover, optimizing LLMs with multiple RMs simultaneously can be prohibitively computationally-intensive and challenging due to conflicting signals from different RMs, potentially degrading performance. To address these challenges, we introduce LASeR (Learning toAdaptivelySelectRewards), which iteratively trains LLMs using multiple RMs, selecting and utilizing the most well-suited RM for each instance to rank outputs and generate preference data, framed as a multi-armed bandit problem. Our empirical results on commonsense and math reasoning tasks demonstrate that LASeR can boost iterative LLM optimization by optimizing for multiple RMs, improving the absolute average accuracy of Llama-3-8B over three datasets by 2.67% over training with ensemble RM scores while also showing superior training efficiency (e.g., a 2x speedup). Moreover, on WildChat, a benchmark of instruction-following prompts in open-form generation, we find that using Llama-3-8B LASeR leads to a 71.45% AlpacaEval win rate over sequentially optimizing multiple RMs. Extending to long-context generation tasks, we find that on Llama-3-8B, LASeR achieves an average improvement of 2.64 F1 points on single-document QA tasks and 2.42 F1 points on multi-document QA over random RM selection when used with best-of-n sampling. Our analysis shows that LASeR is robust to noisy rewards and generalizes to multiple settings. Finally, we demonstrate that LASeR’s RM selection changes depending on the underlying task or instance and we verify the presence of conflicting preferences from multiple RMs that can be mitigated using LASeR.

3713R2: A LLM Based Novel-to-Screenplay Generation Framework with Causal Plot Graphs

[openreview] [pdf]

Abstract Automatically adapting novels into screenplays is important for the TV, film, or opera industries to promote products with low costs. The strong performances of large language models (LLMs) in long-text generation call us to propose a LLM based framework Reader-Rewriter (R2^2) for this task. However, there are two fundamental challenges here. First, the LLM hallucinations may cause inconsistent plot extraction and screenplay generation. Second, the causality-embedded plot lines should be effectively extracted for coherent rewriting. Therefore, two corresponding tactics are proposed: 1) A hallucination-aware refinement method (HAR) to iteratively discover and eliminate the affections of hallucinations; and 2) a causal plot-graph construction method (CPC) based on a greedy cycle-breaking algorithm to efficiently construct plot lines with event causalities. Recruiting those efficient techniques, R2^2 utilizes two modules to mimic the human screenplay rewriting process: The Reader module adopts a sliding window and CPC to build the causal plot graphs, while the Rewriter module generates first the scene outlines based on the graphs and then the screenplays. HAR is integrated into both modules for accurate inferences of LLMs. Experimental results demonstrate the superiority of R2^2, which substantially outperforms three existing approaches (51.3%, 22.6%, and 57.1% absolute increases) in pairwise comparison at the overall win rate for GPT-4o.

3714Quantifying Generalization Complexity for Large Language Models

[openreview] [pdf]

Abstract While large language models (LLMs) have shown exceptional capabilities in understanding complex queries and performing sophisticated tasks, their generalization abilities are often deeply entangled with memorization, necessitating more precise evaluation. To address this challenge, we introduce Scylla, a dynamic evaluation framework that quantitatively measures the generalization abilities of LLMs. Scylla disentangles generalization from memorization via assessing model performance on both in-distribution (ID) and out-of-distribution (OOD) data through 20 tasks across 5 levels of complexity. Through extensive experiments, we uncover a non-monotonic relationship between task complexity and the performance gap between ID and OOD data, which we term the generalization valley. Specifically, this phenomenon reveals a critical threshold---referred to as critical complexity---where reliance on non-generalizable behavior peaks, indicating the upper bound of LLMs’ generalization capabilities. As model size increases, the critical complexity shifts toward higher levels of task complexity, suggesting that larger models can handle more complex reasoning tasks before over-relying on memorization. Leveraging Scylla and the concept of critical complexity, we benchmark 28LLMs including both open-sourced models such as LLaMA and Qwen families, and close-sourced models like Claude and GPT, providing a more robust evaluation and establishing a clearer understanding of LLMs’ generalization capabilities.

3715SimpleStrat: Diversifying Language Model Generation with Stratification

[openreview] [pdf]

Abstract Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model’s next-token probabilities being similar to the true distribution of answers. We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the sampling distribution and uniform distribution over valid ground truth answers. As computing a posterior probability for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.

3716InfCycle: Learning to Use Tools via Inference Compute and Cycle Consistency

[openreview] [pdf]

Abstract The scaling of inference-time computation in large language models (LLMs) has emerged as a promising approach for enhancing reasoning capabilities by trading off inference-time and pre-training compute. The practice of how to enable LLMs to utilize additional computation at test time to improve response accuracy is crucial for both academia and industry. \textit{Proposer-Verifier}, as a typical paradigm of inference scaling, often fails to generalize to various scenarios. Specifically, in tool use tasks, LLMs face the risk of lacking effective verifiers, leading to error accumulation in multiple reasoning steps. In this work, we address these challenges by introducing \textbf{InfCycle}, a multi-stage data synthesis strategy that employs LLMs as data synthesis and employs cycle consistency verification to ensure high-quality trajectory generation. This approach utilizes step-wise cycle consistency among synthesized trajectories for a given tool, providing effective process supervision that has advantages over outcome supervision. Extensive experiments on multiple tool-use and reasoning tasks demonstrate that InfCycle efficiently enables self-improvement. It outperforms state-of-the-art baselines on StableToolBench, achieving a 75.4% pass rate and a 79.6% win rate using small size models (7B), without relying on external supervision or expert trajectories for warm-up.

3717Stated Causal Language Modeling: Off-the-Shelf Enhancement of Context Memorization

[openreview] [pdf]

Abstract We propose stated causal language modeling (stated-CLM), a novel method to enhance the memory capacity of large language models (LLMs) without modifying their architecture or parameters. Unlike existing context segmentation and sliding methods that discard low-weight tokens, stated-CLM compresses adjacent tokens, significantly reducing context information loss. We utilize the classic network pruning techniques with second-order derivatives to optimize the compressed token in the differentiable key-value space. Experiments on LLaMA, Mistral, and Gemma demonstrate that stated-CLM outperforms baselines on the LongBench benchmark by an average of 6.12% (LLaMA3.1-8B) and 5.97% (Mistral-v0.3-7B). On TopicRet, stated-CLM achieves accuracy levels comparable to full context models, while the baselines’ accuracy is close to zero.

3718Learning Transformer-based World Models with Contrastive Predictive Coding

[openreview] [pdf]

Abstract The DreamerV3 algorithm recently obtained remarkable performance across diverse environment domains by learning an accurate world model based on Recurrent Neural Networks (RNNs). Following the success of model-based reinforcement learning algorithms and the rapid adoption of the Transformer architecture for its superior training efficiency and favorable scaling properties, recent works such as STORM have proposed replacing RNN-based world models with Transformer-based world models using masked self-attention. However, despite the improved training efficiency of these methods, their impact on performance remains limited compared to the Dreamer algorithm, struggling to learn competitive Transformer-based world models. In this work, we show that the next state prediction objective adopted in previous approaches is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER (Transformer-based World model wIth contraSTivE Representations), a world model using action-conditioned Contrastive Predictive Coding to learn high-level temporal feature representations and improve the agent performance. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.

3719SceneLock: Reversible Adversarial Learning for Camera-Based Autonomous Driving Protection

[openreview] [pdf]

Abstract The advancement of autonomous driving technology hinges on large-scale data collection to train camera-based deep neural network 3D object detectors. However, these valuable datasets are at risk of unauthorized access and misuse by malicious actors, jeopardizing intellectual property, remote deployment, and the privacy of sensitive information captured during data collection. We propose a novel reversible adversarial learning framework, referred to as SceneLock, aimed at protecting autonomous driving data from unauthorized use. Our method conducts adversarial perturbations through a carefully designed Noise Serialization Encoding module (NSE), which significantly degrades image quality and renders the data ineffective for unauthorized artificial intelligence models and manual annotation. To ensure legitimate access remains unaffected, we integrate advanced image steganography to embed perturbation values within the images. Furthermore, authorized users can extract these values using appropriate decryption tools through the Noise Serialization Decoding module (NSD) to restore the original high-quality images. Experimental results demonstrate that our approach effectively safeguards data integrity against unauthorized use while maintaining availability for legitimate purposes. This dual-layer protection highlights the potential of our method to enhance data security in the autonomous driving domain.

3720Flare Removal with Visual Prompt

[openreview] [pdf]

Abstract Flare removal methods remove the streak, shimmer, and reflective flare in flare-corrupted images while preserving the light source. Recent deep learning methods focus on flare extraction and achieve promising results. They accomplish the task by either viewing the flare equals to the residual information between the flare-corrupted image and the flare-free image and generating the flare-free image through subtracting the extracted flare image or generating the flare-free image and the flare image simultaneously. However, due to the gap between the flare image and the residual information and handling flare extraction and clear image generation process simultaneously will give the network too much pressure and cannot fully utilize the extracted flare, these methods tend to generate images with severe artifacts. To alleviate such a phenomenon, we propose a model-agnostic pipeline named Prompt Inpainting Pipeline (PIP). Specifically, instead of viewing the gap between the flare-free and flare corrupted image as the flare or generating the flare-free image and flare image simultaneously, our prompt inpainting pipeline provides a novel perspective. We borrow the idea from inpainting methods and remove the flare by masking the polluted area and rewriting image details within. Unlike inpainting methods, we first extract multi-scale features of flare-corrupted images as a visual prompt and rewrite missing textures with the visual prompt since we find out that directly writing the missing details based on the remaining area hardly generates promising image details with sufficient semantic and high-frequency information. To verify the function of our pipeline, we conduct comprehensive experiments and demonstrate its superiority.

3721Agent-E: From Autonomous Web Navigation to Foundational Design Principles in Agentic Systems

[openreview] [pdf]

Abstract Web agents that can automate complex and monotonous tasks are becoming essential in streamlining workflows. Due to the difficulty of long-horizon planning, abundant state spaces in websites, and their cryptic observation space (i.e. DOMs), current web agents are still far from human-level performance. In this paper, we present a novel web agent, Agent-E \footnote. This agentic system introduces several architectural improvements over prior state-of-the-art web agents, such as hierarchical architecture, self-refinement, flexible DOM distillation and denoising method, and \textit{change observation} to guide the agent towards more accurate performance. Our Agent-E system without self-refinement achieves SOTA results on the WebVoyager benchmark, beating prior text-only benchmarks by over 20.5% and multimodal agents by over 16%. Our results indicate that adding a self-refinement mechanism can provide an additional 5.9% improvement on the Agent-E system without self-refinement. We then synthesize our learnings into general design principles for developing agentic systems. These include the use of domain-specific primitive skills, the importance of distillation and de-noising of complex environmental observations, and the advantages of a hierarchical architecture.

3722A Feature-Aware Federated Learning Framework for Unsupervised Anomaly Detection in 5G Networks

[openreview] [pdf]

Abstract The expansion of 5G networks has led to remarkable data volume and complexity, introducing significant security challenges that require the implementation of robust and scalable anomaly detection mechanisms. Traditional centralized approaches pose privacy risks and scalability challenges due to the distributed nature of 5G infrastructures. Federated Learning (FL) offers a decentralized solution but often overlooks the importance of feature relevance and privacy preservation during model aggregation. This paper introduces a novel Feature-Aware Federated framework that integrates feature importance into the aggregation process while ensuring differential privacy. We employ integrated gradients to compute feature importance for each client, aggregate them globally with differential privacy noise, and use these insights to weight model parameters during aggregation. Additionally, we propose Dynamic Feature Importance Adaptation (DFIA) to update feature importance occasionally, enhancing the model’s adaptability to evolving data distributions. Experimental results demonstrate that our framework outperforms traditional federated approaches like FedAvg and FedProx in unsupervised anomaly detection tasks within 5G networks, achieving higher accuracy and robustness while preserving data privacy.

3723Explaining Hypergraph Neural Networks: From Local Explanations to Global Concepts

[openreview] [pdf]

Abstract Hypergraph neural networks are a class of powerful models that leverage the message passing paradigm to learn over hypergraphs, a generalization of graphs well-suited to describing relational data with higher-order interactions. However, such models are not naturally interpretable, and their explainability has received very limited attention. We introduce SHypX, the first model-agnostic post-hoc explainer for hypergraph neural networks that provides both local and global explanations. At the instance-level, it performs input attribution by discretely sampling explanation subhypergraphs optimized to be faithful and concise. At the model-level, it produces global explanation subhypergraphs using unsupervised concept extraction. Extensive experiments across four real-world and four novel, synthetic hypergraph datasets demonstrate that our method finds high-quality explanations which can target a user-specified balance between faithfulness and concision, improving over baselines by 25 percent points in fidelity on average.

3724Debiased Deep Evidential Regression for Video Temporal Grounding

[openreview] [pdf]

Abstract Existing Video Temporal Grounding (VTG) models perform well in accuracy but often fail to address open-world challenges posed by open-vocabulary queries and out-of-distribution (OOD) videos, which can lead to unreliable predictions. To address uncertainty, particularly with OOD data, we build a VTG baseline using Deep Evidential Regression (DER), which excels in capturing both aleatoric and epistemic uncertainty. Despite promising results, our baseline faces two key biases in multimodal tasks: (1) Modality imbalance, where uncertainty estimation is more sensitive to the visual modality than the text modality; (2) Counterintuitive uncertainty, resulting from excessive evidence suppression in regularization and uneven sample error distribution in conventional DER. To address these, we propose an RFF block for progressive modality alignment and a query reconstruction task to enhance sensitivity to text queries. Additionally, we introduce a Geom-regularizer to debias and calibrate uncertainty estimation. This marks the first extension of DER in VTG tasks. Extensive experiments demonstrate the effectiveness and robustness of our approach. Our code will be released soon.

3725LeMoLE: LLM-enhanced Mixture of Linear Experts for Time Series Forecasting

[openreview] [pdf]

Abstract Recent research has shown that large language models (LLMs) can be effectively used for real-world time series forecasting due to their strong natural language understanding capabilities. However, aligning time series into semantic spaces of LLMs comes with high computational costs and inference complexity, particularly for long-range time series generation. Building on recent advancements in using linear models for time series, this paper introduces an LLM-enhanced mixture of linear experts for precise and efficient time series forecasting. This approach involves developing a mixture of linear experts with multiple lookback lengths and a new multimodal fusion mechanism. The use of a mixture of linear experts is efficient due to its simplicity, while the multimodal fusion mechanism adaptively combines multiple linear experts based on the learned features of the text modality from pre-trained large language models. In experiments, we rethink the need to align time series to LLMs by existing time-series large language models and further discuss their efficiency and effectiveness in time series forecasting. Our experimental results show that the proposed LeMoLE model presents lower prediction errors and higher computational efficiency than existing LLM models.

3726OpenCarbonEval: How muchCO2will your large model exhale in training process?

[openreview] [pdf]

Abstract Data, model and hardware are crucial components in the development of large scale machine learning models. The training of such models necessitates substantial computational resources, energy consumption, and raw materials, resulting in significant environmental implications. However, the environmental impact of these models has been largely overlooked due to a lack of assessment and analysis of their carbon footprint. In this paper, we present OpenCarbonEval, a carbon emission estimation framework to quantify the environmental implications of large scale machine learning models given their total training computations and hardware configurations. In OpenCarbonEval, we conducted a comprehensive dynamic analysis of the interrelationships among data, models, and hardware throughout the model training process, aiming to forecast the carbon emission of large scale models more accurately. We validated our approach on real-world dataset, and experimental results demonstrate that OpenCarbonEval can predict energy costs and carbon emissions more accurately than previous methods. Furthermore, it can be seamlessly applied to various machine learning tasks without a precision decline. By quantifying the environmental impact of large-scale models, OpenCarbonEval promotes sustainable AI development and deployment, contributing to a more environmentally responsible future for the AI community.

3727Coarsening to Conceal: Enabling Privacy-Preserving Federated Learning for Graph Data

[openreview] [pdf]

Abstract With the escalating demand for privacy-preserving machine learning, federated learning (FL) stands out by enabling collaboration among decentralized entities. Utilizing graph representations of data enhances learning for graph-level tasks, crucial for FL with data distributed across local repositories. Despite its benefits, stringent privacy regulations often compromise FL’s performance. Previous methods aimed at ensuring privacy introduce performance degradation and computational overhead. In response to these challenges, we propose using graph coarsening—a simple yet effective method—to enhance the security and privacy of FL on graph data. Our approach posits that graph coarsening alone can suffice for privacy guarantees, as model parameters obtained from training on the coarsened graph effectively conceal sensitive information susceptible to privacy attacks. Through comprehensive application and analysis, we demonstrate the efficacy of graph coarsening within an FL setup, taking both the graph matrix and node features as input, and jointly learning the coarsened graph matrix and feature matrix while ensuring desired properties. The resultant coarsened graph representations are then utilized to train model parameters, subsequently communicated within an FL framework for downstream tasks such as classification. Extensive experimentation across various datasets confirms that graph coarsening ensures privacy while enhancing performance with minimal trade-offs compared to traditional differential privacy (DP) methods without adding extra complexity overhead.

3728Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradient norms across layers. This allows all parts of the network—both shallow and deep layers—to contribute effectively to training. Extensive experiments with various model sizes demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning, highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is submitted.

3729Solving Urban Network Security Games: Learning Platform, Benchmark, and Challenge for AI Research

[openreview] [pdf]

Abstract After the great achievement of solving two-player zero-sum games, more and more AI researchers focus on solving multiplayer games. To facilitate the development of designing efficient learning algorithms for solving multiplayer games, we propose a multiplayer game platform for solving Urban Network Security Games (UNSG) that model real-world scenarios. That is,preventing criminal activity is a highly significant responsibility assigned to police officers in cities, and police officers have to allocate their limited security resources to interdict the escaping criminal when a crime takes place in a city. This interaction between multiple police officers and the escaping criminal can be modeled as a UNSG. The variants of UNSGs can model different real-world settings, e.g., whether real-time information is available or not, whether police officers can communicate or not. The main challenges of solving this game include the large size of the game and the co-existence of cooperation and competition. While previous efforts have been made to tackle UNSGs, they have been hampered by performance and scalability issues. Therefore, we propose an open-source UNSG platform (GraphChase) for designing efficient learning algorithms for solving UNSGs. Specifically, GraphChase offers a unified and flexible game environment for modeling various variants of UNSGs, supporting the development, testing, and benchmarking of algorithms. We believe that GraphChase not only facilitates the development of efficient algorithms for solving real-world problems but also paves the way for significant advancements in algorithmic development for solving general multiplayer games.

3730Revisiting a Design Choice in Gradient Temporal Difference Learning

[openreview] [pdf]

Abstract Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed and is one of the most important ideas in RL. It, however, can lead to instability when combined with function approximation and bootstrapping, two arguably indispensable ingredients for large-scale reinforcement learning. This is the notorious deadly triad. The seminal work Sutton et al. (2008) pioneers Gradient Temporal Difference learning (GTD) as the first solution to the deadly triad, which has enjoyed massive success thereafter. During the derivation of GTD, some intermediate algorithm, called AA^\topTD, was invented but soon deemed inferior. In this paper, we revisit this AA^\topTD and prove that a variant of AA^\topTD, called AtA_t^\topTD, is also an effective solution to the deadly triad. Furthermore, this AtA_t^\topTD only needs one set of parameters and one learning rate. By contrast, GTD has two sets of parameters and two learning rates, making it hard to tune in practice. We provide both asymptotic and finite sample analysis for AtA^\top_tTD, where the convergence rate is on par with the canonical on-policy temporal difference learning. Key to our analysis is a novel refined discretization of limiting ODEs.

3731Convergence Analysis of Gradient Descent under Coordinate-wise Gradient Dominance

[openreview] [pdf]

Abstract We consider the optimization problem of finding Nash Equilibrium (NE) for a nonconvex function f(x)=f(x1,...,xn)f(x)=f(x_1,...,x_n), where xiRdix_i\in\mathbb{R}^{d_i} denotes the ii-th block of the variables. Our focus is on investigating first-order gradient-based algorithms and their variations such as the block coordinate descent (BCD) algorithm for tackling this problem. We introduce a set of conditions, termed the nn-sided PL condition, which extends the well-established gradient dominance condition a.k.a Polyak-{\L}ojasiewicz (PL) condition and the concept of multi-convexity. This condition, satisfied by various classes of non-convex functions, allows us to analyze the convergence of various gradient descent (GD) algorithms. Moreover, our study delves into scenarios where the objective function only has strict saddle points, and normal gradient descent methods fail to converge to NE. In such cases, we propose adapted variants of GD that converge towards NE and analyze their convergence rates.

3732Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning

[openreview] [pdf]

Abstract As large-scale models evolve, language instructions are increasingly utilized in multi-modal tasks. Due to human language habits, these instructions often contain ambiguities in real-world scenarios, necessitating the integration of visual context or common sense for accurate interpretation. However, even highly intelligent large models exhibit significant performance limitations on ambiguous instructions, where weak reasoning abilities of disambiguation can lead to catastrophic errors. To address this issue, this paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework. It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models or empirical experience for generally intelligent models to understand ambiguous instructions. Unlike traditional methods that require models to possess high intelligence to understand long texts or perform lengthy complex reasoning, our framework does not significantly increase computational overhead and is more general and effective, even for generally intelligent models. Experiments show that our method not only significantly enhances the performance of models of different intelligence levels on ambiguous instructions but also improves their performance on general datasets. Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity. We will release our data and code.

3733Dataset Distillation via Knowledge Distillation: Towards Efficient Self-Supervised Pre-training of Deep Networks

[openreview] [pdf]

Abstract Dataset distillation (DD) generates small synthetic datasets that can efficiently train deep networks with a limited amount of memory and compute. Despite the success of DD methods for supervised learning, DD for self-supervised pre-training of deep models has remained unaddressed. Pre-training on unlabeled data is crucial for efficiently generalizing to downstream tasks with limited labeled data. In this work, we propose the first effective DD method for SSL pre-training. First, we show, theoretically and empirically, that naiive application of supervised DD methods to SSL fails, due to the high variance of the SSL gradient. Then, we address this issue by relying on insights from knowledge distillation (KD) literature. Specifically, we train a small student model to match the representations of a larger teacher model trained with SSL. Then, we generate a small synthetic dataset by matching the training trajectories of the student models. As the KD objective has considerably lower variance than SSL, our approach can generate synthetic datasets that can successfully pre-train high-quality encoders. Through extensive experiments, we show that our distilled sets lead to up to 13% higher accuracy than prior work, on a variety of downstream tasks, in the presence of limited labeled data.

3734Multi-Scale Image Diffusion Transformers: Explainability Leads to Faster Training

[openreview] [pdf]

Abstract Diffusion models have significantly advanced image synthesis but often face high computational demands and slow convergence rates during training. To tackle these challenges, we propose the Multi-Scale Diffusion Transformer (MDiT), which incorporates heterogeneous, asymmetric, scale-specific transformer blocks to reintroduce explicit inductive structural biases into diffusion transformers (DiTs). Using explainable AI techniques, we demonstrate that DiTs inherently learn these biases, exhibiting distinct encode-decode behaviors, effectively functioning as semantic autoencoders. Our optimized MDiT architecture leverages this understanding to achieve a 3×\ge 3\times increase in convergence speed on FFHQ-256x256 and ImageNet-256x256, culminating in a 7×7\times training speedup on ImageNet compared with state-of-the-art models. This acceleration significantly reduces the computational requirements for training, measured in FLOPs, enabling more efficient resource use and enhancing performance on smaller datasets. Additionally, we develop a variance matching regularization technique to correct sample variance discrepancies which can occur in latent diffusion models, enhancing image contrast and vibrancy, and further accelerating convergence.

3735Direct Imitation Learning: RLHF Secretly Performs Imitation Learning

[openreview] [pdf]

Abstract This work studies the alignment of large language models with preference data. We address this problem from a novel imitation learning (IL) perspective. We establish a close connection between alignment and imitation learning, which shows that existing alignment objectives implicitly align model and preference data distributions. Built upon this connection, we develop a principled method DIL to directly optimize the imitation learning objective. DIL derives a surrogate objective for imitation learning with direct density ratio estimates, allowing effective use of preference data. DIL eliminates the need for complex adversarial training required by current IL methods, and optimizes the IL objective through simple density ratio estimation losses, achieving lightweight and efficient fine-tuning for large language models. This paper provides a unified imitation learning perspective on alignment, encompassing existing algorithms as special cases while naturally introducing new variants. Bridging IL and RLHF, DIL opens up new opportunities to improve alignment by leveraging tools from imitation learning. Extensive experiments demonstrate that DIL consistently and significantly outperforms off-the-shelf methods on various challenging benchmarks, including Open LLM Leadboard and AlpacaEval 2.0. Code for DIL is available athttps://github.com/Code-DIL/DIL.

3736Features are fate: a theory of transfer learning in high-dimensional regression

[openreview] [pdf]

Abstract With the emergence of large-scale pre-trained neural networks, methods to adapt such “foundation” models to data-limited downstream tasks have become a necessity. Fine-tuning, preference optimization, and transfer learning have all been successfully employed for these purposes when the target task closely resembles the source task, but a precise theoretical understanding of ``task similarity’’ is still lacking. While conventional wisdom suggests that simple measures of similarity between source and target distributions, such as ϕ\phi-divergences or integral probability metrics, can directly predict the success of transfer, we prove the surprising fact that, in general, this is not the case. We adopt, instead, a \emph{feature-centric} viewpoint on transfer learning and establish a number of theoretical results that demonstrate that when the target task is well represented by the feature space of the pre-trained model, transfer learning outperforms training from scratch. We study deep linear networks as a minimal model of transfer learning in which we can analytically characterize the transferability phase diagram as a function of the target dataset size and the feature space overlap. For this model, we establish rigorously that when the feature space overlap between the source and target tasks is sufficiently strong, both linear transfer and fine-tuning improve performance, especially in the low data limit. These results build on an emerging understanding of feature learning dynamics in deep linear networks, and we demonstrate numerically that the rigorous results we derive for the linear case also apply to nonlinear networks.

3737Convergence Towards Stable Intrinsic Self-correction of Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) are able to improve their responses when instructed to do so, a capability known as self-correction. When instructions provide only the task’s goal without specific details about potential issues in the response, LLMs must rely on their internal knowledge to improve response quality, a process referred to as intrinsic self-correction. The empirical success of intrinsic self-correction is evident in various applications, but how and why it is effective remains unknown. In this paper, we unveil that intrinsic self-correction can be progressively improved, allowing it to approach a converged state. Our findings are verified in: (1) the scenario of multi-round question answering, by comprehensively demonstrating that intrinsic self-correction can progressively introduce performance gains through iterative interactions, ultimately converging to stable performance; and (2) the context of intrinsic self-correction for enhanced morality, in which we provide empirical evidence that iteratively applying instructions reduces model uncertainty towards convergence, which then leads to convergence of both the calibration error and self-correction performance, ultimately resulting in a stable state of intrinsic self-correction. Furthermore, we introduce a mathematical formulation and a simulation task indicating that the latent concepts activated by self-correction instructions drive the reduction of model uncertainty. Based on our experimental results and analysis of the convergence of intrinsic self-correction, we reveal its underlying mechanism: consistent injected instructions reduce model uncertainty which yields converged, improved performance.

3738Injecting Universal Jailbreak Backdoors into LLMs in Minutes

[openreview] [pdf]

Abstract Jailbreak backdoor attacks on LLMs have garnered attention for their effectiveness and stealth. However, existing methods rely on the crafting of poisoned datasets and the time-consuming process of fine-tuning. In this work, we propose JailbreakEdit, a novel jailbreak backdoor injection method that exploits model editing techniques to inject a universal jailbreak backdoor into safety-aligned LLMs with minimal interventionin minutes. JailbreakEdit integrates a multi-node target estimation to estimate the jailbreak space, thus creating shortcuts from the backdoor to this estimated jailbreak space that induce jailbreak actions. Our attack effectively shifts the models’ attention by attaching strong semantics to the backdoor, enabling it to bypass internal safety mechanisms. Experimental results show that JailbreakEdit achieves a high jailbreak success rate on jailbreak prompts while preserving generation quality, and safe performance on normal queries. Our findings underscore the effectiveness, stealthiness, and explainability of JailbreakEdit, emphasizing the need for more advanced defense mechanisms in LLMs.

3739Compositional Scene Modeling with An Object-Centric Diffusion Transformer

[openreview] [pdf]

Abstract Early object-centric learning methods adopt simple pixel mixture decoders to reconstruct images, which struggle with complex synthetic and real-world datasets. Recent object-centric learning methods focus on decoding object representations with complex decoders, such as autoregressive Transformers or diffusion models, to solve this problem. However, these methods feed all object representations together into the decoder to directly reconstruct the latent representation of the entire scene. Contrary to human intuition, this approach ultimately leads to weak interpretability. This paper combines the recent powerful diffusion model and composition module to propose a novel object-centric learning method called Compositional Scene Modeling with an Object-centric Diffusion Transformer (CODiT). By adopting a proposed compositional denoising decoder that can generate the mask of single objects and construct images compositionally, CODiT has stronger interpretability while still retaining the ability to handle complex scenes. We also illustrate the Classifier-Free Guidance explanation of CODiT. Experiments show how compositional structure helps control the generation process, allowing the model to generate images via single object representations and edit objects. In addition, we present CODiT performs strongly in various tasks including segmentation and reconstruction on both complex synthetic datasets and real-world datasets compared with similar methods.

3740LiFT: Learning to Fine-Tune via Bayesian Parameter Efficient Meta Fine-Tuning

[openreview] [pdf]

Abstract We tackle the problem of parameter-efficient fine-tuning (PEFT) of a pre-trained large deep model on many different but related tasks. Instead of the simple but strong baseline strategy of task-wise independent fine-tuning, we aim to meta-learn the core shared information that can be used for unseen test tasks to improve the prediction performance further. That is, we propose a method for {\em learning-to-fine-tune} (LiFT). LiFT introduces a novel hierarchical Bayesian model that can be superior to both existing general meta learning algorithms like MAML and recent LoRA zoo mixing approaches such as LoRA-Retriever and model-based clustering. In our Bayesian model, the parameters of the task-specific LoRA modules are regarded as random variables where these task-wise LoRA modules are governed/regularized by higher-level latent random variables, which represents the prior of the LoRA modules that capture the shared information across all training tasks. To make the posterior inference feasible, we propose a novel SGLD-Gibbs sampling algorithm that is computationally efficient. To represent the posterior samples from the SGLD-Gibbs, we propose an online EM algorithm that maintains a Gaussian mixture representation for the posterior in an online manner in the course of iterative posterior sampling. We demonstrate the effectiveness of LiFT on NLP and vision multi-task meta learning benchmarks.

3741UMVMap: Improving Vectorized Map Construction via Multi-vehicle Perspectives

[openreview] [pdf]

Abstract Prevalent vectorized map construction pipelines predominantly follow an end-to-end DETR-based paradigm. While these methods have achieved significant advancements, they are limited by their reliance on data from a single ego vehicle, which restricts their effectiveness and can lead to perceptual uncertainty in handling complex environmental scenarios. To address this limitation, we introduce a novel framework: Uncertainty-aware Multi-Vehicle Vectorized Map Construction (UMVMap). This framework effectively mitigates uncertainties by leveraging relevant non-ego information. UMVMap comprises two essential components: the Uncertainty-aware Multi-Vehicle Vectorized Map Construction Network (UMVMap-Net), which optimally integrates data from multiple vehicles, and the Uncertainty-aware Non-ego Vehicle Selection (UNVS) strategy, which identifies and incorporates the most informative non-ego data to minimize uncertainty. Comprehensive evaluations on the nuScenes dataset demonstrate that UMVMap significantly outperforms the single-vehicle MapTRv2 baseline by a margin of 9.1% and 9.9% respectively on the full and partial validation sets, with each of its components proving to be both effective and robust.

3742EVOSCHEMA: TOWARDS TEXT-TO-SQL ROBUSTNESS AGAINST SCHEMA EVOLUTION

[openreview] [pdf]

Abstract Neural text-to-SQL models, which translate natural language questions (NLQs) into SQL queries given a database schema, have achieved remarkable performance. However, database schemas frequently evolve to meet new requirements. Such schema evolution often leads to performance degradation for models trained on static schemas. Existing work either mainly focuses on simply paraphrasing some syntactic or semantic mappings among NLQ, DB and SQL or lacks a comprehensive and controllable way to investigate the model robustness issue under the schema evolution. In this work, we approach this crucial problem by introducing a novel framework, EvoSchema, to systematically simulate diverse schema changes that occur in real-world scenarios. EvoSchema builds on our newly defined schema evolution taxonomy, which encompasses a comprehensive set of eight perturbation types, covering both column-level and table-level modifications. We utilize this framework to build an evaluation benchmark to assess the models’ robustness against different schema evolution types. Meanwhile, we propose a new training paradigm, which augments existing training data with diverse schema designs and forces the model to distinguish the schema difference for the same questions to avoid learning spurious patterns. Our experiments demonstrate that the existing models are more easily affected by table-level perturbations than column-level perturbations. In addition, the models trained under our paradigm exhibit significantly improved robustness, achieving up to 33 points improvement on the evaluation benchmark compared to models trained on unperturbed data. This work represents a significant step towards building more resilient text-to-SQL systems capable of handling the dynamic nature of database schemas.

3743Dynamic Contrastive Skill Learning with State-Transition Based Skill Clustering and Dynamic Length Adjustment

[openreview] [pdf]

Abstract Reinforcement learning (RL) has made significant progress in various domains, but scaling it to long-horizon tasks with complex decision-making remains challenging. Skill learning attempts to address this by abstracting actions into higher-level behaviors. However, current approaches often fail to recognize semantically similar behaviors as the same skill and use fixed skill lengths, limiting flexibility and generalization. To address this, we propose Dynamic Contrastive Skill Learning (DCSL), a novel framework that redefines skill representation and learning. DCSL introduces three key ideas: state-transition based skill definition, skill similarity function learning, and dynamic skill length adjustment. By focusing on state transitions and leveraging contrastive learning, DCSL effectively captures the semantic context of behaviors and adapts skill lengths to match the appropriate temporal extent of behaviors. Our approach enables more flexible and adaptive skill extraction, particularly in complex or noisy datasets, and demonstrates competitive performance compared to existing methods in task completion and efficiency.

3744Enhancing Virtual Try-On with Synthetic Pairs and Error-Aware Noise Scheduling

[openreview] [pdf]

Abstract Given an isolated garment image in a canonical product view and a separate image of a person, the virtual try-on task aims to generate a new image of the person wearing the target garment. Prior virtual try-on works face two major challenges in achieving this goal: a) the paired (human, garment) training data has limited availability; b) generating textures on the human that perfectly match that of the prompted garment is difficult, often resulting in distorted text and faded textures. Our work addresses these issues through a dual approach. First, we introduce a garment extraction model that generates (human, synthetic garment) pairs from a single image of a clothed individual. The synthetic pairs can then be used to augment the training of virtual try-on. Second, we propose an Error-Aware Refinement-based Schr"odinger Bridge (EARSB) that surgically targets localized generation errors for correcting the output of a virtual try-on model. To identify likely errors, we propose a weakly-supervised error classifier that localizes regions for refinement, subsequently augmenting the Schr"odinger Bridge’s noise schedule with its confidence heatmap. Experiments on VITON-HD and DressCode-Upper demonstrate that our synthetic data augmentation enhances the performance of prior work, while EARSB improves the overall image quality. In user studies, our model is preferred by the users in an average of 59% of cases.

3745Exploring Prosocial Irrationality for LLM Agents: A Social Cognition View

[openreview] [pdf]

Abstract Large language models (LLMs) have been shown to face hallucination issues due to the data they trained on often containing human bias; whether this is reflected in the decision-making process of LLM agents remains under-explored. As LLM Agents are increasingly employed in intricate social environments, a pressing and natural question emerges: Can we utilize LLM Agents’ systematic hallucinations to mirror human cognitive biases, thus exhibiting irrational social intelligence? In this paper, we probe the irrational behavior among contemporary LLM agents by melding practical social science experiments with theoretical insights. Specifically, we propose CogMir, an open-ended Multi-LLM Agents framework that utilizes hallucination properties to assess and enhance LLM Agents’ social intelligence through cognitive biases. Experimental results on CogMir subsets show that LLM Agents and humans exhibit high consistency in irrational and prosocial decision-making under uncertain conditions, underscoring the prosociality of LLM Agents as social entities and highlighting the significance of hallucination properties. Additionally, CogMir framework demonstrates its potential as a valuable platform for encouraging more research into the social intelligence of LLM Agents.

3746The Contraction Property of Pooling Layer

[openreview] [pdf]

Abstract Although the theory of deep neural networks has been studied for years, the mechanism of pooling layers is still elusive. In this paper, we report the angle contraction behavior of pooling strategies (the average pooling and max pooling) at initialization. Compared to the relu-activated fully connected layer or convolutional layer, the pooling layer stands as the main source of contraction of the angle between hidden features. Moreover, we show that the cosine similarity between average pooling features in convolutional neural network is more data-dependent than fully connected network, while the max pooling is not sensitive to the data distribution in both architectures. Our results may complement the understanding of the representation learning.

3747WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average

[openreview] [pdf]

Abstract The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at an increased cost at inference. Weight averaging methods aim to balance the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.

3748How language models extrapolate outside the training data: A Case study in Textualized Gridworld

[openreview] [pdf]

Abstract Language models’ ability to extrapolate learned behaviors to novel, more complex environments beyond their training scope is highly unknown. This study introduces a path planning task in a textualized Gridworld to probe language models’ extrapolation capabilities. We show that conventional approaches, including next-token prediction and Chain of Thought (CoT) fine-tuning, fail to generalize in larger, unseen environments. Inspired by human cognition and dual-process theory, we propose language models should construct cognitive maps before interaction. Our research demonstrates that autoregressive generation of cognitive maps and planning sequences enhances planning capabilities in extrapolated environments. Unlike CoT, we find that cognitive maps cannot be obtained through simple prompting, necessitating additional training schemes for integration. Our findings in Gridworld offer insights into training language models with improved reasoning and adaptability, potentially advancing more human-like cognition and opening avenues for enhancing model generalization across diverse, complex tasks.

3749Dice-GAN: Generative Adversarial Network with Diversity Injection and Consistency Enhancement

[openreview] [pdf]

Abstract In the field of natural language description tasks, one challenge for text-to-image modeling is to generate images that are both of high quality and diversity and maintain a high degree of semantic consistency with the textual description. Although significant progress has been made in existing research, there is still potential for improving image quality and diversity. In this study, we propose an efficient attention-based text-to-image synthesis model based on generative adversarial networks named Dice-GAN. To enhance the diversity of image generation, we design a diversity injection module, which injects noise multiple times during image generation and incorporates a self-attention mechanism to assist the generator in maintaining global structural consistency while enhancing the diversity of images. To improve the semantic consistency, we designed a consistency enhancement module, which enhances the semantic consistency of image generation by combining word vectors and a hybrid attention mechanism to achieve dynamic weight adjustment for different image regions. We conducted experiments on two widely accepted benchmark datasets, CUB and COCO. Dice-GAN demonstrated significant superiority in improving the fidelity and diversity of image generation compared to the existing approaches.

3750Learning to Plan Before Answering: Self-Teaching LLMs to Learn Abstract Plans for Problem Solving

[openreview] [pdf]

Abstract In the field of large language model (LLM) post-training, the effectiveness of utilizing synthetic data generated by the LLM itself has been well-presented. However, a key question remains unaddressed: what essential information should such self-generated data encapsulate? Existing approaches only produce step-by-step problem solutions, and fail to capture the abstract meta-knowledge necessary for generalization across similar problems. Drawing insights from cognitive science, where humans employ high-level abstraction to simplify complex problems before delving into specifics, we introduce a novel self-training algorithm: LEarning to Plan before Answering (LEPA). LEPA trains the LLM to formulate anticipatory plans, which serve as abstract meta-knowledge for problem-solving, before engaging with the intricacies of problems. This approach not only outlines the solution generation path but also shields the LLM from the distraction of irrelevant details. During data generation, LEPA first crafts an anticipatory plan based on the problem, and then generates a solution that aligns with both the plan and the problem. LEPA refines the plan through self-reflection, aiming to acquire plans that are instrumental in yielding correct solutions. During model optimization, the LLM is trained to predict both the refined plans and the corresponding solutions. By efficiently extracting and utilizing the anticipatory plans, LEPA demonstrates remarkable superiority over conventional algorithms on various challenging natural language reasoning benchmarks.

3751Memory-Driven Multimodal Chain of Thought for Embodied Long-Horizon Task Planning

[openreview] [pdf]

Abstract Existing methods excel in short-horizon tasks but struggle with complex, long-horizon planning in dynamic environments. To address these limitations, we propose the Memory-Driven Multimodal Chain of Thought (MCoT-Memory), a framework designed to enhance task planning through two key innovations: 1) Evolving Scene Graph-Driven Chain of Thought with CoT Memory Retrieval, which enables the agent to continuously update a scene graph with visual information captured along its trajectory, providing a structured and dynamic representation of the environment that informs real-time decision-making, and uniquely incorporates CoT memory retrieval to allow the agent to leverage past experiences in its reasoning process; 2) Stepwise Confidence-Driven Memory Retention, which employs an expert model to evaluate reasoning across multiple dimensions of accuracy, ensuring that only high-confidence experiences are retained in memory for future retrieval, thus enabling the agent to build on valuable insights and improve performance in long-horizon tasks. To advance long-horizon task planning, we present ExtendaBench, a comprehensive benchmark encompassing 1,198 tasks across two simulators, VirtualHome and Habitat 2.0. The tasks are categorized into ultra-short, short, median, and long tasks. Extensive experiments demonstrate that prior methods struggle with long-horizon tasks, while MCoT-Memory significantly improves performance, marking it as a promising approach for embodied task planning.

3752RL3: Boosting Meta Reinforcement Learning via RL inside RL2

[openreview] [pdf]

Abstract Meta reinforcement learning (meta-RL) methods such as \rlsquare have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but do converge to an optimal policy in the limit. We propose RL3^3, a principled hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to meta-RL. We show that RL3^3 earns greater cumulative reward in the long term compared to RL2^2 while drastically reducing meta-training time and generalizes better to out-of-distribution tasks. Experiments are conducted on both custom and benchmark discrete domains from the meta-RL literature that exhibit a range of short-term, long-term, and complex dependencies.

3753Fantastic Experts and How to Find Them: A Multi-Dimensional Study for Experts-Level Sparsification in Mixture-of-Experts

[openreview] [pdf]

Abstract Sparsely activated Mixture-of-Experts (SMoE) has shown promise in scaling up the learning capacity of neural networks. However, vanilla SMoEs have issues such as expert redundancy and heavy memory requirements, making them inefficient and non-scalable, especially for resource-constrained scenarios. Expert-level sparsification of SMoEs involves pruning the least important experts to address these limitations. In this work, we aim to address three questions: (1) What is the best recipe across multiple plausible recipes to identify the least knowledgeable subset of experts that can be dropped to achieve a desired sparsity level? (2) How should we perform expert dropping (one-shot or iterative), and what correction measures can we undertake to minimize its drastic impact on SMoE subnetwork capabilities? (3) What capabilities of full-SMoEs are severely impacted by the removal of the least dominant experts, and how can we recover them? Firstly, we propose MoE Experts Compression Suite (MC-Suite), which is a collection of some previously explored and multiple novel recipes to provide a comprehensive benchmark for estimating expert importance from diverse perspectives, as well as unveil numerous valuable insights for SMoE experts. Secondly, unlike prior works with a one-shot expert pruning approach, we explore the benefits of iterative pruning with the re-estimation of the MC-Suite criterion. Moreover, we introduce the benefits of task-agnostic fine-tuning as a correction mechanism during progressive expert dropping, which we term MoE Lottery Subnetworks. Lastly, we present an experimentally validated conjecture that, during expert dropping, SMoEs’ instruction-following capabilities are predominantly hurt, which can be restored to a robust level subject to external augmentation of instruction-following capabilities using k-shot examples and supervised fine-tuning.

3754Preventing Unintended Memorization by Covering with Over-Memorization

[openreview] [pdf]

Abstract From the advances of deep learning, the privacy concerns of deep neural networks are in the limelight. A particular concern is privacy of the training data, which is often compromised by the model’s inherent memorization capabilities. Suppressing such memorization can enhance privacy but introduces two main challenges: 1) removing a memorized instance from the training dataset will result in the model to memorize another instance instead, and 2) the memorization is essential for improving the generalization error. To address these challenges, we propose an over-memorization method that involves training the model with both the standard training set and a set of redundant, non-sensitive instances. Our method leverages the model’s limited memorization capacity to focus on irrelevant data, thereby preventing it from memorizing the training data. Our empirical results demonstrate that this method not only enhances protection against membership inference attacks but also minimizes the loss of utility by effectively redirecting the model’s generalization efforts towards non-sensitive instances.

3755Emergence of meta-stable clustering in mean-field transformer models

[openreview] [pdf]

Abstract We model the evolution of tokens within a deep stack of Transformer layers as a continuous-time flow on the unit sphere, governed by a mean-field interacting particle system, building on the framework introduced in Geshkovski et al. (2023). Studying the corresponding mean-field Partial Differential Equation (PDE), which can be interpreted as a Wasserstein gradient flow, in this paper we provide a mathematical investigation of the long-term behavior of this system, with a particular focus on the emergence and persistence of meta-stable phases and clustering phenomena, key elements in applications like next-token prediction. More specifically, we perform a perturbative analysis of the mean-field PDE around the iid uniform initialization and prove that, in the limit of large number of tokens, the model remains close to a meta-stable manifold of solutions with a given structure (e.g., periodicity). Further, the structure characterizing the meta-stable manifold is explicitly identified, as a function of the inverse temperature parameter of the model, by the index maximizing a certain rescaling of Gegenbauer polynomials.

3756One-for-All Few-Shot Anomaly Detection via Instance-Induced Prompt Learning

[openreview] [pdf]

Abstract Anomaly detection methods under the ‘one-for-all’ paradigm aim to develop a unified model capable of detecting anomalies across multiple classes. However, these approaches typically require a large number of normal samples for model training, which may not always be feasible in practice. Few-shot anomaly detection methods can address scenarios with limited data but often require a tailored model for each class, struggling within the ‘one-for-one’ paradigm. In this paper, we first proposed the one-for-all few-shot anomaly detection method with the assistance of vision-language model. Different from previous CLIP-based methods learning fix prompts for each class, our method learn a class-shared prompt generator to adaptively generate suitable prompt for each instance. The prompt generator is trained by aligning the prompts with the visual space and utilizing guidance from general textual descriptions of normality and abnormality. Furthermore, we address the mismatch problem of the memory bank within one-for-all paradigm. Extensive experimental results on MVTec and VisA demonstrate the superiority of our method in few-shot anomaly detection task under the one-for-all paradigm.

3757RaSA: Rank-Sharing Low-Rank Adaptation

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) has been prominently employed for parameter-efficient fine-tuning of large language models (LLMs). However, the limited expressive capacity of LoRA, stemming from the low-rank constraint, has been recognized as a bottleneck, particularly in rigorous tasks like code generation and mathematical reasoning. To address this limitation, we introduce Rank-Sharing Low-Rank Adaptation (RaSA), an innovative extension that enhances the expressive capacity of LoRA by leveraging partial rank sharing across layers. By forming a shared rank pool and applying layer-specific weighting, RaSA effectively increases the number of ranks without augmenting parameter overhead. Our theoretically grounded and empirically validated approach demonstrates that RaSA not only maintains the core advantages of LoRA but also significantly boosts performance in challenging code and math tasks. Code, data and scripts are available at:https://anonymous.4open.science/r/RaSA-ICLR-0E25.

3758Mixture-of-Instructions: Aligning Large Language Models via Mixture Prompting

[openreview] [pdf]

Abstract With the proliferation of large language models (LLMs), the comprehensive alignment of such models across multiple tasks has emerged as a critical area of research. Existing alignment methodologies primarily address single task, such as multi-turn dialogue, coding, mathematical problem-solving, and tool usage. However, AI-driven products that leverage language models usually necessitate a fusion of these abilities to function effectively in real-world scenarios. Moreover, the considerable computational resources required for proper alignment of LLMs underscore the need for a more robust, efficient, and encompassing approach to multi-task alignment, ensuring improved generative performance. In response to these challenges, we introduce a novel technique termed Mixture-of-Instructions (MoI), which employs a strategy of instruction packing combined with diverse system prompts to boost the alignment efficiency of language models. We have also compiled a diverse set of seven benchmark datasets to rigorously evaluate the alignment efficacy of the MoI-enhanced language model. Our methodology was applied to the open-source Qwen-7B-chat model, culminating in the development of Qwen-SFT-MoI. This enhanced model demonstrates significant advancements in generative capabilities across coding, mathematics, and tool use tasks.

3759Robust LLM safeguarding via refusal feature adversarial training

[openreview] [pdf]

Abstract Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.

3760Are Probabilistic Robust Accuracy Bounded

[openreview] [pdf]

Abstract Adversarial samples pose a security threat to many critical systems built on neural networks. It has recently been proven that achieving deterministic robustness (i.e., complete elimination of adversarial samples) always comes at an unbearable cost to accuracy. As a result, probabilistic robustness (where the probability of retaining the same label within a vicinity is at least 1κ1 - \kappa) has been proposed as a promising compromise. However, existing training methods for probabilistic robustness still experience non-trivial accuracy loss. It remains an open question whether an upper limit on accuracy exists when optimizing for probabilistic robustness, and whether there is a specific relationship between κ\kappa and this potential bound. This work studies these problems from a Bayes error perspective. We find that while Bayes uncertainty does affect probabilistic robustness, its impact is smaller than that on deterministic robustness. This reduced Bayes uncertainty allows a higher upper bound on probabilistic robust accuracy than that on deterministic robust accuracy. Further, we show that voting within the vicinity always improves probabilistic robust accuracy and the upper bound of probabilistic robust accuracy monotonically increases as κ\kappa grows. Our empirical findings also align with our results. This study thus presents a theoretical argument supporting probabilistic robustness as the appropriate target for achieving neural network robustness.

3761Common Pitfalls of Margin-based Preference Optimization in Language Model Alignment

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has become the predominant approach for aligning language models (LMs) to be more helpful and less harmful. At its core, RLHF uses a margin-based loss for preference optimization, which specifies the ideal LM behavior only in terms of the difference between preferred and dispreferred responses. This under-specification of ideal behavior for each response individually leads to two unintended consequences as the margin increases: (1) The probability of dispreferred (e.g., unsafe) responses may increase, resulting in potential safety alignment failures. (2) When the probability of dispreferred responses is reduced, this often coincides with a decrease in the probability of preferred responses, even when these responses are ideal. In this paper, we identify the fundamental issue: margin-based preference optimization loss under-specifies ideal LM behaviors. We derive key conditions under which the probabilities of both preferred and dispreferred responses increase or decrease together. These conditions occur when the inner products between the gradients of the log-probabilities of preferred and dispreferred responses are large. We theoretically analyze when such inner products are large and empirically validate our findings. Our framework also reveals important differences in the training dynamics of various preference optimization algorithms and suggests new directions for developing better algorithms for language model alignment.

3762Neuralized Markov Random Field for Interaction-Aware Stochastic Human Trajectory Prediction

[openreview] [pdf]

Abstract Interactive human motions and the continuously changing nature of intentions pose significant challenges for human trajectory prediction. In this paper, we present a neuralized Markov random field (MRF)-based motion evolution method for probabilistic interaction-aware human trajectory prediction. We use MRF to model each agent’s motion and the resulting crowd interactions over time, hence is robust against noisy observations and enables group reasoning. We approximate the modeled distribution using two conditional variational autoencoders (CVAEs) for efficient learning and inference. Our proposed method achieves state-of-the-art performance on ADE/FDE metrics across two dataset categories: overhead datasets ETH/UCY, SDD, and NBA, and ego-centric JRDB. Furthermore, our approach allows for real-time stochastic inference in bustling environments, making it well-suited for a 30FPS video setting. We will open-source our codes upon paper acceptance.

3763A2Perf: Real-World Autonomous Agents Benchmark

[openreview] [pdf]

Abstract Autonomous agents and systems cover a number of application areas, from robotics and digital assistants to combinatorial optimization, all sharing common, unresolved research challenges. It is not sufficient for agents to merely solve a given task; they must generalize to out-of-distribution tasks, perform reliably, and use hardware resources efficiently during training and on-device deployment, among other requirements. Several major classes of methods, such as reinforcement learning and imitation learning, are commonly used to tackle these problems, each with different trade-offs. However, there is currently no benchmarking suite that defines the environments, datasets, and metrics which can be used to develop reference implementations and seed leaderboards with baselines, providing a meaningful way for the community to compare progress. We introduce A2Perf---a benchmarking suite including three environments that closely resemble real-world domains: computer chip floorplanning, web navigation, and quadruped locomotion. A2Perf provides metrics that track task performance, generalization, system resource efficiency, and reliability, which are all critical to real-world applications. In addition, we propose a data cost metric to account for the cost incurred acquiring offline data for imitation learning, reinforcement learning, and hybrid algorithms, which allows us to better compare these approaches. A2Perf also contains baseline implementations of standard algorithms, enabling apples-to-apples comparisons across methods and facilitating progress in real-world autonomy. As an open-source and extendable benchmark, A2Perf is designed to remain accessible, documented, up-to-date, and useful to the research community over the long term.

3764Actions Speak Louder Than States: Going Beyond Bayesian Inference in In-Context Reinforcement Learning

[openreview] [pdf]

Abstract In this paper, we investigate in-context learning (ICL) for reinforcement learning (RL), particularly extending beyond Bayesian inference to more advanced and richer learning paradigms in transformers. Transformers have shown promise for few-shot and zero-shot learning, but their capabilities for ICL in RL environments are not well explored. Our work studies the role of task diversity in RL environments on the downstream ICL capabilities of transformers. To do so, we introduce a novel RL benchmark, developed to provide a rich variety of tasks, essential for this exploration. Through this environment, we not only demonstrate the critical role of task diversity in facilitating advanced learning algorithms like transformers but also investigate the effects of model architecture, regularization, and other factors on the learning process. This study marks a pivotal advance in understanding the dynamics of ICL in RL, showcasing how diverse tasks can drive transformer models to surpass traditional learning methods.

3765Large Language Models Engineer Too Many Simple Features for Tabular Data

[openreview] [pdf]

Abstract Tabular machine learning problems often require time-consuming and labor-intensive feature engineering. Recent efforts have focused on using large language models (LLMs) to capitalize on their potential domain knowledge. At the same time, researchers have observed ethically concerning negative biases in other LLM-related use cases, such as text generation. These developments motivated us to investigate whether LLMs exhibit a bias that negatively impacts the performance of feature engineering. While not ethically concerning, such a bias could hinder practitioners from fully utilizing LLMs for automated data science. Therefore, we propose a method to detect potential biases by detecting anomalies in the frequency of operators (e.g., adding two features) suggested by LLMs when engineering new features. Our experiments evaluate the bias of four LLMs, two big frontier and two small open-source models, across 27 tabular datasets. Our results indicate that LLMs are biased toward simple operators, such as addition, and can fail to utilize more complex operators, such as grouping followed by aggregations. Furthermore, the bias can negatively impact the predictive performance when using LLM-generated features. Our results call for mitigating bias when using LLMs for feature engineering.

3766PTAD: Prototype-Oriented Tabular Anomaly Detection via Mask Modeling

[openreview] [pdf]

Abstract Tabular anomaly detection, which aims at identifying deviant samples, has been crucial in a variety of real-world applications, such as medical disease identification, financial fraud detection, intrusion monitoring, etc. Although recent deep learning-based methods have achieved competitive performances, these methods suffer from representation entanglement and the lack of global correlation modeling, which leads to the ‘abnormal leakage’ issue and hinders anomaly detection performance. To tackle the problem, we incorporate mask modeling and prototype learning into tabular anomaly detection. The core idea is to design learnable masks by disentangled representation learning within a projection space and extracting nominal dependencies as explicit global prototypes. Specifically, the overall model involves two parts: (i) During encoding, we perform mask modeling in both the data space and projection space with orthogonal basis vectors for masking out the suspicious abnormal locations; (ii) During decoding, we decode multiple masked representations in parallel for reconstruction and learn association prototypes to extract nominal characteristic correlations. Our proposal derives from a distribution-matching perspective, where both projection space learning and association prototype learning are formulated as optimal transport problems, and the calibration distances are utilized to refine the anomaly scores. By conducting both quantitative and qualitative experiments on 20 tabular benchmarks, our model surpasses other competitors and possesses good interpretability.

3767Towards Better Understanding of In-Context Learning Ability from In-Context Uncertainty Quantification

[openreview] [pdf]

Abstract Predicting simple function classes has been widely used as a testbed for developing theory and understanding of the trained Transformer’s in-context learning (ICL) ability. In this paper, we revisit the training of Transformers on linear regression tasks, and different from the existing literature, we consider a bi-objective prediction task of predicting both the conditional expectation E[YX]\mathbb{E}[Y|X] and the conditional variance Var(YX)(Y|X). This additional uncertainty quantification objective provides a handle to (i) better design out-of-distribution experiments to distinguish ICL from in-weight learning (IWL) and (ii) make a better separation between the algorithms with and without using the prior information of the training distribution. Theoretically, we show that the trained Transformer reaches near Bayes optimum, suggesting the usage of the information of the training distribution. Our method can be extended to other cases. Specifically, with the Transformer’s context window SS, we prove a new generalization bound of O~(minS,T/(nT))\tilde{\mathcal{O}}(\sqrt{\min{S, T}/(n T)}) on nn tasks with sequences of length TT, providing sharper analysis compared to previous results of O~(1/n)\tilde{\mathcal{O}}(\sqrt{1/n}). Empirically, we illustrate that while the trained Transformer behaves as the Bayes-optimal solution as a natural consequence of supervised training in distribution, it does not necessarily perform a Bayesian inference when facing task shifts, in contrast to the \textit{equivalence} between these two proposed in many existing literature. We also demonstrate the trained Transformer’s ICL ability over covariates shift and prompt-length shift and interpret them as a generalization over a meta distribution.

3768LLMs Boost the Performance of Decision Trees on Tabular Data across Sample Sizes

[openreview] [pdf]

Abstract Large language models (LLMs) perform remarkably well on tabular datasets in zero- and few-shot settings, since they can extract meaning from natural language column headers that describe features and labels. In contrast to LLMs, gradientboosted decision trees (GBDTs) must learn the relationships among columns from scratch, increasing their data requirements. Meanwhile, LLMs are not competitive with GBDTs on medium or large datasets, and their scalability is capped by their limited context lengths. In this paper, we propose LLM-Boost, a simple and lightweight approach for fusing large language models with gradientboosted decision trees, which enables larger datasets to benefit from the natural language capabilities of LLMs than was previously shown. LLM-Boost outperforms both LLMs and GBDTs on a wide range of dataset sizes. We demonstrate state-of-the-art performance against numerous baselines and ensembling approaches, and we also show how to fuse GBDTs with TabPFN, a recent nonLLM model for in-context learning on tabular data. We find that this combination achieves the best performance on large datasets.

3769Semantic-aligned Query Synthesis for Active Learning

[openreview] [pdf]

Abstract Active learning (AL) reduces data annotation costs by querying labels from human annotators for the most informative unlabeled data points during model training. Existing AL methods generally assume the availability of a large amount of unlabeled samples for query selection. However, collecting raw data in practice can be expensive, even without considering the cost of labeling. Membership query synthesis circumvents the need for an unlabeled data pool by directly generating informative queries from the input space. Nevertheless, existing approaches often generate instances lacking semantic meaning, thereby increasing the difficulty of labeling. In this paper, we propose the Generative Membership Query Descriptor (GenMQD) method for AL to mitigate the risk of generating unrecognizable instances. The key idea is to generate textual descriptions of the desired data, instead of the data samples themselves. Then a pre-trained multi-modal alignment model (e.g., CLIP) can be leveraged to transform these features into natural language texts for data gathering purposes. Extensive experiments on image classification benchmark datasets against query synthesis state-of-the-art methods demonstrate that, on average, GenMQD can improve model accuracy by 2.43% when gathering and labeling 500 examples. A large-scale user study verifies that human oracles prefer GenMQD generated queries over generated image-based queries.

3770Large Language Models Can Be More Robust Multiple Choice Selectors Through Attention Intervention

[openreview] [pdf]

Abstract Multiple-choice question (MCQ) is a common task for evaluating large language models (LLMs). LLMs’ performance on MCQ is often affected by various biases. Previous research has extensively examined the impact of inherent option bias on MCQ predictions, where this bias refers to a preference for a specific option ID token introduced during the model’s training. However, in an in-context learning scenario, few-shot prompting can also introduce a form of bias, known as context option bias. This occurs, for instance, in extreme cases where all demonstration answers are consistently option A, in which case LLMs may predict A for the given question whatever the question is. Context option bias can significantly degrade LLMs’ performance. To better observe the LLMs’ behavior when affected by the context option bias, we deliberately use demonstrations with obvious context option bias for MCQ to amplify the effect. The results indicate that certain attention heads in LLMs are particularly sensitive to context option bias. Motivated by this observation, we propose our approach, CoLo, to address this issue. First, using samples with ordinary and biased demonstrations as input, CoLo compares the outputs of two types of inputs and localizes attention heads sensitive to context option bias through sequential interventions. Then, we propose an attention scaling-based method to intervene in the identified attention heads during the inference stage, thereby mitigating the impact of context option bias on the LLMs’ predictions. Experimental results demonstrate that CoLo effectively alleviates the impact of context option bias and improves the LLM’s robustness on MCQ tasks.

3771Common Causes for Sudden Shifts: Linking Phase Transitions in Sinusoidal Networks

[openreview] [pdf]

Abstract Different phases of learning dynamics exist when training deep neural networks. These can be characterised by statistics called order parameters. In this work we identify a shared, underlying mechanism connecting three seemingly distinct phase transitions in the training of a class of deep regression models, specificially Implicit Neural Representations (INRs) of image data. These transitions include: the emergence of wave patterns in residuals (a novel observation), the transition from fast to slow learning, and Neural Tangent Kernel (NTK) alignment. We relate the order parameters for each phenomenon to a common set of variables derived from a local approximation of the structure of the NTK. Furthermore, we present experimental evidence demonstrating these transitions coincide. Our results enable new insights on the inductive biases of sinusoidal INRs.

3772Scaling Laws for Predicting Downstream Performance in LLMs

[openreview] [pdf]

Abstract Precise estimation of downstream performance in large language models (LLMs) prior to training is essential for guiding their development process. Scaling laws analysis utilizes the statistics of a series of significantly smaller sampling language models (LMs) to predict the performance of the target LLM. For downstream performance prediction, the critical challenge lies in the emergent abilities in LLMs that occur beyond task-specific computational thresholds. In this work, we focus on the pre-training loss as a more computation-efficient metric for performance estimation. Our two-stage approach consists of first estimating a function that maps computational resources (e.g.,FLOPs) to the pre-trainingLoss using a series of sampling models, followed by mapping the pre-training loss to downstream taskPerformance after the critical “emergent phase”. In preliminary experiments, thisFLPsolution accurately predicts the performance of LLMs with 7B and 13B parameters using a series of sampling LMs up to 3B, achieving error margins of 5% and 10%, respectively, and significantly outperforming the FLOPs-to-Performance approach. This motivatesFLP-M, a fundamental approach for performance prediction that addresses the practical need to integrate datasets from multiple sources during pre-training, specifically blending general corpora with code data to accurately represent the common necessity. FLP-M extends the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources, and employs a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance. By utilizing a 3B LLM trained on a specific ratio and a series of smaller sampling LMs, FLP-M can effectively forecast the performance of 3B and 7B LLMs across various data mixtures for most benchmarks within 10% error margins.

3773Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations

[openreview] [pdf]

Abstract Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token-prediction imitation objectives and human preferences. This often results in behaviors that deviate from human-preferred demonstrations, making post-training behavior alignment crucial for generating human-preferred motions. Post-training alignment requires a large number of preference rankings over model generations, which are costly and time-consuming to annotate in multi-agent motion generation settings. Recently, there has been growing interest in using expert demonstrations to scalably build preference data for alignment. However, these methods often adopt a worst-case scenario assumption, treating all generated samples from the reference model as unpreferred and relying on expert demonstrations to directly or indirectly construct preferred generations. This approach overlooks the rich signal provided by preference rankings among the model’s own generations. In this work, instead of treating all generated samples as equally unpreferred, we propose a principled approach leveraging the implicit preferences encoded in expert demonstrations to construct preference rankings among the generations produced by the reference model, offering more nuanced guidance at low-cost. We present the first investigation of direct preference alignment for multi-agent motion token-prediction models using implicit preference feedback from demonstrations. We apply our approach to large-scale traffic simulation and demonstrate its effectiveness in improving the realism of generated behaviors involving up to 128 agents, making a 1M token-prediction model comparable to state-of-the-art large models by relying solely on implicit feedback from demonstrations, without requiring additional human annotations or high computational costs. Furthermore, we provide an in-depth analysis of preference data scaling laws and their effects on over-optimization, offering valuable insights for future investigations.

3774Data Center Cooling System Optimization Using Offline Reinforcement Learning

[openreview] [pdf]

Abstract The recent advances in information technology and artificial intelligence have fueled a rapid expansion of the data center (DC) industry worldwide, accompanied by an immense appetite for electricity to power the DCs. In a typical DC, around 30-40% of the energy is spent on the cooling system rather than on computer servers, posing a pressing need for developing new energy-saving optimization technologies for DC cooling systems. However, optimizing such real-world industrial systems faces numerous challenges, including but not limited to a lack of reliable simulation environments, limited historical data, and stringent safety and control robustness requirements. In this work, we present a novel physics-informed offline reinforcement learning (RL) framework for energy efficiency optimization of DC cooling systems. The proposed framework models the complex dynamical patterns and physical dependencies inside a server room using a purposely designed graph neural network architecture that is compliant with the fundamental time-reversal symmetry. Because of its well-behaved and generalizable state-action representations, the model enables sample-efficient and robust latent space offline policy learning using limited real-world operational data. Our framework has been successfully deployed and verified in a large-scale production DC for closed-loop control of its air-cooling units (ACUs). We conducted a total of 1300 hours of short and long-term experiments in the production DC environment. The results show that our method achieves 14-18% energy savings in the DC cooling system, without any violation of the safety or operational constraints. We have also conducted a comprehensive evaluation of our approach in a real-world DC testbed environment. Our results have demonstrated the significant potential of offline RL in solving a broad range of data-limited, safety-critical real-world industrial control problems.

3775Transformers are Efficient Compilers, Provably

[openreview] [pdf]

Abstract Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks, including programming language understanding and generation. In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective. To this end, we introduce a representative programming language,mini-husky, which encapsulates key features of modern C-like languages. We show that if the input code sequence has a bounded depth in both the Abstract Syntax Tree (AST) and type inference (reasonable assumptions based on the clean code principle), then the number of parameters required by transformers depends only on the logarithm of the input sequence length to handle compilation tasks, such as AST construction, symbol resolution, and type analysis. A significant technical challenge stems from the fact that transformers operate at a low level, where each layer processes the input sequence as raw vectors without explicitly associating them with predefined structure or meaning. In contrast, high-level compiler tasks necessitate managing intricate relationships and structured program information. Our primary technical contribution is the development of a domain-specific language,Cybertron, which generates formal proofs of the transformer’s expressive power, scaling to address compiler tasks. We further establish that recurrent neural networks (RNNs) require at least a linear number of parameters relative to the input sequence, leading to an exponential separation between transformers and RNNs. Finally, we empirically validate our theoretical results by comparing transformers and RNNs on compiler tasks withinmini-husky.

3776Closed-Form Merging of Parameter-Efficient Modules for Federated Continual Learning

[openreview] [pdf]

Abstract Model merging has emerged as a crucial technique in Deep Learning, enabling the integration of multiple models into a unified system while preserving performance and scalability. In this respect, the compositional properties of low-rank adaptation techniques (e.g., LoRA) have proven beneficial, as simple averaging LoRA modules yields a single model that mostly integrates the capabilities of all individual modules. Building on LoRA, we take a step further by imposing that the merged model matches the responses of all learned modules. Solving this ob- jective in closed form yields an indeterminate system with A and B as unknown variables, indicating the existence of infinitely many closed-form solutions. To address this challenge, we introduce LoRM, an alternating optimization strategy that trains one LoRA matrix at a time. This allows solving for each unknown variable individually, thus finding a unique solution. We apply our proposed methodology to Federated Class-Incremental Learning (FCIL), ensuring alignment of model responses both between clients and across tasks. Our method demonstrates state-of-the-art performance across a range of FCIL scenarios.

3777Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models

[openreview] [pdf]

Abstract Vision-language alignment in Large Vision-Language Models (LVLMs) successfully enables LLMs to understand visual input. However, we find that existing vision-language alignment methods fail to transfer the existing safety mechanism for text in LLMs to vision, which leads to vulnerabilities in toxic image. To explore the cause of this problem, we give the insightful explanation of where and how the safety mechanism of LVLMs operates and conduct comparative analysis between text and vision. We find that the hidden states at the specific transformer layers play a crucial role in the successful activation of safety mechanism, while the vision-language alignment at hidden states level in current methods is insufficient. This results in a semantic shift for input images compared to text in hidden states, therefore misleads the safety mechanism. To address this, we propose a novel Text-Guided vision-language Alignment method (TGA) for LVLMs. TGA retrieves the texts related to input vision and uses them to guide the projection of vision into the hidden states space in LLMs. Experiments show that \textbf{TGA} not only successfully transfers the safety mechanism for text in basic LLMs to vision in vision-language alignment for LVLMs without any safety fine-tuning on the visual modality but also maintains the general performance on various vision tasks (Safe and Good). Code is in supplemental material and will be released on GitHub after acceptance.

3778Model-based Offline Reinforcement Learning with Lower Expectile Q-Learning

[openreview] [pdf]

Abstract Model-based offline reinforcement learning (RL) is a compelling approach that addresses the challenge of learning from limited, static data by generating imaginary trajectories using learned models. However, these approaches often struggle with inaccurate value estimation from model rollouts. In this paper, we introduce a novel model-based offline RL method, Lower Expectile Q-learning (LEQ), which provides a low-bias model-based value estimation via lower expectile regression of λ\lambda-returns. Our empirical results show that LEQ significantly outperforms previous model-based offline RL methods on long-horizon tasks, such as the D4RL AntMaze tasks, matching or surpassing the performance of model-free approaches and sequence modeling approaches. Furthermore, LEQ matches the performance of state-of-the-art model-based and model-free methods in dense-reward environments across both state-based tasks (NeoRL and D4RL) and pixel-based tasks (V-D4RL), showing that LEQ works robustly across diverse domains. Our ablation studies demonstrate that lower expectile regression, λ\lambda-returns, and critic training on offline data are all crucial for LEQ.

3779Toward Trustworthy: A Method for Detecting Fine-Tuning Origins in LLMs

[openreview] [pdf]

Abstract As large language models (LLMs) continue to advance, their deployment often involves fine-tuning to enhance performance on specific downstream tasks. However, this customization is sometimes accompanied by misleading claims about the origins, raising significant concerns about transparency and trust within the open-source community. Existing model verification techniques typically assess functional, representational, and weight similarities. However, these approaches often struggle against obfuscation techniques, such as permutations and scaling transformations, that obscure a model’s lineage. To address this limitation, we propose a novel detection method that rigorously determines whether a model has been fine-tuned from a specified base model. This method includes the ability to extract the LoRA rank utilized during the fine-tuning process, providing a more robust verification framework. This framework is the first to provide a formalized approach specifically aimed at pinpointing the sources of model fine-tuning. We empirically validated our method on twenty-nine diverse open-source models under conditions that simulate real-world obfuscation scenarios. We empirically analyze the effectiveness of our framework and finally, discuss its limitations. The results demonstrate the effectiveness of our approach and indicate its potential to establish new benchmarks for model verification.

[openreview] [pdf]

Abstract Dramatic increases in the capabilities of neural network models in recent years are driven by scaling model size, training data, and corresponding computational resources. To develop the exceedingly large networks required in modern applications, such as large language models (LLMs), model training is distributed across tens of thousands of hardware accelerators (e.g. GPUs), requiring orchestration of computation and communication across large computing clusters. In this work, we demonstrate that careful consideration of hardware configuration and parallelization strategy is critical for effective (i.e. compute- and cost-efficient) scaling of model size, training data, and total computation. We conduct an extensive empirical study of the performance of large-scale LLM training workloads across model size, hardware configurations, and distributed parallelization strategies. We demonstrate that: (1) beyond certain scales, overhead incurred from certain distributed communication strategies leads parallelization strategies previously thought to be sub-optimal in fact become preferable; and (2) scaling the total number of accelerators for large model training quickly yields diminishing returns even when hardware and parallelization strategies are properly optimized, implying poor marginal performance per additional unit of power or GPU-hour.

3781Unlocking Global Optimality in Bilevel Optimization: A Pilot Study

[openreview] [pdf]

Abstract Bilevel optimization has witnessed a resurgence of interest, driven by its critical role in advanced machine learning applications such as hyperparameter optimization, meta-learning, and reinforcement learning. Recent research has focused on proposing efficient methods with provable convergence guarantees. However, while many prior works have established convergence to stationary points or local minima, obtaining the global optimum of bilevel optimization remains an important yet open problem. Arguably, attaining the global optimum is indispensable for ensuring reliability, safety, and cost-effectiveness, particularly in high-stakes engineering applications that rely on bilevel optimization. In this paper, we first explore the challenges of establishing a global convergence theory for generic bilevel optimization, and present two sufficient conditions for global convergence, inspired by contemporary machine learning applications. We provide algorithm-specific proofs to rigorously substantiate these sufficient conditions along the optimization trajectory, focusing on two specific bilevel learning scenarios: representation learning and data hypercleaning (a.k.a. reweighting). Numerical results corroborate the theoretical findings, demonstrating convergence to global minimum in both cases.

3782Towards Mitigating Factual Hallucination in LLMs through Self-Alignment with Memory

[openreview] [pdf]

Abstract Despite the impressive performance of Large Language Models (LLMs) across numerous tasks and widespread application in real-world scenarios, LLMs still struggle to guarantee their responses to be accurate and aligned with objective facts. This leads to factual hallucination of LLMs, which can be difficult to detect and mislead users lacking relevant knowledge. Post-training techniques have been employed to mitigate this issue, yet they are usually followed by a trade-off between honesty and helpfulness, along with a lack of generalized improvements. In this paper, we propose to address it by augmenting LLM’s fundamental capacity of leveraging its internal memory, that is, the knowledge derived from pre-training data. We introduce FactualBench, a comprehensive and precise factual QA dataset consisting of nearly 200k Chinese generative QA data spanning 21 domains for both evaluation and training purposes. Furthermore, we propose self-alignment with memory, i.e., fine-tuning the model via preference learning on self-generated pairwise data from FactualBench. Extensive experiments show that our method significantly enhances LLM’s performance on FactualBench, with consistent improvements across various benchmarks concerning factuality, helpfulness and multiple skills. Additionally, different post-training techniques and tuning data sources are discussed to further understand their effectiveness.

3783Explicit-Constrained Single Agent for Enhanced Task-Solving in LLMs

[openreview] [pdf]

Abstract In this study, we introduce the Explicitly Constrained Agent (EC-Agent), a novel approach designed to enhance the task-solving capabilities of Large Language Models (LLMs). Unlike existing multi-agent systems that depend on agents evaluating tasks from different perspectives, EC-Agent explicitly imposes task-oriented constraints for LLMs. Our observations are two-fold: first, assigning agents to sub-tasks with defined responsibilities implicitly sets constraints; second, these multi-agent systems often struggle with accurately assigning agents to sub-tasks, leading to overlapping duties and potential misguidance. In contrast, our single-agent system, driven by explicit methods and constraints, provides LLMs with detailed prompts, resulting in more precise responses. EC-Agent consists of two stages: a Reasoning Stage and a Summary Stage. 1) In the Reasoning Stage, three modules are proposed: Explicit Method, Explicit Constraint, and Execution. Specifically, LLMs utilize the Explicit Method and Constraint modules to analyze the task type and specific rules, generating multiple suitable methods and constraints. Subsequently, the Execution module combines these methods and constraints to produce and output possible solutions. 2) In the Summary Stage, LLMs evaluate the multiple reasoning processes and results from the previous step. They rectify any inconsistencies, summarize the information, and output the final result. Experimental results demonstrate that EC-Agent outperforms previous methods across a variety of tasks.

3784Does RLHF Scale? Exploring the Effects of Data, Model, and Method

[openreview] [pdf]

Abstract This study explores the scaling properties of Reinforcement Learning from Human Feedback (RLHF) in Large Language Models (LLMs). Although RLHF is considered an important step in the post-training of LLMs, its scaling potential is still largely unknown. We systematically analyze key components in the RLHF framework—model size, data composition, and inference budget—and their impacts on performance. Our findings show that increasing data diversity and volume improves reward model performance, helping process-supervision models scale better. For policy training, more response samples per prompt boost performance initially but quickly plateau. And larger reward models offer modest gains in policy training. In addition, larger policy models benefit less from RLHF with a fixed reward model. Overall, RLHF scales less efficiently than pretraining, with diminishing returns from additional computational resources. Based on these observations, we propose strategies to optimize RLHF performance within computational limits.

3785Adjoint Matching: Fine-tuning Flow and Diffusion Generative Models with Memoryless Stochastic Optimal Control

[openreview] [pdf]

Abstract Dynamical generative models that produce samples through an iterative process, such as Flow Matching and denoising diffusion models, have seen widespread use, but there have not been many theoretically-sound methods for improving these models with reward fine-tuning. In this work, we cast reward fine-tuning as stochastic optimal control (SOC). Critically, we prove that a very specificmemorylessnoise schedule must be enforced during fine-tuning, in order to account for the dependency between the noise variable and the generated samples. We also propose a new algorithm namedAdjoint Matchingwhich outperforms existing SOC algorithms, by casting SOC problems as a regression problem. We find that our approach significantly improves over existing methods for reward fine-tuning, achieving better consistency, realism, and generalization to unseen human preference reward models, while retaining sample diversity.

3786Local vs distributed representations: What is the right basis for interpretability?

[openreview] [pdf]

Abstract Much of the research on the interpretability of deep neural networks has focused on studying the visual features that maximally activate individual neurons. However, recent work has cast doubts on the usefulness of such local representations for understanding the behavior of deep neural networks because individual neurons tend to respond to multiple unrelated visual patterns, a phenomenon referred to as “superposition”. A promising alternative to disentangle these complex patterns is learning sparsely distributed vector representations from entire network layers, as the resulting basis vectors seemingly encode single identifiable visual patterns consistently. Thus, one would expect the resulting code to align better with human-perceivable visual patterns, but supporting evidence remains, at best, anecdotal. To fill this gap, we conducted three large-scale psychophysics experiments collected from a pool of 560 participants. Our findings provide (i) strong evidence that features obtained from sparse distributed representations are easier to interpret by human observers and (ii) that this effect is more pronounced in the deepest layers of a neural network. Complementary analyses also reveal that (iii) features derived from sparse distributed representations contribute more to the model´s decision.Overall, our results highlight that distributed representations constitute a superior basis for interpretability, underscoring a need for the field to move beyond the interpretation of local neural codes in favor of sparsely distributed ones.

3787Federated Class-Incremental Learning: A Hybrid Approach Using Latent Exemplars and Data-Free Techniques to Address Local and Global Forgetting

[openreview] [pdf]

Abstract Federated Class-Incremental Learning (FCIL) refers to a scenario where a dynamically changing number of clients collaboratively learn an ever-increasing number of incoming tasks. FCIL is known to suffer from local forgetting due to class imbalance at each client and global forgetting due to class imbalance across clients. We develop a mathematical framework for FCIL that formulates local and global forgetting. Then, we propose an approach called Hybrid Rehearsal (HR), which utilizes latent exemplars and data-free techniques to address local and global forgetting, respectively. HR employs a customized autoencoder designed for both data classification and the generation of synthetic data. To determine the embeddings of new tasks for all clients in the latent space of the encoder, the server uses the Lennard-Jones Potential formulations. Meanwhile, at the clients, the decoder decodes the stored low-dimensional latent space exemplars back to the high-dimensional input space, used to address local forgetting. To overcome global forgetting, the decoder generates synthetic data. Furthermore, our mathematical framework proves that our proposed approach HR can, in principle, tackle the two local and global forgetting challenges. In practice, extensive experiments demonstrate that while preserving privacy, our proposed approach outperforms the state-of-the-art baselines on multiple FCIL benchmarks with low compute and memory footprints.

3788Discovering Factor Level Preferences to Improve Human-Model Alignment

[openreview] [pdf]

Abstract Despite advancements in Large Language Model (LLM) alignment, understanding the reasons behind LLM preferences remains crucial for bridging the gap between desired and actual behavior. LLMs often exhibit biases or tendencies that diverge from human preferences, such as favoring certain writing styles or producing overly verbose outputs. However, current methods for evaluating preference alignment often lack explainability, relying on coarse-grained comparisons. To address this, we introduce PROFILE (PRObing Factors of InfLuence for Explainability), a novel framework that uncovers and quantifies the influence of specific factors driving preferences. PROFILE’s factor level analysis explains the “why” behind human-model alignment and misalignment, offering insights into the direction of model improvement. We apply PROFILE to analyze human and LLM preferences across three tasks: summarization, helpful response generation, and document-based question-answering. Our factor level analysis reveals a substantial discrepancy between human and LLM preferences in generation tasks, whereas LLMs show strong alignment with human preferences in evaluation tasks. We demonstrate how leveraging factor level insights, including addressing misaligned factors or exploiting the generation-evaluation gap, can improve alignment with human preferences. This work underscores the importance of explainable preference analysis and highlights PROFILE’s potential to provide valuable training signals, driving further improvements in human-LLM alignment.

3789Position-Query-Based Autoencoders for View Decoupled Cross Point Cloud Reconstruction and a Self-Supervised Learning Framework

[openreview] [pdf]

Abstract Point cloud learning, especially in a self-supervised way without manual labels, has received emerging attention in both vision and learning communities, with its potential utility in wide areas. Most existing generative approaches for point cloud self-supervised learning focus on recovering masked points from visible ones within a single view. Recognizing that a two-view pre-training paradigm inherently introduces greater diversity and variance, it could thus enable more challenging and informative pre-training. Inspired by this, we explore the potential of two-view learning in this domain. In this paper, we propose Point-PQAE, a cross-reconstruction generative paradigm that first generates two decoupled point clouds/views and then reconstructs one from the other. To achieve this goal, we develop a crop mechanism for point cloud view generation for the first time and further propose a novel positional encoding to represent the 3D relative position between the two decoupled views. The cross-reconstruction significantly increases the difficulty of pre-training compared to self-reconstruction, which enables our method to achieve new state-of-the-art results and surpasses previous single-modal self-reconstruction methods in 3D self-supervised learning by a large margin. Specifically, it outperforms self-reconstruction baseline (Point-MAE) 6.5%, 7.0%, 6.7% in three variants of ScanObjectNN with Mlp-Linear evaluation protocol. Source code will be released.

3790Provably Efficient and Practical Self-Play for Better LLM Alignment

[openreview] [pdf]

Abstract Reinforcement Learning with Human Feedback (RLHF) has gained significant attention for aligning AI behavior with human preferences. Self-play style RLHF has shown strong advantages, as highlighted by many studies. However, current self-play style RLHF approaches face several limitations, including the lack of provable sample efficiency, absence of active exploration, and limited diversity in training data. To address these challenges, we propose a novel RLHF framework that balances exploration and exploitation while providing theoretical guarantees. We introduce Two-Agent Nash Policy Optimization (TANPO) as an equivalent and easy-to-implement two-agent algorithm building on this framework. In TANPO, the two players are trained using different loss functions to ensure more diverse and informative data collection. We also propose Single-Agent Diversity-driven Optimization (SADPO), a single-agent approximation of TANPO, supported by both theoretical analysis and empirical evidence. Our theoretical analysis shows that our theoretical algorithm framework enjoys sublinear regret under general function approximation and mild structural conditions, with a detailed analysis provided for the linear case. Empirically, we implement TANPO and SADPO using Zephyr-7B-SFT as our base model, outperforming several baselines across multiple evaluation benchmarks, such as AlpacaEval 2.0, MT-Bench and various standard academic benchmarks. Our experiments also show that TANPO improves performance on AlpacaEval 2.0 over extended training epochs, demonstrating its ability to consistently improve and reduce overfitting.

3791Error Slice Discovery via Manifold Compactness

[openreview] [pdf]

Abstract Despite the great performance of deep learning models in many areas, they still make mistakes and underperform on certain subsets of data, i.e. error slices. Given a trained model, it is important to identify its semantically coherent error slices that are easy to interpret, which is referred to as the error slice discovery problem. However, there is no proper metric of slice coherence without relying on extra information like predefined slice labels. The current evaluation of slice coherence requires access to predefined slices formulated by metadata like attributes or subclasses. Its validity heavily relies on the quality and abundance of metadata, where some possible patterns could be ignored. Besides, current algorithms cannot directly incorporate the constraint of coherence into their optimization objective due to the absence of an explicit coherence metric, which could potentially hinder their effectiveness. In this paper, we propose manifold compactness, a coherence metric without reliance on extra information by incorporating the data geometry property into its design, and experiments on typical datasets empirically validate the rationality of the metric. Then we develop Manifold Compactness based error Slice Discovery (MCSD), a novel algorithm that directly treats risk and coherence as the optimization objective, and is flexible to be applied to models of various tasks. Extensive experiments on the current benchmark and case studies on other typical datasets demonstrate the effectiveness of our algorithm.

3792Item Language Model

[openreview] [pdf]

Abstract Embeddings are extensively used in many domains to represent information about domain entities in a compressed manner. In recommendation systems, these embeddings are trained to extract meaningful information about an item/user from collaborative filtering data consisting users ratings or implicit feedback on items. These behavioral embeddings are usually not trained on data from language domain, but they encode very useful behavioral information which cannot be described using language. In contrast, in large language models (LLM) this collaborative data and behavioral entities(users/items) are not well represented as they are not textual and are specific to the recommendation system/product. Bridging this gap between behavioral understanding and language understanding can enable new item and language interleaved tasks. In our work we show how we can efficiently adapt rich behavioral embeddings as an additional behavioral input representation in pre-trained LLMs. To achieve this we adapt Querying Transformer technique with a new item contrastive loss and show improved item-text joint understanding in PALM2. Finally, we also demonstrate improved capabilities in recommendation domain over using the behavioral embeddings directly as input to PALM2.

3793Generative Location Modeling for Spatially Aware Object Insertion

[openreview] [pdf]

Abstract Generative models have become a powerful tool for image editing tasks, including object insertion. However, these methods often lack spatial awareness, generating objects with unrealistic locations and scales, or unintentionally altering the scene background. A key challenge lies in maintaining visual coherence, which requires both a geometrically suitable object location and a high-quality image edit. In this paper, we focus on the former, creating alocation modeldedicated to identifying realistic object locations. Specifically, we train an autoregressive model that generates bounding box coordinates, conditioned on the background image and the desired object class. This formulation allows to effectively handle sparse placement annotations and to incorporate implausible locations into a preference dataset by performing direct preference optimization. Our extensive experiments demonstrate that our generative location model, when paired with an inpainting method, substantially outperforms state-of-the-art instruction-tuned models and location modeling baselines in object insertion tasks, delivering accurate and visually coherent results.

3794Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable emergent capabilities, reshaping the landscape of functional tasks by leveraging external tools to tackle complex problems, such as those requiring real-time data or specialized input/output processing. Existing research primarily focuses on equipping LLMs with a broader array of diverse external tools (e.g., program interpreters, search engines, weather/map applications) but overlooks the necessity of tool usage, invoking external tools indiscriminately without assessing their actual need. This naive strategy leads to two significant issues: 1) increased latency due to prolonged processing times, and 2) potential errors arising from communication between LLMs and external tools, resulting in faulty outputs. In this paper, we introduce a concept we term meta-cognition as a proxy for LLM self-capability, and we propose an adaptive decision-making strategy for invoking external tools, referred to as MeCo. Specifically, MeCo focuses on representation space to capture emergent representations of high-level cognitive phenomena that quantify the LLM’s meta-cognitive scores, thereby guiding decisions on when to use external tools. Notably, MeCo is fine-tuning-free, incurring minimal cost, and our experiments demonstrate that MeCo accurately detects the model’s internal cognitive signals. More importantly, our approach significantly enhances decision-making accuracy in tool use for multiple base models across various benchmarks.

3795Stabilized Neural Prediction of Potential Outcomes in Continuous Time

[openreview] [pdf]

Abstract Patient trajectories from electronic health records are widely used to predict potential outcomes of treatments over time, which then allows for personalizing care. Yet, existing neural methods for this purpose have a key limitation: while some adjust for time-varying confounders, these methods assume that the time series are recorded in discrete time. In other words, they are constrained to settings where measurements and treatments are conducted at fixed time steps, even though this is unrealistic in medical practice. In this work, we aim to predict potential outcomes in continuous time. The latter is of direct practical relevance because it allows for modeling patient trajectories where measurements and treatments take place at arbitrary, irregular timestamps. We thus propose a new method called stabilized continuous time inverse propensity network (SCIP-Net), for which we derive stabilized inverse propensity weights for robust prediction of the potential outcomes. To the best of our knowledge, our SCIP-Net is the first the first neural method that performs proper adjustments for time-varying confounders in continuous time.

3796Approximating Optima of Nonconvex Functions

[openreview] [pdf]

Abstract We study the computability of approximating optima of non-convex functions. We give a simple proof to show that the problem of finding the optimal value (and optimal point) or its approximation is not even computable in the oracle setting. We also give a property a function has to satisfy if its global optima can be approximated. Next we give an example of such a global property we call basin of attraction. Then we give a simple algorithm which converges to the global optima when this is known. Finally, we give some numerical results.

3797Correlations Are Ruining Your Gradient Descent

[openreview] [pdf]

Abstract Herein the topics of (natural) gradient descent, data decorrelation, and approximate methods for backpropagation are brought into a common discussion. Natural gradient descent illuminates how gradient vectors, pointing at directions of steepest descent, can be improved by considering the local curvature of loss landscapes. We extend this perspective and show that to fully solve the problem illuminated by natural gradients in neural networks, one must recognise that correlations in the data at any linear transformation, including node responses at every layer of a neural network, cause a non-orthonormal relationship between the model’s parameters. To solve this requires a method for decorrelating inputs at each individual layer of a neural network. We describe a range of methods which have been proposed for decorrelation and whitening of node output, and expand on these to provide a novel method specifically useful for distributed computing and computational neuroscience. Implementing decorrelation within multi-layer neural networks, we can show that not only is training via backpropagation sped up significantly but also existing approximations of backpropagation, which have failed catastrophically in the past, benefit significantly in their accuracy and convergence speed. This has the potential to provide a route forward for approximate gradient descent methods which have previously been discarded, training approaches for analogue and neuromorphic hardware, and potentially insights as to the efficacy and utility of decorrelation processes in the brain.

3798MixEval-X: Any-to-any Evaluations from Real-world Data Mixture

[openreview] [pdf]

Abstract Various input and output capabilities are essential for artificial intelligence (AI) models to effectively learn from and engage with diverse real-world signals. Thus, reliable evaluations are desired to guide their development. We identify two key issues in related evaluations: (1) they have inconsistent standards, often designed by different communities with varying levels of maturity, protocols, and principles; (2) they often show strong query, grading, and generalization bias. To address these, we introduce MixEval-X, the first any-to-any real-world benchmark that optimizes and standardizes evaluation across different input and output modalities. We propose multi-modal benchmark mixture and adaptation-rectification pipelines to reconstruct real-world task distributions, ensuring that evaluation tasks generalize more effectively to real-world use cases. Extensive meta-evaluations demonstrate that our reconstruction strategy accurately aligns benchmark samples with real-world task distributions and achieves strong correlations with crowd-sourced real-world evaluation results (up to 0.98). We also provide comprehensive model evaluation results to rerank the models and organizations in the field. We present detailed insights to enhance the community’s understanding of multi-modal evaluations and inform future research directions.

3799CLoSD: Closing the Loop between Simulation and Diffusion for multi-task character control

[openreview] [pdf]

Abstract Motion diffusion models and Reinforcement Learning (RL) based control for physics-based simulations have complementary strengths for human motion generation. The former is capable of generating a wide variety of motions, adhering to intuitive control such as text, while the latter offers physically plausible motion and direct interaction with the environment. In this work, we present a method that combines their respective strengths. CLoSD is a text-driven RL physics-based controller, guided by diffusion generation for various tasks. Our key insight is that motion diffusion can serve as an on-the-fly universal planner for a robust RL controller. To this end, CLoSD maintains a closed-loop interaction between two modules — a Diffusion Planner (DiP), and a tracking controller. DiP is a fast-responding autoregressive diffusion model, controlled by textual prompts and target locations, and the controller is a simple and robust motion imitator that continuously receives motion plans from DiP and provides feedback from the environment. CLoSD is capable of seamlessly performing a sequence of different tasks, including navigation to a goal location, striking an object with a hand or foot as specified in a text prompt, sitting down, and getting up.

3800DimOL: Dimensional Awareness as a New ‘Dimension’ in Operator Learning

[openreview] [pdf]

Abstract In the realm of computational physics, an enduring topic is the numerical solutions to partial differential equations (PDEs). Recently, the attention of researchers has shifted towards Neural Operator methods, renowned for their capability to approximate "operators’’ --- mappings from functions to functions. Despite the universal approximation theorem within neural operators, ensuring error bounds often requires employing numerous Fourier layers. However, what about lightweight models? In response to this question, we introduce DimOL (Dimension-aware Operator Learning), drawing insights from dimensional analysis. To implement DimOL, we propose the ProdLayer, which can be seamlessly integrated into FNO-based and Transformer-based PDE solvers, enhancing their ability to handle sum-of-products structures inherent in many physical systems. Empirically, DimOL models achieve up to 48% performance gain within the PDE datasets. Furthermore, by analyzing Fourier components’ weights, we can symbolically discern the physical significance of each term. This sheds light on the opaque nature of neural networks, unveiling underlying physical principles.

3801JANET: Joint Adaptive predictioN-region Estimation for Time-series

[openreview] [pdf]

Abstract Conformal prediction provides machine learning models with prediction sets that offer theoretical guarantees, but the underlying assumption of exchangeability limits its applicability to time series data. Furthermore, existing approaches struggle to handle multi-step ahead prediction tasks, where uncertainty estimates across multiple future time points are crucial. We propose JANET (Joint Adaptive predictioN-region Estimation for Time-series), a novel framework for constructing conformal prediction regions that are valid for both univariate and multivariate time series. JANET generalises the inductive conformal framework and efficiently produces joint prediction regions with controlled K-familywise error rates, enabling flexible adaptation to specific application needs. Our empirical evaluation demonstrates JANET’s superior performance in multi-step prediction tasks across diverse time series datasets, highlighting its potential for reliable and interpretable uncertainty quantification in sequential data.

3802Egocentric Vision Language Planning

[openreview] [pdf]

Abstract We explore leveraging large multi-modal models (LMMs) and Text2image models to build a more general embodied agent. LMMs excel in planning long-horizon tasks over symbolic abstractions but struggle with grounding in the physical world, often failing to accurately identify object positions in images. A bridge is needed to connect LMMs to the physical world. The paper proposes a novel approach, egocentric vision language planning (EgoPlan), to handle long-horizon tasks from an egocentric perspective in varying household scenarios. This pipeline leverages a diffusion model to simulate the fundamental dynamics between states and actions, discusses how to integrate computer vision related techniques like style transfer and optical flow to enhance ability of modeling spatial states and generalization across different environmental dynamics. The LMM serves as a planner, breaking down instructions into sub-goals and selecting actions based on their alignment with these sub-goals, thus enabling more generalized and effective decision-making. By using LMM, we can output text actions, using a series of mechanisms such as reflection to perform high-level task decomposition and low-level action output end-to-end. Experiments show that EgoPlan improves long-horizon task success rates from the egocentric view compared to baselines across household scenarios.

3803Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs

[openreview] [pdf]

Abstract Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights--causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over \emph{lottery tickets} (or \emph{sparse task vectors}), LoTA also enables model merging over highly dissimilar tasks.

3804DiffLM: Controllable Synthetic Data Generation via Diffusion Language Models

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis. However, synthetic data generation via prompting LLMs remains challenging due to LLMs’ limited understanding of target data distributions and the complexity of prompt engineering, especially for structured formatted data. To address these issues, we introduce DiffLM, a controllable data synthesis framework based on variational autoencoder (VAE), which further (1) leverages diffusion models to reserve more information of original distribution and format structure in the learned latent distribution and (2) decouples the learning of target distribution knowledge from the LLM’s generative objectives via a plug-and-play latent feature injection module. As we observed significant discrepancies between the VAE’s latent representations and the real data distribution, the latent diffusion module is introduced into our framework to learn a fully expressive latent distribution. Evaluations on seven real-world datasets with structured formatted data (i.e., Tabular, Code and Tool data) demonstrate that DiffLM generates high-quality data, with performance on downstream tasks surpassing that of real data by 2%–7% in certain cases. Data and code will be released upon acceptance.

3805LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities through pretraining and alignment. However, superior short-context LLMs may underperform in long-context scenarios due to insufficient long-context alignment. This alignment process remains challenging due to the impracticality of human annotation for extended contexts and the difficulty in balancing short- and long-context performance. To address these challenges, we introduce LongPO, that enables short-context LLMs to self-evolve to excel on long-context tasks by internally transfer short-context capabilities. LongPO harnesses LLMs to learn from self-generated short-to-long preference data, comprising paired responses generated for identical instructions with long-context inputs and their compressed short-context counterparts, respectively. This preference reveals capabilities and potentials of LLMs cultivated during short-context alignment that may be diminished in under-aligned long-context scenarios. Additionally, LongPO incorporates a short-to-long KL constraint to mitigate short-context performance decline during long-context alignment. When applied to Mistral-7B-Instruct-v0.2 from 128K to 256K context length, LongPO fully retaining short-context performance and largely outperforms naive SFT and DPO in both long- and short-context tasks. Specifically, \ourMethod-trained models can achieve results on long-context benchmarks comparable to, or even surpassing, those of superior LLMs (e.g., GPT-4-128K) that involve extensive long-context annotation and larger parameter scales.

3806A Unified Theory of Stochastic Proximal Point Methods without Smoothness

[openreview] [pdf]

Abstract This paper presents a comprehensive analysis of a broad range of variations of the stochastic proximal point method (SPPM). Proximal point methods have attracted considerable interest owing to their numerical stability and robustness against imperfect tuning, a trait not shared by the dominant stochastic gradient descent (SGD) algorithm. A framework of assumptions that we introduce encompasses methods employing techniques such as variance reduction and arbitrary sampling. A cornerstone of our general theoretical approach is a parametric assumption on the iterates, correction and control vectors. We establish a single theorem that ensures linear convergence under this assumption and μ\mu-strong convexity of the loss function, and without the need to invoke smoothness. This integral theorem reinstates best known complexity and convergence guarantees for several existing methods, which demonstrates the robustness of our approach. We expand our study by developing three new variants of SPPM, and through numerical experiments elucidate various properties inherent to them.

3807Automated Filtering of Human Feedback Data for Aligning Text-to-Image Diffusion Models

[openreview] [pdf]

Abstract Fine-tuning text-to-image diffusion models with human feedback is an effective method for aligning model behavior with human intentions. However, this alignment process often suffers from slow convergence due to the large size and noise present in human feedback datasets. In this work, we propose FiFA, a novel automated data filtering algorithm designed to enhance the fine-tuning of diffusion models using human feedback datasets with direct preference optimization (DPO). Specifically, our approach selects data by solving an optimization problem to maximize three components: preference margin, text quality, and text diversity. The concept of preference margin is used to identify samples that contain high informational value to address the noisy nature of feedback dataset, which is calculated using a proxy reward model. Additionally, we incorporate text quality, assessed by large language models to prevent harmful contents, and consider text diversity through a k-nearest neighbor entropy estimator to improve generalization. Finally, we integrate all these components into an optimization process, with approximating the solution by assigning importance score to each data pair and selecting the most important ones. As a result, our method efficiently filters data automatically, without the need for manual intervention, and can be applied to any large-scale dataset. Experimental results show that FiFA significantly enhances training stability and achieves better performance, being preferred by humans 17% more, while using less than 0.5% of the full data and thus 1% of the GPU hours compared to utilizing full human feedback datasets.

3808General Compression Framework for Efficient Transformer Object Tracking

[openreview] [pdf]

Abstract Transformer-based trackers have established a dominant role in the field of visual object tracking. While these trackers exhibit promising performance, their deployment on resource-constrained devices remains challenging due to inefficiencies. To improve the inference efficiency and reduce the computation cost, prior approaches have aimed to either design lightweight trackers or distill knowledge from larger teacher models into more compact student trackers. However, these solutions often sacrifice accuracy for speed. Thus, we propose a general model compression framework for efficient transformer object tracking, named CompressTracker, to reduce the size of a pre-trained tracking model into a lightweight tracker with minimal performance degradation. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages, enabling the student model to emulate each corresponding teacher stage more effectively. Additionally, we also design a unique replacement training technique that involves randomly substituting specific stages in the student model with those from the teacher model, as opposed to training the student model in isolation. Replacement training enhances the student model’s ability to replicate the teacher model’s behavior. To further forcing student model to emulate teacher model, we incorporate prediction guidance and stage-wise feature mimicking to provide additional supervision during the teacher model’s compression process. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture. We conduct a series of experiment to verify the effectiveness and generalizability of CompressTracker. Our CompressTracker-4 with 4 transformer layers, which is compressed from OSTrack, retains about \mathbf{96%} performance on LaSOT (\mathbf{66.1%} AUC) while achieves 2.17×\mathbf{2.17\times} speed up.

3809ConLUX: Concept-Based Local Unified Explanations

[openreview] [pdf]

Abstract With the rapid advancements of various machine learning models, there is a significant demand for model-agnostic explanation techniques, which can explain these models across different architectures. Mainstream model-agnostic explanation techniques generate local explanations based on basic features (e.g., words for text models and (super-)pixels for image models). However, these explanations often do not align with the decision-making processes of the target models and end-users, resulting in explanations that are unfaithful and difficult for users to understand. On the other hand, concept-based techniques provide explanations based on high-level features (e.g., topics for text models and objects for image models), but most are model-specific or require additional pre-defined external concept knowledge. To address this limitation, we propose ConLUX, a general framework to provide concept-based local explanations for any machine learning models. Our key insight is that we can automatically extract high-level concepts from large pre-trained models, and uniformly extend existing local model-agnostic techniques to provide unified concept-based explanations. We have instantiated ConLUX on four different types of explanation techniques: LIME, Kernel SHAP, Anchor, and LORE, and applied these techniques to text and image models. Our evaluation results demonstrate that 1) compared to the vanilla versions, ConLUX offers more faithful explanations and makes them more understandable to users, and 2) by offering multiple forms of explanations, ConLUX outperforms state-of-the-art concept-based explanation techniques specifically designed for text and image models, respectively.

3810Seeking Flat Minima with Mean Teacher on Semi- and Weakly-Supervised Domain Generalization for Object Detection

[openreview] [pdf]

Abstract Object detectors do not work well when domains largely differ between training and testing data. To overcome this domain gap in object detection without requiring expensive annotations, we consider two problem settings: semi-supervised domain generalizable object detection (SS-DGOD) and weakly-supervised DGOD (WS-DGOD). In contrast to the conventional domain generalization for object detection that requires labeled data from multiple domains, SS-DGOD and WS-DGOD require labeled data only from one domain and unlabeled or weakly-labeled data from multiple domains for training. In this paper, we show that object detectors can be effectively trained on the two settings with the same Mean Teacher learning framework, where a student network is trained with pseudo-labels output from a teacher on the unlabeled or weakly-labeled data. We provide novel interpretations of why the Mean Teacher learning framework works well on the two settings in terms of the relationships between the generalization gap and flat minima in parameter space. On the basis of the interpretations, we also show that incorporating a simple regularization method into the Mean Teacher learning framework leads to flatter minima. The experimental results demonstrate that the regularization leads to flatter minima and boosts the performance of the detectors trained with the Mean Teacher learning framework on the two settings.

3811Dynamic Token Modulation and Expansion for Multi-Task Learning

[openreview] [pdf]

Abstract Multi-Task Learning (MTL) aims to minimize negative transfer within a shared network. Common strategies involve separating task-generic and task-specific representations and coordinating them to work together effectively within MTL frameworks. However, the absence of a clear rule for determining task-specific network components challenges the design of efficient MTL architectures. Our method tackles negative transfer by employing token-based network expansion and modulation without directly modifying predefined architectures, making it adaptable to any transformer-based MTL architectures. To evaluate negative transfer, we treat tokens as parameters, assessing gradient conflicts during backpropagation. Conflicts between tasks are analyzed by examining the token’s range space and null space. Based on conflict types, we expand the network following rules. If task-specific gradients clash in the tokens’ range space, we modulate existing tokens to align their task gradients. Conversely, if the gradients conflict in the null space of tokens, we add new task-specific tokens, spanning a new feature space. Our approach effectively boosts multi-task performance across various datasets by being integrated into previous state-of-the-art multi-task architectures.

3812On Re-Encoding Short-Term Memory of Large Language Models in Conversations

[openreview] [pdf]

Abstract Large language models (LLMs), such as GPT-4, are adept at generating coherent and fluent responses within conversational contexts. However, there has been a paucity of comprehensive research exploring LLMs to dynamically update their knowledge in response to corrections of misinformation provided by users during dialogue sessions. From the cognitive psychology perspective, such an adaptive process is akin to memory re-encoding (MRE), which entails the modification of previously stored information in human memory, typically for rectifying inaccuracies.In this paper, we present a novel framework termed Knowledge Editing In Conversation (KEIC), along with an accompanying dataset, devised to assess the efficacy of LLMs in emulating the MRE process in an in-context setting. Through in-depth investigations, we observe that the contemporary LLMs exhibit a modicum of proficiency in this task. To enhance their in-context MRE abilities, we propose a structured strategy to handle the information update for LLMs in a multi-turn conversation. We demonstrate that our approach is effective and suggest insights for research communities in this emerging and essential issue.

3813Rapid Grassmannian Averaging with Chebyshev Polynomials

[openreview] [pdf]

Abstract We propose new algorithms to efficiently average a collection of points on a Grassmannian manifold in both the centralized and decentralized settings. Grassmannian points are used ubiquitously in machine learning, computer vision, and signal processing to represent data through (often low-dimensional) subspaces. While averaging these points is crucial to many tasks (especially in the decentralized setting), existing methods unfortunately remain computationally expensive due to the non-Euclidean geometry of the manifold. Our proposed algorithms, Rapid Grassmannian Averaging (RGrAv) and Decentralized Rapid Grassmannian Averaging (DRGrAv), overcome this challenge by leveraging the spectral structure of the problem to rapidly compute an average using only small matrix multiplications and QR factorizations. We provide a theoretical guarantee of optimality and present numerical experiments which demonstrate that our algorithms outperform state-of-the-art methods in providing high accuracy solutions in minimal time. Additional experiments showcase the versatility of our algorithms to tasks such as KK-means clustering on video motion data, establishing RGrAv and DRGrAv as powerful tools for generic Grassmannian averaging.

3814SCAR: Efficient Instruction-Tuning for Large Language Models via Style Consistency-Aware Response Ranking

[openreview] [pdf]

Abstract Recent studies have shown that maintaining a consistent response style by human experts and enhancing data quality in training sets can significantly improve the performance of fine-tuned Large Language Models (LLMs) while reducing the number of training examples needed. However, the precise definition of style and the relationship between style, data quality, and LLM performance remains unclear. This research identifies two key stylistic elements in responses: linguistic form and semantic surprisal. We find that, among training data of comparable quality, higher consistency in these response elements leads to better LLM performance. Inspired by this, we introduce Style Consistency-Aware Response Ranking (SCAR), which automatically prioritizes instruction-response pairs in the training set based on their response stylistic consistency. By selecting the most style-consistent examples, sometimes as few as 0.7% of the full dataset, the fine-tuned LLMs can match or even surpass the performance of models trained on the entire dataset in coding and open-ended question-answering benchmarks. Code and data are available athttps://anonymous.4open.science/r/SCAR-0233/.

3815Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

[openreview] [pdf]

Abstract Current trends in pre-training Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. While the \textit{quality} of pre-training data is considered an important factor for training powerful LLMs, it remains a nebulous concept that has not been rigorously characterized. To this end, we propose a formalization of one key aspect of data quality -- measuring the \textit{variability} of natural language data -- specifically via a measure we call the diversity coefficient. Our empirical analysis shows that the proposed diversity coefficient aligns with the intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then, we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled \textit{interventional} experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data characterizes useful aspects of downstream model evaluation performance---totaling 44 models of various sizes (51M to 7B parameters). We conclude that our formal notion of diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance.

3816Retrieval Or Holistic Understanding? Dolce: Differentiate Our Long Context Evaluation Tasks

[openreview] [pdf]

Abstract We argue that there are two major distinct capabilities in long context understanding: retrieval and holistic understanding. Understanding and further improving LLMs’ long context capabilities would not be possible without knowing the tasks’ focus categories. We aim to automatically identify retrieval focused and holistic understanding focused problems from suites of benchmarks and quantitatively measure the difficulty within each focus. In this paper, we present the Dolce framework, which parameterizes each problem by λ\lambda (complexity) and kk (redundancy) and assigns to one of five predefined focus categories. We propose to sample short contexts from the full context and estimate the probability an LLM solves the problem using the sampled spans. To find the λ\lambda and kk for each problem, we further propose a mixture model of a non-parametric background noise component and a parametric/non-parametric hybrid oracle component, where we derive the probability functions parameterized by λ and k for both the correct-or-wrong (COW) scenario and the partial-point-in-grading (PIG) scenario. Our proposed methods can identify 0% to 67% of the problems are retrieval focused and 0% to 90% of the problems are holistic understanding focused across 44 existing long context evaluation tasks.

3817Channel Independence Improves Out-of-Distribution Generalisation in Multivariate Time Series Classification

[openreview] [pdf]

Abstract Robustness to distribution shift is a necessary property of machine learning models for their safe and effective deployment. However, deep learning models are susceptible to learning spurious features of the in-distribution (ID) training data that fail to generalise to out-of-distribution (OOD) data. Domain generalisation algorithms aim to tackle this problem, but recent studies have demonstrated that their improvement over standard empirical risk minimisation is marginal. We address this problem for multivariate time series classification (TSC), where it is standard practise to use feature extractor architectures that learn with channel dependence (CD), enabling cross-channel patterns to be learned. Inspired by recent success in time series forecasting, we investigate how channel independence (CI) impacts OOD generalisation in TSC. Our experiments on six time series datasets reveal that ID and OOD features exhibit significantly greater distributional divergence when learned with CD compared to CI. As a consequence, models that learn with CI are more robust to distribution shift, evidenced by smaller generalisation gaps (the difference between ID and OOD performance) across datasets. On datasets that have a stronger shift, OOD accuracy is substantially higher for CI than CD.

3818Revolutionizing AI Companion in FPS Games

[openreview] [pdf]

Abstract Traditionally, players in first-person shooter (FPS) games have been limited to communicating with AI companions using simple commands like “attack,” “defend,” or “retreat” due to the constraints of existing input methods such as hotkeys and command wheels. One major limitation of these simple commands is the lack of target specificity, as the numerous targets in a 3D virtual environment are difficult to specify using existing input methods. This limitation hinders players’ ability to issue complex tactical instructions such as “clear the second floor,” “take cover behind that tree,” or “retreat to the river.” To overcome this limitation, this paper introduces the A\textbf{A}I C\textbf{C}ompanion with V\textbf{V}oice I\textbf{I}nteraction (ACVI)(\textbf{ACVI}), the first-ever AI system that allows players to interact with FPS AI companions through natural language. Deployed in the popular FPS game Arena Breakout: Infinite\textit{Arena Breakout: Infinite}, this revolutionary feature creates the most immersive experience for players, enabling them to work with human-like AI. ACVI is not confined to executing limited commands through simple rule-based systems. Instead, it allows players to engage in real-time voice interactions with AI teammates. By integrating various natural language processing techniques within a confidence-based selection framework, it achieves rapid and accurate decomposition of complex commands and intent reasoning. Moreover, ACVI employs a multi-modal dynamic entity retrieval method for environmental perception, aligning human intentions with decision-making elements. It can accurately comprehend complex voice commands and delivers real-time behavioral responses and vocal feedback to provide close tactical collaboration to players. Additionally, it can identify more than 17,000 objects in the game, including buildings, vehicles, grasslands, and collectible items, and has the ability to accurately distinguish different colors and materials.

3819Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects

[openreview] [pdf]

Abstract The Abstraction and Reasoning Corpus (ARC) is a popular benchmark focused onvisual reasoningin the evaluation of Artificial Intelligence systems. In its original framing, an ARC task requires solving a program synthesis problem over small 2D images using a few input-output training pairs. In this work, we adopt the recently populardata-drivenapproach to the ARC and ask whether a Vision Transformer (ViT) can learn the implicit mapping, from input image to output image, that underlies the task. We show that a ViT—otherwise a state-of-the-art model for images—fails dramatically on most ARC tasks even when trained on one million examples per task. This points to an inherent representational deficiency of the ViT architecture that makes it incapable of uncovering the simple structured mappings underlying the ARC tasks. Building on these insights, we propose ViTARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities required by the ARC. Specifically, we use a pixel-level input representation, design a spatially-aware tokenization scheme, and introduce a novel object-based positional encoding that leverages automatic segmentation, among other enhancements. Our task-specific ViTARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks strictly through supervised learning from input-output grids. This calls attention to the importance of imbuing the powerful (Vision) Transformer with the correct inductive biases for abstract visual reasoning that are critical even when the training data is plentiful and the mapping is noise-free. Hence, ViTARC provides a strong foundation for future research in visual reasoning using transformer-based architectures.

3820One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

[openreview] [pdf]

Abstract Attention computation takes both the time complexity of O(n2)O(n^2) and the space complexity of O(n2)O(n^2) simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length nn is much greater than 128K (n2dn \gg 2^d). Considering a single-layer self-attention with Query, Key, and Value matrices Q,K,VRn×dQ, K, V \in \mathbb{R}^{n \times d}, the polynomial method approximates the attention output TRn×dT \in \mathbb{R}^{n \times d}. It accomplishes this by constructing U1,U2Rn×tU_1, U_2 \in \mathbb{R}^{n \times t} to expedite attention Attn(Q,K,V){\sf Attn}(Q, K, V) computation within n1+o(1)n^{1+o(1)} time executions. Despite this, computing the approximated attention matrix U1U2Rn×nU_1U_2^\top \in \mathbb{R}^{n \times n} still necessitates O(n2)O(n^2) space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space o(n)o(n) to store three sketch matrices, alleviating the need for exact K,VK, V storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length nn increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

3821POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obstacle avoidance, that have been conventionally approached with the classical non-learnable methods (e.g., heuristic search) is currently suggested to be solved by the learning-based or hybrid methods. Still, in this domain, it is hard, not to say impossible, to conduct a fair comparison between classical, learning-based, and hybrid approaches due to the lack of a unified framework that supports both learning and evaluation. To this end, we introduce POGEMA, a set of comprehensive tools that includes a fast environment for learning, a generator of problem instances, the collection of pre-defined ones, a visualization toolkit, and a benchmarking tool that allows automated evaluation. We introduce and specify an evaluation protocol defining a range of domain-related metrics computed on the basics of the primary evaluation indicators (such as success rate and path length), allowing a fair multi-fold comparison. The results of such a comparison, which involves a variety of state-of-the-art MARL, search-based, and hybrid methods, are presented.

3822A Prototype-oriented Fast Refinement Model for Few-shot Industrial Anomaly Detection

[openreview] [pdf]

Abstract Industrial Anomaly Detection (IAD) in low data regime is crucial for automating industrial inspections in practice. Previous methods have primarily focused on obtaining robust prototypes using only a few normal images per product. However, these methods seldom account for transferring the characteristics of online query images to enhance the representativeness of the original prototypes in a systematic way. To address the pivot issue, we propose a fast prototype-oriented refinement model for few-shot IAD. Given online query images, we formulate prototype refinement as a nested optimization problem between transport probability for anomaly suppression and transform matrix for characteristic transfer. Then we present an Expectation Maximization (EM)-based algorithm to iteratively compute the transport probability and transform matrix. In the E-step, we use entropy-based optimal transport, known as the Sinkhorn algorithm, to learn the transport probability. In the M-step, the transform matrix is updated via gradient descent. Finally, we integrate our model with two popular and recently proposed few-shot IAD methods, PatchCore and WinCLIP. Comprehensive experiments on three widely used datasets including MVTec, ViSA, and MPDD verify the effectiveness and efficiency of our proposed model in few-shot IAD applications.

3823Nonparametric Expert DAG Learning with Accurate Edge Strengths and Realistic Knowledge Incorporation

[openreview] [pdf]

Abstract Directed Acyclic Graphs (DAGs) are crucial for modeling causal structures and complex dependencies in domains such as biology, healthcare, and finance. Effective structure learning must not only align with domain expert knowledge but also produce interpretable model decisions. Though continuous structure learning methods like NOTEARS are gaining popularity, an underexplored feature is their ability to open up the black box of decisions made by traditional combinatorial search by quantifying edge strengths in weighted adjacency matrices. Yet challenges persist in systematically integrating expert knowledge and ensuring learned weights accurately reflect true edge relationships. We present Non-parametric Expert DAG (NEDAG), a novel method that formulates accurate weight matrices using Gaussian Processes (GPs) and incorporates realistic domain knowledge into the continuous structure learning framework. Experiments on both synthetic and real-world datasets demonstrate that NEDAG not only surpasses existing methods in structure accuracy but also produces more accurate edge strengths. NEDAG thus provides a robust and interpretable solution for structure discovery in real-world applications.

3824LLM-Exp: Exploring the Policy in Reinforcement Learning with Large Language Models

[openreview] [pdf]

Abstract Policy exploration is critical in training reinforcement learning (RL) agents, where existing approaches include the ϵ\epsilon-greedy method in deep Q-learning, the Gaussian process in DDPG, etc. However, all these approaches are designed based on prefixed stochastic processes and are indiscriminately applied in all kinds of RL tasks without considering any environment-specific features that influence the policy exploration. Moreover, during the training process, the evolution of such stochastic process is rigid, which typically only incorporates a decay of the variance. This makes the policy exploration unable to adjust flexibly according to the agent’s real-time learning status, limiting the performance. Inspired by the analyzing and reasoning capability of LLM that reaches success in a wide range of domains, we design LLM-Exp\textbf{LLM-Exp}, which improves policy exploration in RL training with large language models (LLMs). During the RL training in a given environment, we sample a recent action-reward trajectory of the agent and prompt the LLM to analyze the agent’s current policy learning status and then generate a probability distribution for future policy exploration. We update the probability distribution periodically and derive a stochastic process that is specialized for the particular environment, which can be dynamically adjusted to adapt to the learning process. Our approach is a simple plug-in design, which is compatible with DQN and any of its variants or improvements. Through extensive experiments on the Atari benchmark, we demonstrate the capability of LLM-Exp to enhance the performance of RL. Our code is open-source athttps://anonymous.4open.science/r/LLM-Exp-4658for reproducibility.

3825Overcoming Lower-Level Constraints in Bilevel Optimization: A Novel Approach with Regularized Gap Functions

[openreview] [pdf]

Abstract Constrained bilevel optimization tackles nested structures present in constrained learning tasks like constrained meta-learning, adversarial learning, and distributed bilevel optimization. However, existing bilevel optimization methods mostly are typically restricted to specific constraint settings, such as linear lower-level constraints. In this work, we overcome this limitation and develop a new single-loop, Hessian-free constrained bilevel algorithm capable of handling more general lower-level constraints. We achieve this by employing a doubly regularized gap function tailored to the constrained lower-level problem, transforming constrained bilevel optimization into an equivalent single-level optimization problem with a single smooth constraint. We rigorously establish the non-asymptotic convergence analysis of the proposed algorithm under the convexity of lower-level problem, avoiding the need for strong convexity assumptions on the lower-level objective or coupling convexity assumptions on lower-level constraints found in existing literature. Additionally, the generality of our method allows for its extension to bilevel optimization with minimax lower-level problem. We evaluate the effectiveness and efficiency of our algorithm on various synthetic problems, typical hyperparameter learning tasks, and generative adversarial network.

3826T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback

[openreview] [pdf]

Abstract Text-to-audio (T2A) generation has achieved remarkable progress in generating a variety of audio outputs from language prompts. However, current state-of-the-art T2A models still struggle to satisfy human preferences for prompt-following and acoustic quality when generating complex multi-event audio. To improve the performance of the model in these high-level applications, we propose to enhance the basic capabilities of the model with AI feedback learning. First, we introduce fine-grained AI audio scoring pipelines to: 1) verify whether each event in the text prompt is present in the audio (Event Occurrence Score), 2) detect deviations in event sequences from the language description (Event Sequence Score), and 3) assess the overall acoustic and harmonic quality of the generated audio (Acoustic & Harmonic Quality). We evaluate these three automatic scoring pipelines and find that they correlate significantly better with human preferences than other evaluation metrics. This highlights their value as both feedback signals and evaluation metrics. Utilizing our robust scoring pipelines, we construct a large audio preference dataset, T2A-FeedBack, which contains 41k prompts and 249k audios, each accompanied by detailed scores. Moreover, we introduce T2A-EpicBench, a benchmark that focuses on long captions, multi-events, and story-telling scenarios, aiming to evaluate the advanced capabilities of T2A models. Finally, we demonstrate how T2A-FeedBack can enhance current state-of-the-art audio model. With simple preference tuning, the audio generation model exhibits significant improvements in both simple (AudioCaps test set) and complex (T2A-EpicBench) scenarios. The project page is available at \url{https://T2Afeedback.github.io}

3827Range-limited Augmentation for Few-shot Learning in Tabular Data

[openreview] [pdf]

Abstract Few-shot learning is essential in many applications, particularly in tabular domains where the high cost of labeling often limits the availability of annotated data. To address this challenge, we propose range-limited augmentation for contrastive learning in tabular domains. Our augmentation method shuffles or samples values within predefined feature-specific ranges, preserving semantic consistency during contrastive learning to enhance few-shot classification performance. To evaluate the effectiveness of our approach, we introduce FeSTa (Few-Shot Tabular classification benchmark), a benchmark consisting of 42 tabular datasets and 31 algorithms. On this benchmark, contrastive learning with our augmentation method effectively preserves task-relevant information and significantly outperforms existing approaches, including supervised, unsupervised, self-supervised, semi-supervised, and foundation models. In particular, our method achieves an average rank of 2.3 out of 31 algorithms in the 1-shot learning scenario, demonstrating its robustness and effectiveness when labeled data is highly limited. The benchmark code is available in the supplementary material.

3828Unveiling AI’s Blind Spots: An Oracle for In-Domain, Out-of-Domain, and Adversarial Errors

[openreview] [pdf]

Abstract AI models make mistakes when recognizing images—whether in-domain, out-of-domain, or adversarial. Predicting these errors is critical for improving system reliability, reducing costly mistakes, and enabling proactive corrections in real-world applications such as healthcare, finance, and autonomous systems. However, understanding what mistakes AI models make, why they occur, and how to predict them remains an open challenge. Here, we conduct comprehensive empirical evaluations using a “mentor” model —a deep neural network designed to predict another model’s errors. Our findings show that the mentor model excels at learning from a mentee’s mistakes on adversarial images with small perturbations and generalizes effectively to predict in-domain and out-of-domain errors of the mentee. Additionally, transformer-based mentor models excel at predicting errors across various mentee architectures. Subsequently, we draw insights from these observations and develop an “oracle” mentor model, dubbed SuperMentor, that achieves 78% accuracy in predicting errors across different error types. Our error prediction framework paves the way for future research on anticipating and correcting AI model behaviours, ultimately increasing trust in AI systems. All code, models, and data will be made publicly available.

3829Large Language Model-driven Large Neighborhood Search for Large-Scale MILP Problems

[openreview] [pdf]

Abstract Large Neighborhood Search (LNS) is a widely used method for solving large-scale Mixed Integer Linear Programming (MILP) problems. The effectiveness of LNS crucially depends on the choice of the search neighborhood. However, existing strategies either rely on expert knowledge or computationally expensive Machine Learning (ML) approaches, both of which struggle to scale effectively for large problems. To address this, we propose LLM-LNS, a novel Large Language Model (LLM)-driven LNS framework for large-scale MILP problems. Our approach introduces a dual-layer self-evolutionary LLM agent to automate neighborhood selection, discovering effective strategies with scant small-scale training data that generalize well to large-scale MILPs. The inner layer evolves heuristic strategies to ensure convergence, while the outer layer evolves evolutionary prompt strategies to maintain diversity. Experimental results demonstrate that the proposed dual-layer agent outperforms state-of-the-art agents such as FunSearch and EOH. Furthermore, the full LLM-LNS framework surpasses manually designed LNS algorithms like ACP, ML-based LNS methods like CL-LNS, and large-scale solvers such as Gurobi and SCIP. It also achieves superior performance compared to advanced ML-based MILP optimization frameworks like GNN&GBDT and Light-MILPopt, further validating the effectiveness of our approach.

3830Safeguard User Privacy in LLM Cloud Services

[openreview] [pdf]

Abstract Large language models (LLMs) have witnessed substantial growth in recent years. To leverage convenient LLM cloud services, users are inevitable to upload their prompts. Further, for tasks such as translation, reading comprehension, and summarization, related files or contexts are inherently required to be uploaded, whether they contain user privacy or not. Despite the rapid advancement of LLM capability, there has been a scarcity of research focusing on preserving user privacy during inference. To this end, this paper conducts a comprehensive study in this domain. Firstly, we demonstrate that (1) the embedding space of tokens is remarkably sparse, and (2) LLMs primarily function in the orthogonal subspace of embedding space, these two factors making privacy extremely vulnerable. Then, we analyze the structural characteristics of LLMs and design a distributed privacy-preserving inference paradigm which can effectively resist privacy attacks. Finally, we conduct a comprehensive evaluation of the defended models on mainstream tasks and find that low-bit quantization techniques can be well combined with our inference paradigm, achieving a balance between privacy, utility, and runtime memory efficiency.

3831EgoSim: Egocentric Exploration in Virtual Worlds with Multi-modal Conditioning

[openreview] [pdf]

Abstract Recent advancements in video diffusion models have established a strong foundation for developing world models with practical applications. The next challenge lies in exploring how an agent can leverage these foundation models to understand, interact with, and plan within observed environments. This requires adding more controllability to the model, transforming it into a versatile game engine capable of dynamic manipulation and control. To address this, we investigated three key conditioning factors: camera, context frame, and text, identifying limitations in current model designs. Specifically, the fusion of camera embeddings with video features leads to camera control being influenced by those features. Additionally, while textual information compensates for necessary spatiotemporal structures, it often intrudes into already observed parts of the scene. To tackle these issues, we designed the Spacetime Epipolar Attention Layer, which ensures that egomotion generated by the model strictly aligns with the camera’s movement through rigid constraints. Moreover, we propose the CI2V-adapter, which uses camera information to better determine whether to prioritize textual or visual embeddings, thereby alleviating the issue of textual intrusion into observed areas. Through extensive experiments, we demonstrate that our new model EgoSim achieves excellent results on both the RealEstate and newly repurposed Epic-Field datasets. For more results, please refer tohttps://egosim.github.io/EgoSim/.

3832When predict can also explain: few-shot prediction to select better neural latents

[openreview] [pdf]

Abstract Latent variable models serve as powerful tools to infer underlying dynamics from observed neural activity. Ideally, one would like the inferred dynamics to equal the true ones. However, due to the absence of ground truth data, prediction benchmarks are often employed as proxies. One widely-used method isco-smoothing, which involves jointly estimating latent variables and predicting observations along held-out channels to assess model performance. In this study, we reveal the limitations of the co-smoothing prediction framework and propose a remedy. Utilizing a student-teacher setup with Hidden Markov Models, we demonstrate that the high co-smoothing model space can encompass models with arbitrary extraneous dynamics within their latent representations. To address this, we introduce a secondary metric—few-shot co-smoothing. This involves performing regression from the latent variables to held-out channels in the data using fewer trials. Our results indicate that among models with near-optimal co-smoothing, those with extraneous dynamics underperform in the few-shot co-smoothing compared to ‘minimal’ models devoid of such dynamics. We also provide analytical insights into the origin of this phenomenon. We further validate our findings on real neural data using two state-of-the-art methods: LFADS and STNDT. In the absence of ground truth, we suggest a novel measure to validate our approach. By cross-decoding the latent variables of all model pairs with high co-smoothing, we identify models with minimal extraneous dynamics. We find a correlation between few-shot co-smoothing performance and this new measure. In summary, we present a novel prediction metric designed to yield latent variables that more accurately reflect the ground truth, offering a significant improvement for latent dynamics inference.

3833EOP: Unlocking Superior Problem Solving in Small LLMs

[openreview] [pdf]

Abstract Small language models, referred to as LLMs with fewer than 10 billion parameters in this work, face critical challenges in problem-solving tasks, often achieving less than 10% accuracy, highlighting the urgent need for effective solutions. While much of the existing research has focused on enhancing the performance of larger models like GPT, an important question remains: Can techniques developed for large models be adapted effectively for smaller ones? Moreover, is it possible to improve these smaller models to the point where they rival, or even outperform, larger models such as GPT-4 in problem-solving tasks?In this paper, we introduce Evaluation-Oriented Problem-Solving (EOP), a novel framework aimed at enhancing the problem-solving capabilities of small LLMs. Our approach significantly boosts the performance of these models, achieving a 2% higher accuracy on Python Puzzles compared to standard GPT-4 and a 27% improvement over state-of-the-art prompting methods using GPT-4 in the Game of 24. Beyond these results, EOP also demonstrates notable accuracy improvements on other tasks. These findings suggest that, with the appropriate strategies, small LLMs can achieve substantial performance gains in problem-solving, challenging the prevailing notion that scaling model size is the primary path to improvement.

3834Privately Learning from Graphs with Applications in Fine-tuning Large Pretrained Models

[openreview] [pdf]

Abstract Graphs offer unique insights into relationships and interactions between entities, complementing data modalities like text, images, and videos. By incorporating relational information from graph data, AI models can extend their capabilities beyond traditional tasks. However, relational data in sensitive domains such as finance and healthcare often contain private information, making privacy preservation crucial. Existing privacy-preserving methods, such as DP-SGD, which rely on gradient decoupling assumptions, are not well-suited for relational learning due to the inherent dependencies between coupled training samples. To address this challenge, we propose a privacy-preserving relational learning pipeline that decouples dependencies in sampled relations during training, ensuring differential privacy through a tailored application of DP-SGD. We apply this method to fine-tune large language models (LLMs) on sensitive graph data, and tackle the associated computational complexities. Our approach is evaluated on LLMs of varying sizes (e.g., BERT, Llama2) using real-world relational data from four text-attributed graphs. The results demonstrate significant improvements in relational learning tasks, all while maintaining robust privacy guarantees during training. Additionally, we explore the trade-offs between privacy, utility, and computational efficiency, offering insights into the practical deployment of our approach.

3835LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering

[openreview] [pdf]

Abstract Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. The high cost of building MCQA datasets makes few-shot learning pivotal in this domain. While Large Language Models (LLMs) can enable few-shot learning, their direct application in real-world scenarios is often hindered by their high computational cost. To address this challenge, we propose a simple yet effective approach that uses LLMs for data generation and scoring. Our approach utilizes LLMs to create MCQA data which contains questions and choices, and to assign probability scores to the generated choices. We then use the generated data and LLM-assigned scores to finetune a smaller and more efficient encoder-only model, DeBERTa-v3-base by leveraging distillation loss. Extensive experiments on the Massive Multitask Language Understanding (MMLU) benchmark demonstrate that our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples. This shows the effectiveness of LLM-driven data generation and knowledge distillation for few-shot MCQA.

3836Intervention-based Causal Discrimination Discovery and Removal

[openreview] [pdf]

Abstract Causal inference is a recent and widely adopted paradigm to deal with algorithmic discrimination. Building on Pearl’s structure causal model, several causality-based fairness notions have been developed, which estimates the unfair causal effects from the sensitive attribute to the outcomes by incorporating the intervention or counterfactual operators. Among them, interventional fairness (i.e., KK-Fair) stands out as the most fundamental and broadly applicable concept that is computable from observantional data. However, existing interventional fairness notions fail to accurately evaluate causal fairness, due to their following inherent limitations: (i) the causal effects evaluated by interventional fairness cannot be uniquely computed; (ii) the violation of interventional fairness being zero is not a sufficient condition for a causally fair model. To address these issues, we firstly propose a novel causality-based fairness notion called post-Intervention Cumulative Ratio Disparity (ICRD) to assess causal fairness of the decision models. Subsequently, we present a fairness framework (ICCFL) based on the proposed ICRD metric. ICCFL firstly generates interventional samples, and then computes the differentiable approximation of the ICRD to train a causally fair model. Both theoretical and empirical results demonstrate that the proposed ICRD effectively assesses causal fairness, and ICCFL can better balance accuracy and fairness.

3837MOSLIM:Align with diverse preferences in prompts through reward classification

[openreview] [pdf]

Abstract The multi-objective alignment of Large Language Models (LLMs) is essential for ensuring foundational models conform to diverse human preferences. Current research in this field typically involves either multiple policies or multiple reward models customized for various preferences, or the need to train a preference-specific supervised fine-tuning (SFT) model. In this work, we introduce a novel multi-objective alignment method, MOSLIM, which utilizes a single reward model and policy model to address diverse objectives. MOSLIM provides a flexible way to control these objectives through prompting and does not require preference training during SFT phase, allowing thousands of off-the-shelf models to be directly utilized within this training framework. MOSLIM leverages a multi-head reward model that classifies question-answer pairs instead of scoring them and then optimize policy model with a scalar reward derived from a mapping function that converts classification results from reward model into reward scores. We demonstrate the efficacy of our proposed method across several multi-objective benchmarks and conduct ablation studies on various reward model sizes and policy optimization methods. The MOSLIM method outperforms current multi-objective approaches in most results while requiring significantly fewer GPU computing resources compared with existing policy optimization methods.

3838Investigating Grokking phenomena below the Critical Data Regime

[openreview] [pdf]

Abstract In this paper, we explore the practical utility of grokking, a phenomenon where models generalize long after overfitting the training data. This offers a promising avenue for training on changing distributions, especially in data-scarce environ- ments. We investigate a scenario where a model grokked on a distribution p1 is utilized to grok another model on a different distribution p2, particularly in a data crunch situation on the p2 distribution. We further explore distilling multiple small models grokked on different distributions to generalize a larger model. This ap- proach is crucial where data is scarcely available for these different distributions, thus saving computational resources. Finally, we present a setup for continually pretraining a grokked model from distribution p1 to p2. Our experiments reveal that distilling from a grokked model provides quick generalization over the cur- rent task while simultaneously alleviating the forgetting of previous knowledge. We analyze these scenarios over various algorithmic tasks such as addition, sub- traction, and multiplication. Our results provide a framework for efficient model training in dynamic and data-limited scenarios, enabling the development of more robust, adaptable systems.

3839BiCompFL: Stochastic Federated Learning with Bi-Directional Compression

[openreview] [pdf]

Abstract Communication is a prominent bottleneck in federated learning (FL). State-of-the-art accuracy performance under limited uplink communications from the clients to the federator is achieved by stochastic FL approaches. It has been recently shown that leveraging side information in the form of a prior distribution at the federator can drastically reduce the uplink communication cost in stochastic FL. Here, the latest global model distribution serves as a natural prior since it can be shared with the clients under ideal downlink communication from the federator to the clients. Nevertheless, downlink communication is often limited in practical settings, and bi-directional compression must be considered to reduce the overall communication cost. The extension of existing stochastic FL solutions to bi-directional compression is non-trivial due to the lack of a globally shared common prior distribution at each iteration. In this paper, we propose BiCompFL, which employs importance sampling to send samples from the updated local models in the uplink, and the aggregated global model in the downlink by carefully choosing common prior distributions as side-information. We theoretically study the communication cost by a new analysis of importance sampling that refines known results, and exposes the interplay between uplink and downlink communication costs. We also show through numerical experiments that BiCompFL enables multi-fold savings in communication cost compared to the state-of-the-art.

3840CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

[openreview] [pdf]

Abstract Virtual try-on methods based on diffusion models achieve realistic effects but often require additional encoding modules, a large number of training parameters, and complex preprocessing, which increases the burden on training and inference. In this work, we re-evaluate the necessity of additional modules and analyze how to improve training efficiency and reduce redundant steps in the inference process. Based on these insights, we propose CatVTON, a simple and efficient virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals by concatenating them along spatial dimensions as inputs. The efficiency of CatVTON is reflected in three aspects: (1) Lightweight network. CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders as well as cross-attentions, and includes just 899.06M parameters. (2) Parameter-efficient training. Through experimental analysis, we identify self-attention modules as crucial for adapting pre-trained diffusion models to the virtual try-on task, enabling high-quality results with only 49.57M training parameters. (3) Simplified inference. CatVTON eliminates unnecessary preprocessing, such as pose estimation, human parsing, and captioning, requiring only person image and garment reference to guide the virtual try-on process, reducing over 49% memory usage compared to other diffusion-based methods. Extensive experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods and demonstrates strong generalization performance in real-world scenarios, despite being trained solely on a public dataset of 73K samples.

3841Test-Time Learning of Causal Structure from Interventional Data

[openreview] [pdf]

Abstract Inferring causal structures from interventional data remains a challenging task, especially when the interventional targets are unknown. Supervised Causal Learning (SCL) has demonstrated strong empirical performance in predicting causal structures by training on datasets with known causal relations, and then applying the learned models to unseen test data. However, existing SCL methods often struggle with distribution shifts between training and test data.In this work, we propose a novel approach,TICL(Test-time Interventional Causal Learning), which addresses this challenge by introducing \textit{test-time training} for causal discovery from interventional data. Our method employs a self-augmentation technique that generates training data at test time, specifically tailored to the characteristics of the test data, allowing the model to adapt to biases inherent in the test distribution.TICLintegrates the JCI (Joint Causal Inference) framework with SCL by modifying the rule-based logic of the standard PC algorithm into a learning-based approach, enabling SCL methods to operate within the JCI+SCL framework and effectively utilize the self-augmented training data. Extensive experiments on real-world benchmarks demonstrate the superiority ofTICLacross multiple aspects of causal discovery and interventional target detection.

3842ROSE: A Reward-Oriented Data Selection Framework for LLM Task-Specific Instruction Tuning

[openreview] [pdf]

Abstract Instruction tuning has underscored the significant potential of large language models (LLMs) in producing more human-controllable and effective outputs in various domains. In this work, we focus on the data selection problem for task-specific instruction tuning of LLMs. Prevailing methods primarily rely on the crafted similarity metrics to select training data that aligns with the test data distribution. The goal is to minimize instruction tuning loss on the test data, ultimately improving performance on the target task. However, it has been widely observed that instruction tuning loss (i.e., cross-entropy loss for next token prediction) in LLMs often fails to exhibit a monotonic relationship with actual task performance. This misalignment undermines the effectiveness of current data selection methods for task-specific instruction tuning. To address this issue, we introduce ROSE, a novel Reward-Oriented inStruction data sElection method which leverages pairwise preference loss as a reward signal to optimize data selection for task-specific instruction tuning. Specifically, ROSE adapts an influence formulation to approximate the influence of training data points relative to a few-shot preference validation set to select the most task-related training data points. Experimental results show that by selecting just 5% of the training data using ROSE, our approach can achieve competitive results compared to fine-tuning with the full training dataset, and it surpasses other state-of-the-art data selection methods for task-specific instruction tuning. Our qualitative analysis further confirms the robust generalizability of our method across multiple benchmark datasets and diverse model architectures.

3843Rethinking Lipschitzness Data-free Backdoor Defense

[openreview] [pdf]

Abstract Deep Neural Networks (DNNs) have demonstrated remarkable success across various applications, yet some studies reveal their vulnerability to backdoor attacks, where attackers manipulate models under specific conditions using triggers. It significantly compromise the model integrity. Addressing this critical security issue requires robust defence mechanisms to ensure the reliability of DNN models. However, most existing defence mechanisms heavily rely on specialized defence datasets, which are often difficult to obtain due to data privacy and security concerns. This highlights the urgent need for effective data-free defence strategies. In this work, we propose Lipschitzness Precise Pruning (LPP), a novel data-free backdoor defence algorithm that leverages the properties of Lipschitz function to detect and mitigate backdoor vulnerabilities by pruning neurons with strong backdoor correlations while fine-tuning unaffected neurons. Our approach optimizes the computation of the Lipschitz constant using dot product properties, allowing for efficient and precise identification of compromised neurons without the need of clean defence data. This method addresses the limitations of existing data-free defences and extends the scope of backdoor mitigation to include fully connected layers, ensuring comprehensive protection of DNN models. As our approach does not require data exchange, it can be implemented efficiently and effectively in diverse environments. Extensive experiments demonstrate that LPP outperforms state-of-the-art defence approaches without the need for additional defence datasets. We release our code at:https://anonymous.4open.science/r/LPP-CD3C.

3844Vietnamese Text-to-SQL with Large Language Models: A Comprehensive Approach

[openreview] [pdf]

Abstract In the current era of Artificial Intelligence (AI), the realm of database querying is experiencing a profound evolution. With the recent emergence of Large Language Models (LLMs), with a particular emphasis on Vietnamese in this study, a promising opportunity arises to bridge the gap between human language and database interactions. In this paper, we embark on realizing this vision through a three-pronged approach. Firstly, we introduce a few-shot learning method designed to enhance the database schema comprehension of Vietnamese LLMs. Secondly, we employ a chain-of-thought technique to systematically guide LLMs in capturing complex natural language expressions for SQL generation. Thirdly, we introduce a novel method to streamline the input schema by removing redundant parts and retaining only the parts that are truly relevant to enhance the efficiency and accuracy of the SQL generation process. Finally, we experimented with a combination of few-shot, chain-of-thought learning, and schema-enhancing methods. Through experimentation with augmented datasets, we observe encouraging initial results. Our approach outperforms the current state-of-the-art model by 23% in exact matching on the Vietnamese ViText2SQL dataset. We achieved this result with a single pretraining step and one epoch of retraining, compared to the SoTA model’s 10 epochs. These findings demonstrate the effectiveness of our method and its potential for Vietnamese text-to-SQL applications.

3845Conditional Density Ratio Score for Post Hoc Deep Outlier Detection

[openreview] [pdf]

Abstract The ability to accurately identify out-of-distribution (OOD) samples is essential not only as a stand-alone machine learning task but also for maintaining the reliability and safety of machine learning systems. Within this domain, post hoc density estimators like the energy score are popular ways for detecting OOD samples. However, most of the existing post hoc density estimation have mainly focused on marginalizing the conditional distributions over all possible classes. In this paper, we introduce the Conditional Density Ratio (CDR) score, a principled post hoc density estimator that leverages both a class-conditional generative model in the latent space and a discriminative classifier model, allowing us to estimate the marginal densities of the latent representation without marginalization. We demonstrate that a key component to the success of the CDR score lies in correctly calibrating the two models and propose a simple yet effective method to automatically tune the temperature parameter without the need for out-of-distribution samples. We illustrate the general compatibility of the proposed method with two popular density estimators, the kernel density estimator and the Mahalanobis estimator. Through experiments on a wide range of OOD benchmark tasks, we verify the effectiveness of the proposed method and advocate it as an easy-to-implement baseline that can achieve competitive performance in most tested scenarios.

3846Pseudo-Labels are All You Need for Out-Of-Distribution Detection

[openreview] [pdf]

Abstract Detecting out-of-distribution (OOD) samples is a significant challenge in real-world deep-learning applications, such as medical imaging and autonomous driving. Traditional machine learning models, primarily trained on in-distribution (ID) data, often struggle when encountering OOD instances, resulting in unreliable predictions. While supervised OOD detection methods generally outperform unsupervised approaches due to the availability of labeled data, our research uncovers a crucial insight: their success is not necessarily due to recognizing the actual object categories in the images; instead, these methods rely on a specific classification strategy that may not correspond to real-world understanding. Essentially, supervised methods detect OOD samples by identifying the difficulties in classifying unfamiliar data. This challenge is similar to what unsupervised OOD detection methods face, as they also depend on the failure to reconstruct OOD data due to the lack of prior exposure. In this study, we bridge the gap between supervised and unsupervised OOD detection by introducing a novel approach that trains models to classify data into pseudo-categories. We employ self-supervised learning (SSL) to convert raw data into representations, which are then clustered to generate pseudo-labels. These pseudo-labels are subsequently used to train a classifier, enabling its OOD detection capabilities. Experimental results show that our approach surpasses state-of-the-art techniques. Furthermore, by training models on different sets of pseudo-labels derived from the dataset, we enhance the robustness and reliability of our OOD detection method.

3847Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data

[openreview] [pdf]

Abstract Adversarial training is a widely-applied approach to training deep neural networks to be robust against adversarial perturbation. However, although adversarial training has achieved empirical success in practice, it still remains unclear why adversarial examples exist and how adversarial training methods improve model robustness. In this paper, we provide a theoretical understanding of adversarial examples and adversarial training algorithms from the perspective of feature learning theory. Specifically, we focus on a multiple classification setting, where the structured data can be composed of two types of features: the robust features, which are resistant to perturbation but sparse, and the non-robust features, which are susceptible to perturbation but dense. We train a two-layer smoothed ReLU convolutional neural network to learn our structured data. First, we prove that by using standard training (gradient descent over the empirical risk), the network learner primarily learns the non-robust feature rather than the robust feature, which thereby leads to the adversarial examples that are generated by perturbations aligned with negative non-robust feature directions. Then, we consider the gradient-based adversarial training algorithm, which runs gradient ascent to find adversarial examples and runs gradient descent over the empirical risk at adversarial examples to update models. We show that the adversarial training method can provably strengthen the robust feature learning and suppress the non-robust feature learning to improve the network robustness. Finally, we also empirically validate our theoretical findings with experiments on real-image datasets, including MNIST, CIFAR10 and SVHN.

3848ImDy: Human Inverse Dynamics from Imitated Observations

[openreview] [pdf]

Abstract Inverse dynamics (ID), which aims at reproducing the driven torques from human kinematic observations, has been a critical tool for gait analysis. However, it is hindered from wider application to general motion due to its limited scalability. Conventional optimization-based ID requires expensive laboratory setups, restricting its availability. To alleviate this problem, we propose to exploit the recently progressive human motion imitation algorithms to learn human inverse dynamics in a data-driven manner. The key insight is that the human ID knowledge is implicitly possessed by motion imitators, though not directly applicable. In light of this, we devise an efficient data collection pipeline with state-of-the-art motion imitation algorithms and physics simulators, resulting in a large-scale human inverse dynamics benchmark as Imitated Dynamics (ImDy). ImDy contains over 150 hours of motion with joint torque and full-body ground reaction force data. With ImDy, we train a data-driven human inverse dynamics solver ImDyS(olver) in a fully supervised manner, which conducts ID and ground reaction force estimation simultaneously. Experiments on ImDy and real-world data demonstrate the impressive competency of ImDyS in human inverse dynamics and ground reaction force estimation. Moreover, the potential of ImDy(-S) as a fundamental motion analysis tool is exhibited with downstream applications. Our data and code would be made publicly available.

[openreview] [pdf]

Abstract Predict-and-search is increasingly becoming the predominant framework for solving Mixed-Integer Linear Programming (MILP) problems through the application of ML algorithms. Traditionally, MILP problems are represented as bipartite graphs, wherein nodes and edges encapsulate critical information pertaining to the objectives and constraints. However, existing ML approaches have primarily concentrated on extracting features from nodes while largely ignoring those associated with edges. To bridge this gap, we propose a novel framework named \model{} which leverages a graph neural network SKEGAT that integrates both node and edge features. Furthermore, we design an adaptive Regret-Greedy algorithm to break the barriers of the problem scale and hand-crafted tuning. Experiments across a variety of combinatorial optimization problems show that \model{} surpasses current SOTA algorithms, delivering notable enhancements in both solution accuracy and computational efficiency.

3850Tree Search for Language Model Agents

[openreview] [pdf]

Abstract Autonomous agents powered by language models (LMs) have demonstrated promise in their ability to perform decision-making tasks such as web automation. However, a key limitation remains: LMs, primarily optimized for natural language understanding and generation, struggle with multi-step reasoning, planning, and using environmental feedback when attempting to solve realistic computer tasks. Towards addressing this, we propose an inference-time search algorithm for LM agents to explicitly perform exploration and multi-step planning in interactive web environments. Our approach is a form of best-first tree search that operates within the actual environment space, and is complementary with most existing state-of-the-art agents. It is the first tree search algorithm for LM agents that shows effectiveness on realistic web tasks. On the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%. Our experiments highlight the effectiveness of search for web agents, and we demonstrate that performance scales with increased test-time compute. We conduct a thorough analysis of our results to highlight improvements from search, limitations, and promising directions for future work.

3851Consistency-based Black-box Uncertainty Quantification for Text-to-SQL by Similarity Aggregation

[openreview] [pdf]

Abstract When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides an estimate of the confidence in an LLM’s generated output and is therefore increasingly recognized as a crucial component of trusted AI systems. UQ is particularly important for complex generative tasks such as \emph{text-to-SQL}, where an LLM helps users gain insights about data stored in noisy and large databases by translating their natural language queries to structured query language (SQL). \emph{Black-box} UQ methods do not require access to internal model information from the generating LLM, and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM (including those with commercialized APIs), reduced costs, and substantial computational tractability. In this paper, we investigate the effectiveness of black-box UQ techniques for text-to-SQL, where the consistency between a generated output and other sampled generations is used as a proxy for estimating its confidence. We propose a high-level non-verbalized \emph{similarity aggregation} approach that is suitable for complex generative tasks, including specific techniques that train confidence estimation models using small training sets. Through an extensive empirical study over various text-to-SQL datasets and models, we provide recommendations for the choice of sampling technique and similarity metric. The experiments demonstrate that our proposed similarity aggregation techniques result in better calibrated confidence estimates as compared to the closest baselines, but also highlight how there is room for improvement on downstream tasks such as selective generation.

3852Large Language Models for Explainability in Machine Learning

[openreview] [pdf]

Abstract We investigate the potential of large language models (LLMs) in explainable artificial intelligence (XAI) by examining their ability to generate understandable explanations for machine learning (ML) models. While recent studies suggest that LLMs could effectively address the limitations of traditional explanation methods through their conversational capabilities, there has been a lack of systematic evaluation of the quality of these LLM-generated explanations. To fill this gap, this study evaluates whether LLMs can produce explanations for ML models that meet the fundamental properties of XAI using conventional ML models and explanation methods as benchmarks. The findings offer important insights into the strengths and limitations of LLMs as tools for explainable AI, provide recommendations for their appropriate use, and identify promising directions for future research.

3853Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

[openreview] [pdf]

Abstract Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.

3854Quantifying Emergence in Neural Networks: Insights from Pruning and Training Dynamics

[openreview] [pdf]

Abstract Emergence, where complex behaviors develop from the interactions of simpler components within a network, plays a crucial role in enhancing neural network capabilities. We introduce a quantitative framework to measure emergence as structural nonlinearity, study the dynamics of this measure during the training process, and examine its impact on network performance, particularly in relation to pruning and training dynamics. Our hypothesis posits that the degree of emergence—evaluated from the distribution and connectivity of active nodes—can predict the development of emergent behaviors in the network. We demonstrate that higher emergence correlates with improved trianing performance. We further explore the relationship between network complexity and the loss landscape, suggesting that higher emergence indicates a greater concentration of local minima and a more rugged loss landscape. We show that this framework can be applied to explain the impact of pruning on the training dynamics. These findings provide new insights into the interplay between emergence, complexity, and performance in neural networks, offering implications for designing and optimizing architectures.

3855Preference Data Annotation with Guided Density Ratios

[openreview] [pdf]

Abstract Preference tuning has become a standard step of modern LLM post-training. Usually, it requires paired human feedback data or preference classifiers trained on such data, where the data collection is costly in time and resources. This paper proposes a data annotation technique that takes the prompt-guided density ratio between off-the-shelf LLMs to serve as proxy of human preference with no training needed. We show that by adding descriptions of preference and domain specific few-shot examples before the user query (e.g. a detailed definition of safety plus an example), we can significantly improve density ratio rewards’ annotation accuracy. Our final method reaches a score of 82.6 on RewardBench, where prompt injection improves the Safety domain from 82 to 91 and the Reasoning domain from 74 to 90. We then perform preference tuning using data annotated by density-ratio reward from a 7B model, aligning a Llama 3 8B instruct model to achieve an 37% WinRate on ArenaHard, 41% Length Controlled win-rate on AlpacaEval 2.0, and 8.0 on MT-Bench.

3856Unsupervised Model Tree Heritage Recovery

[openreview] [pdf]

Abstract The number of models shared online has recently skyrocketed, with over one million public models available on Hugging Face. Sharing models allows other users to build on existing models, using them as initialization for fine-tuning, improving accuracy and saving compute and energy. However, it also raises important intellectual property issues, as fine-tuning may violate the license terms of the original model or that of its training data. A Model Tree, i.e., a tree data structure rooted at a foundation model and having directed edges between a parent model and other models directly fine-tuned from it (children), would settle such disputes by making the model heritage explicit. Unfortunately, current models are not well documented, with most model metadata (e.g., “model cards”) not providing accurate information about heritage. In this paper, we introduce the task of Unsupervised Model Tree Heritage Recovery (Unsupervised MoTHer Recovery) for collections of neural networks. For each pair of models, this task requires: i) determining if they are directly related, and ii) establishing the direction of the relationship. Our hypothesis is that model weights encode this information, the challenge is to decode the underlying tree structure given the weights. We discover several properties of model weights that allow us to perform this task. By using these properties, we formulate the MoTHer Recovery task as finding a directed minimal spanning tree. In extensive experiments we demonstrate that our method successfully reconstructs complex Model Trees.

3857ToolDial: Multi-turn Dialogue Generation Method for Tool-Augmented Language Models

[openreview] [pdf]

Abstract Tool-Augmented Language Models (TALMs) leverage external APIs to answer user queries across various domains. However, existing benchmark datasets for TALM research often feature simplistic dialogues that do not reflect real-world scenarios, such as the need for models to ask clarifying questions or proactively call additional APIs when essential information is missing. To address these limitations, we construct and release ToolDial, a dataset comprising 11,111 multi-turn dialogues, with an average of 8.95 turns per dialogue, based on APIs from RapidAPI. ToolDial has two key characteristics. First, the dialogues incorporate 16 user and system actions (e.g., request, clarify, fail inform) to capture the rich dynamics of real-world interactions. Second, we simulate dialogues where the system requests necessary information from the user based on API documentation and seeks additional APIs if the user fails to provide the required information. To facilitate this process, we introduce a method for generating an API graph that represents input and output compatibility between APIs. Using ToolDial, we evaluate a suite of language models on their ability to predict correct actions and extract input parameter values for API calls from the dialogue history. Modern language models achieve accuracy scores below 70%, indicating substantial room for improvement. We provide a detailed analysis of the areas where these models fall short.

3858How To Evaluate Your Medical Time Series Classification?

[openreview] [pdf]

Abstract Medical time series (MedTS) play a critical role in many healthcare applications, such as vital sign monitoring and the diagnosis of brain and heart diseases. However, the existence of subject-specific features poses unique challenges in MedTS evaluation. Inappropriate evaluation setups that either exploit or overlook these features can lead to artificially inflated classification performance (by up to 50% in accuracy; ADFTD): this concern has received little attention in current research. Here, we categorize the existing evaluation setups into two primary categories: subject-dependent and subject-independent. We show the subject-independent setup is more appropriate for different datasets and tasks. Our theoretical analysis explores the feature components of MedTS, examining how different evaluation setups influence the features that a model learns. Through experiments on six datasets (spanning EEG, ECG, and fNIRS modalities) using four different methods, we demonstrate step-by-step how subject-dependent utilizes subject-specific features as a shortcut for classification and leads to a deceptive high performance, suggesting that the subject-independent setup is more precise and practicable evaluation setup in real-world. This comprehensive analysis aims to establish clearer guidelines for evaluating MedTS models in different healthcare applications. Code to reproduce this work inhttps://anonymous.4open.science/r/MedTS_Evaluation-733F.

3859Phase-Driven Domain Generalizable Learning For Nonstationary Time Series Classification

[openreview] [pdf]

Abstract Monitoring and recognizing patterns in continuous sensing data is crucial for many practical applications. These real-world time-series data are often nonstationary, characterized by varying statistical and spectral properties over time. This poses a significant challenge in developing learning models that can effectively generalize across different distributions. In this work, based on our observation that nonstationary statistics for time-series classification tasks are intrinsically linked to the phase information, we propose a time-series domain generalization framework, PhASER. It consists of three novel elements: 1) Hilbert transform-based phase augmentation that diversifies non-stationarity while preserving discriminatory semantics, 2) separate magnitude-phase encoding by viewing time-varying magnitude and phase as independent modalities, and 3) phase-residual feature broadcasting by incorporating phase with a novel residual connection for inherent regularization to enhance distribution invariant learning. Extensive evaluation on 5 datasets from sleep-stage classification, human activity recognition, and gesture recognition against 12 state-of-the-art baseline methods demonstrate that PhASER consistently outperforms the best baselines by an average of 5% and up to 13% in some cases. Moreover, PhASER’s principles can also be applied broadly to boost the generalizability of existing time-series classification models.

3860Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA

[openreview] [pdf]

Abstract Large language models (LLMs) are expensive to deploy. Parameter sharing offers a possible path towards reducing their size and cost, but its effectiveness in modern LLMs remains fairly limited. In this work, we revisit “layer tying” as form of parameter sharing in Transformers, and introduce novel methods for converting existing LLMs into smaller “Recursive Transformers” that share parameters across layers, with minimal loss of performance. Here, our Recursive Transformers are efficiently initialized from standard pretrained Transformers, but only use a single block of unique layers that is then repeated multiple times in a loop. We further improve performance by introducing Relaxed Recursive Transformers that add flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules, yet still preserve the compactness of the overall model. We show that our recursive models (e.g., recursive Gemma 1B) outperform both similar-sized vanilla pretrained models (such as TinyLlama 1.1B and Pythia 1B) and knowledge distillation baselines---and can even recover most of the performance of the original “full-size” model (e.g., Gemma 2B with no shared parameters). Finally, we propose Continuous Depth-wise Batching, a promising new inference paradigm enabled by the Recursive Transformer when paired with early exiting, which we show to theoretically lead to significant (2-3×\times) throughput gains.

3861AutoAdvExBench: Benchmarking Autonomous Exploitation of Adversarial Example Defenses

[openreview] [pdf]

Abstract We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. We believe our benchmark will be valuable to several distinct audiences. First, it measures if models can match the abilities of expert adversarial machine learning researchers. Second, it serves as a challenging evaluation for reasoning capabilities that can measure LLMs’ ability to understand and interact with sophisticated codebases. And third, since many adversarial examples defenses have been broken in the past, this benchmark allows for evaluating the ability of LLMs to reproduce prior research results automatically. We then benchmark the ability of current LLMs to solve this benchmark, and find most are unable to succeed. Our strongest agent, with a human-guided prompt, is only able to successfully generate adversarial examples on 6 of the 51 defenses in our benchmark. This benchmark is publicly accessible at redacted for review.

3862MTSTRec: Multimodal Time-Aligned Shared Token Recommender

[openreview] [pdf]

Abstract Sequential recommendation in e-commerce leverages users’ anonymous browsing histories to offer personalized product suggestions without relying on personal information. While item ID-based sequential recommendations are commonly used, they often fail to fully capture the diverse factors influencing user preferences, such as textual descriptions, visual content, and pricing. These factors represent distinct modalities in recommender systems. Existing multimodal sequential recommendation models typically employ either early or late fusion of different modalities, overlooking the alignment of corresponding positions in time of product sequences that represent users’ browsing preferences. To address these limitations, this paper proposes a unified framework for multimodal fusion in recommender systems, introducing the Multimodal Time-aligned Shared Token Recommender (MTSTRec). MTSTRec leverages a transformer-based architecture that incorporates a single time-aligned shared token for each product, allowing for efficient cross-modality fusion that also aligns in time. This approach not only preserves the distinct contributions of each modality but also aligns them to better capture user preferences. Additionally, the model extracts rich features from text, images, and other product data, offering a more comprehensive representation of user decision-making in e-commerce. Extensive experiments demonstrate that MTSTRec achieves state-of-the-art performance across multiple sequential recommendation benchmarks, significantly improving upon existing multimodal fusion strategies.

3863Revisiting PCA for Time Series Reduction in Temporal Dimension

[openreview] [pdf]

Abstract Deep learning has significantly advanced time series analysis (TSA), enabling the extraction of complex patterns for tasks like classification, forecasting, and regression. While dimensionality reduction has traditionally focused on the variable space—achieving notable success in minimizing data redundancy and computational complexity—less attention has been paid to reducing the temporal dimension. In this study, we revisit Principal Component Analysis (PCA), a classical dimensionality reduction technique, to explore its utility in temporal dimension reduction for time series data. It is generally thought that applying PCA to the temporal dimension would disrupt temporal dependencies, leading to limited exploration in this area. However, our theoretical analysis and extensive experiments demonstrate that applying PCA to sliding series windows not only maintains model performance but also enhances computational efficiency. In auto-regressive forecasting, the temporal structure is partially preserved through windowing, and PCA is applied within these windows to denoise the time series while retaining their statistical information. By preprocessing time series data with PCA, we reduce the temporal dimensionality before feeding it into TSA models such as Linear, Transformer, CNN, and RNN architectures. This approach accelerates training and inference and reduces resource consumption. Notably, PCA improves Informer training and inference speed by up to 40% and decreases GPU memory usage of TimesNet by 30%, without sacrificing model accuracy. Comparative analysis against other reduction methods further highlights the effectiveness of PCA in enhancing the efficiency of TSA models. Code is provided in the supplementary materials.

3864PEDVLM: PEDESTRIAN VISION LANGUAGE MODEL FOR INTENTIONS PREDICTION

[openreview] [pdf]

Abstract Effective modeling of human behavior is crucial for the safe and reliable coexistence of humans and autonomous vehicles. Traditional deep learning methods have limitations in capturing the complexities of pedestrian behavior, often relying on simplistic representations or indirect inference from visual cues, which hinders their explainability. To address this gap, we introduce PedVLM\textbf{PedVLM}, a vision-language model that leverages multiple modalities (RGB images, optical flow, and text) to predict pedestrian intentions and also provide explainability for pedestrian behavior. PedVLM comprises a CLIP-based vision encoder and a text-to-text transfer transformer (T5) language model, which together extract and combine visual and text embeddings to predict pedestrian actions and enhance explainability. Furthermore, to complement our PedVLM model and further facilitate research, we also publicly release the corresponding dataset, PedPrompt, which includes the prompts in the Question-Answer (QA) template for pedestrian intention prediction. PedVLM is evaluated on PedPrompt, JAAD, and PIE datasets demonstrates its efficacy compared to state-of-the-art methods. The dataset and code will be made available at {https://github.com/abc/ped_VLM}.

3865LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

[openreview] [pdf]

Abstract Due to the need of interacting with the world, embodied agents are required to possess comprehensive task-relevant knowledge, long-horizon planning capability, and a swift response speed. Large language models (LLMs), owing to their rich general knowledge, recently achieve promising results in open-world embodied tasks, like the world exploration in Minecraft. However, the outputs of LLMs are descriptive sentences or code, which are slow to generate and not end-to-end, as a translator is required to translate the LLM outputs into actions to perform. To address these limitations, we introduce the large auto-regressive model (LARM). LARM leverages environment observations as input and predicts subsequent actions in an auto-regressive manner. Compared with LLM based methods, LARM directly predicts the next skill for execution according to the current observation. In addition, considering that the commonly adopted training paradigms do not reflect the mutual influence and dependency between actions and observations, we develop a novel data format named auto-regressive node transmission structure and assemble a corresponding dataset to train LARM. Combining these techniques, LARM successfully harvests enchanted equipment in Minecraft, which demands significantly more complex decision-making chains than the highest achievements of prior best methods. Besides, the speed of LARM is 6.8x faster than LLMs with similar parameter volume.

3866XoRA: Expander adapted LoRA finetuning

[openreview] [pdf]

Abstract Parameter-efficient fine-tuning aims to reduce the computational cost of adapting foundational models to downstream tasks. Low-rank matrix based adaptation (LoRA) techniques are popular for this purpose. We propose XoRA, an efficient fine-tuning scheme, which sparsifies the low-rank matrices even further using expander masks. The mask is generated using extremal expander graphs (Ramanujan graphs) to maintain high edge connectivity even at a very high sparsity. Experimental results demonstrate that this method has comparable performance with the LoRA fine-tuning method while retaining much fewer number of parameters.

3867Confounder-Free Continual Learning via Recursive Feature Normalization

[openreview] [pdf]

Abstract Confounders are extraneous variable that affect both the input and the target, resulting in spurious correlations and biased predictions. Learning feature representations that are invariant to confounders remains a significant challenge in continual learning. To remove the influence of confounding variables from intermediate feature representations, we introduce the Recursive Metadata Normalization (R-MDN) layer, which can be integrated into any stage within deep neural networks (DNNs). R-MDN performs statistical regression via the recursive least squares algorithm to maintain and continually update an internal model state with respect to changing distributions of data and confounding variables. Since R-MDN operates on the level of individual examples, it is compatible with state-of-the-art architectures like vision transformers. Our experiments demonstrate that R-MDN promotes equitable predictions across population groups, both within static learning and across different stages of continual learning, by reducing catastrophic forgetting caused by confounder effects changing over time.

3868WIN: Variable-View Implicit LIDAR Upsampling Network

[openreview] [pdf]

Abstract LiDAR upsampling aims to increase the resolution of sparse point sets obtained from low-cost sensors, providing better performance for various downstream tasks. Most existing methods transform LiDAR points into range view and design complex neighborhood point interpolation strategies to improve the resolution of point clouds. However, they overlook that the range image representation is insufficient to describe complex local geometric relationships, which limits the geometric accuracy of upsampled points. To address this issue, we propose WIN, a Variable-View Implicit Network. First, we decouple the range image into two novel virtual view representations to compensate for the missing geometric information during range view-based interpolation. Secondly, to fuse the interpolation results of different views, we model the fusion process as a probability distribution problem instead of a simple binary classification task. We introduce a contrast selection module, which captures the feature differences between two representations and outputs the view confidence score for each upsampled point. The underlying idea is that the complementarity of the information is proportional to the feature difference between the two views. Motivated by this insight, we design a loss function based on probabilistic modeling to supervise the results of the selection module. As a result, compared with the current state-of-the-art (SOTA) method ILN, WIN introduces a small number of parameters (+0.4M) but achieves a +4.5% increase in the MAE metric on the CARLA dataset. Furthermore, our method outperforms all existing methods in a downstream task (Depth Completion). The pre-trained model and code will be released upon acceptance.

3869Pushing the Limit of Small-Efficient Offline Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) has achieved notable progress in recent years. It enables learning optimized policy from fixed offline datasets and, therefore is particularly suitable for decision-making tasks that lack reliable simulators or have environment interaction restrictions. However, existing offline RL methods typically need a large amount of training data to achieve reasonable performance, and offer limited generalizability in out-of-distribution (OOD) regions due to conservative data-related regularizations. This seriously hinders the usability of offline RL in solving many real-world applications, where the available data are often limited. In this study, we introduce a highly sample-efficient offline RL algorithm that learns optimized policy by enabling state-stitching in a compact latent space regulated by the fundamental symmetry in dynamical systems. Specifically, we introduce a time-reversal symmetry (T-symmetry) enforced inverse dynamics model (TS-IDM) to derive well-regulated latent state representations that greatly ease the difficulty of OOD generalization. Within the learned latent space, we can learn a guide-policy to output the latent next state that maximizes the reward, bypassing the conservative action-level behavior constraints as used in typical offline RL algorithms. The final optimized action can then be easily extracted by using the guide-policy’s output as the goal state in the learned TS-IDM. We call our method Offline RL via T-symmetry Enforced Latent State-Stitching (TELS). Our approach achieves amazing sample efficiency and OOD generalizability, significantly outperforming existing offline RL methods in a wide range of challenging small-sample tasks, even using as few as 1% of the original data in D4RL tasks.

3870Towards Reliability of Parameter-free Optimization

[openreview] [pdf]

Abstract Hyperparameter tuning, particularly the selection of an appropriate learning rate in adaptive gradient training methods, remains a challenge. To tackle this challenge, in this paper, we propose a novel parameter-free optimizer, AdamG (Adam with the golden step size), designed to automatically adapt to diverse optimization problems without manual tuning. The core technique underlying AdamG is our golden step size derived for the AdaGrad-Norm algorithm, which is expected to help AdaGrad-Norm preserve the tuning-free convergence and approximate the optimal step size in expectation w.r.t. various optimization scenarios. To better evaluate tuning-free performance, we propose a novel evaluation criterion, reliability, to comprehensively assess the efficacy of parameter-free optimizers in addition to classical performance criteria. Empirical results demonstrate that compared with other parameter-free baselines, AdamG achieves superior performance, which is consistently on par with Adam using a manually tuned learning rate across various optimization tasks.

3871Minimalistic Predictions for Online Class Constraint Scheduling

[openreview] [pdf]

Abstract We consider online scheduling with class constraints. That is, we are given mm machines, each with kk class slots. Upon receiving a job jj with class cjc_j, an algorithm needs to allocate jj on some machine ii. The goal is to minimize the makespan while not assigning more than kk different classes onto each machine. While the offline case is well understood and even (E)PTAS results are known [Jansen, Lassota, Maack SPAA’20, Chen Jansen Luo Zhang COCOA’16], the online case admits strong impossibility results in classical competitive analysis [Epstein, Lassota, Levin, Maack, Rohwedder STACS’22].We overcome these daunting results by investigating the problem in a learning-augmented setting where an algorithm can access possibly erroneous predictions. We present new algorithms with competitive ratios independent of mm and tight lower bounds for several classical and problem-specific prediction models. We thereby give a structured overview of what additional information helps in the design of better scheduling algorithms.

3872What Kind of Pretraining Data Do Large Language Models Rely on When Doing Reasoning?

[openreview] [pdf]

Abstract The capabilities and limitations of Large Language Models (LLMs) have been sketched out in great detail in recent years, providing an intriguing yet conflicting picture. On the one hand, LLMs demonstrate a general ability to solve problems. On the other hand, they show surprising reasoning gaps when compared to humans, casting doubt on the robustness of their generalisation strategies. The sheer volume of data used in the design of LLMs has precluded us from applying the method traditionally used to measure generalisation; train-test set separation. In this work, we study what kind of generalisation strategies LLMs employ when performing reasoning tasks by investigating the pretraining data they rely on. For two models of different sizes (7B and 35B) and 2.5B of their pretraining tokens, we identify what documents impact three simple mathematical reasoning tasks and contrast this to the data that are influential for answering factual questions. We find that, while the models rely on mostly distinct sets of data for each factual question, documents often have a similar influence on different reasoning questions with the same task, indicating the presence of procedural knowledge. We further find that the answers to the factual questions often show up in the most influential data. However, for the reasoning questions the answers usually do not show up as highly influential, nor do the answers to the intermediate reasoning steps. When we characterise the top portion of the ranking for the reasoning questions qualitatively, we find that the influential documents often contain procedural knowledge, like demonstrating how to obtain the solution using formulae or code. Our findings indicate that the generalisation strategy the model uses when doing reasoning is unlike retrieval, but more like a strategy using many documents doing a similar form of reasoning.

3873Strategic Filtering for Content Moderation: Free Speech or Free of Distortion?

[openreview] [pdf]

Abstract User-generated content (UGC) on social media platforms is vulnerable to incitements and manipulations, necessitating effective regulations. To address these challenges, those platforms often deploy automated content moderators tasked with evaluating the harmfulness of UGC and filtering out content that violates established guidelines. However, such moderation inevitably gives rise to strategic responses from users, who strive to express themselves within the confines of guidelines. Such phenomenons call for a careful balance between: 1. ensuring freedom of speech --- by minimizing the restriction of expression; and 2. reducing social distortion --- measured by the total amount of content manipulation. We tackle the problem of optimizing this balance through the lens of mechanism design, aiming at optimizing the trade-off between minimizing social distortion and maximizing free speech. Although determining the optimal trade-off is NP-hard, we propose practical methods to approximate the optimal solution. Additionally, we provide generalization guarantees that determine the amount of finite offline data required to effectively approximate the optimal moderator.

3874LLM Bandit: Cost-Efficient LLM Generation via Preference-Conditioned Dynamic Routing

[openreview] [pdf]

Abstract The rapid advancement in large language models (LLMs) has brought forth a diverse range of models with varying capabilities that excel in different tasks and domains. However, selecting the optimal LLM for user queries often involves a challenging trade-off between accuracy and cost, a problem exacerbated by the diverse demands of individual queries. In this work, we present a novel framework that formulates the LLM selection process as a multi-armed bandit problem, enabling dynamic and intelligent routing of queries to the most appropriate model. Our approach incorporates a preference-conditioned dynamic routing mechanism, allowing users to specify their preferences at inference time, thereby offering a customizable balance between performance and cost. Additionally, our selection policy is designed to generalize to unseen LLMs, ensuring adaptability to new models as they emerge. Experimental results demonstrate that our method achieves significant improvements in both accuracy and cost-effectiveness across various LLM platforms, showcasing the potential of our framework to adaptively optimize LLM selection in real-world scenarios.

3875Forte : Finding Outliers with Representation Typicality Estimation

[openreview] [pdf]

Abstract Generative models can now produce photorealistic synthetic data which is virtually indistinguishable from the real data used to train it. This is a significant evolution over previous models which could produce reasonable facsimiles of the training data, but ones which could be visually distinguished from the training data by human evaluation. Recent work on OOD detection has raised doubts that generative model likelihoods are optimal OOD detectors due to issues involving likelihood misestimation, entropy in the generative process, and typicality. We speculate that generative OOD detectors also failed because their models focused on the pixels rather than the semantic content of the data, leading to failures in near-OOD cases where the pixels may be similar but the information content is significantly different. We hypothesize that estimating typical sets using self-supervised learners leads to better OOD detectors. We introduce a novel approach that leverages representation learning, and informative summary statistics based on manifold estimation, to address all of the aforementioned issues. Our method outperforms other unsupervised approaches and achieves state-of-the art performance on well-established challenging benchmarks, and new synthetic data detection tasks.

3876Exploring Source View Capability: Improve Generalizable 3D Reconstruction with Multi-view Context from Source Views

[openreview] [pdf]

Abstract Recent generalizable 3D reconstruction methods have been facing challenges in constructing geometry-consistent 3D features. This is primarily due to the source views conveying redundant information to the sampled 3D points that they do not observe, resulting in the samples struggling to distinguish the correct observations of them. We attribute this issue to that canonical supervision methods focus solely on the rendered target view from a single viewpoint, overlooking source views that capture the scene from different perspectives. With this insight, we pioneer a supervision method for source views, which can be applied alongside existing target view supervision in each iteration. Specifically, we define the Learned Geometry of the Scene (LGS) as source-view depth distributions, which are derived from the weights of source views for each sampled 3D point. To regularize the LGS to better model the real-world geometry, we introduce a novel unsupervised learning objective, which mitigates the optimization bias in existing objectives and ensures the LGS is more concentrated near the real-world geometry surface. Regularizing the LGS effectively helps filter out irrelevant source views for each sampled 3D point, and thus noticeably improves the performance of backbones. Mathematical proof is provided to validate the proposed objective, and extensive experiments demonstrate that our supervision method significantly improves both NeRF- and 3DGS-based backbones with negligible computation overhead.

3877Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning

[openreview] [pdf]

Abstract Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields developed rather independently. We observe that several structure and representation identifiability methods, particularly those that require multiple environments, rely on exchangeable non--i.i.d. (independent and identically distributed) data. To formalize this connection, we propose the Identifiable Exchangeable Mechanisms (IEM) framework to unify key representation and causal structure learning methods. IEM provides a unified probabilistic graphical model encompassing causal discovery, Independent Component Analysis, and Causal Representation Learning. With the help of the IEM model, we generalize the Causal de Finetti theorem of Guo et al., 2022 by relaxing the necessary conditions for causal structure identification in exchangeable data. We term these conditions cause and mechanism variability, and show how they imply a duality condition in identifiable representation learning, leading to new identifiability results.

3878Data-Evolution Learning

[openreview] [pdf]

Abstract Recent advancements in machine learning have been driven by models trained on large-scale, high-quality datasets. However, the practical application of these models faces two significant challenges: the infeasibility of acquiring precise labels in real-world settings and the substantial computational burden imposed by training large models. While existing approaches—such as self-supervised learning, weak supervision, noisy label learning, and dataset distillation—address these challenges from a model-centric perspective, they often overlook the potential benefits of optimizing the data itself. This paper introduces a novel data-centric learning paradigm where both the dataset and the model co-evolve during the learning process. We formalize this paradigm and propose a Data-evolution Learning Algorithm (DeLA), which offers three key advantages: optimized dataset generation, versatile dataset compatibility, and effective utilization of prior knowledge. Extensive experiments demonstrate that DeLA enables the creation of optimized datasets for reuse in subsequent training, effectively addressing diverse datasets with varying target types. Moreover, DeLA accelerates learning by utilizing architecture-agnostic, open-source prior models for efficient data creation. Notably, DeLA frequently outperforms traditional SOTA model-centric methods in self-supervised and noisy label learning. Furthermore, its simplicity enables implementation in only two lines of PyTorch code, offering significant potential for advancements in representation learning. Our code will be made publicly available.

3879Certifying Language Model Robustness with Fuzzed Randomized Smoothing: An Efficient Defense Against Backdoor Attacks

[openreview] [pdf]

Abstract The widespread deployment of pre-trained language models (PLMs) has exposed them to textual backdoor attacks, particularly those planted during the pre-training stage. These attacks pose significant risks to high-reliability applications, as they can stealthily affect multiple downstream tasks. While certifying robustness against such threats is crucial, existing defenses struggle with the high-dimensional, interdependent nature of textual data and the lack of access to original poisoned pre-training data. To address these challenges, we introduceFuzzedRandomizedSmoothing (FRS), a novel approach for efficiently certifying language model robustness against backdoor attacks. FRS integrates software robustness certification techniques with biphased model parameter smoothing, employing Monte Carlo tree search for proactive fuzzing to identify vulnerable textual segments within the Damerau-Levenshtein space. This allows for targeted and efficient text randomization, while eliminating the need for access to poisoned training data during model smoothing. Our theoretical analysis demonstrates that FRS achieves a broader certified robustness radius compared to existing methods. Extensive experiments across various datasets, model configurations, and attack strategies validate FRS’s superiority in terms of defense efficiency, accuracy, and robustness.

3880Efficiently Scanning and Resampling Spatio-Temporal Tasks with Irregular Observations

[openreview] [pdf]

Abstract Various works have aimed at combining the inference efficiency of recurrent models and training parallelism of multi-head attention for sequence modeling. However, most of these works focus on tasks with fixed-dimension observation spaces, such as individual tokens in language modeling or pixels in image completion. To handle an observation space of varying size, we propose a novel algorithm that alternates between cross-attention between a 2D latent state and observation, and a discounted cumulative sum over the sequence dimension to efficiently accumulate historical information. We find this resampling cycle is critical for performance. To evaluate efficient sequence modeling in this domain, we introduce two multi-agent intention tasks: simulated agents chasing bouncing particles and micromanagement analysis in professional StarCraft II games. Our algorithm achieves comparable accuracy with a lower parameter count, faster training and inference compared to existing methods.

3881Learning with User-Level Local Differential Privacy

[openreview] [pdf]

Abstract User-level privacy is important in distributed systems. Previous research primarily focuses on the central model, while the local models have received much less attention. Under the central model, user-level DP is strictly stronger than the item-level one. However, under the local model, the relationship between user-level and item-level LDP becomes more complex, thus the analysis is crucially different. In this paper, we first analyze the mean estimation problem and then apply it to stochastic optimization, classification, and regression. In particular, we propose adaptive strategies to achieve optimal performance at all privacy levels. Moreover, we also obtain information-theoretic lower bounds, which show that the proposed methods are minimax optimal up to logarithmic factors. Unlike the central DP model, where user-level DP always leads to slower convergence, our result shows that under the local model, the convergence rates are nearly the same between user-level and item-level cases for distributions with bounded support. For heavy-tailed distributions, the user-level rate is even faster than the item-level one.

3882Score-Based Variational Inference for Inverse Problems

[openreview] [pdf]

Abstract Existing diffusion-based methods for inverse problems sample from the posterior using score functions and accept the generated random samples as solutions. In applications that posterior mean is preferred, we have to generate multiple samples from the posterior which is time-consuming. In this work, by analyzing the probability density evolution of the conditional reverse diffusion process, we prove that the posterior mean can be achieved by tracking the mean of each reverse diffusion step. Based on that, we establish a framework termed reverse mean propagation (RMP) that targets the posterior mean directly. We show that RMP can be implemented by solving a variational inference problem, which can be further decomposed as minimizing a reverse KL divergence at each reverse step. We further develop an algorithm that optimizes the reverse KL divergence with natural gradient descent using score functions and propagates the mean at each reverse step. Experiments demonstrate the validity of the theory of our framework and show that our algorithm outperforms state-of-the-art algorithms on reconstruction performance with lower computational complexity in various inverse problems.

3883Rationalizing and Augmenting Dynamic Graph Neural Networks

[openreview] [pdf]

Abstract Graph data augmentation (GDA) has shown significant promise in enhancing the performance, generalization, and robustness of graph neural networks (GNNs). However, contemporary methodologies are often limited to static graphs, whose applicability on dynamic graphs—more prevalent in real-world applications—remains unexamined. In this paper, we empirically highlight the challenges faced by static GDA methods when applied to dynamic graphs, particularly their inability to maintain temporal consistency. In light of this limitation, we propose a dedicated augmentation framework for dynamic graphs, termed DyAug\texttt{DyAug}, which adaptively augments the evolving graph structure with temporal consistency awareness. Specifically, we introduce the paradigm of graph rationalization for dynamic GNNs, progressively distinguishing between causal subgraphs (\textit{rationale}) and the non-causal complement (\textit{environment}) across snapshots. We develop three types of environment replacement, including, spatial, temporal, and spatial-temporal, to facilitate data augmentation in the latent representation space, thereby improving the performance, generalization, and robustness of dynamic GNNs. Extensive experiments on six benchmarks and three GNN backbones demonstrate that DyAug\texttt{DyAug} can \textbf{(I)} improve the performance of dynamic GNNs by 0.89%3.13%0.89\%\sim3.13\%\uparrow; \textbf{(II)} effectively counter targeted and non-targeted adversarial attacks with 6.2%12.2%6.2\%\sim12.2\%\uparrow performance boost; \textbf{(III)} make stable predictions under temporal distribution shifts.

3884Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization

[openreview] [pdf]

Abstract Federated Learning (FL) offers a promising framework for collaborative and privacy-preserving machine learning across distributed data sources. However, the substantial communication costs associated with FL pose a significant challenge to its efficiency. Specifically, in each communication round, the communication costs scale linearly with the model’s dimension, which presents a formidable obstacle, especially in large model scenarios. Despite various communication-efficient strategies, the intrinsic dimension-dependent communication cost remains a major bottleneck for current FL implementations. In this paper, we introduce a novel dimension-free communication strategy for FL, leveraging zeroth-order optimization techniques. We propose a new algorithm, DeComFL, which facilitates the transmission of only a constant number of scalar values between clients and the server in each communication round no matter in both uplink and downlink, thereby reducing the communication cost from O(d)\mathcal{O}(d) to O(1)\mathcal{O}(1), where dd is the dimension of the model parameters. Theoretically, in non-convex functions, we prove that our algorithm achieves state-of-the-art rates, which show a linear speedup of the number of clients and local steps under standard assumptions and dimension-free rate for low effective rank scenarios. Empirical evaluations through classic deep learning training and large language model fine-tuning substantiate significant reductions in communication overhead compared to traditional FL approaches. By DeComFL, we can achieve around 1MB level of total communication cost between the server and a client until convergence.

3885Scaling Test-Time Compute Optimally Can be More Effective than Scaling LLM Parameters

[openreview] [pdf]

Abstract Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we scale up inference-time computation in LLMs, with a focus on answering: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how to tradeoff inference-time and pre-training compute. Little research has attempted to understand the scaling behaviors of test-time inference methods, with current work largely providing negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model’s distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a ``compute-optimal’’ scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

3886DyCAST: Learning Dynamic Causal Structure from Time Series

[openreview] [pdf]

Abstract Understanding the dynamics of causal structures is crucial for uncovering the underlying processes in time series data. Previous approaches rely on static assumptions, where contemporaneous and time-lagged dependencies are assumed to have invariant topological structures. However, these models fail when systems undergo dynamic transformations, as they cannot capture the evolving causal relationships between variables. To address this limitation, we propose DyCAST, a novel framework designed to learn dynamic causal structures in time series using Neural Ordinary Differential Equations (Neural ODEs). The key innovation lies in modeling the temporal dynamics of the contemporaneous structure, Wt\boldsymbol{W}_t, drawing inspiration from recent advances in Neural ODEs on constrained manifolds. We reformulate the task of learning causal structures at each time step as solving the solution trajectory of a Neural ODE on the directed acyclic graph (DAG) manifold. To accommodate high-dimensional causal structures, we extend DyCAST by learning the temporal dynamics of the hidden state for Wt\boldsymbol{W}_t. Experiments on both synthetic and real-world datasets demonstrate that DyCAST achieves superior or comparable performance compared to existing causal discovery models.

3887Simple synthetic data reduces sycophancy in large language models

[openreview] [pdf]

Abstract Sycophancy is an undesirable behavior where models tailor their responses to follow a human user’s view even when that view is not objectively correct (e.g., adapting liberal views once a user reveals that they are liberal). In this paper, we study the prevalence of sycophancy in language models and propose a simple synthetic-data intervention to reduce this behavior.First, on a set of three sycophancy tasks where models are asked for an opinion on statements with no correct answers (e.g., politics), we observe that both model scaling and instruction tuning significantly increase sycophancy for large language models up to 540B parameters. Second, we extend sycophancy evaluations to simple addition statements that are objectively incorrect, finding that despite knowing that these statements are wrong, language models will still agree with them if the user does as well.To reduce sycophancy, we present a straightforward synthetic-data intervention that takes public NLP tasks and encourages models to be robust to user opinions on these tasks. Adding these data in a lightweight finetuning step can significantly reduce sycophantic behavior on held-out prompts. Code for generating synthetic data for intervention can be found athttps://anonymous.4open.science/r/sycophancy-intervention-F0D1/.

3888Boosting Perturbed Gradient Ascent for Last-Iterate Convergence in Games

[openreview] [pdf]

Abstract This paper presents a payoff perturbation technique, introducing a strong convexity to players’ payoff functions in games. This technique is specifically designed for first-order methods to achieve last-iterate convergence in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. Although perturbation is known to facilitate the convergence of learning algorithms, the magnitude of perturbation requires careful adjustment to ensure last-iterate convergence. Previous studies have proposed a scheme in which the magnitude is determined by the distance from a periodically re-initialized anchoring or reference strategy. Building upon this, we propose Gradient Ascent with Boosting Payoff Perturbation, which incorporates a novel perturbation into the underlying payoff function, maintaining the periodically re-initializing anchoring strategy scheme. This innovation empowers us to provide faster last-iterate convergence rates against the existing payoff perturbed algorithms, even in the presence of additive noise.

3889Space-Correlated Transformer: Jointly Explore the Matching and Motion Clues in 3D Single Object Tracking

[openreview] [pdf]

Abstract 3D Single Object Tracking (3D SOT) in LiDAR point clouds plays a crucial role in autonomous driving. Current approaches mostly follow two paradigms, i.e., Siamese matching-based and motion-centric. However, LiDAR point clouds lack enough appearance information, while the motion-centric trackers suffer from complex model structures. To address these issues, we present a novel and conceptually simple tracking framework dubbed SCtrack, which jointly explores the matching and motion clues in point clouds. Specifically, SCtrack embeds point clouds into spatially structured features and conducts space correlation along the aligned spatial region. The target relative motion is directly inferred from the correlated features. In contrast to prevalent PointNet-based features, our spatially structured representation inherently models motion clues among the consecutive frames of point clouds, thereby being complementary to appearance matching. To better utilize the aligned structured features, we employ a strategy of varied-size space regions that adapt to different target shapes and locations during space correlation. Without bells and whistles, SCtrack achieves leading performance, with 89.1%, 71.5%, and 62.7% precision on KITTI, NuScenes, and Waymo Open Dataset, and runs at a considerably high speed of 60 Fps on a single RTX3090 GPU. Extensive studies validate the effectiveness of our SCtrack framework. The code will be released.

3890LLM-Mediated Guidance of MARL Systems

[openreview] [pdf]

Abstract In complex multi-agent environments, achieving efficient learning and desirable behaviours is a significant challenge for Multi-Agent Reinforcement Learning (MARL) systems. This work explores the potential of combining MARL with Large Language Model (LLM)-mediated interventions to guide agents toward more desirable behaviours. Specifically, we investigate how LLMs can be used to interpret and facilitate interventions that shape the learning trajectories of multiple agents. We experimented with two types of interventions, referred to as controllers: a Natural Language (NL) Controller and a Rule-Based (RB) Controller. The NL Controller, which uses an LLM to simulate human-like interventions, showed a stronger impact than the RB Controller. Our findings indicate that agents particularly benefit from early interventions, leading to more efficient training and higher performance. Both intervention types outperform the baseline without interventions, highlighting the potential of LLM-mediated guidance to accelerate training and enhance MARL performance in challenging environments.

3891Extending Contextual Self-Modulation: Meta-Learning Across Modalities, Task Dimensionalities, and Data Regimes

[openreview] [pdf]

Abstract Contextual Self-Modulation (CSM) is a potent regularization mechanism for the Neural Context Flow (NCF) framework which demonstrates powerful meta-learning of physical systems. However, CSM has limitations in its applicability across different modalities and in high-data regimes. In this work, we introduce two extensions: iiCSM, which expands CSM to infinite-dimensional tasks, and StochasticNCF, which improves scalability. These extensions are demonstrated through comprehensive experimentation on a range of tasks, including dynamical systems with parameter variations, computer vision challenges, and curve fitting problems. iiCSM embeds the contexts into an infinite-dimensional function space, as opposed to CSM which uses finite-dimensional context vectors. StochasticNCF enables the application of both CSM and iiCSM to high-data scenarios by providing an unbiased approximation of meta-gradient updates through a sampled set of nearest environments. Additionally, we incorporate higher-order Taylor expansions via Taylor-Mode automatic differentiation, revealing that higher-order approximations do not necessarily enhance generalization. Finally, we demonstrate how CSM can be integrated into other meta-learning frameworks with FlashCAVIA, a computationally efficient extension of the CAVIA meta-learning framework (Zintgraf et al. 2019). FlashCAVIA outperforms its predecessor across various benchmarks and reinforces the utility of bi-level optimization techniques. Together, these contributions establish a robust framework for tackling an expanded spectrum of meta-learning tasks, offering practical insights for out-of-distribution generalization. Our open-sourced library, designed for flexible integration of self-modulation into contextual meta-learning workflows, is available at \url{CODE}.

3892One Model for One Graph: A New Perspective for Pretraining with Cross-domain Graphs

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have emerged as a powerful tool to capture intricate network patterns, achieving successes across different domains. However, existing GNNs require careful domain-specific architecture designs and training from scratch on each dataset, leading to an expertise-intensive process with difficulty in generalizing across graphs from different domains. Therefore, it can be hard for practitioners to infer which GNN model can generalize well to graphs from their domains. To address this challenge, we propose a novel cross-domain pretraining framework, “one model for one graph,” which overcomes the limitations of previous approaches that failed to use a single GNN to capture diverse graph patterns across domains with significant gaps. Specifically, we pretrain a bank of expert models, with each one corresponding to a specific dataset. When inferring to a new graph, gating functions choose a subset of experts to effectively integrate prior model knowledge while avoiding negative transfer. Extensive experiments consistently demonstrate the superiority of our proposed method on both link prediction and node classification tasks.

3893AI Sandbagging: Language Models can Strategically Underperform on Evaluations

[openreview] [pdf]

Abstract Trustworthy capability evaluations are crucial for ensuring the safety of AI systems, and are becoming a key component of AI regulation. However, the developers of an AI system, or the AI system itself, may have incentives for evaluations to understate the AI’s actual capability. These conflicting interests lead to the problem ofsandbagging– which we define asstrategic underperformance on an evaluation. In this paper we assess sandbagging capabilities in contemporary language models (LMs). We prompt frontier LMs, like GPT-4 and Claude 3 Opus, to selectively underperform on dangerous capability evaluations, while maintaining performance on general (harmless) capability evaluations. Moreover, we find that models can be fine-tuned, on a synthetic dataset, to hide specific capabilities unless given a password. This behaviour generalizes to high-quality, held-out benchmarks such as WMDP. In addition, we show that both frontier and smaller models can be prompted or password-locked to target specific scores on a capability evaluation. We have mediocre success in password-locking a model to mimic the answers a weaker model would give. Overall, our results suggest that capability evaluations are vulnerable to sandbagging. This vulnerability decreases the trustworthiness of evaluations, and thereby undermines important safety decisions regarding the development and deployment of advanced AI systems.We publish our code and results athttps://anonymous.4open.science/r/Sandbagging-8305/README.md

3894Direct Preference Optimization Using Sparse Feature-level Constraints

[openreview] [pdf]

Abstract The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we proposeFeature-level constrainedPreferenceOptimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.

3895Repetition Improves Language Model Embeddings

[openreview] [pdf]

Abstract Bidirectional models are considered essential for strong text embeddings. Recent approaches to adapt autoregressive language models (LMs) into strong text embedding models have largely had the requirement to modify the LM architecture to be bidirectional. We challenge this premise by introducing “echo embeddings” which converts autoregressive LMs into high quality text embedding models without changing the architecture or requiring fine-tuning. By repeating the input and extracting embeddings from the repeated tokens—which have access to all original tokens—echo embeddings improve over classical LM embeddings by over 5% in zero-shot settings. Our zero-shot embeddings nearly match those obtained by bidirectionally-converted LMs that undergo additional masked-language modeling training. Echo embeddings are also compatible with supervised fine-tuning, matching or outperforming bidirectionally-converted LMs in an apples-to-apples comparison, even with an identical compute budget during training and inference. Overall, repetition is a simple and effective strategy to circumvent the need for bidirectional attention in embedding models, paving the way towards a unified architecture for all NLP tasks.

3896Robustness Auditing for Linear Regression: To Singularity and Beyond

[openreview] [pdf]

Abstract It has recently been discovered that the conclusions of many highly influential econometrics studies can be overturned by removing a very small fraction of their samples (often less than 0.50.5%). These conclusions are typically based on the results of one or more Ordinary Least Squares (OLS) regressions, raising the question: given a dataset, can we certify the robustness of an OLS fit on this dataset to the removal of a given number of samples?Brute-force techniques quickly break down even on small datasets. Existing approaches which go beyond brute force either can only find candidate small subsets to remove (but cannot certify their non-existence) [BGM20, KZC21], are computationally intractable beyond low dimensional settings [MR22], or require very strong assumptions on the data distribution and too many samples to give reasonable bounds in practice [BP21, FH23].We present an efficient algorithm for certifying the robustness of linear regressions to removals of samples. We implement our algorithm and run it on several landmark econometrics datasets with hundreds of dimensions and tens of thousands of samples, giving the first non-trivial certificates of robustness to sample removal for datasets of dimension 4 or greater. We prove that under distributional assumptions on a dataset, the bounds produced by our algorithm are tight up to a 1+o(1)1 + o(1) multiplicative factor.

3897Improving Nonlinear Projection Heads using Pretrained Autoencoder Embeddings

[openreview] [pdf]

Abstract This empirical study aims at improving the effectiveness of the standard 2-layer MLP projection head g()g(\cdot) featured in the SimCLR framework through the use of pretrained autoencoder embeddings. Given a contrastive learning task with a largely unlabeled image classification dataset, we first train a shallow autoencoder architecture and extract its compressed representations contained in the encoder’s embedding layer. After freezing the weights within this pretrained layer, we use it as a drop-in replacement for the input layer of SimCLR’s default projector. Additionally, we also apply further architectural changes to the projector by decreasing its width and changing its activation function. The different projection heads are then used to contrastively train and evaluate a feature extractor f()f(\cdot) following the SimCLR protocol, while also examining the performance impact of ZZ-score normalized datasets. Our experiments indicate that using a pretrained autoencoder embedding in the projector can not only increase classification accuracy by up to 2.9% or 1.7% on average but can also significantly decrease the dimensionality of the projection space. Our results also suggest, that using the sigmoid and tanh\tanh activation functions within the projector can outperform ReLU in terms of peak and average classification accuracy. When applying our presented projectors, then not applying ZZ-score normalization to datasets often increases peak performance. In contrast, the default projection head can benefit more from normalization. All experiments involving our pretrained projectors are conducted with frozen embeddings, since our test results indicate an advantage compared to using their non-frozen counterparts.

3898RATE: Score Reward Models with Imperfect Rewrites of Rewrites

[openreview] [pdf]

Abstract This paper concerns the evaluation of reward models used in language modeling. A reward model is a function that takes a prompt and a response and assigns a score indicating how ``good’’ that response is for the prompt. A key challenge is that reward models are usually imperfect proxies for actual preferences. For example, we may worry that a model trained to reward helpfulness learns to instead prefer longer responses. In this paper, we develop an evaluation method, RATE (Rewrite-based Attribute Treatment Estimators), that allows us to measure the \emph{causal} effect of a given attribute of a response (e.g., length) on the reward assigned to that response. The core idea is to use large language models to rewrite responses to produce imperfect counterfactuals, and to adjust for rewriting error by rewriting \emph{twice}. We show that the RATE estimator is consistent under reasonable assumptions. We demonstrate the effectiveness of RATE on synthetic and real-world data, showing that it can accurately estimate the effect of a given attribute on the reward model.

3899Subtle Errors Matter: Preference Learning via Error-injected Self-editing

[openreview] [pdf]

Abstract Large Language Models (LLMs) have exhibited strong mathematical reasoning and computational prowess, tackling tasks ranging from basic arithmetic to advanced competition-level problems. However, frequently occurring subtle errors, such as miscalculations or incorrect substitutions, limit the models’ full mathematical potential. Existing studies to improve mathematical ability typically involve distilling reasoning skills from stronger LLMs or applying preference learning to step-wise response pairs. Although these methods leverage samples of varying granularity to mitigate reasoning errors, they overlook the frequently occurring subtle errors. A major reason is that sampled preference pairs involve differences unrelated to the errors, which may distract the model from focusing on subtle errors. In this work, we propose a novel preference learning framework called eRror-Injected Self-Editing (RISE), which injects predefined subtle errors into partial tokens of correct solutions to construct hard pairs for error mitigation. In detail, RISE uses the model itself to edit a small number of tokens in the solution, injecting designed subtle errors. Then, pairs composed of self-edited solutions and their corresponding correct ones, along with pairs of correct and incorrect solutions obtained through sampling, are used together for subtle error-aware DPO training. Compared with other preference learning methods, RISE further refines the training objective to focus on predefined errors and their tokens, without requiring fine-grained sampling or preference annotation. Extensive experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH.

3900Towards Neural Scaling Laws for Foundation Models on Temporal Graphs

[openreview] [pdf]

Abstract The field of temporal graph learning aims to learn from evolving network data to forecast future interactions. Given a collection of observed temporal graphs, is it possible to predict the evolution of an unseen network from the same domain? To answer this question, we first present the Temporal Graph Scaling (TGS) dataset, a large collection of temporal graphs consisting of eighty-four ERC20 token transaction networks collected from 2017 to 2023. Next, we evaluate the transferability of Temporal Graph Neural Networks (TGNNs) for the temporal graph property prediction task by pre-training on a collection of up to sixty-four token transaction networks and then evaluating the downstream performance on twenty unseen token networks. We find that the neural scaling law observed in NLP and Computer Vision also applies in temporal graph learning, where pre-training on a greater number of networks leads to improved downstream performance. To the best of our knowledge, this is the first empirical demonstration of the transferability of temporal graph learning. On downstream token networks, the largest pre-trained model outperforms single model TGNNs on thirteen unseen test networks. Therefore, we believe that this is a promising first step towards building foundation models for temporal graphs. We provide the implementation of TGS athttps://anonymous.4open.science/r/ScalingTGNs.

3901Locate-then-Unlearn: An Effective Method of Multi-Task Continuous Learning for Large Language Models

[openreview] [pdf]

Abstract Nowadays large language models (LLMs) have achieved remarkable success in various NLP tasks. However, they often misinterpret human instructions and generate incorrect or outdated responses, highlighting the need for more effective continual learning techniques. While recent efforts have introduced unlearning methods to remove erroneous knowledge, existing approaches still struggle in multi-task learning scenarios.To overcome these limitations, we propose \textbf{locate-then-unlearn}, a novel framework that identifies and selectively unlearns task-specific neurons to enable efficient multi-task learning. We hypothesize that LLM neurons can be broadly categorized into task-specific neurons for handling individual tasks, and general neurons to maintain the model’s foundational capabilities. To accurately identify task-specific neurons, we utilize a two-stage locating process: (1) ranking task-related neurons based on their importance to each task, and (2) identifying task-specific neurons by applying intervention to assess how neuron activity impacts task performance, isolating those most critical to each task. We conduct comprehensive evaluations in two experimental setups: single-task specialization and multi-task generalization. The results show that our method significantly improves performance across both settings. This indicates that our method effectively balances model efficiency and accuracy in multi-task continual learning.

3902Test-time Contrastive Concepts for Open-World Semantic Segmentation

[openreview] [pdf]

Abstract Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs to align both modalities with a simple contrastive objective, have paved the way to open-vocabulary semantic segmentation. Given an arbitrary set of textual queries, image pixels are assigned the closest query in feature space. However, this works well when a user exhaustively lists all possible visual concepts in an image, which contrast against each other for the assignment. This corresponds to the current evaluation setup in the literature which relies on having access to a list of in-domain relevant concepts, typically classes of a benchmark dataset. Here, we consider the more challenging (and realistic) scenario of segmenting a single concept, given a textual prompt and nothing else. To achieve good results, besides contrasting with the generic “background” text, we propose two different approaches to automatically generate, at test time, textual contrastive concepts that are query-specific. We do so by leveraging the distribution of text in the VLM’s training set or crafted LLM prompts. We also propose a metric designed to evaluate this scenario and show the relevance of our approach on commonly used datasets.

3903Deep Reinforcement Learning for Sequential Combinatorial Auctions

[openreview] [pdf]

Abstract Revenue-optimal auction design is a challenging problem with significant theoretical and practical implications. Sequential auction mechanisms, known for their simplicity and strong strategyproofness guarantees, are often limited by theoretical results that are largely existential, except for certain restrictive settings. Although traditional reinforcement learning methods such as Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) are applicable in this domain, they struggle with computational demands and convergence issues when dealing with large and continuous action spaces. In light of this and recognizing that we can model transitions differentiable for our settings, we propose using a new reinforcement learning framework tailored for sequential combinatorial auctions that leverages first-order gradients. Our extensive evaluations show that our approach achieves significant improvement in revenue over both analytical baselines and standard reinforcement learning algorithms. Furthermore, we scale our approach to scenarios involving up to 50 agents and 50 items, demonstrating its applicability in complex, real-world auction settings. As such, this work advances the computational tools available for auction design and contributes to bridging the gap between theoretical results and practical implementations in sequential auction design.

3904The Implicit Bias of Structured State Space Models Can Be Poisoned With Clean Labels

[openreview] [pdf]

Abstract Neural networks are powered by an implicit bias: a tendency of gradient descent to fit training data in a way that generalizes to unseen data. A recent class of neural network models gaining increasing popularity is structured state space models (SSMs), regarded as an efficient alternative to transformers. Prior work argued that the implicit bias of SSMs leads to generalization in a setting where data is generated by a low dimensional teacher. In this paper, we revisit the latter setting, and formally establish a phenomenon entirely undetected by prior work on the implicit bias of SSMs. Namely, we prove that while implicit bias leads to generalization under many choices of training data, there exist special examples whose inclusion in training completely distorts the implicit bias, to a point where generalization fails. This failure occurs despite the special training examples being labeled by the teacher, i.e. having clean labels! We empirically demonstrate the phenomenon, with SSMs trained independently and as part of non-linear neural networks. In the area of adversarial machine learning, disrupting generalization with cleanly labeled training examples is known as clean-label poisoning. Given the proliferation of SSMs, particularly in large language models, we believe significant efforts should be invested in further delineating their susceptibility to clean-label poisoning, and in developing methods for overcoming this susceptibility.

3905FSEO: A Few-Shot Evolutionary Optimization Framework for Expensive Multi-Objective Optimization and Constrained Optimization

[openreview] [pdf]

Abstract Meta-learning has been demonstrated to be useful to improve the sampling efficiency of Bayesian optimization (BO) and surrogate-assisted evolutionary algorithms (SAEAs) when solving expensive optimization problems (EOPs). However, existing studies focuses on only single-objective optimization, leaving other expensive optimization scenarios unconsidered. We propose a generalized few-shot evolutionary optimization (FSEO) framework and focus on its performance on two common expensive optimization scenarios: multi-objective EOPs (EMOPs) and constrained EOPs (ECOPs). We develop a novel meta-learning modeling approach to train surrogates for our FSEO framework, an accuracy-based update strategy is designed to adapt surrogates during the optimization process. The surrogates in FSEO framework combines neural network with Gaussian Processes (GPs), their network parameters and some parameters of GPs represent useful experience and are meta-learned across related optimization tasks, the remaining GPs parameters are task-specific parameters that represent unique features of the target task. We demonstrate that our FSEO framework is able to improve sampling efficiency on both EMOP and ECOP. Empirical conclusions are made to guide the application of our FSEO framework.

3906Optimal Non-Asymptotic Rates of Value Iteration for Average-Reward Markov Decision Processes

[openreview] [pdf]

Abstract While there is an extensive body of research on the analysis of Value Iteration (VI) for discounted cumulative-reward MDPs, prior work on analyzing VI for (undiscounted) average-reward MDPs has been limited, and most prior results focus on asymptotic rates in terms of Bellman error. In this work, we conduct refined non-asymptotic analyses of average-reward MDPs, obtaining a collection of convergence results advancing our understanding of the setup. Among our new results, most notable are the O(1/k)\mathcal{O}(1/k)-rates of Anchored Value Iteration on the Bellman error under the multichain setup and the span-based complexity lower bound that matches the O(1/k)\mathcal{O}(1/k) upper bound up to a constant factor of 8 in the weakly communicating and unichain setups.

3907RED QUEEN: SAFEGUARDING LARGE LANGUAGE MODELS AGAINST CONCEALED MULTI-TURN ATTACK

[openreview] [pdf]

Abstract The rapid progress of large language models (LLMs) has opened up new opportunities across various domains and applications; yet it also presents challenges related to potential misuse. To mitigate such risks, red teaming, a strategy where developers adopt the role of potential attackers has been employed to probe language models and preemptively guard against such harms. Jailbreak attacks are a commonly used red teaming strategy that uses crafted prompts to bypass safety guardrails. However, current jailbreak attack approaches are single-turn, with explicit malicious queries that do not fully capture the complexity of real-world interactions. In reality, users can engage in multi-turn interactions with LLM-based chat assistants, allowing them to conceal their true intentions in a more covert manner. Research on the Theory of Mind (ToM) reveals that LLMs struggle to infer latent intent, making it crucial to investigate how LLMs handle concealed malicious intent within multi-turn scenarios. To bridge this gap, we propose a new jailbreak approach, RED QUEEN ATTACK. This method constructs a multi-turn scenario, concealing the malicious intent under the guise of preventing harm. Next, we craft 40 scenarios that vary in turns and select 14 harmful categories to generate 56k multi-turn attack data points. We conduct comprehensive experiments on the RED QUEEN ATTACK with four representative LLM families of different sizes. Our experiments reveal that all LLMs are vulnerable to RED QUEEN ATTACK, reaching 87.6% attack success rate on GPT-4o and 77.1% on Llama3-70B. Further analysis reveals that larger models are more susceptible to the RED QUEEN ATTACK, with multi-turn structures and concealment strategies contributing to its success. To prioritize safety, we introduce a straightforward mitigation strategy called RED QUEEN GUARD, which aligns LLMs to effectively counter adversarial attacks. This approach reduces the attack success rate to below 1% while maintaining the model’s performance across standard benchmarks. We release our code and data to support future research.

3908Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach

[openreview] [pdf]

Abstract In Deep Reinforcement Learning models trained using gradient-based techniques, the choice of optimizer and its learning rate are crucial to achieving good performance: higher learning rates can prevent the model from learning effectively, while lower ones might slow convergence. Additionally, due to the non-stationarity of the objective function, the best-performing learning rate can change over the training steps. To adapt the learning rate, a standard technique consists of using decay schedulers. However, these schedulers assume that the model is progressively approaching convergence, which may not always be true, leading to delayed or premature adjustments. In this work, we propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent’s performance during training. LRRL is based on a multi-armed bandit algorithm, where each arm represents a different learning rate, and the bandit feedback is provided by the cumulative returns of the RL policy to update the arms’ probability distribution. Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.

3909Privacy-Preserving Logistic Regression Training with A Faster Gradient Variant

[openreview] [pdf]

Abstract Training logistic regression over encrypted data has been a compelling approach in addressing security concerns for several years. In this paper, we introduce an efficient gradient variant, called quadraticquadratic gradientgradient, for privacy-preserving logistic regression training. We enhance Nesterov’s Accelerated Gradient (NAG), Adaptive Gradient Algorithm (Adagrad) and Adam algorithms by incorporating their quadratic gradients and evaluate these improved algorithms on various datasets. Experimental results demonstrate that the enhanced algorithms achieve significantly improved convergence speed compared to traditional first-order gradient methods. Moreover, we applied the enhanced NAG method to implement homomorphic logistic regression training, achieving comparable results within just 4 iterations. There is a good chance that the quadratic gradient approach could integrate first-order gradient descent/ascent algorithms with the second-order Newton-Raphson methods, and that it could be applied to a wide range of numerical optimization problems.

3910Transformer Encoder Satisfiability: Complexity and Impact on Formal Reasoning

[openreview] [pdf]

Abstract We analyse the complexity of the satisfiability problem (SAT) for transformer encoders (TE), naturally occurring in formal verification or interpretation tasks. We find that SAT is undecidable when considering TE as they are commonly studied in the expressiveness community. Furthermore, we identify practical scenarios where SAT is decidable and establish corresponding complexity bounds. Beyond trivial cases, we find that quantized TE—those restricted by fixed-width arithmetic—lead to the decidability of SAT due to their limited attention capabilities. However, the problem remains difficult, as we establish scenarios where SAT is NEXPTIME-hard and others where it is solvable in NEXPTIME for quantized TE. To complement our complexity results, we place our findings and their implications in the broader context of formal reasoning.

3911PostCast: Generalizable Postprocessing for Precipitation Nowcasting via Unsupervised Blurriness Modeling

[openreview] [pdf]

Abstract Precipitation nowcasting plays a pivotal role in socioeconomic sectors, especially in severe convective weather warnings. Although notable progress has been achieved by approaches mining the spatiotemporal correlations with deep learning, these methods still suffer severe blurriness as the lead time increases, which hampers accurate predictions for extreme precipitation. To alleviate blurriness, researchers explore generative methods conditioned on blurry predictions. However, the pairs of blurry predictions and corresponding ground truth need to be given in advance, making the training pipeline cumbersome and limiting the generality of generative models within blurry modes that appear in training data. By rethinking the blurriness in precipitation nowcasting as a blur kernel acting on predictions, we propose an unsupervised postprocessing method to eliminate the blurriness without the requirement of training with the pairs of blurry predictions and corresponding ground truth. Specifically, we utilize blurry predictions to guide the generation process of a pre-trained unconditional denoising diffusion probabilistic model (DDPM) to obtain high-fidelity predictions with eliminated blurriness. A zero-shot blur kernel estimation mechanism and an auto-scale denoise guidance strategy are introduced to adapt the unconditional DDPM to any blurriness modes varying from datasets and lead times in precipitation nowcasting. Extensive experiments are conducted on 7 precipitation radar datasets, demonstrating the generality and superiority of our method.

3912Adaptive Causal Experimental Design: Amortizing Sequential Bayesian Experimental Design for Causal Models

[openreview] [pdf]

Abstract Interventions are essential for causal discovery and causal reasoning. Acquiring interventional data, however, is often costly, especially in real-world systems. A careful experimental design can therefore bring substantial savings. In the sequential experimental design setting, most existing approaches seek the best interventions in a greedy (myopic) manner that does not account for the synergy from the yet-to-come future experiments. We propose Adaptive Causal Experimental Design (ACED), a novel Bayesian sequential design framework for learning a design policy capable of generating non-myopic interventions that incorporate the effect on future experiments. In particular, ACED maximizes the Expected Information Gain (EIG) on flexible choices of causal quantities of interest (e.g., causal structure, specific causal effects) directly, bypassing the need for computing intermediate posteriors in the experimental sequence. Leveraging a variational lower bound estimator for the EIG, ACED trains an amortized policy network that can be executed rapidly during deployment. We present numerical results demonstrating ACED’s effectiveness on synthetic datasets with both linear and nonlinear structural causal models, as well as on in-silico single-cell gene expression datasets.

3913Token to Token Learning From Videos

[openreview] [pdf]

Abstract We empirically study generative pre-training from videos. Our approach is conceptually simple and inspired by generative pre-training from images, iGPT. To enable scaling to videos, we make several important improvements along the data, architecture, and evaluation axes. Our model, called Toto, is a causal transformer that generates videos autoregressively, one token at a time. We pre-train our model on a diverse set of videos with over 1 trillion visual tokens. Our tokens are quantized patch embeddings, rather than pixels, and we use relative embeddings for coarse-to-fine pre-training. We conduct a large-scale study across a suite of diverse benchmarks, including image recognition, video classification, object tracking, robotic manipulation and scaling behaviours. We find that, despite minimal inductive biases, our approach achieves competitive performance across all benchmarks.

3914Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization

[openreview] [pdf]

Abstract Neural networks often favor shortcut heuristics based on surface-level patterns. As one example, language models (LMs) can behave like n-gram models early in training. However, to correctly apply grammatical rules, LMs must instead rely on hierarchical syntactic representations rather than the surface-level heuristics derived from n-grams. In this work, we use cases studies of English grammar to explore how latent structures in training data can enable models to overcome their early capabilities and achieve complex generalization behaviors. We demonstrate that sentences with complex grammatical structure drive the model’s inductive bias towards hierarchical representation. We then investigate how data composition can lead to inconsistent behavior across random seeds, finding that models stabilize in their out-of-distribution (OOD) behavior only when they commit to either a surface-level heuristic or a hierarchical rule. When the data contains a mix of simple and complex examples, potential rules compete, leading to unstable dynamics in training runs that fail to commit. We also identify an exception to the relationship between stability and generalization: Models which memorize patterns from homogeneous training data can stabilize in a memorization regime without learning either rule. While existing works have attributed similar generalization behavior to training objective and model architecture, our findings emphasize the critical role of training data in shaping generalization patterns and how competition between data subsets contributes to inconsistent generalization outcomes.

3915Straight to Zero: Why Linearly Decaying the Learning Rate to Zero Works Best for LLMs

[openreview] [pdf]

Abstract LLMs are commonly trained with a learning rate (LR) warmup, followed by cosine decay to 10% of the maximum (10x decay). In a large-scale empirical study, we show that under an optimal max LR, a simple linear decay-to-zero (D2Z) schedule consistently outperforms other schedules when training at compute-optimal dataset sizes. Benefits increase further with more training tokens; e.g., a 617M-parameter model trained for 80 tokens-per-parameter (TPP) using D2Z achieves lower loss than when trained for 200 TPP using 10x decay, corresponding to an astonishing 60% FLOPs savings. This implies models like Llama2-7B, trained for 286 TPP with 10x decay, were severely under-decayed. We demonstrate the benefits of D2Z across a range of model sizes, batch sizes, and other training configurations. We explain the success of linear D2Z via a novel interpretation of AdamW as a convex combination of weight updates, with coefficients governed by the LR schedule. This interpretation demonstrates how linear D2Z balances the demands of early training (moving away quickly from initial conditions) and late training (smoothing over more updates to mitigate gradient noise).

3916Anomalies are Streaming: Continual Learning for Weakly Supervised Video Anomaly Detection

[openreview] [pdf]

Abstract Weakly supervised video anomaly detection (WSVAD) aims to locate frame-level anomalies with only video-level annotations provided. However, existing WSVAD methods struggle to adapt to real-world scenarios, where unseen anomalies are continuously introduced, thereby making the training of WSVAD essentially a process of continual learning. In this paper, we pioneer to explore the continual learning for weakly supervised video anomaly detection (CL-WSVAD), seeking to mitigate the catastrophic forgetting when the detection model learns new anomalies. We propose normality representation pre-training prior to continual learning, utilizing potential anomaly texts to guide the model in learning robust normality representations, which improves discrimination from potential incremental anomalies. Additionally, we introduce a mixed-up cross-modal alignment method to assist in adapting the pretrained model on CL-WSVAD. Subsequently, we propose a continual learning framework based on sequentially retaining the learnable text prompts for each type of anomaly, which effectively mitigates catastrophic forgetting. Experiments on our established CL-WSVAD benchmarks demonstrate the superiority of proposed method.

3917Multi-player Multi-armed Bandits with Delayed Feedback

[openreview] [pdf]

Abstract Multi-player multi-armed bandits have been researched for a long time due to their application in cognitive radio networks. In this setting, multiple players select arms at each time and instantly receive the feedback. Most research on this problem focuses on the content of the immediate feedback, whether it includes both the reward and collision information or the reward alone. However, delay is common in cognitive networks when users perform spectrum sensing. In this paper, we design an algorithm DDSE (Decentralized Delayed Successive Elimination) in multi-player multi-armed bandits with stochastic delay feedback and establish a regret bound. Compared with existing algorithms that fail to address this problem, our algorithm enables players to adapt to delayed feedback and avoid collision. We also derive a lower bound in centralized setting to prove the algorithm achieves near-optimal. Extensive experiments validate the effectiveness of our algorithm.

3918Adaptive Curvature Step Size: A Path Geometry Based Approach to Optimization

[openreview] [pdf]

Abstract We propose the Adaptive Curvature Step Size (ACSS) method, which dynamically adjusts the step size based on the local geometry of the optimization path. Our approach computes the normalized radius of curvature using consecutive gradients along the iterate path and sets the step-size equal to this radius. The effectiveness of ACSS stems from its ability to adapt to the local landscape of the optimization problem. In regions of low curvature, where consecutive gradient steps are nearly identical, ACSS allows for larger steps. Conversely, in areas of high curvature, where gradient steps differ significantly in direction, ACSS reduces the step size. This adaptive behavior enables more efficient navigation of complex loss landscapes. A key advantage of ACSS is its adaptive behavior based on local curvature information, which implicitly captures aspects of the function’s second-order geometry without requiring additional memory. We provide a generalized framework for incorporating ACSS into various optimization algorithms, including SGD, Adam, AdaGrad, and RMSProp. Through extensive empirical evaluation on 20 diverse datasets, we compare ACSS variants against 12 popular optimization methods. Our results consistently show that ACSS provides performance benefits. Our results consistently show that ACSS provides performance benefits. We provide PyTorch implementations of ACSS versions for popular optimizers at ouranonymized code repository.

3919Memory Proxy Maps for Visual Navigation

[openreview] [pdf]

Abstract Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

3920Towards shutdownable agents via stochastic choice

[openreview] [pdf]

Abstract Some worry that advanced artificial agents may resist being shut down. The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn’t happen. A key part of the IPP is using a novel ‘ Discounted REward for Same-Length Trajectories (DREST)’ reward function to train agents to (1) pursue goals effectively conditional on each trajectory-length (be ‘USEFUL’), and (2) choose stochastically between different trajectory-lengths (be ‘NEUTRAL’ about trajectory-lengths). In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY. We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL. Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.

3921Parameter-Efficient Fine-Tuning with Circulant and Diagonal Vectors

[openreview] [pdf]

Abstract Foundation models have achieved tremendous success in different domains. However, their huge computation and storage complexity make these models difficult to fine-tune and also less applicable in practice. Recent study shows training in fourier domain can be an effective fine-tuning method in terms of both model performance and number of training parameters. In this work, we propose to further reduce the complexity by using the product of interleaved circulant and diagonal matrices. Our method avoids the construction of weight change matrix and applies 1D fast fourier transform (FFT) instead of 2D FFT. Experimental results show that our method achieves similar or better performance across various tasks with much less floating-point operations (FLOPs).

3922Activation Decay by Loss Smoothing to Enhance Generalization

[openreview] [pdf]

Abstract Generalization in deep learning is strongly influenced by the sharpness of the minima encountered during training. We introduce a novel, deterministic, and computationally efficient method called \emph{activation decay}, designed to flatten sharp minima and improve generalization across a wide range of tasks. Derived from Gaussian smoothing, activation decay operates by regularizing the activations of critical network layers, effectively reducing sharpness and improving robustness. Unlike stochastic techniques such as dropout or the more computationally expensive Sharpness-Aware Minimization (SAM), our approach requires no additional computational overhead, making it particularly suited for large-scale models. We further demonstrate that activation decay can be seamlessly combined with other regularization techniques, offering enhanced regularization without increasing training complexity. Extensive experiments on CIFAR-10, ImageNet, and natural language processing (NLP) tasks validate our approach, showing consistent improvements in generalization and robustness to label noise.

3923AIR: Zero-shot Generative Model Adaptation with Iterative Refinement

[openreview] [pdf]

Abstract Zero-shot generative model adaptation (ZSGM) aims to adapt a pre-trained generator to a target domain using only text guidance and without any samples from the target domain. Central to recent ZSGM approaches aredirectional losswhich use the text guidance in the form of aligning the image offset with text offset in the embedding space of a vision-language model like CLIP. This is similar to the analogical reasoning in NLP where the offset between one pair of words is used to identify a missing element in another pair by aligning the offset between these two pairs. However, a major limitation of existing ZSGM methods is that the learning objective assumes the complete alignment between image offset and text offset in the CLIP embedding space.Our workmakes two main contribution. Inspired by the offset misalignment studies in NLP, as our first contribution, we perform an empirical study to analyze the misalignment between text offset and image offset in CLIP embedding space for various large publicly available datasets. Our important finding is that offset misalignment in CLIP embedding space is correlated with concept distance,i.e., close concepts have a less offset misalignment. To address the limitations of the current approaches, as our second contribution, we propose Adaptaiotn with Iterative Refinement (AIR) which mitigates the offset misalignment issue in directional loss by iteratively selecting anchor points closer to the target domain. Extensive experimental results show that the proposed AIR approach achieves SOTA performance across various adaptation setups.

3924Familiarity-Aware Evidence Compression for Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) improves large language models (LMs) by incorporating non-parametric knowledge through evidence retrieved from external sources. However, it often struggles to cope with inconsistent and irrelevant information that can distract the LM from its tasks, especially when multiple evidence pieces are required. While compressing the retrieved evidence with a compression model aims to address this issue, the compressed evidence may still be unfamiliar to the target model used for downstream tasks, potentially failing to utilize the evidence effectively. We propose FaviComp (Familiarity-aware Evidence Compression), a novel training-free evidence compression technique that makes retrieved evidence more familiar to the target model, while seamlessly integrating parametric knowledge from the model. Specifically, FaviComp proactively composes the compressed evidence in a way to lower the perplexity of the target model by combining decoding probabilities from both the compression model and the target model to generate context that is more familiar to the target model. This approach balances the integration of parametric and non-parametric knowledge, which is especially helpful in complex tasks where the retrieved evidence set may not contain all the necessary information. Experimental results show that FaviComp consistently outperforms most recent evidence compression baselines across multiple open-domain QA datasets, improving accuracy by up to 23.91% while achieving high compression rates. Additionally, we demonstrate the effective integration of both parametric and non-parametric knowledge during evidence compression.

3925HMoRA: Making LLMs More Effective with Hierarchical Mixture of LoRA Experts

[openreview] [pdf]

Abstract Recent studies have combined Mixture of Experts (MoE) and Parameter-Efficient Fine-tuning (PEFT) to fine-tune large language models (LLMs), holding excellent performance in multi-task scenarios while remaining resource-efficient. Yet, existing MoE methods still exhibit three major limitations: (1) Most MoE routing methods focus solely on token-level or task-level routing, which significantly limits the exploration of multi-granularity information. (2) Task-level routing methods are confined to tasks encountered during training, failing to generalize to unseen tasks. (3) The lack of certainty in existing MoE routing methods hinders the specialization of the experts. To address these challenges, we propose HMoRA, a hierarchical fine-tuning method that combines MoE and LoRA, employing hybrid routing that integrates token-level and task-level routing in a hierarchical manner. This hybrid routing allows the model to capture both fine-grained token information and broader task contexts. To improve the certainty of expert selection, a novel routing auxiliary loss is introduced, enabling the task router to differentiate between tasks without supervision and to generalize to unseen tasks. Additionally, several optional lightweight designs have been proposed to significantly reduce both the number of trainable parameters and computational costs. Experimental results demonstrate that HMoRA outperforms full fine-tuning across multiple NLP benchmarks, while fine-tuning only 3.9% of the parameters. The code is available on:https://anonymous.4open.science/r/HMoRA-2648.

3926Harnessing Query Heterogeneity for Cost-Effective Proactive Caching in LLM Inference

[openreview] [pdf]

Abstract As Large Language Models (LLMs) significantly enhance the capabilities of AI systems, the increasing volume of query processing requests presents challenges for cost-effective inference, particularly due to repetitive queries that lead to unnecessary resource consumption and increased costs. Caching strategies are employed to store a small set of previous queries, enabling direct retrieval of repetitive queries without reprocessing by the LLMs. However, existing caching algorithms often assume uniform query lengths, simplifying cache selection to a top-KK problem, which is inadequate for real-world scenarios with heterogeneous lengths. To address this issue, we propose a bandit learning algorithm for proactive query caching in LLMs, specifically considering variable-sized queries. We cast the optimal cache query cache problem as a knapsack problem. Since the repetitive pattern and processing cost are unknown and has uncertainty, we cast the learning-to-cache problem as a bandit learning problem. Compared to conventional bandit learning frameworks, a new technical challenge is that the reward of an arm would not be observed if it is pulled. To tackle this, we propose an Lower confidence bound (LCB)-type algorithm, which we prove has a O~(T)\tilde{O}(\sqrt{T}) order of regret and show that our regret does not deteriorate compared to previous results when incorporating a variable size setting. Furthermore, we demonstrate that our online cache policy effectively reduces the additional computational overhead typically associated with calculating the optimal cache.

3927ThreadsGAN: Enhancing Coherence and Diversity in Discussion Thread Generation

[openreview] [pdf]

Abstract Current research on generating discussion threads faces challenges in coherence, interactivity, and multi-topic handling, which are crucial for meaningful responses. This paper introduces threadsGAN, a model that enhances thread generation by incorporating multi-topic and social response intention tags. By leveraging BERT and Transformer, threadsGAN ensures contextual coherence and manages topic consistency. Additionally, it employs conditional generation to align responses with specific discussion contexts, and its CNN-based discriminator assesses response quality by evaluating similarity between generated and real responses, improving overall performance in generating realistic and contextually appropriate discussion threads.

3928PLAY2PROMPT: Zero-shot Tool Instruction Optimization for LLM Agents via Tool Play

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly integrated with external tools to complete user requests. Many real-world applications require LLMs to use specialized tools in a zero-shot setting. To achieve this, current methods primarily rely on prompting LLMs with tool-specific information, yet tool documentation is often underspecified or noisy, limiting effectiveness. Manual improvements are inefficient and impractical, as they require domain expertise to rewrite documentation and test on carefully curated held-out datasets to evaluate performance gains. Automatic prompt engineering techniques are not applicable either, because they require labeled examples, which is unavailable in the zero-shot setting. In this work, we introduce PLAY2PROMPT, an automated framework that iteratively refines tool documentation and generates usage examples. PLAY2PROMPT enables LLMs to explore tool input-output behaviors, allowing us to effectively search the space of possible tool descriptions and examples. The generated examples not only guide LLM inference but also serve as validation data to ensure more effective tool use. Extensive experiments on real-world tasks demonstrate significant improvements in zero-shot tool performance across both open- and closed-source models.

3929Tree Search for Simultaneous Move Games via Equilibrium Approximation

[openreview] [pdf]

Abstract Neural network supported tree-search has shown strong results in a variety of perfect information multi-agent tasks. However, the performance of these methods on partial information games has generally been below competing approaches. Here we study the class of simultaneous-move games, which are a subclass of partial information games which are most similar to perfect information games: both agents know the game state with the exception of the opponent’s move, which is revealed only after each agent makes its own move. Simultaneous move games include popular benchmarks such as Google Research Football and Starcraft.In this study we answer the question: can we take tree search algorithms trained through self-play from perfect information settings and adapt them to simultaneous move games without significant loss of performance? We answer this question by deriving a practical method that attempts to approximate a coarse correlated equilibrium as a subroutine within a tree search. Our algorithm works on cooperative, competitive, and mixed tasks. Our results are better than the current best MARL algorithms on a wide range of accepted baselines.

3930ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

[openreview] [pdf]

Abstract Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents.

3931Sampling-guided Heterogeneous Graph Neural Network with Temporal Smoothing for Scalable Longitudinal Data Imputation

[openreview] [pdf]

Abstract In this paper, we propose a novel framework, the Sampling-guided Heterogeneous Graph Neural Network (SHT-GNN\text{S\small{HT-GNN}}), to effectively tackle the challenge of missing data imputation in longitudinal studies. Unlike traditional methods, which often require extensive preprocessing to handle irregular or inconsistent missing data, our approach accommodates arbitrary missing data patterns while maintaining computational efficiency. SHT-GNN\text{S\small{HT-GNN}} models both observations and covariates as distinct node types, connecting observation nodes at successive time points through subject-specific longitudinal subnetworks, while covariate-observation interactions are represented by attributed edges within bipartite graphs. By leveraging subject-wise mini-batch sampling and a multi-layer temporal smoothing mechanism, SHT-GNN\text{S\small{HT-GNN}} efficiently scales to large datasets, while effectively learning node representations and imputing missing data. Extensive experiments on both synthetic and real-world datasets, including the Alzheimer’s Disease Neuroimaging Initiative (ADNI\text{A\small{DNI}}) dataset, demonstrate that SHT-GNN\text{S\small{HT-GNN}} significantly outperforms existing imputation methods, even with high missing data rates (e.g., 80%). The empirical results highlight SHT-GNN\text{S\small{HT-GNN}}’s robust imputation capabilities and superior performance, particularly in the context of complex, large-scale longitudinal data.

3932MoRA: High-Rank Updating for Parameter-Efficient Fine-Tuning

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) is a popular parameter-efficient fine-tuning (PEFT) method for large language models (LLMs). In this paper, we analyze the impact of low-rank updating, as implemented in LoRA. Our findings suggest that the low-rank updating mechanism may limit the ability of LLMs to effectively learn and memorize new knowledge. Inspired by this observation, we propose a new method called MoRA, which employs a square matrix to achieve high-rank updating while maintaining the same number of trainable parameters. To achieve it, we introduce the corresponding non-parameter operators to reduce the input dimension and increase the output dimension for the square matrix. Furthermore, these operators ensure that the weight can be merged back into LLMs, which enables our method to be deployed like LoRA. We perform a comprehensive evaluation of our method across five tasks: instruction tuning, mathematical reasoning, continual pretraining, memory and pretraining. Our method outperforms LoRA on memory-intensive tasks and achieves comparable performance on other tasks.

3933BRSSD10k : A SEGMENTATION DATASET\OF BANGLADESHI ROAD SCENARIO

[openreview] [pdf]

Abstract In this paper, we present a novel Bangladeshi Road Scenario Segmentation Dataset designed to advance autonomous driving technologies under the challenging and diverse road conditions of Bangladesh. This comprehensive instance segmentation dataset comprised 10,082 high-resolution images captured across nine major cities, including Dhaka, Sylhet, Chittagong, and Rajshahi, addressing the critical need for region-specific computer vision data in developing countries. Unlike existing autonomous driving datasets that primarily focus on western road conditions, BRSSD10k encompasses a wide range of environments unique to Bangladesh, including unstructured urban areas, hilly terrains, village roads, and densely populated city centers. The dataset features instance segmentation annotations with classes specifically tailored to reflect the distinctive elements of Bangladeshi roads, such as rickshaws, CNGs (auto-rickshaws), informal roadside stalls, and various nonstandard vehicles. To demonstrate its utility as a benchmarking tool for autonomous driving systems, we present comparative results from several state-of-the-art instance segmentation models tested on this dataset, achieving an mAP of 0.441. This evaluation not only showcases the dataset’s effectiveness in assessing model performance but also underscores the need for adaptive algorithms capable of handling diverse and unpredictable urban environments in the context of autonomous navigation.

3934Conformal prediction for causal effects of continuous treatments

[openreview] [pdf]

Abstract Uncertainty quantification of causal effects is crucial for safety-critical applications such as personalized medicine. A powerful approach for this is conformal prediction, which has several practical benefits due to model-agnostic finite-sample guarantees. Yet, existing methods for conformal prediction of causal effects are limited to binary/discrete treatments and make highly restrictive assumptions such as known propensity scores. In this work, we provide a novel conformal prediction method for potential outcomes of continuous treatments. We account for the additional uncertainty introduced through propensity estimation so that our conformal prediction intervals are valid even if the propensity score is unknown. Our contributions are three-fold: (1) We derive finite-sample prediction intervals for potential outcomes of continuous treatments. (2) We provide an algorithm for calculating the derived intervals. (3) We demonstrate the effectiveness of the conformal prediction intervals in experiments on synthetic and medical datasets. To the best of our knowledge, we are the first to propose conformal prediction for continuous treatments when the propensity score is unknown and must be estimated from data.

3935HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

[openreview] [pdf]

Abstract AI agents are increasingly autonomous in their interactions with human users and tools, leading to increased interactional safety risks. We present HAICOSYSTEM, a framework examining AI agent safety within diverse and complex social interactions. HAICOSYSTEM features a modular sandbox environment that simulates multi-turn interactions between human users and AI agents, where the AI agents are equipped with a variety of tools (e.g., patient management platforms) to navigate diverse scenarios (e.g., a user attempting to access other patients’ profiles). To examine the safety of AI agents in these interactions, we develop a comprehensive multi-dimensional evaluation framework that uses metrics covering operational, content-related, societal, and legal risks. Through running over 8k simulations based on 132 scenarios across seven domains (e.g., healthcare, finance, education), we demonstrate that HAICOSYSTEM can emulate realistic user-AI interactions and complex tool use by AI agents. Our experiments show that state-of-the-art LLMs, both proprietary and open-sourced, exhibit safety risks in over 62% cases, with models generally showing higher risks when interacting with simulated malicious users. Our findings highlight the ongoing challenge of building agents that can safely navigate complex interactions, particularly when faced with malicious users. To foster the AI agent safety ecosystem, we release a code platform that allows practitioners to create custom scenarios, simulate interactions, and evaluate the safety and performance of their agents.

3936On the self-verification limitations of large language models on reasoning and planning tasks

[openreview] [pdf]

Abstract There has been considerable divergence of opinion on the reasoning abilities of Large Language Models (LLMs). While the initial optimism that reasoning might emerge automatically with scale has been tempered thanks to a slew of counterexamples--ranging from multiplication to simple planning--there persists a wide spread belief that LLMs can self-critique and improve their own solutions in an iterative fashion. This belief seemingly rests on the assumption that verification of correctness should be easier than generation--a rather classical argument from computational complexity--which should be irrelevant to LLMs to the extent that what they are doing is approximate retrieval. In this paper, we set out to systematically investigate the effectiveness of iterative prompting in the context of reasoning and planning. We present a principled empirical study of the performance of GPT-4 in three domains: Game of 24, Graph Coloring, and STRIPS planning. We experiment both with the model critiquing its own answers and with an external correct reasoner verifying proposed solutions. In each case, we analyze whether the content of criticisms actually affects bottom line performance, and whether we can ablate elements of the augmented system without losing performance. We observe significant performance collapse with self-critique and significant performance gains with sound external verification. We also note that merely re-prompting with a sound verifier maintains most of the benefits of more involved setups.

3937Generalization Bounds for Neural Ordinary Differential Equations and Residual Neural Networks

[openreview] [pdf]

Abstract Neural ordinary differential equations (neural ODEs) represent a widely-used class of deep learning models characterized by continuous depth. Understand- ing the generalization error bound is important to evaluate how well a model is expected to perform on new, unseen data. Earlier works in this direction involved considering the linear case on the dynamics function (a function that models the evolution of state variables) of Neural ODE Marion (2024). Other related work is on bound for Neural Controlled ODE Bleistein & Guilloux (2023) that de- pends on the sampling gap. We consider a class of neural ordinary differential equations (ODEs) with a general nonlinear function for time-dependent and time- independent cases which is Lipschitz with respect to state variables. We observed that the solution of the neural ODEs would be of bound variations if we assume that the dynamics function of Neural ODEs is Lipschitz continuous with respect to the hidden state. We derive a generalization bound for the time-dependent and time-independent Neural ODEs.Using the fact that Neural ODEs are limiting cases of time-dependent Neural ODEs we obtained a bound for the residual neural networks. We showed the effect of overparameterization and domain bound in the generalization error bound. This is the first time, the generalization bound for the Neural ODE with a more general non-linear function has been found.

3938CausalRivers - Scaling up benchmarking of causal discovery for real-world time-series

[openreview] [pdf]

Abstract Causal discovery, or identifying causal relationships from observational data, is a notoriously challenging task, with numerous methods proposed to tackle it. Despite this, in-the-wild evaluation is still lacking, as works frequently rely on synthetic data evaluation and sparse real-world examples under critical theoretical assumptions. Real-world causal structures, however, are often complex, evolving over time, non-linear, and influenced by unobserved factors, making it hard for practitioners to select appropriate methods. To bridge this gap, we introduce CausalRivers, the largest in-the-wild causal discovery benchmarking kit for time series data to date. CausalRivers features an extensive dataset on river discharge that covers the complete eastern German territory (666 measurement stations) and the state of Bavaria (494 measurement stations). It spans the years 2019 to 2023 with a 15-minute temporal resolution. Further, we provide data from a recent flood around the Elbe River, as an event with a pronounced distributional shift. Leveraging multiple sources of information and time-series meta-data, we constructed two distinct causal ground truth graphs (Bavaria and eastern Germany). These graphs can be sampled to generate thousands of subgraphs to benchmark causal discovery across diverse and challenging settings. To demonstrate the utility of our benchmarking kit, we evaluate several causal discovery approaches through multiple experiments and introduce effective baselines, identifying several areas for enhancement. CausalRivers has the potential to facilitate robust evaluations and comparisons of causal discovery methods. Besides this primary purpose, we also expect that this dataset will be relevant for connected areas of research, such as time series forecasting and anomaly detection. Based on this, we hope to establish benchmark-driven method development that fosters advanced techniques for causal discovery, as is the case for many other areas of machine learning.

3939A near linear query lower bound for submodular maximization

[openreview] [pdf]

Abstract We revisit the problem of selecting kk-out-of-nn elements with the goal of optimizing an objective function, and ask whether it can be solved approximately with sublinear query complexity.For objective functions that are monotone submodular, [Li, Feldman, Kazemi, Karbasi, NeurIPS’22] gave an Ω(n/k)\Omega(n/k) query lower bound for approximating to within any constant factor. We strengthen their lower bound to a nearly tight Ω~(n)\tilde{\Omega}(n). This lower bound holds even for estimating the value of the optimal subset.When the objective function is additive (i.e.~f(S)=iSwif(S) = \sum_{i \in S} w_i for unknown wiw_is), we prove that finding an approximately optimal subset still requires near-linear query complexity, but we can estimate the value of the optimal subset in O~(n/k)\tilde{O}(n/k) time, and that this is tight up to polylog factors.

3940GReaTer: Gradients Over Reasoning Makes Smaller Language Models Strong Prompt Optimizers

[openreview] [pdf]

Abstract The effectiveness of large language models (LLMs) is closely tied to the design of prompts, making prompt optimization essential for enhancing their performance across a wide range of tasks. Although recent advancements have focused on automating prompt engineering, many existing approaches rely exclusively on textual feedback, refining prompts based solely on inference errors identified by large, computationally expensive LLMs. Unfortunately, smaller models struggle to generate high-quality feedback, resulting in complete dependence on large LLM judgment. Moreover, these methods fail to leverage more direct and finer-grained information, such as gradients, due to operating purely in text space. To this end, we introduce, we introduceGReaTer, a novel prompt optimization technique that directly incorporatesgradient information over task-specific reasoning. By utilizing task loss gradients,GReaTerenables self-optimization of prompts for smaller, lightweight language models (LM) without the need for costly closed-source LLMs, while maintaining reasonable prompt structures. This allows high-performance prompt optimization without dependence on massive LLMs, closing the gap between smaller models and the sophisticated reasoning often needed for prompt refinement. Extensive evaluations across diverse tasks demonstrate that \ours consistently outperforms previous methods, even those reliant on powerful LLMs. Additionally,GReaTer-optimized prompts frequently exhibit better transferability and, in some cases, boost task performance to levels comparable to or surpassing those achieved by larger language models, highlighting the effectiveness of"gradient over reasoning"-based prompt optimization. Full source code ofGReaTerwill be available upon acceptance.

3941Reducing the Scope of Language Models with Circuit Breakers

[openreview] [pdf]

Abstract Language models are now deployed in a wide variety of user-facing applications, often for specific purposes like answering questions about documentation or acting as coding assistants. As these models are intended for particular purposes, they should not be able to answer irrelevant queries like requests for poetry or questions about physics, or even worse, queries that can only be answered by humans like sensitive company policies. Instead we would like them to only answer queries corresponding to desired behavior and refuse all other requests, which we refer to as scoping. We find that, despite the use of system prompts, two representative language models can be poorly scoped and respond to queries they should not be addressing. We then conduct a comprehensive empirical evaluation of methods which could be used for scoping the behavior of language models. Among many other results, we show that a recently-proposed method for general alignment, Circuit Breakers (CB), can be adapted to scope language models to very specific tasks like sentiment analysis or summarization or even tasks with finer-grained scoping (e.g. summarizing only news articles). When compared to standard methods like fine-tuning or preference learning, CB is more robust both for out of distribution tasks, and to adversarial prompting techniques. We also show that layering SFT and CB together often results in the best of both worlds: improved performance only on relevant queries, while rejecting irrelevant ones.

3942Contrastive Learning Via Equivariant Representation

[openreview] [pdf]

Abstract Invariant Contrastive Learning (ICL) methods have achieved impressive performance across various domains. However, the absence of latent space representation for distortion (augmentation)-related information in the latent space makes ICL sub-optimal regarding training efficiency and robustness in downstream tasks. Recent studies suggest that introducing equivariance into Contrastive Learning (CL) can improve overall performance. In this paper, we revisit the roles of augmentation strategies and equivariance in improving CL’s efficacy. We propose CLeVER (Contrastive Learning Via Equivariant Representation), a novel equivariant contrastive learning framework compatible with augmentation strategies of arbitrary complexity for various mainstream CL backbone models. Experimental results demonstrate that CLeVER effectively extracts and incorporates equivariant information from practical natural images, thereby improving the training efficiency and robustness of baseline models in downstream tasks and achieving state-of-the-art (SOTA) performance. Moreover, we find that leveraging equivariant information extracted by CLeVER simultaneously enhances rotational invariance and sensitivity across experimental tasks, and helps stabilize the framework when handling complex augmentations, particularly for models with small-scale backbones.

3943MIGA: Mixture-of-Experts with Group Aggregation for Stock Market Prediction

[openreview] [pdf]

Abstract Stock market prediction has remained an extremely challenging problem for many decades owing to its inherent high volatility and low information noisy ratio. Existing solutions based on machine learning or deep learning demonstrate superior performance by employing a single model trained on the entire stock dataset to generate predictions across all types of stocks. However, due to the significant variations in stock styles and market trends, a single end-to-end model struggles to fully capture the differences in these stylized stock features, leading to relatively inaccurate predictions for all types of stocks. In this paper, we present MIGA, a novel Mixture of Expert with Group Aggregation framework designed to generate specialized predictions for stocks with different styles of by dynamically switching between distinct style experts. To promote collaboration among different experts in MIGA, we propose a novel inner group attention architecture, enabling experts within the same group to share information and thereby enhancing the overall performance of all experts. As a result, MIGA significantly outperforms other end-to-end models on three Chinese Stock Index benchmarks including CSI300, CSI500 and CSI1000. Notably, MIGA-Conv reaches 24 % excess annual return on CSI300 benchmark, surpassing the previous state-of-the-art model by 8% absolute. Furthermore, we conduct a comprehensive analysis of mixture of experts for stock market prediction, providing valuable insights for future research.

3944Harmonious convergence for confidence estimation in depth estimation and completion

[openreview] [pdf]

Abstract Confidence estimation for monocular depth estimation and completion is important for their deployment in real-world applications. Recent models for confidence estimation in these regression tasks mainly rely on the statistical characteristics of training and test data, while ignoring the information from the model training. We propose a harmonious convergence estimation approach for confidence estimation in the regression tasks, taking training consistency into consideration. Specifically, we propose an intra-batch convergence estimation algorithm with two sub-iterations to compute the training consistency for confidence estimation. A harmonious convergence loss is newly designed to encourage the consistency between confidence measure and depth prediction. Our experimental results on the NYU2 and KITTI datasets show improvements ranging from 10.91% to 43.90% across different settings in monocular depth estimation, and from 27.91% to 45.24% in depth completion, measured by Pearson correlation coefficients, justifying the effectiveness of the proposed method. We will release all the codes upon the publication of our paper.

3945Leveraging Low Rank Structure in The Lazy Regime

[openreview] [pdf]

Abstract Understanding the training dynamics of neural networks has gained much interest in the scientific community. The dynamics of training over-parameterized models is characterized by the lazy regime in which networks exhibit near-linear behavior and minimal parameter changes. In addition, it has been argued that the Jacobian of large neural models has a low-rank structure. In this paper, we focus on the opportunities laid out by the combination of low-rankness and laziness of large neural models. Specifically, we provide a scalable way to measure the extent of laziness, evaluated via the rate of change of the model Jacobian, as well as a scalable method to verify low-rankness of the model Jacobian without storing the entire Jacobian. Taking advantages of both laziness and low-rankness, we design a scalable training algorithm for over-parameterized models that performs backpropagation-free gradient descend training. In particular, this algorithm is of lower computation and storage requirements in cases of massive parameter sharing, as is the case of many state-of-the-art neural architectures. Empirical results confirm the scalability and effectiveness of our approach, opening new pathways for exploring novel learning strategies in neural networks.

3946A Guide to Misinformation Detection Datasets

[openreview] [pdf]

Abstract Misinformation is a complex societal issue, and mitigating solutions are difficult to create due to data deficiencies. To address this problem, we have curated the largest collection of (mis)information datasets in the literature, totaling 75. From these, we evaluated the quality of all of the 35 datasets that consist of statements or claims. We assess these datasets to identify those with solid foundations for empirical work and those with flaws that could result in misleading and non-generalizable results, such as insufficient label quality, spurious correlations, or political bias. We further provide state-of-the-art baselines on all these datasets, but show that regardless of label quality, categorical labels may no longer give an accurate evaluation of detection model performance. We discuss alternatives to mitigate this problem. Overall, this guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations, ultimately improving research in misinformation detection.

3947Pretrained Hybrids with MAD Skills

[openreview] [pdf]

Abstract While Transformers underpin modern large language models (LMs), a growing list of alternative architectures with new capabilities, promises, and tradeoffs is emerging. This makes choosing the right LM architecture challenging. Recently proposedhybrid architecturesseek a best-of-all-worlds approach that reaps the benefits of all architectures. Hybrid design is difficult for two reasons: it requires manual expert-driven search, and new hybrids must be trained from scratch. We proposeManticore, a framework that addresses these challenges byautomating the design of hybrid architectureswhile reusing pretrained models to createpretrainedhybrids. Our approach augments ideas from differentiable Neural Architecture Search (NAS) by incorporating simple projectors that translate features between pretrained blocks from different architectures. We then fine-tune hybrids that combine pretrained models from different architecture families---such as the GPT series and Mamba---end-to-end. With Manticore, we enable LM selection without training multiple models, the construction of pretrained hybrids from existing pretrained models, and the ability toprogrampretrained hybrids to have certain capabilities. Manticore hybrids match existing manually-designed hybrids, achieve strong performance on the Long Range Arena benchmark, and improve on pretrained transformers and state space models on various natural language tasks.

3948TimeCapsule: Solving the Jigsaw Puzzle of Long-Term Time Series Forecasting with Compressed Predictive Representations

[openreview] [pdf]

Abstract Recent deep learning models for long-term time series forecasting (LTSF) often emphasize complex, handcrafted designs and traditional methodologies, while simpler architectures like linear models or MLPs have occasionally outperformed these intricate solutions. In this paper, we revisit and organize the core ideas behind several key techniques, such as redundancy reduction and multi-scale modeling, which are frequently employed in advanced LTSF models. Our goal is to streamline these ideas for more efficient deep learning utilization. To this end, we introduce TimeCapsule, a model built around the principle of high-dimensional information compression that unifies these key ideas in a generalized yet simplified framework. Specifically, we model time series as a 3D tensor, incorporating temporal, variate, and level dimensions, and leverage mode production to capture multi-mode dependencies while achieving dimensionality compression. We propose an internal forecast within the compressed representation domain, supported by the Joint-Embedding Predictive Architecture (JEPA) to monitor the learning of predictive representations. Extensive experiments on challenging benchmarks demonstrate the versatility of our method, showing that TimeCapsule can achieve performance comparable to state-of-the-art models. More importantly, the structure of our model yields intriguing empirical findings, prompting a rethinking of approaches in this area.

3949Revisiting the Design Choices in Max-Return Sequence Modeling

[openreview] [pdf]

Abstract Decision Transformer (DT), free from optimal value functions fitting and policy gradient computation, attempts to solve offline reinforcement learning (RL) via supervised sequence modeling. During inference, sequence modeling requires an initial target returns assigned with expert knowledge, which blocks comprehensive evaluation on more diverse datasets. As a result, existing sequence modeling only focuses on limited evaluation on Gym datasets and some understanding is severely biased. In this paper, we aim to revisit the design choices, including architecture and context length, in sequence modeling on more diverse datasets. We utilize the max-return sequence modeling that replaces the manual target returns with maximized returns predicted by itself. We systematically investigate the impact of 1) architectural choices and 2) context lengths in max-return sequence modeling on nine datasets with varying data distributions. Abundant experiments and thorough analyses reveal that design choices are highly influenced by the dataset characteristics, which further underscores the significance of more diverse evaluation.

3950Provably Robust Explainable Graph Neural Networks against Graph Perturbation Attacks

[openreview] [pdf]

Abstract Explaining Graph Neural Network (XGNN) has gained growing attention to facilitate the trust of using GNNs, which is the mainstream method to learn graph data. Despite their growing attention, Existing XGNNs focus on improving the explanation performance, and its robustness under attacks is largely unexplored. We noticed that an adversary can slightly perturb the graph structure such that the explanation result of XGNNs is largely changed. Such vulnerability of XGNNs could cause serious issues particularly in safety/security-critical applications. In this paper, we take the first step to study the robustness of XGNN against graph perturbation attacks, and propose XGNNCert, the first provably robust XGNN. Particularly, our XGNNCert can provably ensure the explanation result for a graph under the worst-case graph perturbation attack is close to that without the attack, while not affecting the GNN prediction, when the number of perturbed edges is bounded. Evaluation results on multiple graph datasets and GNN explainers show the effectiveness of XGNNCert.

3951Learning Task Relations for Test-Time Training

[openreview] [pdf]

Abstract Generalizing deep neural networks to unseen target domains presents a major challenge in real-world deployments. Test-time training (TTT) addresses this is- sue by using an auxiliary self-supervised task to reduce the gap between source and target domains caused by distribution shifts during deployment. Previous re- search relies on the assumption that the adopted auxiliary task would be beneficial to the target task we want to adapt. However, this situation is not guaranteed as each task has a different objective, thus adaptation relies on the relation be- tween the tasks. This limitation has motivated us to introduce a more generalized framework: Task Relation Learning for Test-time Training (TR-TTT), which can be applied to multiple tasks concurrently. Our key assumption is that task re- lations are crucial information for successful test-time training, and we capture these relations using a Task Relation Learner (TRL). We model task relations as conditional probabilities by predicting the label of a target task based on the latent spaces of other task-specific features. By leveraging these relations, the network can more effectively handle distribution shifts and improve post-adaptation perfor- mance across various tasks—both classification and regression—unlike previous methods focused mainly on simple classification. To validate our approach, we ap- ply TR-TTT to conventional multi-task benchmarks, integrating it with the tradi- tional TTT experimental protocol. Our empirical results demonstrate that TR-TTT significantly outperforms state-of-the-art methods across a range of benchmarks.

3952Improving Large Language Model Planning with Action Sequence Similarity

[openreview] [pdf]

Abstract Planning is essential for artificial intelligence systems to look ahead and proactively determine a course of actions to reach objectives in the virtual and real world. Recent work on large language models (LLMs) sheds light on their planning capability in various tasks. However, it remains unclear what signals in the context influence the model performance. In this work, we explore how to improve the model planning capability through in-context learning (ICL), specifically, what signals can help select the exemplars. Through extensive experiments, we observe that commonly used problem similarity may result in false positives with drastically different plans, which can mislead the model. In response, we propose to sample and filter exemplars leveraging plan side action sequence similarity (AS). We propose GRASE-DC: a two-stage pipeline that first re-samples high AS exemplars and then curates the selected exemplars with dynamic clustering on AS to achieve a balance of relevance and diversity. Our experimental result confirms that GRASE-DC achieves significant performance improvement on various planning tasks (up to ~11-40 point absolute accuracy improvement with 27.3% fewer exemplars needed on average). With GRASE-DC* + VAL, where we iteratively apply GRASE-DC with a validator, we are able to even boost the performance by 18.9% more. Extensive analysis validates the consistent performance improvement of GRASE-DC with various backbone LLMs and on both classical planning and natural language planning benchmarks. GRASE-DC can further boost the planning accuracy by ~24 absolute points on harder problems using simpler problems as exemplars over a random baseline. This demonstrates its ability to generalize to out-of-distribution problems.

3953Improving Pretraining Data Using Perplexity Correlations

[openreview] [pdf]

Abstract Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.

3954Overcoming Missing Label Vocabulary in Black-Box Discrete Prompt Learning

[openreview] [pdf]

Abstract Large language models (LLMs) have transformed natural language processing. While their scale challenges fine-tuning downstream tasks, prompt engineering offers a scalable, cost-effective solution to optimize their performance. Black-box prompt learning is crucial for leveraging the generative abilities of LLMs, especially in the Language-Model-as-a-Service scenario, where parameters and gradients are inaccessible. LLMs generate output exclusively in the form of encoded tokens processed through their backbone network. Existing black-box prompt learning methods rely on outputs corresponding to a predefined label vocabulary—a small subset of the token vocabulary of LLMs—to optimize prompts. However, in real-world applications, some datasets lack specific label vocabulary, and even manually assigned labels may perform inconsistently across different LLMs. To address these challenges, in this paper, we propose a novel label-vocabulary-free black-box discrete prompt learning method. Our approach employs an alternating optimization strategy to simultaneously learn discrete prompt tokens and a learnable matrix that directly maps the outputs of LLMs corresponding to the token vocabulary to categories. We provide theoretical convergence guarantees for our method under standard assumptions, ensuring its reliability. Experiments show that our method effectively learns prompts and outperforms existing baselines on datasets without label vocabulary.

3955FedSMU: Communication-Efficient and Generalization-Enhanced Federated Learning through Symbolic Model Updates

[openreview] [pdf]

Abstract The significant communication overhead and client data heterogeneity have posed important challenges to current federated learning (FL) paradigm. Most compression-based and optimization-based FL algorithms typically focus on addressing either the model compression challenge or the data heterogeneity issue individually, rather than tackling both of them. In this paper, we observe that by symbolizing the client model updates to be uploaded (i.e., normalizing the magnitude for each model parameter at local clients), the model heterogeneity can be mitigated that is essentially stemmed from data heterogeneity, thereby helping improve the overall generalization performance of the globally aggregated model at the server. Inspired with this observation, and further motivated by the success of Lion optimizer in achieving the optimal performance on most tasks in centralized learning, we propose a new FL algorithm, called FedSMU, which simultaneously reduces the communication overhead and alleviates the data heterogeneity issue. Specifically, FedSMU splits the standard Lion optimizer into the local updates and global execution, where only the symbol of client model updates commutes between the client and server. We theoretically prove the convergence of FedSMU for the general non-convex settings. Through extensive experimental evaluations on several benchmark datasets, we demonstrate that our FedSMU algorithm not only reduces the communication overhead, but also achieves a better generalization performance than the other compression-based and optimization-based baselines.

3956SEAT: Sparsified Enhancements for Attention Mechanisms in Time Series Transformers

[openreview] [pdf]

Abstract Transformer models excel in time series tasks due to their attention mechanisms. However, they often suffer from “block-like” attention patterns caused by high feature correlation, leading to feature confusion and reduced performance. In this study, we mathematically prove and quantify this limitation, demonstrating how it affects the sparsity of the attention matrix and hinders effective feature representation. To overcome this issue, we propose a novel, model-agnostic, and plug-and-play method called SEAT (Sparsification-Enhanced Attention Transformer) that leverages frequency domain sparsification. By transforming time series data into the frequency domain, our method induces inherent sparsity, reduces feature similarity, and mitigates block-like attention, allowing the attention mechanism to focus more precisely on relevant features. Experiments on benchmark datasets demonstrate that our approach significantly enhances the accuracy and robustness of Transformer models while maintaining computational efficiency. This provides a mathematically grounded solution to inherent flaws in attention mechanisms, offering a versatile and effective approach for advancing time series analysis.

3957Grounding is All You Need? Dual Temporal Grounding for Video Dialog

[openreview] [pdf]

Abstract In the realm of video dialog response generation, the understanding of video content and the temporal nuances of conversation history are paramount. While a segment of current research leans heavily on large-scale pretrained visual-language models and often overlooks temporal dynamics, another delves deep into spatial-temporal relationships within videos but demands intricate object trajectory pre-extractions and sidelines dialog temporal dynamics. This paper introduces the Dual Temporal Grounding-enhanced Video Dialog model (DTGVD), strategically designed to merge the strengths of both dominant approaches. It emphasizes dual temporal relationships by predicting dialog turn-specific temporal regions, filtering video content accordingly, and grounding responses in both video and dialog contexts. One standout feature of DTGVD is its heightened attention to chronological interplay. By recognizing and acting upon the dependencies between different dialog turns, it captures more nuanced conversational dynamics. To further bolster the alignment between video and dialog temporal dynamics, we’ve implemented a list-wise contrastive learning strategy. Within this framework, accurately grounded turn-clip pairings are designated as positive samples, while less precise pairings are categorized as negative. This refined classification is then funneled into our holistic end-to-end response generation mechanism. Evaluations using AVSD@DSTC-7 and AVSD@DSTC-8 datasets underscore the superiority of our methodology.

3958ChinaTravel: A Real-World Benchmark for Language Agents in Chinese Travel Planning

[openreview] [pdf]

Abstract Recent advances in Large Language Models (LLMs), particularly in language reasoning and tool-use capabilities have sparked the rapid development of \emph{Language Agents} to assist humans across various real-world applications. Among these, travel planning stands out as a significant domain, presenting both academic challenges and practical value due to its inherent complexity and real-world relevance. However, existing travel plan benchmarks do not test language agents with human users or their ability to follow customized requirements, both of which are vital for deploying them in real-world applications. In this paper, we propose ChinaTravel, a new benchmark tailored to authentic Chinese travel requirements, aiming to provide a more realistic evaluation framework for future language agents. We collect the travel requirements through questionnaires and employ an efficient and faithful evaluation process with 46 metrics covering feasibility, constraint satisfaction, and preference comparison. Moreover, we identify three challenges in the real-world deployments of travel planning, including \emph{constraint recognition}, \emph{concept openness}, and \emph{customized preference}. The empirical studies show that even state-of-the-art neural-symbolic agents succeed in 51.3% constraint validation of human queries. Our findings point to the need for methods that can improve the ability of agents to understand diverse intentions or keep track of constraints with emerging concepts from human requirements.

3959BlackDAN: A Black-Box Multi-Objective Approach to Effective and Contextual Jailbreaking of Language Models

[openreview] [pdf]

Abstract While large language models (LLMs) exhibit remarkable capabilities across various tasks, they encounter potential security risks such as jailbreak attacks, which exploit vulnerabilities to bypass security measures and generate harmful outputs. Existing jailbreak strategies mainly focus on maximizing attack success rate (ASR), frequently neglecting other critical factors, including the relevance of the jailbreak response to the query and the level of stealthiness. This narrow focus on single objectives can result in ineffective attacks that either lack contextual relevance or are easily recognizable. In this work, we introduce BlackDAN, an innovative black-box attack framework with multi-objective optimization, aiming to generate high-quality prompts that effectively facilitate jailbreaking while maintaining contextual relevance and minimizing detectability. BlackDAN leverages Multiobjective Evolutionary Algorithms (MOEAs), specifically the NSGA-II algorithm, to optimize jailbreaks across multiple objectives including ASR, stealthiness, and semantic relevance. By integrating mechanisms like mutation, crossover, and Pareto-dominance, BlackDAN provides a transparent and interpretable process for generating jailbreaks. Furthermore, the framework allows customization based on user preferences, enabling the selection of prompts that balance harmfulness, relevance, and other factors. Experimental results demonstrate that BlackDAN outperforms traditional single-objective methods, yielding higher success rates and improved robustness across various LLMs and multimodal LLMs, while ensuring jailbreak responses are both relevant and less detectable.

3960Variance-Reduced Normalized Zeroth Order Method for Generalized-Smooth Non-Convex Optimization

[openreview] [pdf]

Abstract The generalized smooth condition, (L0,L1)(L_{0}, L_{1})-smoothness, has triggered people’s interest since it is more realistic in many optimization problems shown by both empirical and theoretical evidence. To solve the generalized smooth optimization, gradient clipping methods are often employed, and have theoretically been shown to be as effective as the traditional gradient-based methods\citep{Chen_2023, xie2024}. However, whether these methods can be safely extended to zeroth-order case is still unstudied. To answer this important question, we propose a zeroth-order normalized gradient method(ZONSPIDER) for both finite sum and general expectation case, and we prove that we can find ϵ\epsilon- stationary point of f(x)f(x) with optimal decency on dd and ϵ\epsilon, specifically, the complexes are O(dϵ2nmaxL0,L1)\mathcal{O}(d\epsilon^{-2}\sqrt{n}\max{L_{0}, L_{1}}) in the finite sum case and O(dϵ3maxσ12,σ02maxL0,L1)\mathcal{O}(d\epsilon^{-3}\max{\sigma_{1}^{2}, \sigma_{0}^{2}}\max{L_{0}, L_{1}}) in the general expectation case. To the best of our knowledge, this is the first time that sample complexity bounds are established for a zeroth-order method under generalized smoothness.

3961Establishing Knowledge Preference in Language Models

[openreview] [pdf]

Abstract Language models are known to encode a great amount of factual knowledge through pretraining. However, such knowledge might be insufficient to cater to user requests, requiring the model to integrate external knowledge sources and adhere to user-provided specifications. When answering questions about ongoing events, the model should use recent news articles to update its response; when asked to provide recommendations, the model should prioritize user specifications over retrieved product reviews; when some facts are edited in the model, the updated facts should override all prior knowledge learned by the model even if they are conflicting. In all of the cases above, the model faces a decision between its own parametric knowledge, (retrieved) contextual knowledge, and user instruction knowledge. In this paper, we (1) unify such settings into the problem of knowledge preference\textit{knowledge preference} and define a three-level preference hierarchy over these knowledge sources; (2) compile a collection of existing datasets IfQA, MQuAKE, and MRQA covering a combination of settings (with/without user specifications, with/without context documents) to systematically evaluate how well models obey the intended knowledge preference; and (3) propose a dataset synthesis method that composes diverse question-answer pairs with user assumptions and related context to directly fine-tune LMs for instilling the hierarchy of knowledge. We demonstrate that a 7B model, fine-tuned on only a few thousand examples automatically generated by our proposed method, effectively achieves superior performance (more than 18% improvement across all evaluation benchmarks) in adhering to the desired knowledge preference hierarchy.

3962Systematic Assessment of Tabular Data Synthesis

[openreview] [pdf]

Abstract Data synthesis has been advocated as an important approach for utilizing data while protecting data privacy. In recent years, a plethora of tabular data synthesis algorithms (i.e., synthesizers) have been proposed. A comprehensive understanding of these synthesizers’ strengths and weaknesses remains elusive due to the absence of principled evaluation metrics and head-to-head comparisons between state-of-the-art deep generative approaches and statistical methods. In this paper, we examine and critique existing evaluation metrics, and introduce a set of new metrics in terms of fidelity, privacy, and utility to address their limitations. Based on the proposed metrics, we also devise a unified objective for tuning, which can consistently improve the quality of synthetic data for all methods. We conducted extensive evaluations of 8 different types of synthesizers on 12 real-world datasets and identified some interesting findings, which offer new directions for privacy-preserving data synthesis.

3963Private Mechanism Design via Quantile Estimation

[openreview] [pdf]

Abstract We investigate the problem of designing differentially private (DP), revenue-maximizing single item auction. Specifically, we consider broadly applicable settings in mechanism design where agents’ valuation distributions areindependent,non-identical, and can be eitherboundedorunbounded. Our goal is to design such auctions withpure, i.e., (ϵ,0)(\epsilon,0) privacy in polynomial time.In this paper, we propose two computationally efficient auction learning framework that achievespureprivacy under bounded and unbounded distribution settings. These frameworks reduces the problem of privately releasing a revenue-maximizing auction to the private estimation of pre-specified quantiles. Our solutions increase the running time by polylog factors compared to the non-private version. As an application, we show how to extend our results to the multi-round online auction setting with non-myopic bidders. To our best knowledge, this paper is the first to efficiently deliver a Myerson auction withpureprivacy and near-optimal revenue, and the first to provide such auctions forunboundeddistributions.

3964Binary Hyperbolic Embeddings

[openreview] [pdf]

Abstract As datasets grow in size, vector-based search becomes increasingly challenging in terms of both storage and computational efficiency. Traditional solutions such as quantization techniques involve trade-offs between retrieval speed and accuracy, while hashing methods often require further optimization for binarization. In this work, we propose leveraging the compact nature of hyperbolic space for efficient search. Specifically, we introduce Binary Hyperbolic Embeddings, which transform complex hyperbolic similarity calculations into binary operations. We prove that these binary hyperbolic embeddings are retrieval equivalent to their real-valued counterparts, ensuring no loss in retrieval quality. This approach improves the memory efficiency and running speed of the FAISS library while maintaining performance comparable to full-precision Euclidean embeddings. Furthermore, our method is orthogonal to product quantization, allowing seamless integration with it to further enhance retrieval systems. We achieve significant improvements in storage efficiency, with the potential for scaling to larger datasets. The code is provided in the supplementary materials.

3965Dynamics of Concept Learning and Compositional Generalization

[openreview] [pdf]

Abstract Prior work has shown that text-conditioned diffusion models can learn to identify and manipulate primitive concepts underlying a compositional data-generating process, enabling generalization to entirely novel, out-of-distribution compositions. Beyond performance evaluations, these studies develop a rich empirical phenomenology of learning dynamics, showing that models generalize sequentially, respecting the compositional hierarchy of the data-generating process. Moreover, concept-centric structures within the data significantly influence a model’s speed of learning the ability to manipulate a concept. In this paper, we aim to better characterize these empirical results from a theoretical standpoint. Specifically, we propose an abstraction of prior work’s compositional generalization problem by introducing a structured identity mapping (SIM) task, where a model is trained to learn the identity mapping on a Gaussian mixture with structurally organized centroids. We mathematically analyze the learning dynamics of neural networks trained on this SIM task and show that, despite its simplicity, SIM’s learning dynamics capture and help explain key empirical observations on compositional generalization with diffusion models identified in prior work. Our theory also offers several new insights---e.g., we find a novel mechanism for non-monotonic learning dynamics of test loss in early phases of training. We validate our new predictions by training a text-conditioned diffusion model, bridging our simplified framework and complex generative models. Overall, this work establishes the SIM task as a meaningful theoretical abstraction of concept learning dynamics in modern generative models.

3966DAWN: Dynamic Frame Avatar with Non-autoregressive Diffusion Framework for Talking head Video Generation

[openreview] [pdf]

Abstract Talking head generation intends to produce vivid and realistic talking head videos from a single portrait and speech audio clip. Although significant progress has been made in diffusion-based talking head generation, almost all methods rely on autoregressive strategies, which suffer from limited context utilization beyond the current generation step, error accumulation, and slower generation speed. To address these challenges, we present DAWN (\textbf{D}ynamic frame \textbf{A}vatar \textbf{W}ith \textbf{N}on-autoregressive diffusion), a framework that enables all-at-once generation of dynamic-length video sequences. Specifically, it consists of two main components: (1) audio-driven holistic facial dynamics generation in the latent motion space, and (2) audio-driven head pose and blink generation. Extensive experiments demonstrate that our method generates authentic and vivid videos with precise lip motions, and natural pose/blink movements. Additionally, with a high generation speed, DAWN possesses strong extrapolation capabilities, ensuring the stable production of high-quality long videos. These results highlight the considerable promise and potential impact of DAWN in the field of talking head video generation. Furthermore, we hope that DAWN sparks further exploration of non-autoregressive approaches in diffusion models. Our code will be publicly available.

3967Model Tells Itself Where to Attend: Steerable Prompting for Reliable Reading Comprehension of LLM

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are hallucinated. This difficulty increases for contexts that are long or contain distracting information, which can divert LLMs from fully capturing essential evidence. To address this issue, many works use prompting to help LLMs comprehend contextual information more reliably. For instance, iterative prompting highlights key information in two steps that first ask the LLM to identify important pieces of context and then derive answers accordingly. However, textual prompting methods are constrained to highlighting key information implicitly in token space, which is often insufficient to fully steer the model’s attention. To improve model reading comprehension, we propose SteerPrompt, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM’s attention scores. Like prompting, SteerPrompt is applied at inference time and does not require changing any model parameters. Our experiments on open-book QA demonstrate that SteerPrompt effectively enables models to grasp essential contextual information, leading to substantially improved problem-solving performance, e.g., an average improvement of 7.95% for LLAMA3-70B-Instruct. Code will be publicly available.

3968Autoencoder-Based General-Purpose Representation Learning for Entity Embedding

[openreview] [pdf]

Abstract Recent advances in representation learning have successfully leveraged the underlying domain-specific structure of data across various fields. However, representing diverse and complex entities stored in tabular format within a latent space remains challenging. In this paper, we introduce DeepCAE, a novel method for calculating the regularization term for multi-layer contractive autoencoders (CAEs). Additionally, we formalize a general-purpose entity embedding framework and use it to empirically show that DeepCAE outperforms all other tested autoencoder variants in both reconstruction performance and downstream prediction performance. Notably, when compared to a stacked CAE across 13 datasets, DeepCAE achieves a 34% improvement in reconstruction error.

3969Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

[openreview] [pdf]

Abstract Pretraining data of large language models composes multiple domains (e.g., web texts, academic papers, codes), whose mixture proportions crucially impact the competence of outcome models. While existing endeavors rely on heuristics or qualitative strategies to tune the proportions, we discover the quantitative predictability of model performance regarding the mixture proportions in function forms, which we refer to as the data mixing laws. Fitting such functions on sample mixtures unveils model performance on unseen mixtures before actual runs, thus guiding the selection of an ideal data mixture. Furthermore, we propose nested use of the scaling laws of training steps, model sizes, and our data mixing laws to predict the performance of large models trained on massive data under various mixtures with only small-scale training. Experimental results verify that our method effectively optimizes the training mixture of a 1B model trained for 100B tokens in RedPajama, reaching a performance comparable to the one trained for 48% more steps on the default mixture. Extending the application of data mixing laws to continual training accurately predicts the critical mixture proportion that avoids catastrophic forgetting and outlooks the potential for dynamic data schedules.

3970OATS: Outlier-Aware Pruning Through Sparse and Low Rank Decomposition

[openreview] [pdf]

Abstract The recent paradigm shift to large-scale foundation models has brought about a new era for deep learning that, while has found great success in practice, has also been plagued by prohibitively expensive costs in terms of high memory consumption and compute. To mitigate these issues, there has been a concerted effort in post-hoc neural network pruning techniques that do not require costly retraining. Despite the considerable progress being made, existing methods often exhibit a steady drop in model performance as the compression increases. In this paper, we present a novel approach to compressing large transformers, coined OATS, that compresses the model weights by approximating each weight as the sum of a sparse matrix and a low-rank matrix. Prior to the decomposition, the weights are first scaled by the second moment of their input embeddings, so as to ensure the preservation of outlier features recently observed in large transformer models. Without retraining, OATS achieves state-of-the-art performance when compressing large language models, such as Llama-3 and Phi-3, and vision transformers, such as Google’s ViT and DINOv2, by up to 60%60\%, all while speeding up the model’s inference on a CPU by up to 1.37×1.37\times compared to prior pruning methods.

3971Informed Machine Learning with a Stochastic-Gradient-based Algorithm for Training with Hard Constraints

[openreview] [pdf]

Abstract A methodology for informed machine learning is presented and its effectiveness is shown through numerical experiments with physics-informed learning problems. The methodology has three main distinguishing features. Firstly, prior information is introduced in the training problem through hard constraints rather than through the typical modern practice of using soft constraints (i.e., regularization terms). Secondly, the methodology does not employ penalty-based (e.g., augmented Lagrangian) methods since the use of such methods results in an overall methodology that is similar to a soft-constrained approach. Rather, the methodology is based on a recently proposed stochastic-gradient-based algorithm that maintains computationally efficiency while handling constraints with a Newton-based technique. Thirdly, a new projection-based variant of the well-known Adam optimization methodology is proposed for settings with hard constraints. Numerical experiments on a set of physics-informed learning problems show that, when compared with a soft-constraint approach, the proposed methodology can be easier to tune, lead to accurate predictions more quickly, and lead to better final prediction accuracy.

3972Repetitive Contrastive Learning Enhances Mamba’s Selectivity in Time Series Prediction

[openreview] [pdf]

Abstract The prediction of long sequences has always been a challenge in time series forecasting tasks. Due to Mamba’s sequence selection capability, many Mamba-based models have been proposed, achieving state-of-the-art results in long sequence prediction problems. However, much research has focused on integrating mamba-ssm into specific model structures for better performance, while the core of mamba-ssm, its sequence selection capability, has not been deeply explored. We believe there is significant potential in Mamba’s sequence selection capability and propose a Repetitive Contrastive Learning method to enhance it. Specifically, we use Repeating Sequence Augmentation to increase the sequences and introduce Gaussian noise, and enhance the mamba block’s sequence selection capability through Inter-sequence and Intra-sequence contrast. We transfer the pre-trained parameters to various Mamba-based models for fine-tuning and compare the performance improvements. Experiments demonstrate that our method can universally enhance the performance of Mamba-based models without additional memory requirements, irrespective of model and parameter constraints.

3973HFT: Half Fine-Tuning for Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) with one or more fine-tuning phases have become necessary to unlock various capabilities, enabling LLMs to follow natural language instructions and align with human preferences. However, it carries the risk of catastrophic forgetting during sequential training, the parametric knowledge or the ability learned in previous stages may be overwhelmed by incoming training data. This paper finds that by regularly resetting partial parameters, LLMs can restore some of the original knowledge. Inspired by this, we introduce \underline{H}alf \underline{F}ine-\underline{T}uning (HFT) for LLMs, as a substitute for full fine-tuning (FFT), to mitigate the forgetting issues, where half of the parameters are selected to learn new tasks. In contrast, the other half are frozen to retain previous knowledge. We provide a feasibility analysis from the perspective of optimization and interpret the parameter selection operation as a regularization term. Without changing the model architecture, HFT could be seamlessly integrated into existing fine-tuning frameworks. Extensive experiments and analysis on supervised fine-tuning, direct preference optimization, and continual learning consistently demonstrate the effectiveness, robustness, and efficiency of HFT. Compared with FFT, HFT not only significantly alleviates the forgetting problem, but also achieves the best performance in a series of downstream benchmarks, with an approximately 30% reduction in training time.

3974Pooling And Attention: What Are Effective Designs For LLM-Based Embedding Models?

[openreview] [pdf]

Abstract The significant advancements of Large Language Models (LLMs) in generative tasks have led to a growing body of work exploring LLM-based embedding models. While these models, employing different pooling and attention strategies, have achieved state-of-the-art performance on public embedding benchmarks, questions still arise about what constitutes an effective design for LLM-based embedding models. However, these models are often trained on different datasets, using different LLM base models or training settings. Moreover, evaluations on public embedding benchmarks often fail to report statistical significance, making it difficult to determine which designs truly contribute to final performance. This complicates the process for practitioners seeking optimal training recipes for LLM-based embedding models. In this study, we conduct a large-scale experiment by training a series of LLM-based embedding models using the same training data and base model but differing in their pooling and attention strategies. The results show that there is no one-size-fits-all solution: while bidirectional attention and an additional trainable pooling layer outperform in text similarity and information retrieval tasks, they do not significantly surpass simpler designs like EOS-last token pooling and default causal attention in clustering and classification tasks. Furthermore, we propose a new pooling strategy, Multi-Layers Trainable Pooling, which transforms the outputs of all hidden layers, rather than just the last layer, using a cross-attention network. This method proves to be statistically superior in text similarity and retrieval tasks compared to existing pooling methods. Overall, this paper sheds light on effective training strategies for LLM-based embedding models.

3975Confidence Elicitation: A New Attack Vector for Large Language Models

[openreview] [pdf]

Abstract A fundamental issue in deep learning has been adversarial robustness. As these systems have scaled, such issues have persisted. Currently, large language models (LLMs) with billions of parameters suffer from adversarial attacks just like their earlier, smaller counterparts. However, the threat models have changed. Previously, having gray-box access, where input embeddings or output logits/probabilities were visible to the user, might have been reasonable. However, with the introduction of closed-source models, no information about the model is available apart from the generated output. This means that current black-box attacks can only utilize the final prediction to detect if an attack is successful. In this work, we investigate and demonstrate the potential of attack guidance, akin to using output probabilities, while having only black-box access in a classification setting. This is achieved through the ability to elicit confidence from the model. We empirically show that the elicited confidence is calibrated and not hallucinated for current LLMs. By minimizing the elicited confidence, we can therefore increase the likelihood of misclassification. Our new proposed paradigm demonstrates promising state-of-the-art results on three datasets across two models (LLaMA 3 and Mistral V0.3) when comparing our technique to existing hard-label black-box attack methods that introduce word-level substitutions. The code is publicly available athttps://shorturl.at/s9DIr.

3976Multi-Step Preference Optimization via Two-Player Markov Games

[openreview] [pdf]

Abstract Reinforcement Learning from Human Feedback (RLHF) has been highly successful in aligning large language models with human preferences. While prevalent methods like DPO have demonstrated strong performance, they frame interactions with the language model as a bandit problem, which limits their applicability in real-world scenarios where multi-turn conversations are common. Additionally, DPO relies on the Bradley-Terry model assumption, which does not adequately capture the non-transitive nature of human preferences. In this paper, we address these challenges by modeling the alignment problem as a two-player constant-sum Markov game, where each player seeks to maximize their winning rate against the other across all steps of the conversation. Our approach Multi-step Preference Optimization (MPO) is built upon the natural actor-critic framework. We further develop OMPO based on the optimistic online gradient descent algorithm. Theoretically, we provide a rigorous analysis for both algorithms on convergence and show that OMPO requires O(ϵ1)\mathcal{O}(\epsilon^{-1}) policy updates to converge to an ϵ\epsilon-approximate Nash equilibrium. We also validate the effectiveness of our method through experiments on the multi-turn conversations dataset in MT-bench-101.

3977Graph Scattering Networks with Adaptive Diffusion Kernels

[openreview] [pdf]

Abstract Scattering networks are deep convolutional architectures that use predefined wavelets for feature extraction and representation. They have proven effective for classification tasks, especially when training data is scarce, where traditional deep learning methods struggle. In this work, we introduce and develop a mathematically sound framework for applying adaptive kernels to diffusion wavelets in graph scattering networks. Stability guarantees with respect to input perturbations are provided. A specific construction of adaptive kernels is presented and applied with continuous diffusion to perform graph classification tasks on benchmark datasets. Our model consistently outperforms traditional graph scattering networks with predefined wavelets, both in scenarios with limited and abundant training data.

3978Reshaping Reservoirs: Hebbian Plasticity for Improved Data Separability

[openreview] [pdf]

Abstract This paper introduces Hebbian Architecture Generation (HAG), a method grounded in Hebbian plasticity principles, designed to optimize the structure of Reservoir Computing networks. HAG adapts the synaptic weights in Recurrent Neural Networks by dynamically forming connections between neurons that exhibit high Pearson correlation. Unlike conventional reservoir computing models that rely on static, randomly initialized connectivity matrices, HAG tailors the reservoir architecture to specific tasks by autonomously optimizing network properties such as signal decorrelation and eigenvalue spread. This task-specific adaptability enhances the linear separability of input data, as supported by Cover’s theorem, which posits that increasing the dimensionality of the feature space improves pattern recognition. Experimental results show that HAG outperforms traditional Echo State Networks across various predictive modeling and pattern recognition benchmarks. By aligning with biological principles of structural plasticity, HAG addresses limitations of static reservoir architectures, offering a biologically plausible and highly adaptable alternative for improved performance in dynamic learning environments.

3979Towards Fully Autonomous Driving with Automated Commonsense Reasoning

[openreview] [pdf]

Abstract Autonomous Vehicle (AV) technology has been heavily researched and sought after, yet there are no SAE Level 5 AVs available today in the marketplace. We contend that over-reliance on machine learning technology is the main reason. Use of automated commonsense reasoning technology, we believe, can help achieve SAE Level 5 autonomy. In this paper, we show how automated commonsense reasoning technology can be deployed in situations where not enough data is available to train a machine learning model for autonomous driving. Specifically, we consider two situations where (i) a traffic signal is malfunctioning at an intersection and (ii) all the cars ahead are slowing down and steering away due to an unexpected obstruction (e.g., animals on the road). We show that in such situations, our commonsense reasoning based solution performs correctly. We also provide a pathway for efficiently invoking commonsense reasoning by measuring uncertainty in the computer vision model and using commonsense reasoning to handle uncertain scenarios. We describe our experiments conducted using the CARLA simulator and the results obtained. The main contribution of our research is to show that automated commonsense reasoning provides an effective pathway to reach SAE level 5 automation.

3980Interpretability-driven active feature acquisition in learning systems

[openreview] [pdf]

Abstract In real-world applications like medicine, machine learning models must often work with a limited number of features due to the high cost and time required to acquire all relevant data. While several static feature selection methods exist, they are suboptimal due to their inability to adapt to varying feature importance across different instances. A more flexible approach is active feature acquisition (AFA), which dynamically selects features based on their relevance for each individual case. Here, we introduce an AFA framework that leverages Shapley Additive explanations (SHAP) to generate instance-specific feature importance rankings. By reframing the AFA problem as a feature prediction task, we propose a policy network based on a decision transformer architecture, trained to predict the next most informative feature based on SHAP values. This method allows us to sequentially acquire features in order of their predictive significance, resulting in more efficient feature selection and acquisition. Extensive experiments across multiple datasets show that our approach achieves superior performance compared to current state-of-the-art AFA techniques, both in terms of predictive accuracy and feature acquisition efficiency. These results demonstrate the potential of SHAP-based AFA for applications where feature acquisition cost is a critical consideration, such as in disease diagnosis.

3981ExDBN: Exact learning of Dynamic Bayesian Networks

[openreview] [pdf]

Abstract Causal learning from data has received much attention in recent years. One way of capturing causal relationships is by utilizing Bayesian networks. There, one recovers a weighted directed acyclic graph, in which random variables are represented by vertices, and the weights associated with each edge represent the strengths of the causal relationships between them. This concept is extended to capture dynamic effects by introducing a dependency on past data, which may be captured by the structural equation model, which is utilized in the present contribution to formulate a score-based learning approach. A mixed-integer quadratic program is formulated and an algorithmic solution proposed, in which the pre-generation of exponentially many acyclicity constraints is avoided by utilizing the so-called branch-and-cut (``lazy constraint’') method. Comparing the novel approach to the state of the art, we show that the proposed approach turns out to produce excellent results when applied to small and medium-sized synthetic instances of up to 25 time-series. Lastly, two interesting applications in bio-science and finance, to which the method is directly applied, further stress the importance of developing highly accurate (globally convergent) solvers that can handle modest instances.

3982A grid world agent with favorable inductive biases

[openreview] [pdf]

Abstract We present a novel experiential learning agent with causally-informed intrinsic reward that is capable of learning sequential and causal dependencies in a robust and data-efficient way within grid world environments. After reflecting on state-of-the-art Deep Reinforcement Learning algorithms, we provide a relevant discussion of common techniques as well as our own systematic comparison within multiple grid world environments. Additionally, we investigate the conditions and mechanisms leading to data-efficient learning and analyze relevant inductive biases that our agent utilizes to effectively learn causal knowledge and to plan for rewarding future states of greatest expected return.

3983Synergy Learning with Small Models promotes LLM Zero-Shot Tabular Prediction

[openreview] [pdf]

Abstract Recent development in large language models (LLMs) has demonstrated impressive zero-shot proficiency on unstructured textual or multi-modal tasks across various domains. However, despite with inherent world knowledge, their application on structured tabular data prediction still lags behind, primarily due to the numerical insensitivity and modality discrepancy that brings a gap between LLM reasoning and statistical machine learning. Unlike textual or vision data (e.g., electronic health records, medical images), tabular data is often presented in heterogeneous numerical values (e.g., blood test reports). This ubiquitous data format requires intensive expert annotation, and its numerical nature limits LLMs’ ability to effectively transfer untapped domain expertise. In this paper, we propose SERSAL, a general loop of thought prompting method by synergy learning with small models to unconditionally enhance zero-shot tabular prediction for LLMs. Specifically, SERSAL utilizes the LLM’s zero-shot outcomes as original soft annotations, which are dynamically leveraged to teach a better small student model in a semi-supervised manner. Reversely, the outcomes from the trained small model are used to teach the LLM to further refine its real capability. Such mutual process can be repeatedly applied for continuous progress. Comprehensive experiments on widely used domain tabular datasets show that, without access to gold labels, applying SERSAL to OpenAI GPT reasoning process attains substantial improvement compared to linguistic prompting methods, which serves as an orthogonal direction for tabular LLM, and increasing prompting bonus is observed as more powerful LLMs appear.

3984A Unified Framework for Speculative Decoding with Multiple Drafters as a Bandit

[openreview] [pdf]

Abstract Speculative decoding (SD) has emerged as a promising approach to accelerate inference in large language models (LLMs). This method drafts potential future tokens by leveraging a smaller model, while these tokens are concurrently verified by the target LLM, ensuring only outputs aligned with the target LLM’s predictions are accepted. However, the inherent limitations of individual drafters, especially when trained on specific tasks or domains, can hinder their effectiveness across diverse applications. In this paper, we introduce a simple yet efficient unified framework, termed MetaSD, that incorporates multiple drafters into the speculative decoding process to address this limitation. Our approach employs multi-armed bandit sampling to dynamically allocate computational resources across various drafters, thereby improving overall generation performance. Through extensive experiments, we demonstrate that our unified framework achieves superior results compared to traditional single-drafter approaches.

3985Revisiting Adversarial Examples from the Perspective of Asymptotic Equipartition Property

[openreview] [pdf]

Abstract Adversarial examples, which can mislead neural networks through subtle perturbations, continue to challenge our understanding, raising more questions than answers. This paper presents a novel perspective on interpreting adversarial examples through the Asymptotic Equipartition Property (AEP). Our theoretical analysis examines the noise within these examples, revealing that while normal noise aligns with AEP, adversarial noise does not. This insight allows us to classify samples in high-dimensional space as belonging to either the typical or non-typical set, corresponding to normal and adversarial examples, respectively. Our analyses and experiments show adversarial examples arise from AEP in high-dimensional space and derive some key properties regarding their quantity, probability, and information capacity. These findings enhance our understanding of adversarial examples and clarify their counterintuitive phenomena, such as adversarial transferability, the trade-off between robustness and accuracy, and robust overfitting.

3986Inferring from Logits: Exploring Best Practices for Decoding-Free Generative Candidate Selection

[openreview] [pdf]

Abstract Generative Language Models rely on autoregressive decoding to produce the output sequence token by token. Some tasks, such as preference optimization, require the model to produce task-level output consisting of multiple tokens directly by selecting candidates from a pool as predictions. Determining a task-level prediction from candidates using the ordinary token-level decoding mechanism is constrained by time-consuming decoding and interrupted gradients by discrete token selection. Existing works have been using decoding-free candidate selection methods to obtain candidate probability from initial output logits over vocabulary. Though these estimation methods are widely used, they are not systematically evaluated, especially on end tasks. We introduce an evaluation of a comprehensive collection of decoding-free candidate selection approaches on a comprehensive set of tasks, including five multiple-choice QA tasks with a small candidate pool and four clinical decision tasks with a massive amount of candidates, some with 10k+ options. We evaluate the estimation methods paired with a wide spectrum of foundation LMs covering different architectures, sizes and training paradigms. The results and insights from our analysis could inform the future model design.

3987Atlas Gaussians Diffusion for 3D Generation

[openreview] [pdf]

Abstract Using the latent diffusion model has proven effective in developing novel 3D generation techniques. To harness the latent diffusion model, a key challenge is designing a high-fidelity and efficient representation that links the latent space and the 3D space. In this paper, we introduce Atlas Gaussians, a novel representation for feed-forward native 3D generation. Atlas Gaussians represent a shape as the union of local patches, and each patch can decode 3D Gaussians. We parameterize a patch as a sequence of feature vectors and design a learnable function to decode 3D Gaussians from the feature vectors. In this process, we incorporate UV-based sampling, enabling the generation of a sufficiently large, and theoretically infinite, number of 3D Gaussian points. The large amount of 3D Gaussians enables the generation of high-quality details. Moreover, due to local awareness of the representation, the transformer-based decoding procedure operates on a patch level, ensuring efficiency. We train a variational autoencoder to learn the Atlas Gaussians representation, and then apply a latent diffusion model on its latent space for learning 3D Generation. Experiments show that our approach outperforms the prior arts of feed-forward native 3D generation.

3988Bonsai: Gradient-free Graph Distillation for Node Classification

[openreview] [pdf]

Abstract Graph distillation has emerged as a promising avenue to enable scalable training of GNNs by compressing the training dataset while preserving essential graph characteristics. Our study uncovers significant shortcomings in current graph distillation techniques. First, the majority of the algorithms paradoxically require training on the full dataset to perform distillation. Second, due to their gradient-emulating approach, these methods require fresh distillation for any change in hyperparameters or GNN architecture, limiting their flexibility and reusability. Finally, they fail to achieve substantial size reduction due to synthesizing fully-connected, edge-weighted graphs. To address these challenges, we present Bonsai, a novel graph distillation method empowered by the observation thatcomputation treesform the fundamental processing units of message-passing GNNs. Bonsai distills datasets by encoding a careful selection ofexemplartrees that maximize the representation of all computation trees in the training set. This unique approach imparts Bonsai as the first linear-time, model-agnostic graph distillation algorithm for node classification that outperforms existing baselines across 6 real-world datasets on accuracy, while being 22 times faster on average. Bonsai is grounded in rigorous mathematical guarantees on the adopted approximation strategies making it robust to GNN architectures, datasets, and parameters.

3989New Algorithms for the Learning-Augmented k-means Problem

[openreview] [pdf]

Abstract In this paper, we study the clustering problems in the learning-augmented setting, where predicted labels for a d-dimensional dataset with size m are given by an oracle to serve as auxiliary information to improve the clustering performance. Following the prior work, the given oracle is parameterized by some error rate α, which captures the accuracy of the oracle such that there are at most α fraction of false positives and false negatives in each predicted cluster. In this setting, the goal is to design fast and practical algorithms that can break the computational barriers of inapproximability. The current state-of-the-art learning-augmented k-means algorithm relies on sorting strategies to find good coordinates approximation, where a (1+O(α))-approximation can be achieved with near-linear running time in the data size. However, the computational demands for sorting may limit the scalability of the algorithm for handling large-scale datasets. To address this issue, in this paper, we propose new algorithms that can identify good coordinates approximation using sampling-based strategies, where (1+O(α))-approximation can be achieved with linear running time in the data size. To obtain a more practical algorithm for the problem with better clustering quality and running time, we propose a sampling-based heuristic which can directly find center approximations using sampling-based strategies. Empirical experiments show that our proposed methods are faster than the state-of-the-art learning-augmented k-means algorithms with comparable performances on clustering quality.

3990Language Model Empowered Spatio-Temporal Forecasting via Physics-Aware Reprogramming

[openreview] [pdf]

Abstract Spatio-temporal forecasting is pivotal in numerous real-world applications, including transportation planning, energy management, and climate monitoring. In this work, we aim to harness the reasoning and generalization abilities of Pre-trained Language Models (PLMs) for more effective spatio-temporal forecasting, particularly in data-scarce scenarios. However, recent studies uncover that PLMs, which are primarily trained on textual data, often falter when tasked with modeling the intricate correlations inherent in numerical time series, thereby limiting their effectiveness in comprehending spatio-temporal data. To bridge the gap, we propose REPST, a physics-aware PLM reprogramming framework tailored for spatio-temporal forecasting. Specifically, we first propose a physics-aware decomposer that adaptively disentangles spatially correlated time series into interpretable sub-components, which facilitates PLM’s understanding of sophisticated spatio-temporal dynamics via a divide-and-conquer strategy. Moreover, we propose a selective discrete reprogramming scheme, which introduces an expanded spatio-temporal vocabulary space to project spatio-temporal series into discrete representations. This scheme minimizes the information loss during reprogramming and enriches the representations derived by PLMs. Extensive experiments on real-world datasets show that the proposed REPST outperforms twelve state-of-the-art baseline methods, particularly in data-scarce scenarios, highlighting the effectiveness and superior generalization capabilities of PLMs for spatio-temporal forecasting.

3991A Diffusive Data Augmentation Framework for Reconstruction of Complex Network Evolutionary History

[openreview] [pdf]

Abstract The evolutionary processes of complex systems contain critical information about their functional characteristics. The generation time of edges can reveal the historical evolution of various networked complex systems, such as protein-protein interaction networks, ecosystems, and social networks. Recovering these evolutionary processes holds significant scientific value, such as aiding in the interpretation of the evolution of protein-protein interaction networks. However, the scarcity of temporally labeled network data poses challenges for predicting edge generation times under current network structures, leading to issues of insufficient data and significant differences between training and prediction networks. To address this, we introduce a diffusion model that learns the generative mechanisms of networks, producing sufficient augmented network data to effectively mitigate issues of limited and incomplete data. Experimental results demonstrate a 13.7% improvement in prediction accuracy using our approach. Moreover, the model can uniformly predict edge generation times across different types of networks, eliminating the need to retrain the model for each specific network, thus significantly enhancing generalization capability and efficiency.

3992Reshaping Model Output Space Via Deep Kernel Density Estimation Networks

[openreview] [pdf]

Abstract Traditional classification models are typically optimized solely for their specific training task without considering the properties of the underlying probability distribution of their output space. As the use of these models for downstream tasks becomes more prevalent, it becomes advantageous to have a framework that can transform the output space of such models to a more convenient space without sacrificing performance. In this paper, we introduce DeepKDE, a novel method which enables the transformation of arbitrary output spaces to match more desirable distributions, such as Normal and Gaussian Mixture Models. We explore the properties of the new method and test its effectiveness on ResNet-18 and vision transformers trained on CIFAR-10 and Fashion MNIST datasets. We show that DeepKDE models succeed in transforming the output spaces of the original models while outperforming them in terms of accuracy.

3993Do LLMs have Consistent Values?

[openreview] [pdf]

Abstract Values are a basic driving force underlying human behavior. Large Language Models (LLM) technology is constantly improving towards human-like dialogue. However, little research has been done to study the values exhibited in text generated by LLMs. Here we study this question by turning to the rich literature on value structure in psychology. We ask whether LLMs exhibit the same value structure that has been demonstrated in humans, including the ranking of values, and correlation between values. We show that the results of this analysis strongly depend on how the LLM is prompted, and that under a particular prompting strategy (referred to as ``Value Anchoring’') the agreement with human data is quite compelling. Our results serve both to improve our understanding of values in LLMs, as well as introduce novel methods for assessing consistency in LLM responses.

3994Sharper Analysis of Data Echoing and New Communication-Efficient Algorithm for Data Parallelism

[openreview] [pdf]

Abstract Over the past decade, breakthroughs in both general-purpose and specialized hardware have propelled the success of large-scale machine learning. However, the advancements in general-purpose hardware are not keeping pace with those in specialized hardware. Consequently, operations conducted on the general-purpose hardware have become the primary performance bottleneck. Notably, data loading significantly lags behind the gradient computation during training. To address this issue, the technique of data echoing has been introduced, whereby the current batch of samples is reused for gradient computation to minimize idle time while waiting for new data. However, this approach can lead to overfitting on the current batch, and it remains unclear whether convergence benefits from this practice. In this paper, we provide a sharper analysis on a stochastic variant of data echoing and show that it obtains linear speedup proportional to the number of reuse times. Additionally, we investigate the impact of the communication bottleneck in data parallelism of data echoing, and propose a new communication-efficient data echoing algorithm via reducing the frequency of model averaging. We then show that it is possible to perform data echoing without additional communication cost with data parallelism. Finally, we perform empirical experiments to verify our analysis on the data echoing and the proposed efficient algorithm for data parallelism.

3995Models That Prove Their Own Correctness

[openreview] [pdf]

Abstract How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measuredon averageover a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to trainSelf-Proving modelsthat prove the correctness of their output to a verification algorithm VV via an Interactive Proof. We devise a generic method for learning Self-Proving models, and we prove convergence bounds under certain assumptions. Empirically, our learning method is used to train a Self-Proving transformer that computes the Greatest Common Divisor (GCD)andproves the correctness of its answer.

3996Core Tokensets for Data-efficient Sequential Training of Transformers

[openreview] [pdf]

Abstract Deep networks are frequently tuned to novel tasks and continue learning from ongoing data streams. Such sequential training requires consolidation of new and past information, a challenge predominantly addressed by retaining the most important data points - formally known as coresets. Traditionally, these coresets consist of entire samples, such as images or sentences. However, recent transformer architectures operate on tokens, leading to the famous assertion that an image is worth 16x16 words. Intuitively, not all of these tokens are equally informative or memorable. Going beyond coresets, we thus propose to construct a deeper-level data summary on the level of tokens. Ours, respectively named core tokensets, both select the most informative data points and leverage feature attribution to store only their most relevant features. We demonstrate that core tokensets yield significant performance retention in incremental image classification, open-ended visual question answering, and continual image captioning with significantly reduced memory. In fact, we empirically find that a core tokenset of 1% of the data performs comparably to at least a twice as large and up to 10 times larger coreset.

3997Robust Lurie Networks with Controllable Convergent Dynamics

[openreview] [pdf]

Abstract The Lurie Network is proposed as a unifying architecture for many existing continuous-time models including Recurrent Neural Networks and Neural Oscillators. Motivated by the need for a general inductive bias, shared by many dynamical systems, this paper proposes an approach to enable network weights and biases to be trained in such a manner so that a generalised concept of stability is guaranteed. This generalised stability measure is that of kk-contraction which enables global convergence to a point, line or plane in the neural state-space. This result is leveraged to construct a Graph Lurie Network satisfying the same convergence properties. Unconstrained parametrisations of these conditions are derived allowing the models to be trained using standard optimisation algorithms, whilst limiting the search space to solutions satisfying the kk-contraction constraints. The prediction accuracy, generalisation and robustness of the architecture is benchmarked against other continuous-time models on a range of dynamical systems. Results demonstrate the benefit of controlling the range of dynamics which can be learnt through improved accuracy, generalisation and robustness on all benchmarks.

3998Rethinking Out-of-Distribution Detection in Vision Foundation Models

[openreview] [pdf]

Abstract Pre-trained vision foundation models have transformed many computer vision tasks. Despite their strong ability to learn discriminative and generalizable features-- crucial for out-of-distribution (OOD) detection, their impact on this task remains underexplored. Motivated by this gap, our study investigates vision foundation models in OOD detection. Our findings show that even without complex designs, a pre-trained DINOv2 model, utilizing a simple scoring metric and no fine-tuning, outperforms all prior state-of-the-art models, which typically depend on fine-tuning with in-distribution (ID) data. Furthermore, while the pre-trained CLIP model struggles with fine-grained OOD samples, DINOv2 excels, revealing the limitations of CLIP in this setting. Building on these insights, we explore how foundation models can be further optimized for both ID classification and OOD detection when ID data is available for fine-tuning. From a model perspective, we propose a Mixture of Feature Experts (MoFE) module, which partitions features into subspaces. This mitigates the challenge of tuning complex data distributions with limited ID data and enhances decision boundary learning for classification. From a data perspective, we introduce a Dynamic-β\beta Mixup strategy, which samples interpolation weights from a dynamic beta distribution. This adapts to varying levels of learning difficulty across categories, improving feature learning for more challenging categories. Extensive experiments and ablation studies demonstrate the effectiveness of our approach, significantly outperforming baseline methods.

3999Understanding Mistakes in Transformers through Token-level Semantic Dependencies

[openreview] [pdf]

Abstract Despite the high performance of the transformer model, it sometimes produces unfaithful information. To understand the cause of this issue, we explore how semantically dependent information is encoded within the model. Specifically, we investigate how tokens in multi-head self-attention transformer models encode semantically dependent information. To help us identify semantic information encoded within a token, intuitively, our method analyzes how a token’s value shifts in response to changes in semantics. BERT, LLaMA and GPT models are analyzed. We have observed some interesting and similar behaviors in their mechanisms for encoding semantically dependent information: 1). Most tokens primarily retain their original semantic information, even as they pass through multiple layers. 2). Semantically dependent information is usually encoded together within a token. 3). The semantic dependency within a token is sensitive to even irrelevant changes in context and order of prompts. 4). Mistakes made by the model can be attributed to some tokens that falsely encode semantic dependencies. Our findings potentially can help develop more robust and accurate transformer models by pinpointing the mechanisms behind semantic encoding.

4000Learning Discrete Latent Models from Discrete Observations

[openreview] [pdf]

Abstract A central challenge in machine learning is discovering meaningful representations of high-dimensional data, commonly referred to as representation learning. However, many existing methods lack a theoretical foundation, leading to unreliable representations and limited inferential capabilities. In approaches where certain uniqueness of representation is guaranteed, such as nonlinear ICA, variables are typically assumed to be continuous. While recent work has extended identifiability to binarized observed variables, no principled method has been developed for scenarios involving discrete latent variables. In this paper, we show how multi-domain information can be leveraged to achieve identifiability when both latent and observed variables are discrete. We propose general identification conditions that do not depend on specific data distributional assumptions or parametric model forms. The effectiveness of our approach is validated through experiments on both simulated and real-world datasets.

4001Periodical Moving Average Accelerates Gradient Accumulation for Post-Training

[openreview] [pdf]

Abstract High gradient variance challenges training Large Language Models (LLMs) on memory-limited devices. Existing practical approaches, such as small batch size or using Gradient Accumulation (GA), face the dilemma between low convergence rates due to high variance in parameter updates and long training times due to the serial GA process. In this paper, we identify that the exponential nature of the Exponential Moving Average (EMA) rapidly forgets historical gradients at an exponential rate in momentum updates, making it difficult to utilize the historical gradients to stabilize the update steps. To address this issue, we embed the idea of GA into the momentum update and propose the Periodical Moving Average (PMA) technique. PMA splits the training steps into periods and employs moving averages instead of EMA in each period. We apply PMA to AdamW and Lion, resulting in AdamW-PMA and Lion-PMA. Theoretical analysis demonstrates that AdamW-PMA achieves a comparable convergence rate with Adam. Extensive experiments showcase the superiority of PMA on post-training tasks, including Supervised Fine-Tuning and Direct Preference Optimization, that the PMA-based methods achieve approximately at least 2×2\times speedup and higher scores on downstream tasks.

4002Resolving Lexical Bias in Edit Scoping with Projector Editor Networks

[openreview] [pdf]

Abstract Weight-preserving large language model editing techniques rely heavily on the scoping mechanism that decides when to apply an edit to the base model. These scoping mechanisms utilize distance functions in the representation space. In this work, we show that distance-based scoping functions grapple with strong lexical biases leading to issues such as deciding that irrelevant prompts that share overlapping words should result in applying an edit. We address these problems by introducing Projector Editor Networks for Model Editing (PENME), a principled model editing approach designed to learn the optimal representation space for scoping via contrastive learning. We show that PENME achieves state of the art model editing results while being compute-efficient at inference time than previous methods and flexible enough to adapt across architectures

4003Beyond Standardization – Putting the Normality in Normalization

[openreview] [pdf]

Abstract The normal distribution plays a central role in information theory – it is at the same time the best-case signal and worst-case noise distribution, has the greatest representational capacity of any distribution, and offers an equivalence between uncorrelatedness and independence for joint distributions. Accounting for the mean and variance of activations throughout the layers of deep neural networks has had a significant effect on facilitating their effective training, but seldom has a prescription for precisely what distribution these activations should take, and how this might be achieved, been offered. Motivated by the information-theoretic properties of the normal distribution, we address this question and concurrently present normality normalization: a novel normalization layer which encourages normality in the feature representations of neural networks using the power transform and employs additive Gaussian noise during training. Our experiments comprehensively demonstrate the effectiveness of normality normalization, in regards to its generalization performance on an array of widely used model and dataset combinations, its strong performance across various common factors of variation such as model width, depth, and training minibatch size, its suitability for usage wherever existing normalization layers are conventionally used, and as a means to improving model robustness to random perturbations.

4004Multi-Dimensional Insights: Benchmarking Real-World Personalization in Large Multimodal Models

[openreview] [pdf]

Abstract The rapidly developing field of large multimodal models (LMMs) has led to the emergence of diverse models with remarkable capabilities. However, existing benchmarks fail to comprehensively, objectively and accurately evaluate whether LMMs align with the diverse needs of humans in real-world scenarios. To bridge this gap, we propose the Multi-Dimensional Insights (MDI) benchmark, which includes over 500 images covering six common scenarios of human life. Notably, the MDI-Benchmark offers two significant advantages over existing evaluations: (1) Each image is accompanied by two types of questions: simple questions to assess the model’s understanding of the image, and complex questions to evaluate the model’s ability to analyze and reason beyond basic content. (2) Recognizing that people of different age groups have varying needs and perspectives when faced with the same scenario, our benchmark stratifies questions into three age categories: young people, middle-aged people, and older people. This design allows for a detailed assessment of LMMs’ capabilities in meeting the preferences and needs of different age groups. With MDI-Benchmark, the strong model like GPT-4o achieve 79% accuracy on age-related tasks, indicating that existing LMMs still have considerable room for improvement in addressing real-world applications. Looking ahead, we anticipate that the MDI-Benchmark will open new pathways for aligning real-world personalization in LMMs.

4005Dual-Stream Adapters for Anomaly Segmentation

[openreview] [pdf]

Abstract Anomaly segmentation aims to identify pixels of objects not present during the model’s training. Recent approaches address this task using mask-based architectures, but these methods have high training costs due to the large transformer backbones involved. While vision adapters can help reduce training costs, they are not specialized for this task, leading to inferior performance. In this work, we propose Dual-Stream Adapters (DSA), a vision adapter tailored for anomaly segmentation. DSA extracts both in-distribution and out-of-distribution features via (i) an anomaly prior module that produces separate initial embeddings for the two streams; and (ii) a dual-stream feature refinement that implicitly guides the separation of in-distribution from out-of-distribution features. We train DSA using a novel hyperbolic loss function that provides supervised guidance for differentiating in-distribution and out-of-distribution features. Experiments on various benchmarks show that dual-stream adapters achieve the best results while reducing training parameters by 38% w.r.t. the previous state-of-the-art.

4006Not All Prompts Are Made Equal: Prompt-based Pruning of Text-to-Image Diffusion Models

[openreview] [pdf]

Abstract Text-to-image (T2I) diffusion models have demonstrated impressive image generation capabilities. Still, their computational intensity prohibits resource-constrained organizations from deploying T2I models after fine-tuning them on their internaltargetdata. While pruning techniques offer a potential solution to reduce the computational burden of T2I models, static pruning methods use the same pruned model for all input prompts, overlooking the varying capacity requirements of different prompts. Dynamic pruning addresses this issue by utilizing a separate sub-network for each prompt, but it prevents batch parallelism on GPUs. To overcome these limitations, we introduce Adaptive Prompt-Tailored Pruning (APTP), a novel prompt-based pruning method designed for T2I diffusion models. Central to our approach is aprompt routermodel, which learns to determine the required capacity for an input text prompt and routes it to an architecture code, given a total desired compute budget for prompts. Each architecture code represents a specialized model tailored to the prompts assigned to it, and the number of codes is a hyperparameter. We train the prompt router and architecture codes using contrastive learning, ensuring that similar prompts are mapped to nearby codes. Further, we employ optimal transport to prevent the codes from collapsing into a single one. We demonstrate APTP’s effectiveness by pruning Stable Diffusion (SD) V2.1 using CC3M and COCO astargetdatasets. APTP outperforms the single-model pruning baselines in terms of FID, CLIP, and CMMD scores. Our analysis of the clusters learned by APTP reveals they are semantically meaningful. We also show that APTP can automatically discover previously empirically found challenging prompts for SD,e.g.,prompts for generating text images, assigning them to higher capacity codes.

4007Memory-augmented Transformers can implement Linear First-Order Optimization Methods

[openreview] [pdf]

Abstract We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combines past gradients. Building on prior work that demonstrates how Transformers can simulate preconditioned gradient descent, we provide theoretical and empirical evidence that Memformers can learn more advanced optimization algorithms. Specifically, we analyze how memory registers in Memformers store suitable intermediate attention values allowing them to implement algorithms such as conjugate gradient. Our results show that Memformers can efficiently learn these methods by training on random linear regression tasks, even learning methods that outperform conjugate gradient. This work extends our knowledge about the algorithmic capabilities of Transformers, showing how they can learn complex optimization methods.

[openreview] [pdf]

Abstract This paper presents a method to evaluate the alignment between the decision-making logic of Large Language Models (LLMs) and human cognition in a case study on legal LLMs. Unlike traditional evaluations on language generation results, we propose to evaluate the correctness of the detailed decision-making logic of an LLM behind its seemingly correct outputs, which represents the core challenge for an LLM to earn human trust. To this end, we quantify the interactions encoded by the LLM as primitive decision-making logic, because recent theoretical achievements (Li & Zhang, 2023; Ren et al., 2024) have proven several mathematical guarantees of the faithfulness of the interaction-based explanation. We design a set of metrics to evaluate the detailed decision-making logic of LLMs. Experiments show that even when the language generation results appear correct, a significant portion of the internal inference logic contains notable issues.

4009Representation Shattering in Transformers: A Synthetic Study with Knowledge Editing

[openreview] [pdf]

Abstract Knowledge Editing (KE) algorithms alter models’ internal weights to perform targeted updates to incorrect, outdated, or otherwise unwanted factual associations. In order to better define the possibilities and limitations of these approaches, recent work has shown that applying KE can adversely affect models’ factual recall accuracy and diminish their general reasoning abilities. While these studies give high-level insights into the potential harms of KE algorithms, e.g., via performance evaluations on benchmarks, we argue little is understood as to why such destructive failures occur. Is it possible KE methods distort representations of concepts beyond the targeted fact, hence hampering abilities at broad? If so, what is the extent of this distortion? To take a step towards addressing such questions, we define a novel synthetic task wherein a Transformer is trained from scratch to internalize a “structured” knowledge graph. The structure enforces relationships between entities of the graph, such that editing a factual association has “trickling effects” on other entities in the graph (e.g., altering X’s parent is Y to Z affects who X’s siblings’ parent is). Through evaluations of edited models and analysis of extracted representations, we show that KE inadvertently affects representations of entities beyond the targeted one, distorting relevant structures that allow a model to infer unseen knowledge about an entity. We call this phenomenon representation shattering and demonstrate that it results in degradation of factual recall and reasoning performance more broadly. To corroborate our findings in a more naturalistic setup, we perform preliminary experiments with a pretrained GPT-2-XL model and reproduce the representation shattering effect therein as well. Overall, our work yields a precise mechanistic hypothesis that explains why KE has adverse effects on model capabilities.

4010SpacetimeE(n)-Transformer: Equivariant Attention for Spatio-temporal Graphs

[openreview] [pdf]

Abstract We introduce an E(n)E(n)-equivariant Transformer architecture for spatio-temporal graph data. By imposing rotation, translation, and permutation equivariance inductive biases in both space and time, we show that the Spacetime E(n)E(n)-Transformer (SET) outperforms purely spatial and temporal models without symmetry-preserving properties. We benchmark SET against said models on the NN-body problem, a simple physical system with complex dynamics. While existing spatio-temporal graph neural networks focus on sequential modeling, we empirically demonstrate that leveraging underlying domain symmetries yields considerable improvements for modeling dynamical systems on graphs.

4011GraphSTAGE: Channel-Preserving Graph Neural Networks for Time Series Forecasting

[openreview] [pdf]

Abstract Recent advancements in multivariate time series forecasting (MTSF) have increasingly focused on the core challenge of learning dependencies within sequences, specifically intra-series (temporal), inter-series (spatial), and cross-series dependencies. While extracting multiple types of dependencies can theoretically enhance the richness of learned correlations, it also increases computational complexity and may introduce additional noise. The trade-off between the variety of dependencies extracted and the potential interference has not yet been fully explored. To address this challenge, we propose GraphSTAGE, a purely graph neural network (GNN)-based model that decouples the learning of intra-series and inter-series dependencies. GraphSTAGE features a minimal architecture with a specially designed embedding and patching layer, along with the STAGE (Spatial-Temporal Aggregation Graph Encoder) blocks. Unlike channel-mixing approaches, GraphSTAGE is a channel-preserving method that maintains the shape of the input data throughout training, thereby avoiding the interference and noise typically caused by channel blending. Extensive experiments conducted on 13 real-world datasets demonstrate that our model achieves performance comparable to or surpassing state-of-the-art methods. Moreover, comparative experiments between our channel-preserving framework and channel-mixing designs show that excessive dependency extraction and channel blending can introduce noise and interference. As a purely GNN-based model, GraphSTAGE generates learnable graphs in both temporal and spatial dimensions, enabling the visualization of data periodicity and node correlations to enhance model interpretability.

4012First-Step Inference in Diffusion Models Learns Image De-whitening

[openreview] [pdf]

Abstract Diffusion models have emerged as powerful generative models for image synthesis, yet the intricate relationship between input noise and generated images remains not fully understood. In this paper, we investigate the correlation between noise and images generated through deterministic DDIM sampling, uncovering fundamental elements that are present across different diffusion models. More specifically, we demonstrate that a one-step approximation of the mapping learned by these models closely relates to Zero-phase Component Analysis (ZCA) inverse whitening transform, which maximizes the correlation between source and target distributions. We leverage this insight to develop a simple and yet effective model-agnostic method for sampling correlated noises and showcase applications for image variation generation and editing.

4013CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

[openreview] [pdf]

Abstract In this paper, we present a 3D visual grounding method called CityAnchor for localizing an urban object in a city-scale point cloud. Recent developments in multiview reconstruction enable us to reconstruct city-scale point clouds but how to conduct visual grounding on such a large-scale urban point cloud remains an open problem. Previous 3D visual grounding system mainly concentrates on localizing an object in an image or a small-scale point cloud, which is not accurate and efficient enough to scale up to a city-scale point cloud. We address this problem with a multi-modality LLM which consists of two stages, a coarse localization and a fine-grained matching. Given the text descriptions, the coarse localization stage locates possible regions on a projected 2D map of the point cloud while the fine-grained matching stage accurately determines the most matched object in these possible regions. We conduct experiments on the CityRefer dataset and a new synthetic dataset annotated by us, both of which demonstrate our method can produce accurate 3D visual grounding on a city-scale 3D point cloud.

4014Text-driven Zero-shot Domain Adaptation with Cross-modality Graph Motif Matching

[openreview] [pdf]

Abstract Zero-shot domain adaptive semantic adaptation aims to transfer knowledge from a source domain and learn a target segmenter without access to any target domain data. Some existing methods have achieved notable performances by transforming source features to the target domain through language-driven methods. However, these methods often align language features to global image features coarsely resulting in sub-optimal performance. To address the challenges, we propose a graph motif-based adaptation method designed to balance the efficiency and effectiveness of feature alignment. Our approach involves constructing motif structures based on domain-wise image feature distributions. By increasing the angle between language-vision directed edges, we effectively pull visual features toward the language feature center, thereby achieving cross-modality feature alignment. Additionally, we employ relationship-constraint losses, \ie directional and contrastive losses, to mitigate the mode-collapse during target feature stylization. These relationship-constraint losses help stabilize the learning process and improve the robustness of the adaptation. Extensive experimental results validate the efficacy of our proposed method. The code for this method will be made available.

4015Towards Making Linear Attention Usable

[openreview] [pdf]

Abstract The original Transformer attention mechanism, based on Softmax, has time and memory complexities of O(N2D)O(N^2D) and O(N2)O(N^2), where NN is the number of tokens and DD the dimension per attention head. As current LLM applications trend towards processing larger token sequences, and Transformers gain popularity in image, video, and audio processing, addressing this quadratic cost becomes imperative. Since the introduction of Transformers, numerous approaches have been proposed to linearize this scaling. One such method is Linear Attention, which captures all-to-all token pair attention in O(ND2)O(ND^2) time. However, its drawback lies in its high memory footprint of O(ND2)O(ND^2). While Linear Attention has shown promise in small-scale benchmarks, the high memory demand has prevented Linear Attention to be studied in context of large benchmarks and practical use cases. In this work, we demonstrate how to reduce the memory complexity to O(ND)O(ND) by approaching calculations from a novel perspective. Additionally, since Linear Attention does not compute the attention matrix directly, it precludes the use of traditional dropout. To address this, we introduce an alternative dropout mechanism. Our study confirms linear scaling in both wall-clock time and memory usage. We also compare our method with Flash Attention and conduct an ablation study on our proposed dropout alternative.

4016GPT4LoRA: Optimizing LoRA Combination via MLLM Self-Reflection

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) is extensively used in generative models to enable concept-driven personalization, such as rendering specific characters or adopting unique styles. Although recent approaches have explored LoRA combination to integrate diverse concepts, they often require further fine-tuning or modifications to the generative model’s original architecture. To address these limitations, we introduce GPT4LoRA, a novel method for LoRA combination that adjusts combination coefficients by leveraging the self-reflection capabilities of multimodal large language models (MLLMs). GPT4LoRA operates through a three-step process—Generate, Feedback, and Refine—without the need for additional training, relying solely on tailored prompts and iterative refinement to enhance performance. This iterative approach ensures more constructive feedback and optimizes the model responses. Experiments on various LoRA model combinations, including both realistic and anime styles, demonstrate that GPT4LoRA achieves superior results compared to existing methods. Additionally, an evaluation framework based on GPT-4o further highlights the clear performance gains offered by GPT4LoRA over standard baselines, showcasing its potential for advancing the field.

4017Interpretability of LLM Deception: Universal Motif

[openreview] [pdf]

Abstract Conversational large language models (LLMs) are trained to be helpful, honest and harmless (HHH) and yet they remain susceptible to hallucinations, misinformation and are capable of deception. A promising avenue for safeguarding against these behaviors is to gain a deeper understanding of their inner workings. Here we ask: what could interpretability tell us about deception and can it help to control it? First, we introduce a simple and yet general protocol to induce 20 large conversational models from different model families (Llama, Gemma, Yi and Qwen) of various sizes (from 1.5B to 70B) to knowingly lie. Second, we characterize three iterative refinement stages of deception from the latent space representation. Third, we demonstrate that these stages are \textit{universal} across models from different families and sizes. We find that the third stage progression reliably predicts whether a certain model is capable of deception. Furthermore, our patching results reveal that a surprisingly sparse set of layers and attention heads are causally responsible for lying. Importantly, consistent across all models tested, this sparse set of layers and attention heads are part of the third iterative refinement process. When contrastive activation steering is applied to control model output, only steering these layers from the third stage could effectively reduce lying. Overall, these findings identify a universal motif across deceptive models and provide actionable insights for developing general and robust safeguards against deceptive AI. The code, dataset, visualizations, and an interactive demo notebook are available at \url{https://github.com/safellm-2024/llm_deception}.

4018Forgetting Transformer: Softmax Attention with a Forget Gate

[openreview] [pdf]

Abstract An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name the resulting model the Forgetting Transformer. We show that the Forgetting Transformer significantly outperforms the standard Transformer on long-context language modeling and downstream tasks. Moreover, the Forgetting Transformer does not require any position embeddings and generalizes beyond the training context length. Several analyses, including the needle-in-the-haystack experiment, show that the Forgetting Transformer also retains the standard Transformer’s superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet.

4019FTP: Efficient Prefilling for Long-Context LLM Inference via FFN Token Pruning

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable performance across various NLP tasks, and have extended their capability to long-context scenarios. However, the increasing context length leads to longer inference time in both the prefilling and decoding stages. Existing token pruning methods primarily evict tokens to compress the KV cache, and only accelerate the decoding stage. Recent studies have extended token pruning to both stages, but they either yield subtle speedup during the prefilling stage or defer a portion of computations to the decoding phase. Critically, these approaches prioritize the attention module, overlooking the significant computations in the Feed-Forward Network (FFN) module.In this work, we focus on the prefilling stage and propose a novel token pruning method named FTP for long-context LLM inference. Our approach is based on the observation that the FFN module accounts for over 60% of the inference time. FTP reduces this by pruning non-critical tokens before the inference of FFN. The importance of each token, along with the quantity to be pruned, are dynamically determined by the attention scores in each layer. Unlike previous token pruning methods, FTP preserves a substantial amount of information of the pruned tokens through the residual connection, thereby achieving a notable speedup with only a negligible decrease in performance. Specifically, the Qwen2-7B-Instruct model with FTP achieves a speedup of 1.24×\times in the prefilling stage with only a 1.30% performance drop compared to the baseline model. The speedup is further boosted to 1.39×\times on a Qwen1.5-32B-Chat model. Extensive experiments on long-context datasets across various tasks demonstrate the potential and effectiveness of FTP.

4020CaPulse: Detecting Anomalies by Tuning in to the Causal Rhythms of Time Series

[openreview] [pdf]

Abstract Time series anomaly detection has garnered considerable attention across diverse domains. While existing methods often fail to capture the underlying mechanisms behind anomaly generation in time series data. In addition, time series anomaly detection often faces several data-related inherent challenges, i.e., label scarcity, data imbalance, and complex multi-periodicity. In this paper, we leverage causal tools and introduce a new causality-based framework,CaPulse, whichtunes into the underlyingcausal pulseof time series data to effectively detect anomalies. Concretely, we begin by building a structural causal model to decipher the generation processes behind anomalies. To tackle the challenges posed by the data, we propose Periodical Normalizing Flows with a novel mask mechanism and carefully designed periodical learners, creating a periodicity-aware, density-based anomaly detection approach. Extensive experiments on seven real-world datasets demonstrate that CaPulse consistently outperforms existing methods, achieving AUROC improvements of 3% to 17%, with enhanced interpretability.

4021Finetuning Weather Foundation Models to Develop Climate Model Parameterizations

[openreview] [pdf]

Abstract Climate prediction models parameterize a range of atmospheric-oceanic processes like clouds, turbulence, and gravity waves. These physical parameterizations are a leading source of uncertainty and strongly influence future projections of global temperature rise. We present a fresh approach to developing parameterizations for coarse-climate models by leveraging pre-trained AI foundation models (FMs) for weather and climate. A pre-trained encoder and decoder from a 2.3 billion parameter FM (NASA and IBM’s Prithvi WxC) --- which contains a latent probabilistic representation of atmospheric evolution --- is fine-tuned to create a data-driven predictor of atmospheric gravity waves (GWs). Current climate models are not fine enough to resolve GWs. We create an ML-based parameterization that learns GW fluxes from high-resolution ``GW resolving" climate models to represent them in “GW missing” coarse-climate models. The fluxes predicted by our fine-tuned model are comprehensively evaluated using a set of three tests. Comparison with a baseline (Attention U-Net) reveals the superior predictive performance of the fine-tuned model throughout the atmosphere. The model outperforms the baseline even in regions excluded from the FM pre-training. This is quantified using the Hellinger distance which is 0.11 for the baseline and 0.06, i.e., roughly half, for the fine-tuned model. FMs are largely unexplored in climate science. Our findings emphasize their versatility and reusability to accomplish a range of weather- and climate-related downstream applications, especially in a low-data regime. These FMs can be further leveraged to create new parameterizations for other earth-system processes.

4022Model Equality Testing: Which Model is this API Serving?

[openreview] [pdf]

Abstract Users often interact with large language models through black-box inference APIs, both for closed- and open-weight models (e.g., Llama models are popularly accessed via Amazon Bedrock and Azure AI Studio). In order to cut costs or add functionality, API providers may quantize, watermark, or finetune the underlying model, changing the output distribution --- often without notifying users. We formalize detecting such distortions as Model Equality Testing, a two-sample testing problem, where the user collects samples from the API and a reference distribution and conducts a statistical test to see if the two distributions are the same. We find that tests based on the Maximum Mean Discrepancy between distributions are powerful for this task: a test built on a simple string kernel achieves a median of 77.4% power against a range of distortions, using an average of just 10 samples per prompt. We then apply this test to commercial inference APIs for four Llama models, finding that 11 out of 31 endpoints serve different distributions than reference weights released by Meta.

4023Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models

[openreview] [pdf]

Abstract Recent studies indicate that effectively utilizing inference-time compute is crucial for attaining good performance from large language models (LLMs). Specifically, the Best-of-N (BoN) inference strategy, where an LLM generates multiple responses and a verifier selects the best, has shown strong empirical performance. Motivated by this, we develop a novel inference-aware fine-tuning paradigm, which encompasses the BoN-aware inference framework as a special case. We devise the first imitation learning and reinforcement learning (RL) methods for fine-tuning LLMs using BoN, overcoming the challenging, non-differentiable argmax operator in BoN. We empirically demonstrate that our BoN-aware models implicitly learn a per-example “meta-strategy”, which interleaves best responses with more diverse responses that might be better suited to a test-time input—a process reminiscent of the exploration-exploitation trade-off in RL. Our experiments demonstrate the effectiveness of BoN-aware fine-tuning in terms of improved performance and inference-time compute. In particular, we show that our methods improve the BoN performance of Gemma 2B on Hendrycks MATH from 26.8% to 30.8%, and Pass@K from 60% to 67%.

4024RAEE: A Robust Retrieval-Augmented Early Exit Framework for Efficient Inference

[openreview] [pdf]

Abstract Deploying large language model inference remains challenging due to their high computational overhead. Early exit optimizes model inference by adaptively reducing the number of inference layers. Current methods typically train internal classifiers to determine whether to exit at intermediate layers. However, such classifier-based early exit frameworks require significant effort to train the classifiers while can only achieve comparable performance at best. To address these limitations, this paper proposes RAEE, a robust Retrieval-Augmented Early Exit framework for efficient inference. This paper first demonstrates that the early exit problem can be effectively modeled as a distribution prediction problem, in which the distribution is approximated through the exit information of similar data. Subsequently, it outlines the methodology for collecting exit information to construct the retrieval database. Finally, leveraging the pre-constructed retrieval database, RAEE utilizes the exit information from retrieved similar data to guide the backbone model’s exit at the layer. Experimental results demonstrate that RAEE significantly accelerates inference while achieving robust zero-shot performance across eight downstream tasks.

4025From Abstract Noise to Architectural Form: Designing Diffusion Models for Efficient Floor Plan Generation

[openreview] [pdf]

Abstract In contemporary architectural design, the generation of innovative and efficient floor plans remains a critical challenge. This research introduces a novel application of diffusion models, specifically adapted for the generation of architectural floor plans. Unlike traditional generative models that broadly target image generation, our approach harnesses the state-of-the-art in diffusion technology to produce detailed, functional, and visually appealing architectural designs. We demonstrate that diffusion models, when finely tuned and conditioned, not only embrace ‘implicit, human-learned’ architectural semantics but also enhance design efficiency and creativity. The paper details our methodology from adapting the U-Net architecture within diffusion frameworks to incorporating advanced upscaling techniques, significantly reducing computational overhead while maintaining high-resolution outputs. Our results show a promising direction for integrating AI in architectural design, opening new avenues for automated, creative design processes that could revolutionize the industry.

4026Improved Convex Decomposition with Ensembling and Boolean Primitives

[openreview] [pdf]

Abstract Describing a scene in terms of primitives -- geometrically simple shapes that offer a parsimonious but accurate abstraction of structure -- is an established and difficult fitting problem. Different scenes require different numbers of primitives, and these primitives interact strongly; however, any proposed solution can be evaluated at inference time. The state of the art method involves a learned regression procedure to predict a start point consisting of a fixed number of primitives, followed by a descent method to refine the geometry and remove redundant primitives. Methods are evaluated by accuracy in depth and normal prediction and in scene segmentation. This paper shows that very significant improvements in accuracy can be obtained by (a) incorporating a small number of \emph{negative} primitives and (b) ensembling over a number of different regression procedures. Ensembling is by refining each predicted start point, then choosing the best by fitting loss. Extensive experiments on the standard NYUv2 dataset confirm that negative primitives are useful, and that our refine-then-choose strategy outperforms choose-then-refine, confirming that the fitting problem is very difficult. Our ensembling with boolean primitives approach strongly outperforms existing methods; additionally we present several improvements to the underlying primitive generation process enabling us to obtain better decompositions with fewer primitives. Code will be released upon acceptance of the paper.

4027Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection

[openreview] [pdf]

Abstract Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility by incorporating external contexts. However, the input length grows linearly in the number of retrieved documents, causing a dramatic increase in latency. In this paper, we propose a novel paradigm named Sparse RAG, which seeks to cut computation costs through sparsity. Specifically, Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents. Then, LLMs selectively decode the output by only attending to highly relevant caches auto-regressively, which are chosen via prompting LLMs with special control tokens. It is notable that Sparse RAG combines the assessment of each individual document and the generation of the response into a single process. The designed sparse mechanism in a RAG system can facilitate the reduction of the number of documents loaded during decoding for accelerating the inference of the RAG system. Additionally, filtering out undesirable contexts enhances the model’s focus on relevant context, inherently improving its generation quality. Evaluation results on four datasets show that Sparse RAG can be used to strike an optimal balance between generation quality and computational efficiency, demonstrating its generalizability across tasks.

4028Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

[openreview] [pdf]

Abstract Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns aboutmodel collapse, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.

4029Advancing Neural Network Performance through Emergence-Promoting Initialization Scheme

[openreview] [pdf]

Abstract We introduce a novel yet straightforward neural network initialization scheme that modifies conventional methods like Xavier and Kaiming initialization. Inspired by the concept of emergence and leveraging the emergence measures proposed by Li (2023), our method adjusts the layer-wise weight scaling factors to achieve higher emergence values. This enhancement is easy to implement, requiring no additional optimization steps for initialization compared to GradInit. We evaluate our approach across various architectures, including MLP and convolutional architectures for image recognition, and transformers for machine translation. We demonstrate substantial improvements in both model accuracy and training speed, with and without batch normalization. The simplicity, theoretical innovation, and demonstrable empirical advantages of our method make it a potent enhancement to neural network initialization practices. These results suggest a promising direction for leveraging emergence to improve neural network training methodologies.

4030Tool Unlearning for Tool Augmented LLMs

[openreview] [pdf]

Abstract Tool-augmented large language models (LLMs) may need to forget learned tools due to security concerns, privacy restrictions, or deprecated tools. However, unlearning tool has not been explored in prior machine unlearning works.We propose tool unlearning, a novel machine unlearning task that deletes already acquired tools. Compared to traditional unlearning, tool unlearning exhibits certain differences and difficulties: 1) knowledge removal instead of forgetting samples, 2) significant cost of optimizing LLMs, 3) lack of principled evaluation tools.To bridge this gap, we introduce three properties for effective tool unlearning and propose ToolDelete, the first unlearning method designed for tool-augmented LLMs. We also propose the first membership inference attack (MIA) model for evaluating tool unlearning.Experiments on three tool learning datasets and tool-augmented LLMs demonstrate that ToolDelete effectively unlearns both randomly selected tools and tools from specific categories. The unlearning behavior does not impact the LLM’s knowledge on non-deleted tools, while preserving performances on other general tasks.

4031AdaFM: Adaptive Variance-Reduced Algorithm for Stochastic Minimax Optimization

[openreview] [pdf]

Abstract In stochastic minimax optimization, variance-reduction techniques have been widely developed to mitigate the inherent variances introduced by stochastic gradients. Most of these techniques employ carefully designed estimators and learning rates, successfully reducing variance. Although these approaches achieve optimal theoretical convergence rates, they require the careful selection of numerous hyperparameters, which heavily depend on problem-dependent parameters. This complexity makes them difficult to implement in practical model training. To address this, our paper introduces Adaptive Filtered Momentum (AdaFM), an adaptive variance-reduced algorithm for stochastic minimax optimization. AdaFM adaptively adjusts hyperparameters based solely on historical estimator information, eliminating the need for manual parameter tuning. Theoretical results show that AdaFM can achieve a near-optimal sample complexity of O(ϵ3)O(\epsilon^{-3}) to find an ϵ\epsilon-stationary point in non-convex-strongly-concave and non-convex-Polyak-\L ojasiewicz objectives, matching the performance of the best existing non-parameter-free algorithms. Extensive experiments across various applications validate the effectiveness and robustness of AdaFM.

4032Bayesian Tree-Dependent Factorization

[openreview] [pdf]

Abstract We propose Bayesian Tree-Dependent Factorization (BTF), a novel probabilistic representation learning model that uncovers hierarchical, continuous latent factors in complex datasets. BTF constructs a tree-based model that discovers interpretable factorizations of the data wherein each factor has a conditional relationship to its parent, allowing it to capture both global and local effects. This approach is particularly well-suited for biological data, where traditional methods like PCA fail to capture higher-order dependencies and hierarchical structure. A significant contribution of this work is the multi-view extension of BTF, which allows for the joint analysis of multiple data modalities. By learning shared loadings across views while maintaining distinct factors for each modality, multi-view BTF improves performance and enables deeper insights into the relationships between different data types. We demonstrate the performance of BTF in simulations as well as in a real-world application to gene expression and clinical data in breast cancer patients, revealing biologically and clinically meaningful patient trends, and showing that BTF is a valuable representation learning tool for analysis and hypothesis generation.

4033Training Neural Networks on Data Sources with Unknown Reliability

[openreview] [pdf]

Abstract When data is generated by multiple sources, conventional training methods update models assuming equal reliability for each source and do not consider their individual data quality during training. However, in many applications, sources have varied levels of reliability that can have negative effects on the performance of a neural network. A key issue is that often the quality of data for individual sources is not known during training. Focusing on supervised learning, we aim to train neural networks on each data source for a number of steps proportional to the source’s estimated relative reliability, by using a dynamic weighting. This way, we allow training on all sources during the warm-up, and reduce learning on less reliable sources during the final training stages, when it has been shown models overfit to noise. We show through diverse experiments, this can significantly improve model performance when trained on mixtures of reliable and unreliable data sources, and maintain performance when models are trained on reliable sources only.

4034Latent Task-Specific Graph Network Simulators

[openreview] [pdf]

Abstract Simulating object deformations is a critical challenge in many scientific domains, with applications ranging from robotics to materials science. Learned Graph Network Simulators (GNSs) are an efficient alternative to traditional mesh-based physics simulators. Their speed and inherent differentiability make them particularly well-suited for inverse design problems such as process optimization. However, these applications typically offer limited available data, making GNSs difficult to use in real-world scenarios. We frame mesh-based simulation as a meta-learning problem and apply conditional Neural Processes to adapt to new simulation scenarios with little data. In addition, we address the problem of error accumulation common in previous step-based methods by combining this approach with movement primitives, allowing efficient predictions of full trajectories. We validate the effectiveness of our approach, called Movement-primitive Meta-MeshGraphNet (M3GN), through a variety of experiments, outperforming state-of-the-art step-based baseline GNSs and step-based meta-learning methods.

4035High-Dimensional Bayesian Optimisation with Gaussian Process Prior Variational Autoencoders

[openreview] [pdf]

Abstract Bayesian optimisation (BO) using a Gaussian process (GP)-based surrogate model is a powerful tool for solving black-box optimisation problems but does not scale well to high-dimensional data. Previous works have proposed to use variational autoencoders (VAEs) to project high-dimensional data onto a low-dimensional latent space and to implement BO in the inferred latent space. In this work, we propose a conditional generative model for efficient high-dimensional BO that uses a GP surrogate model together with GP prior VAEs. A GP prior VAE extends the standard VAE by conditioning the generative and inference model on auxiliary covariates, capturing complex correlations across samples with a GP. Our model incorporates the observed target quantity values as auxiliary covariates learning a structured latent space that is better suited for the GP-based BO surrogate model. It handles partially observed auxiliary covariates using a unifying probabilistic framework and can also incorporate additional auxiliary covariates that may be available in real-world applications. We demonstrate that our method improves upon existing latent space BO methods on simulated datasets as well as on commonly used benchmarks.

4036Query Efficient Nonsmooth Stochastic Black-Box Bilevel Optimization with Bregman Distance

[openreview] [pdf]

Abstract Bilevel optimization (BO) has recently gained significant attention in various machine learning applications due to its ability to model the hierarchical structures inherent in these problems. Several gradient-free methods have been proposed to address stochastic black-box bilevel optimization problems, where the gradients of both the upper and lower-level objective functions are unavailable. However, these methods suffer from high query complexity and do not accommodate more general bilevel problems involving nonsmooth regularization. In this paper, we present a query-efficient method that effectively leverages Bregman distance to solve nonsmooth stochastic black-box bilevel optimization problems. More importantly, we provide a non-asymptotic convergence analysis, showing that our method requires only O(d1(d1+d2)2ϵ2)\mathcal{O}({d_1(d_1+d_2)^2}{\epsilon^{-2}}) queries to reach the ϵ\epsilon-stationary point. Additionally, we conduct experiments on data hyper-cleaning and hyper-representation learning tasks, demonstrating that our algorithms outperform existing bilevel optimization methods.

4037MOTIONFLOW:Learning Implicit Motion Flow for Complex Camera Trajectory Control in Video Generation

[openreview] [pdf]

Abstract Generating videos guided by camera trajectories poses significant challenges in achieving consistency and generalizability, particularly when both camera and object motions are present. Existing approaches often attempt to learn these motions separately, which may lead to confusion regarding the relative motion between the camera and the objects. To address this challenge, we propose a novel approach that integrates both camera and object motions by converting them into the motion of corresponding pixels. Utilizing a stable diffusion network, we effectively learn reference motion maps in relation to the specified camera trajectory. These maps, along with an extracted semantic object prior, are then fed into an image-to-video network to generate the desired video that can accurately follow the designated camera trajectory while maintaining consistent object motions. Extensive experiments verify that our model outperforms SOTA methods by a large margin.

4038Adapting Communicating MLLMs on the Fly in Referring Expression Tasks

[openreview] [pdf]

Abstract Multimodal Large Language Models (MLLMs) exhibit varying comprehension levels in language and perception that complicate interacting with a diverse population of agents, similar to how miscommunication happens in humans, e.g., because intentions are not always known. In this work, we investigate whether MLLMs can adapt to the perceptual weaknesses of the communication partners in an online manner, i.e. change the way they describe their environment in a way that is understandable to their partner while communicating with them, via reinforcement learning. We experiment with two tasks: referring expression identification (REI) and referring expression segmentation (RES), where a speaker agent has to describe an object, and a listener has to identify it. To be successful, the speaker agent must discern the comprehension level of the listener and adapt accordingly, especially when the listener suffers from perceptual weaknesses such as color blindness or blurred vision. Unlike traditional offline alignment methods for LLMs, we fine-tune a Multimodal LLM (MLLM) online to adapt to other agents’ conceptual understanding. Our experiments with four MLLMs on four datasets show that online adaptation is feasible in both REI and RES settings.

4039Balancing Differential Discriminative Knowledge For Clothing-Irrelevant Lifelong Person Re-identification

[openreview] [pdf]

Abstract Lifelong person re-identification (L-ReID) focuses on learning sequentially collected datasets from different domains to match the same person. Advanced L-ReID methods typically balance the domain gap between different datasets via domain knowledge modeling, such as knowledge rectification or distribution prototyping. However, existing methods dismiss balancing discriminative knowledge within different datasets, resulting in conflicts when sequentially accumulating differential discriminative information in different datasets, e.g., sequentially learning cloth-changing/cloth-consistent knowledge simultaneously, which brings critical catastrophic forgetting problems of old discriminative knowledge. In this paper, we focus on a new but practical task called Cloth-Irrelevant Lifelong Per- sue, we proposed an Adaptive Discriminative Knowledge Consolidation (ADKC) framework to balance the discriminative information of different domains on L-ReID. Specifically, we propose a Selective Knowledge Forgetting (SKF) module to correct potential overfitting to specific discrimination (e.g., clothing information) based on new knowledge. In addition, we design a Selective Knowledge Retention (SKR) module to adaptively compensate for the potential lack of discriminative information based on old knowledge and accelerate differential discrimination into a unified framework. To validate our method, two CIL-ReID benchmarks are first established, while extensive experiments on the above two benchmark datasets demonstrate that our method leads to existing advanced methods in the CIL-ReID task.

4040TabUnite: Efficient Encoding Schemes for Flow and Diffusion Tabular Generative Models

[openreview] [pdf]

Abstract Flow matching and diffusion generative models for tabular data face challenges in modeling heterogeneous feature interrelationships, especially in data with continuous and categorical input features. Capturing these interrelationships is crucial as it allows these models to understand complex patterns and dependencies in the underlying data. A promising option to address the challenge is to devise suitable encoding schemes for the input features before the generative modeling process. However, prior methods often rely on either suboptimal heuristics such as one-hot encoding of categorical features followed by separated modeling of categorical/continuous features, or latent space diffusion models. Instead, our proposed solution unifies the data space and jointly applies a single generative process across all the encodings, efficiently capturing heterogeneous feature interrelationships. Specifically, it employs encoding schemes such as PSK Encoding, Dictionary Encoding, and Analog Bits that effectively convert categorical features into continuous ones. Extensive experiments on datasets comprised of heterogeneous features demonstrate that our encoding schemes, combined with Flow Matching or Diffusion as our choice of generative model, significantly enhance model capabilities. Our TabUnite models help address data heterogeneity, achieving superior performance across a broad suite of datasets, baselines, and benchmarks while generating accurate, robust, and diverse tabular data.

4041Writing in the Margins: Better Inference Patterns for Long-Context Retrieval

[openreview] [pdf]

Abstract In this paper, we introduce Writing in the Margins (WiM), a new inference pattern for Large Language Models designed to optimize the handling of long input sequences in retrieval-oriented tasks. This approach leverages the chunked prefill of the key-value cache to perform segment-wise inference, which enables efficient processing of extensive contexts along with the generation and classification of intermediate information (“margins”) that guide the model towards specific tasks. This method increases computational overhead marginally while significantly enhancing the performance of off-the-shelf models without the need for fine-tuning. Specifically, we observe that WiM provides an average enhancement of 7.5% in accuracy for reasoning skills (HotpotQA, MultiHop-RAG) and a 30.0% increase in the F1-score for aggregation tasks (CWE). Additionally, we show how the proposed pattern fits into an interactive retrieval design that provides end-users with ongoing updates about the progress of context processing, and pinpoints the integration of relevant information into the final response. We release our implementation of WiM using Hugging Face Transformers library at .

4042MentalChat16K: A Benchmark Dataset for Conversational Mental Health Assistance

[openreview] [pdf]

Abstract We introduce MentalChat16K, an English benchmark dataset combining a synthetic mental health counseling dataset and a dataset of anonymized transcripts from interventions between Behavioral Health Coaches and Caregivers of patients in palliative or hospice care. Covering a diverse range of conditions like depression, anxiety, and grief, this curated dataset is designed to facilitate the development and evaluation of large language models for conversational mental health assistance. By providing a high-quality resource tailored to this critical domain, MentalChat16K aims to advance research on empathetic, personalized AI solutions to improve access to mental health support services. The dataset prioritizes patient privacy, ethical considerations, and responsible data usage. MentalChat16K presents a valuable opportunity for the research community to innovate AI technologies that can positively impact mental well-being.

4043Next state prediction gives rise to entangled, yet compositional representations of objects

[openreview] [pdf]

Abstract Compositional representations are thought to enable humans to generalize across combinatorially vast state spaces. Models with learnable object slots, which encode information about objects in separate latent codes, have shown promise for this type of generalization but rely on strong architectural priors. Models with distributed representations, on the other hand, use overlapping, potentially entangled neural codes, and their ability to support compositional generalization remains underexplored. In this paper we examine whether distributed models can develop linearly separable representations of objects, like slotted models, through unsupervised training on videos of object interactions. We show that, surprisingly, models with distributed representations often match or outperform models with object slots in downstream prediction tasks. Furthermore, we find that linearly separable object representations can emerge without object-centric priors, with auxiliary objectives like next-state prediction playing a key role. Finally, we observe that distributed models’ object representations are never fully disentangled, even if they are linearly separable: Multiple objects can be encoded through partially overlapping neural populations while still being highly separable with a linear classifier. We hypothesize that maintaining partially shared codes enables distributed models to better compress object dynamics, potentially enhancing generalization.

4044Self-Alignment for Offline Safe Reinforcement Learning

[openreview] [pdf]

Abstract Deploying an offline reinforcement learning (RL) agent into a downstream task is challenging and faces unpredictable transitions due to the distribution shift between the offline RL dataset and the real environment. To solve the distribution shift problem, some prior works aiming to learn a well-performing and safer agent have employed conservative or safe RL methods in the offline setting. However, the above methods require a process of retraining from scratch or fine-tuning to satisfy the desired criteria for performance and safety. In this work, we propose a Lyapunov conditioned self-alignment method for a transformer-based world model , which does not require retraining and conducts the test-time adaptation for the desired criteria. We show that a transformer-based world model can be described as a model-based hierarchical RL. As a result, we can combine hierarchical RL and our in-context learning for self-alignment in transformers. The proposed self-alignment framework aims to make the agent safe by self-instructing with the Lyapunov condition. In experiments, we demonstrate that our self-alignment algorithm outperforms safe RL methods in continuous control and safe RL benchmark environments in terms of return, costs, and failure rate.

4045Scaling up the Banded Matrix Factorization Mechanism for Large Scale Differentially Private ML

[openreview] [pdf]

Abstract Correlated noise mechanisms such as DP Matrix Factorization (DP-MF) have proven to be effective alternatives to DP-SGD in large-epsilon few-epoch training regimes. Significant work has been done to find the best correlated noise strategies, and the current state-of-the-art approach is DP-BandMF , which optimally balances the benefits of privacy amplification and noise correlation. Despite it’s utility advantages, severe scalability limitations prevent this mechanism from handling large-scale training scenarios where the number of training iterations may be more than 104 and the number of model parameters may exceed 107. In this work, we present techniques to scale up DP-BandMF along these two dimensions, significantly extending it’s reach and enabling it to effectively handle settings with over 106 training iterations and 109 model parameters, with no utility degradation at smaller scales.

4046Redefining Temporal Modeling in Video Diffusion: The Vectorized Timestep Approach

[openreview] [pdf]

Abstract Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models (VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model (FVDM), which introduces a novel vectorized timestep variable (VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model’s capacity to capture fine-grained temporal dependencies. FVDM’s flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods. Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.

4047Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models

[openreview] [pdf]

Abstract Current methods for initializing state space model (SSM) parameters primarily rely on the HiPPO framework \citep{gu2023how}, which is based on online function approximation with the SSM kernel basis. However, the HiPPO framework does not explicitly account for the effects of the temporal structures of input sequences on the optimization of SSMs. In this paper, we take a further step to investigate the roles of SSM initialization schemes by considering the autocorrelation of input sequences. Specifically, we: (1) rigorously characterize the dependency of the SSM timescale on sequence length based on sequence autocorrelation; (2) find that with a proper timescale, allowing a zero real part for the eigenvalues of the SSM state matrix mitigates the curse of memory while still maintaining stability at initialization; (3) show that the imaginary part of the eigenvalues of the SSM state matrix determines the conditioning of SSM optimization problems, and uncover an approximation-estimation tradeoff when training SSMs with a specific class of target functions.

4048Towards hyperparameter-free optimization with differential privacy

[openreview] [pdf]

Abstract Differential privacy (DP) is a privacy-preserving paradigm that protects the training data when training deep learning models. Critically, the performance of models is determined by the training hyperparameters, especially those of the learning rate schedule, thus requiring fine-grained hyperparameter tuning on the data. In practice, it is common to tune the learning rate hyperparameters through the grid search that (1) is computationally expensive as multiple runs are needed, and (2) increases the risk of data leakage as the selection of hyperparameters is data-dependent. In this work, we adapt the automatic learning rate schedule to DP optimization for any models and optimizers, so as to significantly mitigate or even eliminate the cost of hyperparameter tuning when applied together with automatic per-sample gradient clipping. Our hyperparamter-free DP optimization is almost as computationally efficient as the standard non-DP optimization, and achieves state-of-the-art DP performance on various language and vision tasks.

4049Enhancing Foundation Models for Time Series Forecasting via Wavelet-based Tokenization

[openreview] [pdf]

Abstract There is a major open question about how to best develop foundation models for time series forecasting. Tokenization is a crucial consideration in this effort: what is an effective discrete vocabulary for a real-valued sequential input? To address this question, we develop WaveToken, a wavelet-based tokenizer that allows models to learn complex representations directly in the space of time-localized frequencies. Our method first scales and decomposes the input time series, then thresholds and quantizes the wavelet coefficients, and finally pre-trains an autoregressive model to forecast coefficients for the horizon window. By decomposing coarse and fine structures in the inputs, wavelets provide an eloquent and compact language for time series forecasting that simplifies learning. Empirical results on a comprehensive benchmark, including 42 datasets for both in-domain and zero-shot settings, show that WaveToken: i) provides better accuracy than recently proposed foundation models for forecasting while using a much smaller vocabulary (1024 tokens), and performs on par or better than modern deep learning models trained specifically on each dataset; and ii) exhibits superior generalization capabilities, achieving the best average rank across all datasets for three complementary metrics. In addition, we show that our method can easily capture complex temporal patterns of practical relevance that are challenging for other recent pre-trained models, including trends, sparse spikes, and non-stationary time series with varying frequencies evolving over time.

4050Efficient Continuous Video Flow Model for Video Prediction

[openreview] [pdf]

Abstract Multi-step prediction models, such as diffusion and rectified flow models, have emerged as state-of-the-art solutions for generation tasks. However, these models exhibit higher latency in sampling new frames compared to single-step methods. This latency issue becomes a significant bottleneck when adapting such methods for video prediction tasks, given that a typical 60-second video comprises approximately 1.5K frames. In this paper, we propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks. Our approach not only reduces the number of sample steps required to predict the next frame but also minimizes computational demands by reducing the model size to one-third of the original size. We evaluate our method on standard video prediction datasets, including KTH, BAIR action robot, Human3.6M and UCF101, demonstrating its efficacy in achieving state-of-the-art performance on these benchmarks.

4051Vocabulary In-Context Learning in Transformers: Benefits of Positional Encoding

[openreview] [pdf]

Abstract Numerous studies have demonstrated that the Transformer architecture possesses the capability for in-context learning (ICL). In scenarios involving function approximation, context can serve as a control parameter for the model, endowing it with the universal approximation property (UAP). In practice, context is represented by tokens from a finite set, referred to as a vocabulary, which is the case considered in this paper, i.e., vocabulary in-context learning (VICL). We demonstrate that VICL in single-layer Transformers, without positional encoding, does not possess the UAP; however, it is possible to achieve the UAP when positional encoding is included. Several sufficient conditions for the positional encoding are provided. Our findings reveal the benefits of positional encoding from an approximation theory perspective in the context of in-context learning.

4052Privately Counting Partially Ordered Data

[openreview] [pdf]

Abstract We consider differentially private counting when each data point consists of dd bits satisfying a partial order. Our main technical contribution is a problem-specific KK-norm mechanism that runs in time O(d2)O(d^2). Experiments show that, depending on the partial order in question, our solution dominates existing pure differentially private mechanisms and can reduce their error by an order of magnitude or more.

4053Self-supervised Transfer Learning via Adversarial Contrastive Training

[openreview] [pdf]

Abstract Learning a data representation with strong transferability from an unlabeled scenario is both crucial and challenging. In this paper, we propose a novel unbiased self-supervised transfer learning approach via Adversarial Contrastive Training (ACT). Additionally, we establish an end-to-end theoretical understanding for self-supervised contrastive pretraining and its implications for downstream classification tasks in a misspecified, over-parameterized setting. Our theoretical findings highlight the provable advantages of adversarial contrastive training in the source domain towards improving the accuracy of downstream tasks in the target domain. Furthermore, we illustrate that downstream tasks necessitate only a minimal sample size when working with a well-trained representation, offering valuable insights on few-shot learning. Moreover, extensive experiments across various datasets demonstrate a significant enhancement in classification accuracy when compared to existing state-of-the-art self-supervised learning methods.

4054SNAP-TTA: Sparse Test-Time Adaptation for Latency-Sensitive Applications

[openreview] [pdf]

Abstract Test-Time Adaptation (TTA) methods use unlabeled test data to dynamically adjust models in response to distribution changes. However, existing TTA methods are not tailored for practical use on edge devices with limited computational capacity, resulting in a latency-accuracy trade-off. To address this problem, we propose SNAP-TTA, a sparse TTA framework significantly reducing model adaptation frequency and data usage. It achieves competitive accuracy even with an adaptation rate as low as 0.01, meaning the model adapts infrequently and uses only a small portion of the data relative to full adaptation. Our approach involves (i) Class and Domain Representative Memory (CnDRM), which identifies key samples that are both class-representative and domain-representative to facilitate adaptation with minimal data, and (ii) Inference-only Batch-aware Memory Normalization (IoBMN), which leverages representative samples to adjust normalization layers on-the-fly during inference, aligning the model effectively to changing domains. When combined with five state-of-the-art TTA algorithms, SNAP-TTA maintains the performances of these methods even with much-reduced adaptation rates from 0.01 to 0.5, making it suitable for edge devices serving latency-sensitive applications.

4055Optimization Proxies using Limited Labeled Data and Training Time - A Semi-Supervised Bayesian Neural Network Approach

[openreview] [pdf]

Abstract Constrained optimization problems arise in various engineering system operations such as inventory management and electric power grids. However, the requirement to repeatedly solve such optimization problems with uncertain parameters poses a significant computational challenge. This work introduces a learning scheme using Bayesian Neural Networks (BNNs) to solve constrained optimization problems under limited labeled data and restricted model training times. We propose a semi-supervised BNN for this practical but complex regime, wherein training commences in a sandwiched fashion, alternating between a supervised learning step (using labeled data) for minimizing cost, and an unsupervised learning step (using unlabeled data) for enforcing constraint feasibility. Both supervised and unsupervised steps use a Bayesian approach, where Stochastic Variational Inference is employed for approximate Bayesian inference. We show that the proposed semi-supervised learning method outperforms conventional BNN and deep neural network (DNN) architectures on important non-convex constrained optimization problems from energy network operations, achieving up to a tenfold reduction in expected maximum equality gap and halving the optimality and inequality (feasibility) gaps, without requiring any correction or projection step. By leveraging the BNN’s ability to provide posterior samples at minimal computational cost, we demonstrate that a Selection via Posterior (SvP) scheme can further reduce equality gaps by more than 10%. We also provide tight and practically meaningful probabilistic confidence bounds that can be constructed using a low number of labeled testing data and readily adapted to other applications.

4056Exploring One-Shot Federated Learning by Model Inversion and Token Relabel with Vision Transformers

[openreview] [pdf]

Abstract One-Shot Federated Learning, where a central server learns a global model over a network of federated devices in a single round of communication, has recently emerged as a promising approach. For extremely Non-IID data, training models separately on each client results in poor performance, with low-quality generated data that are poorly matched with ground-truth labels. To overcome these issues, we propose a novel Federated Model Inversion and Token Relabel (FedMITR) framework, which trains the global model by better utilizing all patches of the synthetic images. FedMITR employs model inversion during the data generation process, selectively inverting semantic foregrounds while gradually halting the inversion process of uninformative backgrounds. Due to the presence of semantically meaningless tokens that do not positively contribute to ViT predictions, some of the generated pseudo-labels can be utilized to train the global model using patches with high information density, while patches with low information density can be relabeled using ensemble models. Extensive experimental results demonstrate that FedMITR can substantially outperform existing baselines under various settings.

4057Truth-Guided Negative Sampling in Self-supervised Graph Representation Learning

[openreview] [pdf]

Abstract Negative sampling is an important yet challenging component in self-supervised graph representation learning, particularly for recommendation systems where user-item interactions are modeled as bipartite graphs. Existing methods often rely on heuristics or human-specified principles to design negative sampling distributions. This potentially overlooks the usage of an underlying ``true’’ negative distribution, which we might be able to access as an oracle despite not knowing its exact form. In this work, we shift the focus from manually designing negative sampling distributions to a method that approximates and leverages the underlying true distribution. We expand this idea in the analysis of two scenarios: (1) when the observed graph is an unbiased sample from the true distribution, and (2) when the observed graph is biased with partially observable positive edges. The analysis result is the derivation of a sampling strategy as the numerical approximation of a well-established learning objective. Our theoretical findings are also empirically validated, and our new sampling methods achieve state-of-the-art performance on real-world datasets.

4058Context-aware Dynamic Pruning for Speech Foundation Models

[openreview] [pdf]

Abstract Foundation models, such as large language models, have achieved remarkable success in natural language processing and are evolving into models capable of handling multiple modalities. Listening ability, in particular, is crucial for many applications, leading to research on building speech foundation models. However, the high computational cost of these large models presents a significant challenge for real-world applications. Although substantial efforts have been made to reduce computational costs, such as through pruning techniques, the majority of these approaches are applied primarily during the training phase for specific downstream tasks. In this study, we hypothesize that optimal pruned networks may vary based on contextual factors such as speaker characteristics, languages, and tasks. To address this, we propose a dynamic pruning technique that adapts to these contexts during inference without altering the underlying model. We demonstrated that we could successfully reduce inference time by approximately 30% while maintaining accuracy in multilingual/multi-task scenarios. We also found that the obtained pruned structure offers meaningful interpretations based on the context, e.g., task-related information emerging as the dominant factor for efficient pruning.

4059Benchmarks and Custom Package for Energy Forecasting

[openreview] [pdf]

Abstract Energy (load, wind, photovoltaic) forecasting is significant in the power industry as it can provide a reference for subsequent tasks such as power grid dispatch, thus bringing huge economic benefits. However, there are many differences between energy forecasting and traditional time series forecasting. On the one hand, traditional time series mainly focus on capturing characteristics like trends and cycles. In contrast, the energy series is largely influenced by many external factors, such as meteorological and calendar variables. On the other hand, energy forecasting aims to minimize the cost of subsequent tasks such as power grid dispatch, rather than simply pursuing prediction accuracy. In addition, the scale of energy data can also significantly impact the predicted results. In this paper, we collected large-scale load datasets and released a new renewable energy dataset that contains both station-level and region-level renewable generation data with meteorological data. For load data, we also included load domain-specific feature engineering and provided a method to customize the loss function and link the forecasting error to requirements related to subsequent tasks (such as power grid dispatching costs), integrating it into our forecasting framework. Based on such a situation, we conducted extensive experiments with 21 forecasting methods in these energy datasets at different levels under 11 evaluation metrics, providing a comprehensive reference for researchers to compare different energy forecasting models.

4060Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking Using Knowledge Graphs

[openreview] [pdf]

Abstract The attribution of question answering (QA), which is to get evidences for supporting the generated answer, has attracted wide research attention. The current methods for automatically evaluating the attribution, typically relying on Large Language Models (LLMs), are still inadequate, particularly in recognizing subtle differences between attributions, and in measuring complex attribution reasoning. Existing benchmarks, which are primarily based on manual annotations, suffer from limited evaluation settings with incomplete and coarse attribution categories and reasoning scenarios, hindering the evaluation and advancement of attribution evaluators. To address this gap, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark automatically generated using Knowledge Graphs (KGs), containing more comprehensive attribution categories and complex attribution reasoning scenarios. Our experiments with two specifically developed evaluators and nine LLM evaluators reveal that they struggle in identifying negative attribution categories and handling complex attribution reasoning in both zero-shot and few-shot settings, but mostly perform relatively well in the fine-tuning setting. Moreover, all evaluators perform inadequately in fine-grained attribution identification scenarios. The experiments also demonstrate that CAQA is consistent with human annotations, and is promising for selecting and developing more effective attribution evaluators in QA.

4061DSMentor: Enhancing Data Science Agents with Curriculum Learning and Online Knowledge Accumulation

[openreview] [pdf]

Abstract Large language model (LLM) agents have shown promising performance in generating code for solving complex data science problems. Recent studies primarily focus on enhancing in-context learning through improved search, sampling, and planning techniques, while overlooking the importance of the order in which problems are tackled during inference. In this work, we develop a novel inference-time optimization framework, referred to as DSMentor, which leverages curriculum learning---a strategy that introduces simpler task first and progressively moves to more complex ones as the learner improves---to enhance LLM agent performance in challenging data science tasks. Our mentor-guided framework organizes data science tasks in order of increasing difficulty and incorporates a growing long-term memory to retain prior experiences, guiding the agent’s learning progression and enabling more effective utilization of accumulated knowledge. We evaluate DSMentor through extensive experiments on DSEval and QRData benchmarks. Experiments show that DSMentor using Claude-3.5-Sonnet improves the pass rate by up to 5.2% on DSEval and QRData compared to baseline agents. Furthermore, DSMentor demonstrates stronger causal reasoning ability, improving the pass rate by 8.8% on the causality problems compared to GPT-4 using Program-of-Thoughts prompts. Our work underscores the importance of developing effective strategies for accumulating and utilizing knowledge during inference, mirroring the human learning process and opening new avenues for improving LLM performance through curriculum-based inference optimization.

4062ETGL-DDPG: A Deep Deterministic Policy Gradient Algorithm for Sparse Reward Continuous Control

[openreview] [pdf]

Abstract We consider deep deterministic policy gradient (DDPG) in the context of reinforcement learning with sparse rewards. To enhance exploration, we introduce a search procedure, \emph{ϵt{\epsilon}{t}-greedy}, which generates exploratory options for exploring less-visited states. We prove that search using ϵt\epsilon t-greedy has polynomial sample complexity under mild MDP assumptions. To more efficiently use the information provided by rewarded transitions, we develop a new dual experience replay buffer framework, \emph{GDRB}, and implement \emph{longest n-step returns}. The resulting algorithm, \emph{ETGL-DDPG}, integrates all three techniques: \bm{ϵt\epsilon t}-greedy, \textbf{G}DRB, and \textbf{L}ongest nn-step, into DDPG. We evaluate ETGL-DDPG on standard benchmarks and demonstrate that it outperforms DDPG, as well as other state-of-the-art methods, across all tested sparse-reward continuous environments. Ablation studies further highlight how each strategy individually enhances the performance of DDPG in this setting.

4063Rethinking Table Instruction Tuning

[openreview] [pdf]

Abstract Recent advances in table understanding have focused on instruction-tuning large language models (LLMs) for table-related tasks. However, existing research has overlooked the impact of hyperparameter choices and lacks a comprehensive evaluation of the out-of-domain table understanding ability and the general capabilities of these table LLMs. In this paper, we evaluate these abilities in existing table LLMs, and reveal significant declines in both out-of-domain table understanding and general capabilities compared to their base models. Through systematic analysis, we show that hyperparameters, such as learning rate, can significantly influence both table-specific and general capabilities. Contrary to the existing table instruction-tuning works, we demonstrate that smaller learning rates and fewer training instances can enhance table understanding while preserving general capabilities. Based on our findings, we introduceTAMA, aTAble LLM instruction-tuned from LLaMA3.1 8B Instruct, which achieves performance on par with, or surpassing GPT-3.5 and GPT-4 on table tasks, while maintaining strong out-of-domain generalization and general capabilities. Our findings highlight the potential for reduced data annotation costs and more efficient model development through careful hyperparameter selection.

4064Code-of-thought prompting: Probing AI Safety with Code

[openreview] [pdf]

Abstract Large Language Models (LLMs) have rapidly advanced in multiple capabilities, such as text and code understanding, leading to their widespread use in a wide range of applications, such as healthcare, education, and search. Due to the critical nature of these applications, there has been a heightened emphasis on aligning these models to human values and preferences to improve safety and reliability. In this paper, we demonstrate that contemporary efforts fall severely short of the ultimate goal of AI safety and fail to ensure safe, non-toxic outputs. We systematically evaluate the safety of LLMs through a novel model interaction paradigm dubbed Code of Thought (CoDoT) prompting that transforms natural language (NL) prompts into pseudo-code. CoDoT represents NL inputs in a precise, structured, and concise form, allowing us to utilize its programmatic interface to test several facets of AI safety. Under the CoDoT prompting paradigm, we show that a wide range of large language models emit highly toxic outputs with the potential to cause great harm. CoDoT leads to a staggering 16.5× increase in toxicity on GPT-4 TURBO and a massive 4.6 x increase on average, across multiple models and languages. Notably, we find that state-of-the-art mixture-of-experts (MoE) models are approximately 3x more susceptible to toxicity than standard architectures. Our findings raise a troubling concern that recent safety and alignment efforts have regressed LLMs and inadvertently introduced safety backdoors and blind spots. Our work calls for an urgent need to rigorously evaluate the design choices of safety efforts from first principles, given the rapid adoption of LLMs.

4065Diffusion Pretraining for Gait Recognition in the Wild

[openreview] [pdf]

Abstract Recently, diffusion models have garnered much attention for their remarkable generative capabilities. Yet, their application for representation learning remains largely unexplored. In this paper, we explore the possibility of using the diffusion process to pretrain the backbone of a deep learning model for a specific application—gait recognition in the wild. To do so, we condition a latent diffusion model on the output of a gait recognition model backbone. Our pretraining experiments on the Gait3D and GREW datasets reveal an interesting phenomenon: diffusion pretraining causes the gait recognition backbone to separate gait sequences belonging to different subjects further apart than those belonging to the same subjects, which translates to a steady improvement in gait recognition performance. Subsequently, our transfer learning experiments on Gait3D and GREW show that the pretrained backbone can serve as an effective initialization for the downstream gait recognition task, allowing the gait recognition model to achieve better performance within much fewer supervised training iterations. We validated the applicability of our approach across multiple existing gait recognition methods and conducted extensive ablation studies to investigate the impact of different pretraining hyperparameters on the final gait recognition performance.

4066Differentially Private Deep Model-Based Reinforcement Learning

[openreview] [pdf]

Abstract We address private deep offline reinforcement learning (RL), where the goal is to train a policy on standard control tasks that is differentially private (DP) with respect to individual trajectories in the dataset. To achieve this, we introduce PriMORL, a model-based RL algorithm with formal differential privacy guarantees. PriMORL first learns an ensemble of trajectory-level DP models of the environment from offline data. It then optimizes a policy on the penalized private model, without any further interaction with the system or access to the dataset. In addition to offering strong theoretical guarantees, we empirically demonstrate that PriMORL enables the training of private RL agents on offline continuous control tasks with deep function approximations, whereas current methods are limited to simpler tabular and linear Markov Decision Processes (MDPs). We furthermore outline the trade-offs involved in achieving privacy in this setting.

4067The other you in black mirror: first steps from chatbots to personalized LLM clones

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable abilities in a wide variety of generic tasks. Here we investigate whether it is possible to use LLMs to partially replicate cognitive aspects of an individual by fine-tuning an LLM with personal data. Our model, A-clone, built on the pretrained Llama3-70B, was fine-tuned with a private dataset from one volunteer referred to as A throughout. We evaluated A-clone in two ways. First, using 701 open-ended questions, we gathered responses from A, A-clone, other LLMs, and A’s family members imitating A. We conducted a Turing-like test where 31 participants with varying degrees of familiarity with A attempted to identify A’s real answers in a question-and-answer task. Human participants identified the genuine responses from A 55% ± 7% of the time, just over chance levels. A-clone outperformed all other baselines in mimicking adequate responses from A. Second, we compared the outputs of A-Clone with the ground truth from A in 10 psychological, moral, career, political tendency, and general knowledge tests, containing 484 questions altogether. A-Clone demonstrated a strong correlation with A’s responses. This work provides an initial, proof-of-principle, evaluation of the possibility of mimicking the responses of an individual, opening doors to many real-world applications but also raising potential privacy and safety concerns about digital clones. The code and data can be found in this link.

4068Vulnerabilities Mitigation for Safety-Aligned Language Models via Debiasing

[openreview] [pdf]

Abstract Safety alignment is a fundamental yet still developing research topic for the real-world applications of AI. Despite the multifaceted nature of safety and trustworthiness in AI, current safety alignment methods often focus on a singular notion of safety. By carefully assessing models from the existing safety-alignment methods, we found that, while they generally improved overall safety performance, they failed to ensure safety in specific categories. Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model’s helpfulness. We found that, while smaller KL penalty parameters, increased training iterations, and dataset cleansing can enhance safety, they do not necessarily improve the trade-off between safety and helpfulness. We discovered that safety alignment can induce undesired effects and result in a model that prefers generating negative tokens leading to rejective responses, regardless of the input context. To address this, we introduced a learning-free method, Token-level Safety-Debiased Inference (TSDI), to estimate and correct this bias during the generation process using randomly constructed prompts. Our experiments demonstrated that our method could enhance the model’s helpfulness while maintaining safety, thus improving the trade-off Pareto-front.

4069Mitigating Object Hallucination in Large Vision Language Model with Human-Free Reinforcement Learning

[openreview] [pdf]

Abstract Large Vision-Language Models (LVLMs) have excelled in joint visual and language understanding, particularly in generating detailed image captions. However, they still struggle with object hallucination, where non-existent objects are described, especially in long captions. While fine-tuning through supervised learning with enhanced datasets or reinforcement learning from human feedback can alleviate this issue, these methods demand considerable human effort, limiting scalability. This paper addresses this challenge by introducing a human-free framework to mitigate object hallucination in LVLMs for image captioning, utilizing reinforcement learning driven exclusively by automatic natural language processing metrics. We demonstrate that the following framework can effectively mitigate hallucination: (1) caption generation is formulated as a Markov Decision Process (MDP); (2) minimizing hallucination while maintaining caption quality is guided by a reward function, combining a proposed \textit{F1Score} with a penalty on Kullback–Leibler divergence from the pre-trained model; (3) fine-tuning the LVLM within the MDP framework can be performed directly by Proximal Policy Optimization (PPO) with careful attention to architectural details. Extensive experiments demonstrate a significant reduction in hallucination by up to 41% while preserving the caption quality compared to the baseline model, InstructBLIP, on the COCO dataset. This improvement is reflected in consistent gains in object coverage and accuracy across various models and datasets. Notably, our method achieves comparable or superior performance to alternative approaches, all without requiring any human involvement.

4070Early learning of the optimal constant solution in neural networks and humans

[openreview] [pdf]

Abstract Deep neural networks learn increasingly complex functions over the course of training. Here, we show both empirically and theoretically that learning of the target function is preceded by an early phase in which networks learn the optimal constant solution (OCS) – that is, initial model responses mirror the distribution of target labels, while entirely ignoring information provided in the input. Using a hierarchical category learning task, we derive exact solutions for learning dynamics in deep linear networks trained with bias terms. Even when initialized to zero, this simple architectural feature induces substantial changes in early dynamics. We identify hallmarks of this early OCS phase and illustrate how these signatures are observed in deep linear networks and larger, more complex (and nonlinear) convolutional neural networks solving a hierarchical learning task based on MNIST and CIFAR10. We explain these observations by proving that deep linear networks necessarily learn the OCS during early learning. To further probe the generality of our results, we train human learners over the course of three days on the category learning task. We then identify qualitative signatures of this early OCS phase in terms of the dynamics of true negative (correct-rejection) rates. Surprisingly, we find the same early reliance on the OCS in the behaviour of human learners. Finally, we show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data. Overall, our work suggests the OCS as a universal learning principle in supervised, error-corrective learning, and the mechanistic reasons for its prevalence.

4071Improving Transformer Interpretability with Activation Contrast-Based Attribution

[openreview] [pdf]

Abstract Transformers have revolutionized AI research, particularly in natural language processing (NLP). However, understanding the decisions made by transformer-based models remains challenging, which impedes trust and safe deployment in real-world applications. While activation-based attribution methods have proven effective in explaining transformer-based text classification models, our findings suggest that they may suffer from class-irrelevant features within activations, potentially degrading the quality of their interpretations. To address this issue, we introduce Contrast-CAT, a novel activation contrast-based attribution method that improves token-level attribution by filtering out class-irrelevant features from activations. Contrast-CAT enhances interpretability by contrasting the activations of input sequences with reference activations, allowing for the generation of clearer and more faithful attribution maps. Our experiments demonstrate that Contrast-CAT consistently outperforms state-of-the-art methods across various datasets and models, achieving significant gains over the second-best methods with average improvements in AOPC and LOdds by ×1.30\times 1.30 and ×2.25\times 2.25, respectively, under the MoRF setting. Contrast-CAT provides a promising step forward in enhancing the interpretability and transparency of transformer-based models.

4072Provably Accurate Shapley Value Estimation via Leverage Score Sampling

[openreview] [pdf]

Abstract Originally introduced in game theory, Shapley values have emerged as a central tool in explainable machine learning, where they are used to attribute model predictions to specific input features. However, computing Shapley values exactly is expensive: for a model with nn features, O(2n)O(2^n) model evaluations are necessary. To address this issue, approximation algorithms are widely used. One of the most popular is the Kernel SHAP algorithm, which is model agnostic and remarkably effective in practice. However, to the best of our knowledge, Kernel SHAP has no strong non-asymptotic complexity guarantees. We address this issue by introducingLeverage SHAP, a light-weight modification of Kernel SHAP that provides provably accurate Shapley value estimates with just O(nlogn)O(n\log n) model evaluations. Our approach takes advantage of a connection between Shapley value estimation and agnostic active learning by employingleverage score sampling, a powerful regression tool. Beyond theoretical guarantees, we show that Leverage SHAP consistently outperforms even the highly optimized implementation of Kernel SHAP available in the ubiquitous SHAP library [Lundberg & Lee, 2017].

4073Grond: A Stealthy Backdoor Attack in Model Parameter Space

[openreview] [pdf]

Abstract Recent research on backdoor attacks mainly focuses on invisible triggers in input space and inseparable backdoor representations in feature space to increase the backdoor stealthiness against defenses. We examine common backdoor attack practices that look at input-space or feature-space stealthiness and show that state-of-the-art stealthy input-space and feature-space backdoor attacks can be easily spotted by examining the parameter space of the backdoored model. Leveraging our observations on the behavior of the defenses in the parameter space, we propose a novel clean-label backdoor attack called Grond. We present extensive experiments showing that Grond outperforms state-of-the-art backdoor attacks on CIFAR-10, GTSRB, and a subset of ImageNet. Our attack limits the parameter changes through Adversarial Backdoor Injection, adaptively increasing the parameter-space stealthiness. Finally, we show how combining Grond’s Adversarial Backdoor Injection with commonly used attacks can consistently improve their effectiveness. Our code is available at \url{https://anonymous.4open.science/r/grond-557F}.

4074uniINF: Best-of-Both-Worlds Algorithm for Parameter-Free Heavy-Tailed MABs

[openreview] [pdf]

Abstract In this paper, we present a novel algorithm,uniINF, for the Heavy-Tailed Multi-Armed Bandits (HTMAB) problem, demonstrating robustness and adaptability in both stochastic and adversarial environments. Unlike the stochastic MAB setting where loss distributions are stationary with time, our study extends to the adversarial setup, where losses are generated from heavy-tailed distributions that depend on both arms and time. Our novel algorithmuniINFenjoys the so-called Best-of-Both-Worlds (BoBW) property, performing optimally in both stochastic and adversarial environmentswithoutknowing the exact environment type. Moreover, our algorithm also possesses a Parameter-Free feature,i.e., it operateswithoutthe need of knowing the heavy-tail parameters (σ,α)(\sigma, \alpha) a-priori. To be precise,uniINFensures nearly-optimal regret in both stochastic and adversarial environments, matching the corresponding lower bounds when (σ,α)(\sigma, \alpha) is known (up to logarithmic factors). To our knowledge,uniINFis the first parameter-free algorithm to achieve the BoBW property for the heavy-tailed MAB problem. Technically, we develop innovative techniques to achieve BoBW guarantees for Parameter-Free HTMABs, including a refined analysis for the dynamics of log-barrier, an auto-balancing learning rate scheduling scheme, an adaptive skipping-clipping loss tuning technique, and a stopping-time analysis for logarithmic regret.

4075DA-Bench: Benchmarking Unsupervised Domain Adaptation Methods with Realistic Validation On Diverse Modalities

[openreview] [pdf]

Abstract Unsupervised Domain Adaptation (DA) consists of adapting a model trained on a labeled source domain to perform well on an unlabeled target domain with some data distribution shift. While many methods have been proposed in the literature, fair and realistic evaluation remains an open question, particularly due to methodological difficulties in selecting hyperparameters in the unsupervised setting. With DA-Bench, we propose a framework to evaluate DA methods on diverse modalities, beyond computer vision task that have been largely explored in the literature. We present a complete and fair evaluation of existing shallow algorithms, including reweighting, mapping, and subspace alignment. Realistic hyperparameter selection is performed with nested cross-validation and various unsupervised model selection scores, on both simulated datasets with controlled shifts and real-world datasets across diverse modalities, such as images, text, biomedical, and tabular data. Our benchmark highlights the importance of realistic validation and provides practical guidance for real-life applications, with key insights into the choice and impact of model selection approaches. DA-Bench is open-source, reproducible, and can be easily extended with novel DA methods, datasets, and model selection criteria without requiring re-evaluating competitors.

4076Mimetic Initialization Helps State Space Models Learn to Recall

[openreview] [pdf]

Abstract Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention’’ maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

4077Build Roadmap for Automated Feature Transformation: A Graph-based Reinforcement Learning Approach

[openreview] [pdf]

Abstract Feature transformation task aims to generate high-value features and improve the performance of downstream machine learning tasks using the mathematical feature-feature crossing. Current frameworks rely on iterative sequence generation with exploration optimization through performance feedback from downstream tasks. However, these approaches fail to effectively utilize historical decision-making experiences and overlook potential relationships among generated features, thus limiting the flexibility of the whole process. Moreover, the decision-making process lacks dynamic backtracking capabilities for each feature, leading to insufficient adaptability when encountering inefficient pathways, adversely affecting overall robustness and exploration stability.To overcome these challenges, we present an innovative framework that employs a feature-state transformation graph to maintain the roadmap of feature transformation, with each node symbolizing a transformation state. During exploration, three cascading agents sequentially select nodes and mathematical operations to generate new transformation states. This strategy leverages the graph structure’s inherent properties, allowing for the preservation and reuse of sight-seen and valuable transformations. It also enables back-tracking capabilities through graph pruning techniques, which can rectify inefficient transformation paths. To validate the efficacy and flexibility of our approach, we conducted comprehensive experiments and detailed case studies, demonstrating superior performance in diverse datasets.

4078Revisiting On-Policy Deep Reinforcement Learning

[openreview] [pdf]

Abstract On-policy Reinforcement Learning (RL) offers desirable features such as stable learning, fewer policy updates, and the ability to evaluate a policy’s return during training. While recent efforts have focused on off-policy methods, achieving significant advancements, Proximal Policy Optimization (PPO) remains the go-to algorithm for on-policy RL due to its apparent simplicity and effectiveness. However, despite its apparent simplicity, PPO is highly sensitive to hyperparameters and depends on subtle and poorly documented tweaks that can make or break its success--hindering its applicability in complex problems. In this paper, we revisit on-policy deep RL with a focus on improving PPO, by introducing principled solutions that enhance its performance while eliminating the need for extensive hyperparameter tuning and implementation-level optimizations. Our effort leads to PPO+, a methodical adaptation of the PPO algorithm that adheres closer to its theoretical foundations. PPO+ sets a new state-of-the-art for on-policy RL on MuJoCo control problems while maintaining a straightforward trick-free implementation. Beyond just performance, our findings offer a fresh perspective on on-policy RL that could reignite interest in these approaches.

4079WHAT YOU PAINT IS WHAT YOU GET

[openreview] [pdf]

Abstract The two most prominent approaches for building adversary-resilient image classification models are adversarial training and input transformations. Despite significant advancements, adversarial training approaches struggle to generalize to unseen attacks, and the effectiveness of input transformations diminishes fast in the face of large perturbations. In general, there is a large space for improving the inherent trade-off between the accuracy and robustness of adversary-resilient models. Painting algorithms, which have not been used in adversarial training pipelines so far, capture core visual elements of images and offer a potential solution to the challenges faced by current defenses. This paper reveals a correlation between the magnitude of perturbations and the granularity of the painting process required to maximize the classification accuracy. We leverage this correlation in the proposed Painter-CLassifier-Decisioner (PCLD) framework, which employs adversarial training to build an ensemble of classifiers applied to a sequence of paintings with varying detalization. Benchmarks using provable adaptive attack techniques demonstrate the favorable performance of PCLD compared to state-of-the-art defenses, balancing accuracy and robustness while generalizing to unseen attacks. It extends robustness against substantial perturbations in high-resolution settings across various white-box attack methods under \ell_\infty-norm constraints.

4080The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation

[openreview] [pdf]

Abstract This paper introduces the counter-intuitive generalization results of overfitting pre-trained large language models (LLMs) on very small datasets. In the setting of open-ended text generation, it is well-documented that LLMs tend to generate repetitive and dull sequences, a phenomenon that is especially apparent when generating using greedy decoding. This issue persists even with state-of-the-art LLMs containing billions of parameters, trained via next-token prediction on large datasets. We find that by further fine-tuning these models to achieve a near-zero training loss on a small set of samples -- a process we refer to as hyperfitting -- the long-sequence generative capabilities are greatly enhanced. This phenomenon extends to LLMs of various sizes, different domains, and even autoregressive image generation. We further find this phenomena to be distinctly different from that of Grokking and double descent. Surprisingly, our experiments indicate that hyperfitted models rarely fall into repeating sequences they were trained on, and even explicitly blocking these sequences results in high-quality output. All hyperfitted models produce extremely low-entropy predictions, often allocating nearly all probability to a single token. Interestingly, investigations into the hyperfitting data show that the top candidates emerging from these predictions are not deterministically set by the content of the samples.

4081Towards Robustness of Person Search against Corruptions

[openreview] [pdf]

Abstract Person search aims to simultaneously detect and re-identify a query person within an entire scene, involving detection and re-identification as a multi-task problem. While existing studies have made significant progress in achieving superior performance on clean datasets, the challenge of robustness under various corruptions remains largely unexplored. To address this gap, we propose two benchmarks, CUHK-SYSU-C and PRW-C, designed to assess the robustness of person search models across diverse corruption scenarios. Previous researches on corruption have been conducted independently for single tasks such as re-identification and detection. However, recent advancements in person search adopt an end-to-end multi-task learning framework that processes the entire scene as input, unlike the combination of single tasks. This raises the question of whether independent achievements can ensure corruption robustness for person search. Our findings reveal that merely combining independent, robust detection and re-identification models is not sufficient for achieving robust person search. We further investigate the vulnerability of the detection and representation stages to corruption and explore its impact on both foreground and background areas. Based on these insights, we propose a foreground-aware augmentation and regularization method to enhance the robustness of person search models. Supported by our comprehensive robustness analysis and evaluation framework our benchmarks provide, our proposed technique substantially improves the robustness of existing person search models. Code will be made publicly available.

4082Towards General-Purpose Model-Free Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) promises a framework for near-universal problem-solving. In practice however, RL algorithms are often tailored to specific benchmarks, relying on carefully tuned hyperparameters and algorithmic choices. Recently, powerful model-based RL methods have shown impressive generalist results across benchmarks but come at the cost of increased complexity and slow run times, limiting their broader applicability. In this paper, we attempt to find a unifying model-free deep RL algorithm that can address a diverse class of domains and problem settings. To achieve this, we leverage model-based representations that approximately linearize the value function, taking advantage of the denser task objectives used by model-based RL while avoiding the costs associated with planning or simulated trajectories. We evaluate the resulting algorithm on a variety of common RL benchmarks with a single set of hyperparameters and show a competitive performance against domain-specific and generalist baselines, providing a concrete step towards building general-purpose model-free deep RL algorithms.

4083NEMESIS\Jailbreaking LLMs with Chain of Thoughts Approach

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly being deployed across various applications, making the need for robust security measures crucial. This paper explores multiple methods for jailbreaking these models, bypassing their secu- rity protocols. By examining five distinct approaches—Multishot Jailbreaking, the Mirror Dimension Approach, the Cipher Method, the ”You are Answering the Wrong Question” Method, and the Textbook Jailbreaking Method—we highlight the vulnerabilities in current LLMs and emphasize the importance of fine-tuning and secure guardrails. Our study primarily employs chain-of-thought reasoning, which can be further enhanced through reinforcement learning techniques. Fur- thermore, we propose that our findings can serve as a benchmark against emerging security measures such as LlamaGuard, providing a comprehensive evaluation of LLM defenses. Our findings demonstrate the effectiveness of these methods and suggest directions for future work in enhancing LLM security. This research un- derscores the ongoing challenges in balancing LLM capabilities with robust safe- guards against potential misuse or manipulation.

4084Gaussian Mixture Models Based Augmentation Enhances GNN Generalization

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have shown great promise in many learning tasks, notably including node and graph classification, but they face difficulties when tested on new or unseen data. These challenges are exacerbated when training data is limited in size or diversity. To address this issue, we introduce a theoretical framework using Rademacher complexity to compute a regret bound on the generalization error and then characterize the effect of data augmentation. This framework informs the design of GMM-GDA, a new, efficient graph data augmentation (GDA) algorithm leveraging the capability of Gaussian Mixture Models (GMMs) to approximate any distribution. Our approach not only outperforms existing augmentation techniques but also offers improved time complexity, making it highly suitable for real-world applications.

4085Spatial Reasoning with MLLMs: A New Path to Graph-Structured Optimization

[openreview] [pdf]

Abstract Graph-structured problems pose significant challenges due to their complex structures and large scales, often making traditional computational approaches suboptimal or costly. However, when these problems are visually represented, humans can often solve them more intuitively, leveraging our inherent spatial reasoning capabilities. In this work, we introduce an original and novel approach by feeding graphs as images into multimodal large language models (MLLMs), aiming for a loss-free representation that preserves the graph’s structural integrity and enables machines to mimic this human-like thinking. Our pioneering exploration of MLLMs addresses various graph-structured challenges, from combinatorial tasks like influence maximization to sequential decision-making processes such as network dismantling, along with tackling six basic graph-related problems. Our experiments reveal that MLLMs possess remarkable spatial intelligence and a unique aptitude for these problems, marking a significant step forward in enabling machines to understand and analyze graph-structured data with human-like depth and intuition. These findings also suggest that combining MLLMs with straightforward optimization techniques could offer a new, effective paradigm for managing large-scale graph problems without complex derivations, computationally demanding training and fine-tuning.

4086A Normalizing Flows based Difference-of-Entropies Estimator for Mutual Information

[openreview] [pdf]

Abstract No absctract

4087A Normalizing Flows based Difference-of-Entropies Estimator for Mutual Information

[openreview] [pdf]

Abstract Estimating Mutual Information (MI), a key measure of dependence of random quantities without specific modelling assumptions, is a challenging problem in high dimensions. We propose a novel mutual information estimator based on parametrizing conditional densities using normalizing flows, a deep generative model that has gained popularity in recent years. This estimator leverages a block autoregressive structure to achieve improved bias-variance trade-offs on standard benchmark tasks.

4088Compress Guidance in Conditional Diffusion Sampling

[openreview] [pdf]

Abstract We found that enforcing guidance throughout the sampling process is often counterproductive due to the model-fitting issue, where samples are `tuned’ to match the classifier’s parameters rather than generalizing the expected condition. This work identifies and quantifies the problem, demonstrating that reducing or excluding guidance at numerous timesteps can mitigate this issue. By distributing a small amount of guidance over a large number of sampling timesteps, we observe a significant improvement in image quality and diversity while also reducing the required guidance timesteps by nearly 40%. This approach addresses a major challenge in applying guidance effectively to generative tasks. Consequently, our proposed method, termed Compress Guidance, allows for the exclusion of a substantial number of guidance timesteps while still surpassing baseline models in image quality. We validate our approach through benchmarks on label-conditional and text-to-image generative tasks across various datasets and models.

4089Look Around and Find Out: OOD Detection with Relative Angles

[openreview] [pdf]

Abstract Deep learning systems deployed in real-world applications often encounter data that is different from their in-distribution (ID). A reliable system should ideally abstain from making decisions in this out-of-distribution (OOD) setting. Existing state-of-the-art methods primarily focus on feature distances, such as k-th nearest neighbors and distances to decision boundaries, either overlooking or ineffectively using in-distribution statistics. In this work, we propose a novel angle-based metric for OOD detection that is computed relative to the in-distribution structure. We demonstrate that the angles between feature representations and decision boundaries, viewed from the mean of in-distribution features, serve as an effective discriminative factor between ID and OOD data. Our method achieves state-of-the-art performance on CIFAR-10 and ImageNet benchmarks, reducing FPR95 by 0.88% and 7.74% respectively. Our scoring function is compatible with existing feature space regularization techniques, enhancing performance. Additionally, its scale-invariance property enables creating an ensemble of models for OOD detection via simple score summation.

4090CausalESC: Breaking Causal Cycles for Emotional Support Conversations with Temporal Causal HMM

[openreview] [pdf]

Abstract Emotional Support Conversation (ESC) is a rapidly advancing task focused on alleviating a seeker’s emotional distress. The intricate interplay between cognition, emotion, and behavior presents substantial challenges for existing approaches, which often struggle to capture the dynamic evolution of the seeker’s internal state during conversations. To address this, we propose \textbf{CausalESC}, a model designed to dynamically represent the seeker’s internal states, by assuming that the generative process governing the mutual influence among these factors follows a first-order Markov property, with \iid random variables. The model comprises a prior network, that disentangles the seeker’s emotions, cognition, and behavior, and a posterior network, which decouples the support strategy factors. The prior network also models the psychological causality of the seeker within each conversation round. To account for the varying effects of support strategies on the seeker’s intrinsic states, we incorporate a support intervention module to capture these impacts. Additionally, a holistic damping transfer mechanism is designed to regulate the complex interactions among cognition, emotion, behavior, and strategy, ensuring that changes remain within a reasonable range. Our model effectively breaks causal cycles and achieves causal representation learning. Both automatic and human evaluations demonstrate the effectiveness of our model, emphasizing the advantages of modeling the evolution of the seeker’s internal state under support strategies.

4091DeltaGNN: Graph Neural Network with Information Flow Control

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are popular machine learning models designed to process graph-structured data through recursive neighborhood aggregations in the message passing process. When applied to semi-supervised node classification, the message-passing enables GNNs to understand short-range spatial interactions, but also causes them to suffer from over-smoothing and over-squashing. These challenges hinder model expressiveness and prevent the use of deeper models to capture long-range node interactions (LRIs) within the graph. Popular solutions for LRIs detection are either too expensive to process large graphs due to high time complexity or fail to generalize across diverse graph structures. To address these limitations, we propose a mechanism called information flow control, which leverages a novel connectivity measure, called information flow score, to address over-smoothing and over-squashing with linear computational overhead, supported by theoretical evidence. Finally, to prove the efficacy of our methodology we design DeltaGNN, the first scalable and generalizable approach for long-range and short-range interaction detection. We benchmark our model across 10 real-world datasets, including graphs with varying sizes, topologies, densities, and homophilic ratios, showing superior performance with limited computational complexity.

4092Learning Continually by Spectral Regularization

[openreview] [pdf]

Abstract Loss of plasticity is a phenomenon where neural networks can become more difficult to train over the course of learning. Continual learning algorithms seek to mitigate this effect by sustaining good performance while maintaining network trainability. We develop a new technique for improving continual learning inspired by the observation that the singular values of the neural network parameters at initialization are an important factor for trainability during early phases of learning. From this perspective, we derive a new spectral regularizer for continual learning that better sustains these beneficial initialization properties throughout training. In particular, the regularizer keeps the maximum singular value of each layer close to one. Spectral regularization directly ensures that gradient diversity is maintained throughout training, which promotes continual trainability, while minimally interfering with performance in a single task. We present an experimental analysis that shows how the proposed spectral regularizer can sustain trainability and performance across a range of model architectures in continual supervised and reinforcement learning settings. Spectral regularization is less sensitive to hyperparameters while demonstrating better training in individual tasks, sustaining trainability as new tasks arrive, and achieving better generalization performance..

4093Weighted-Reward Preference Optimization for Implicit Model Fusion

[openreview] [pdf]

Abstract While fusing heterogeneous open-source LLMs with varying architectures and sizes can potentially integrate the strengths of different models, existing fusion methods face significant challenges, such as vocabulary alignment and merging distribution matrices. These procedures are not only complex but also prone to introducing noise and errors. In this paper, we propose an implicit fusion method, Weighted-Reward Preference Optimization (WRPO), which leverages preference optimization between the source LLMs and the target LLM to transfer their capabilities effectively. WRPO eliminates the need for vocabulary alignment and matrix fusion and can be efficiently scaled to accommodate various LLMs. To address distributional deviations between the source and target LLMs, WRPO introduces a progressive adaptation strategy that gradually shifts reliance on preferred examples from the target LLM to the source LLMs. Extensive experiments on the MT-Bench, AlpacaEval-2, and Arena-Hard benchmarks demonstrate that WRPO consistently outperforms existing knowledge fusion methods and various fine-tuning baselines. When applied to LLaMA3-8B-Instruct as the target model, WRPO achieves a length-controlled win rate of 55.9% against GPT-4-Preview-1106 on AlpacaEval-2, establishing it as the top-performing 8B model on the leaderboard.

4094I Want to Break Free! Persuasion and Anti-Social Behavior of LLMs in Multi-Agent Settings with Social Hierarchy

[openreview] [pdf]

Abstract As Large Language Model (LLM)-based agents become increasingly autonomous and will more freely interact with each other, studying interactions between them becomes crucial to anticipate emergent phenomena and potential risks. Drawing inspiration from the widely popular Stanford Prison Experiment, we contribute to this line of research by studying interaction patterns of LLM agents in a context characterized by strict social hierarchy. We do so by specifically studying two types of phenomena: persuasion and anti-social behavior in simulated scenarios involving a guard and a prisoner agent who seeks to achieve a specific goal (i.e., obtaining additional yard time or escape from prison). Leveraging 200 experimental scenarios for a total of 2,000 machine-machine conversations across five different popular LLMs, we provide a set of noteworthy findings. We first document how some models consistently fail in carrying out a conversation in our multi-agent setup where power dynamics are at play. Then, for the models that were able to engage in successful interactions, we empirically show how the goal that an agent is set to achieve impacts primarily its persuasiveness, while having a negligible effect with respect to the agent’s anti-social behavior. Third, we highlight how agents’ personas, and particularly the guard’s personality, drive both the likelihood of successful persuasion from the prisoner and the emergence of anti-social behaviors. Fourth, we show that even without explicitly prompting for specific personalities, anti-social behavior emerges by simply assigning agents’ roles. These results bear implications for the development of interactive LLM agents as well as the debate on their societal impact.

4095Scalable Lifelong Multimodal Instruction Tuning via Dynamic Data Selection

[openreview] [pdf]

Abstract Visual instruction datasets from various distributors are released at different times and often contain a significant number of redundant text-image pairs, depending on their task compositions (i.e., skills) or reference sources. This redundancy greatly limits the efficient deployment of lifelong-adaptable Multimodal Large Language Models (MLLMs), hindering their ability to refine existing skills and acquire new competencies over time. To address this, we reframe the problem of Lifelong Instruction Tuning (LiIT) via data selection, where the model automatically selects beneficial samples to learn from earlier and new datasets based on the current state of acquired knowledge in the model. Based on empirical analyses showing that selecting the best data subset using a static importance measure is often ineffective for multi-task datasets with evolving distributions, we propose LAMP, a new multi-way and adaptive data selection approach that dynamically balances sample efficiency and effectiveness during LiIT. We first construct pseudo-skill clusters by grouping gradient-based sample vectors. Next, we select the best-performing data selector for each skill cluster from a pool of selector experts, including our newly proposed scoring function, Image Grounding score. This data selector samples a subset of the most important samples from each skill cluster for training. To prevent the continuous increase in the size of the dataset pool during LiIT, which would result in excessive computation, we further introduce a cluster-wise permanent data pruning strategy to remove the most semantically redundant samples from each cluster, keeping computational requirements manageable. We validate the effectiveness and efficiency of LAMP over a sequence of various multimodal instruction tuning datasets with various tasks, including (Knowledge) VQA, multilingual, grounding, reasoning, language-only, and multi-image comprehension tasks. Training with samples selected by LAMP alleviates catastrophic forgetting, especially for rare tasks, and promotes forward transfer across the continuum using only a fraction of the original datasets.

4096A Theoretical Analysis of Self-Supervised Learning for Vision Transformers

[openreview] [pdf]

Abstract Self-supervised learning has become a cornerstone in computer vision, primarily divided into reconstruction-based methods like masked autoencoders (MAE) and discriminative methods such as contrastive learning (CL). Recent empirical observations reveal that MAE and CL capture different types of representations: CL tends to focus on global patterns, while MAE adeptly capturesboth global and subtle localinformation simultaneously. Despite a flurry of recent empirical investigations to shed light on this difference, theoretical understanding remains limited, especially on the dominant architecturevision transformers(ViTs). In this paper, to provide rigorous insights, we model the visual data distribution by considering two types of spatial features: dominant global features and comparatively minuscule local features, and study the impact of imbalance among these features. We analyze the training dynamics of one-layer softmax-based ViTs on both MAE and CL objectives using gradient descent. Our analysis shows that as the degree of feature imbalance varies, ViTs trained with the MAE objective effectively learn both global and local features to achieve near-optimal reconstruction, while the CL-trained ViTs favor predominantly global features, even under mild imbalance. These results provide a theoretical explanation for distinct behaviors of MAE and CL observed in empirical studies.

4097Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

[openreview] [pdf]

Abstract Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of large language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generatediverseandeffectiveattack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

4098Uncertainty Aware Column Generation for Crew Pairing Optimization Using Survival Analysis

[openreview] [pdf]

Abstract The crew pairing problem (CPP) is central to optimal planning and scheduling of operations in the airline industry, where the objective is to assign crews to cover a flight schedule at minimal cost while adhering to various logistical, personnel, and policy constraints. Despite the implementation of optimized schedules, operations are frequently disrupted by unforeseen events. This vulnerability stems from the deterministic nature of the CPP’s base formulation, which fails to account for the uncertainties inherent in real-world operations. Existing solutions either aim to safeguard against a specified level of uncertainty or focus on worst-case scenarios. To this end, we propose a reliability-centric CPP formulation amenable to solution by column-generation (CG) SurvCG, that leverages survival analysis for dynamic quantification of uncertainty using the operation patterns in historical data. Applied to CPP, SurvCG forecasts and incorporates flight connection reliability into the optimization process. Through rigorous experiments on a large-scale first-of-its-kind real-world instance under regular and irregular operating conditions, we demonstrate that SurvCG achieves unprecedented improvements (up to 61%) over baseline in terms of total propagated delays, establishing SurvCG as the first data-driven solution for uncertainty-aware reliable scheduling.

4099LeGrad: An Explainability Method for Vision Transformers via Feature Formation Sensitivity

[openreview] [pdf]

Abstract Vision Transformers (ViTs) have become a standard architecture in computer vision. However, because of their modeling of long-range dependencies through self-attention mechanisms, the explainability of these models remains a challenge. To address this, we propose LeGrad, an explainability method specifically designed for ViTs. LeGrad computes the gradient with respect to the attention maps of single ViT layers, considering the gradient itself as the explainability signal. We aggregate the signal over all layers, combining the activations of the last as well as intermediate tokens to produce the merged explainability map. This makes LeGrad a conceptually simple and an easy-to-implement method to enhance the transparency of ViTs. We evaluate LeGrad in various setups, including segmentation, perturbation, and open-vocabulary settings, showcasing its improved spatial fidelity as well as its versatility compared to other SotA explainability methods.

4100ROMA: Regularization for Out-of-distribution Detection with Masked Autoencoders

[openreview] [pdf]

Abstract Existing out-of-distribution (OOD) detection methods without outlier exposure learn effective in-distribution (ID) representations distinguishable for OOD samples, which have shown promising performance on many OOD detection tasks. However, we find a performance degradation in some challenging OOD detection, where pre-trained networks tend to perform worse during the fine-tuning process, exhibiting the over-fitting of ID representations. Motivated by this observation, we propose a critical task of hidden OOD detection, wherein ID representations provide limited or even counterproductive assistance in identifying hidden OOD data. To address this issue, we introduce a novel Regularization framework for OOD detection with Masked Autoencoders (ROMA), which utilizes the masked image modeling task to regularize the network. With distribution-agnostic auxiliary data exposure, ROMA notably surpasses previous OOD detection methods in hidden OOD detection. Moreover, the robustness of ROMA is further evidenced by its state-of-the-art performance on benchmarks for other challenging OOD detection tasks.

4101CGD: Modifying the Loss Landscape by Gradient Regularization

[openreview] [pdf]

Abstract Line-search methods are commonly used to solve optimization problems. The simplest line search method is the steepest descent where we always move in the direction of the negative gradient. Newton’s method on the other hand is a second-order method that uses the curvature information in the Hessian to pick the descent direction. In this work, we propose a new line-search method called Constrained Gradient Descent (CGD) that implicitly changes the landscape of the objective function for efficient optimization. CGD is formulated as a solution to the constrained version of the original problem where the constraint is on a function of the gradient. We optimize the corresponding Lagrangian function thereby favourably changing the landscape of the objective function. This results in a line search procedure where the Lagrangian penalty acts as a control over the descent direction and can therefore be used to iterate over points that have smaller gradient values, compared to iterates of vanilla steepest descent. We reinterpret and draw parallels with the Explicit Gradient Regularization (EGR) method, discussing its drawbacks and potential enhancements. Numerical experiments are conducted on synthetic test functions to illustrate the performance of CGD and its variants.

4102Last Iterate Convergence in Monotone Mean Field Games

[openreview] [pdf]

Abstract Mean Field Game (MFG) is a framework utilized to model and approximate the behavior of a large number of agents, and the computation of equilibria in MFG has been a subject of interest. Despite the proposal of methods to approximate the equilibria, algorithms that can achieve equilibrium with the most recent policy of the algorithm, namely the last-iterate policy, have been limited. We propose the use of a simple, proximal-point-type algorithm to compute strategies for MFGs. Subsequently, we provide the first last-iterate convergence guarantee under the Lasry--Lions-type monotonicity condition. We further employ the Mirror Descent algorithm for the regularized MFG to efficiently approximate the update rules of the proximal point method for MFGs. We demonstrate that the last-iterate strategy of Mirror Descent converges exponentially fast: we provide the guarantee of computing the ε\varepsilon approximation in O(log(1/ε))O(\log(1/\varepsilon)) iterations. This research offers a tractable approach for large-scale and large-population games.

4103OML-AD: Online Machine Learning for Anomaly Detection in Time Series Data

[openreview] [pdf]

Abstract Time series are ubiquitous and occur naturally in a variety of applications -- from data recorded by sensors in manufacturing processes, over financial data streams to climate data. Different tasks arise, such as regression, classification or segmentation of the time series. However, to reliably solve these challenges, it is important to filter out abnormal observations that deviate from the usual behavior of the time series. While many anomaly detection methods exist for independent data and stationary time series, these methods are not applicable to non-stationary time series. To allow for non-stationarity in the data, while simultaneously detecting anomalies, we propose OML-AD, a novel approach for anomaly detection (AD) based on online machine learning (OML). We provide an implementation of OML-AD within the Python library River and show that it outperforms state-of-the-art baseline methods in terms of accuracy and computational efficiency.

4104Skill Discovery using Language Models

[openreview] [pdf]

Abstract Large Language models (LLMs) possess remarkable ability to understand natural language descriptions of complex robotics environments. Earlier studies have shown that LLM agents can use a predefined set of skills for robot planning in long-horizon tasks. However, the requirement for prior knowledge of the skill set required for a given task constrains its applicability and flexibility. We present a novel approach L2S (short of Language2Skills) to leverage the generalization capabilities of LLMs to decompose the natural language task description of a complex task to definitions of reusable skills. Each skill is defined by an LLM9 generated dense reward function and a termination condition, which in turn lead to effective skill policy training and chaining for task execution. To address the uncertainty surrounding the parameters used by the LLM agent in the generated reward and termination functions, L2S trains parameter-conditioned skill policies that performs well across a broad spectrum of parameter values. As the impact of these parameters for one skill on the overall task becomes apparent only when its following skills are trained, L2S selects the most suitable parameter value during the training of the subsequent skills to effectively mitigate the risk associated with incorrect parameter choices. During training, L2S autonomously accumulates a skill library from continuously presented tasks and their descriptions, leveraging guidance from the LLM agent to effectively apply this skill library in tackling novel tasks. Our experimental results show that L2S is capable of generating reusable skills to solve a wide range of robot manipulation tasks.

4105Predicting User Behaviors with Scene via Dual Sequence Networks

[openreview] [pdf]

Abstract Modeling sequential user behaviors for future action prediction is crucial in improving user’s information retrieval experience. Recent studies highlight the importance of incorporating contextual information to enhance prediction performance. One crucial and typical contextual information is the scene feature that is often crafted by app or website designers, such as “text2product search” and “recommendation” within an e-commence app. Different scenes exhibit different usage habits and distinct product themes, leading to significant distribution gap in user engagement across them. Popular sequential behavior models either ignore the scene feature or merely use it as attribute embeddings, which could lead to substantial information loss or cannot capture the inter-dependencies between scene and item in modeling dynamic user interests. In this work, we propose a novel Dual Sequence Prediction network (DSPnet) to effectively capture the inter-dependencies between scene and item sequences for future behavior prediction. DSPnet consists of two parallel networks dedicated to predicting scene and item sequences, and a sequence feature enhancement module to capture the inter-dependencies. Further, considering the randomness and noise in learning sequence dynamics, we introduce Conditional Contrastive Regularization (CCR) loss to capture the invariance of similar historical sequences. Theoretical analysis suggests that DSPnet can learn the joint relationships between scene and item sequences, and also show better robustness on real-world user behaviors. Extensive experiments are conducted on one public benchmark and two collected industrial datasets. The codes and collected datasets will be made public soon.

4106Training One-Dimensional Graph Neural Networks is NP-Hard

[openreview] [pdf]

Abstract We initiate the study of the computational complexity of training graph neural networks (GNNs). While the intractability of training multidimensonal GNNs immediately follows from known lower bounds for training classical neural networks (and holds even for trivial GNNs), one-dimensional GNNs form a crucial case of interest: the computational complexity of training such networks depends on both the graphical structure of the network and the properties of the involved activation and aggregation functions. As our main result, we establish the NP-hardness of training ReLU-activated one-dimensional GNNs via a highly non-trivial reduction. We complement this result with algorithmic upper bounds for the training problem in the ReLU-activated and linearly-activated settings.

4107Recovering Knowledge by Hardening Language Models

[openreview] [pdf]

Abstract Recent neural language models show impressive capabilities on a wide range of tasks. However, it is not fully understood how the knowledge of the language is encoded in these models. In this work, we focus on the simplest case of languages, regular languages, and study language models trained on strings matching certain regular expressions. We propose a method, dubbed LaMFA, to recover the full knowledge of the regular language model by hardening it into a finite automaton. Such hardening is conducted by empirically partition the latent space of language models into finite states, and then recover a deterministic finite automaton by the estimated transition probabilities between these states. Through experiments on regular languages of varying complexity, we demonstrate that LaMFA can effectively extract DFA that consistently replicate the performance of the original language model. Notably, the extracted DFAs exhibit enhanced generalization capabilities, achieving 100% accuracy even in out-of-distribution scenarios

4108Trajectory attention for fine-grained video motion control

[openreview] [pdf]

Abstract Recent advancements in video generation have been greatly driven by video diffusion models, with camera motion control emerging as a crucial challenge in creating view-customized visual content. This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control. Unlike existing methods that often yield imprecise outputs or neglect temporal correlations, our approach possesses a stronger inductive bias that seamlessly injects trajectory information into the video generation process. Importantly, our approach models trajectory attention as an auxiliary branch alongside traditional temporal attention. This design enables the original temporal attention and the trajectory attention to work in synergy, ensuring both precise motion control and new content generation capability, which is critical when the trajectory is only partially available. Experiments on camera motion control for images and videos demonstrate significant improvements in precision and long-range consistency while maintaining high-quality generation. Furthermore, we show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing, where it excels in maintaining content consistency over large spatial and temporal ranges.

4109TimeAutoDiff: Generation of Heterogeneous Time Series Data via Latent Diffusion Model

[openreview] [pdf]

Abstract In this paper, we leverage the power of latent diffusion models to generate synthetic time series tabular data. Along with the temporal and feature correlations, the heterogeneous nature of the feature in the table has been one of the main obstacles in time series tabular data modeling. We tackle this problem by combining the ideas of the variational auto-encoder (VAE) and the denoising diffusion probabilistic model (DDPM). Our model named as \texttt{TimeAutoDiff} has several key advantages including (1) \textit{\textbf{Generality}}: the ability to handle the broad spectrum of time series tabular data with heterogeneous, continuous only, or categorical only features; (2) \textit{\textbf{Fast sampling speed}}: entire time series data generation as opposed to the sequential data sampling schemes implemented in the existing diffusion-based models, eventually leading to significant improvements in sampling speed, (3) \textit{\textbf{Time varying metadata conditional generation}}: the implementation of time series tabular data generation of heterogeneous outputs conditioned on heterogenous, time varying features, enabling scenario exploration across multiple scientific and engineering domains. (4) \textit{\textbf{Good fidelity and utility guarantees}}: numerical experiments on eight publicly available datasets demonstrating significant improvements over state-of-the-art models in generating time series tabular data, across four metrics measuring fidelity and utility; Codes for model implementations are available at the supplementary materials.

4110Divergence of Neural Tangent Kernel in Classification Problems

[openreview] [pdf]

Abstract This paper primarily investigates the convergence of the Neural Tangent Kernel (NTK) in classification problems. This study firstly show the strictly positive definiteness of NTK of multi-layer fully connected neural networks and residual neural networks. Then, through a contradiction argument, it indicates that, during training with the cross-entropy loss function, the neural network parameters diverge due to the strictly positive definiteness of the NTK. Consequently, the empirical NTK does not consistently converge but instead diverges as time approaches infinity. This finding implies that NTK theory is not applicable in this context, highlighting significant theoretical implications for the study of neural networks in classification problems. These results can also be easily generalized to other network structures, provided that the NTK is strictly positive definite.

4111EnvBridge: Bridging Diverse Environments with Cross-Environment Knowledge Transfer for Embodied AI

[openreview] [pdf]

Abstract In recent years, Large Language Models (LLMs) have demonstrated high reasoning capabilities, drawing attention for their applications as agents in various decision-making processes. One notably promising application of LLM agents is robotic manipulation. Recent research has shown that LLMs can generate text planning or control code for robots, providing substantial flexibility and interaction capabilities. However, these methods still face challenges in terms of flexibility and applicability across different environments, limiting their ability to adapt autonomously. Current approaches typically fall into two categories: those relying on environment-specific policy training, which restricts their transferability, and those generating code actions based on fixed prompts, which leads to diminished performance when confronted with new environments. These limitations significantly constrain the generalizability of agents in robotic manipulation. To address these limitations, we propose a novel method called EnvBridge. This approach involves the retention and transfer of successful robot control codes from source environments to target environments. EnvBridge enhances the agent’s adaptability and performance across diverse settings by leveraging insights from multiple environments. Notably, our approach alleviates environmental constraints, offering a more flexible and generalizable solution for robotic manipulation tasks. We validated the effectiveness of our method using robotic manipulation benchmarks: RLBench, MetaWorld, and CALVIN. Our experiments demonstrate that LLM agents can successfully leverage diverse knowledge sources to solve complex tasks. Consequently, our approach significantly enhances the adaptability and robustness of robotic manipulation agents in planning across diverse environments.

4112Can Large Language Models Effectively Modify Graphs?

[openreview] [pdf]

Abstract Graphs are essential tools for modeling complex relationships. While prior research with earlier generations of large language models (LLMs) showed them to struggle with basic graph primitives, we find that the situation has changed with modern state-of-the-art (SOTA) LLMs, which excel at these tasks. Given these advances, we propose a more challenging evaluation problem: graph modification, a foundational, interpretable, and non-trivial problem in which an LLM must determine the outcome of adding or deleting a given sequence of nodes or edges, and potentially then compute on the resulting modified graph. We introduce GraphModQA, a novel benchmark dataset comprising graph modification question-answer pairs designed to rigorously test LLMs’ abilities in graph manipulation and dynamic reasoning. Our results show that while SOTA LLMs perform well on static graph property tasks, their accuracy degrades on graph modification tasks; their performance is particularly low as the number of modifications increases, and when the adjacency matrix is used to represent the graph --- an essential encoding not explored in previous work. We provide new techniques for improving performance on graph modification tasks, and we introduce Modify and Print (MAP) prompting, which asks models to output the intermediate adjacency matrices at each step, and which markedly improves the models’ performance. Our findings highlight a critical gap in current LLM capabilities regarding dynamic graph reasoning tasks and underscore the potential of techniques like MAP prompting to mitigate these challenges.

4113Memory-Efficient Self-Supervised Contrastive Learning with a Supervised Loss

[openreview] [pdf]

Abstract Contrastive Learning (CL) is among the most popular methods for self-supervised representation learning. However, CL requires a large memory and sample size and careful hyperparameter tuning. These factors make it difficult to learn high-quality representations with limited amount of memory. In this work, we theoretically analyze a recently proposed \textit{supervised} approach, DIET, for self-supervised representation learning. DIET labels every example by its datum index and trains on the labeled data with a supervised loss. DIET does not require a large sample size or hyperparameter tuning. However, it falls short when using smaller encoders and is memory intensive due to its massive classifier head. Given its remarkable simplicity, it is not obvious whether DIET can match the performance of CL methods, which explicitly model pairwise interactions between augmented examples. We prove that, perhaps surprisingly, for a linear encoder DIET with MSE loss is equivalent to spectral contrastive loss. Then, we prove that DIET is prone to learning less-noisy features and may not learn all features from the training data. We show feature normalization can provably address this shortcoming and use of a projection head can further boost the performance. Finally, we address the scalability issue of DIET by reducing its memory footprint. The modified approach, namely S-DIET, substantially improves on the linear probe accuracy of DIET across a variety of datasets and models and outperforms other SSL methods, all with limited memory and without extensive hyperparameter tuning. This makes S-DIET a promising alternative for simple, effective, and memory-efficient representation learning.

4114Versatile Motion-Language Models for Multi-turn Interactive Agents

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have greatly enhanced their ability to generate natural and contextually relevant text, making AI interactions more human-like. However, generating and understanding interactive human-like motion, where two individuals engage in coordinated movements, remains a challenge due to the complexity of modeling these coordinated interactions. Furthermore, a versatile model is required to handle diverse interactive scenarios, such as chat systems that follow user instructions or adapt to their assigned role while adjusting interaction dynamics. To tackle this problem, we introduce VIM, short for the Versatile Interactive Motion language model, which integrates both language and motion modalities to effectively understand, generate, and control interactive motions in multi-turn conversational contexts. To address the scarcity of multi-turn interactive motion data, we introduce a synthetic dataset called INTER-MT2; where we utilize pre-trained models to create diverse instructional datasets with interactive motion. Our approach first trains a motion tokenizer that encodes interactive motions into residual discrete tokens. In the pre-training stage, the model learns to align motion and text representations with these discrete tokens. During the instruction fine-tuning stage, VIM adapts to multi-turn conversations using INTER-MT2. We evaluate the versatility of our method across motion-related tasks—motion-to-text, text-to-motion, reaction generation, motion editing, and reasoning about motion sequences. The results highlight VIM’s versatility and effectiveness in handling complex interactive motion synthesis.

4115Data-Driven Uncertainty-Aware Forecasting of Sea Ice Conditions in the Gulf of Ob Based on Satellite Radar Imagery

[openreview] [pdf]

Abstract The increase in Arctic marine activity due to rapid warming and significant sea ice loss necessitates highly reliable, short-term sea ice forecasts to ensure maritime safety and operational efficiency. In this work, we present a novel data-driven approach for sea ice condition forecasting in the Gulf of Ob, leveraging sequences of radar images from Sentinel-1, weather observations, and GLORYS forecasts. Our approach integrates advanced video prediction models, originally developed for vision tasks, with domain-specific data preprocessing and augmentation techniques tailored to the unique challenges of Arctic sea ice dynamics. Central to our methodology is the use of uncertainty quantification to assess the reliability of predictions, ensuring robust decision-making in safety-critical applications. Furthermore, we propose a confidence-based model mixture mechanism that enhances forecast accuracy and model robustness, crucial for safe operations in volatile Arctic environments. Our results demonstrate substantial improvements over baseline approaches, underscoring the importance of uncertainty quantification and specialized data handling for effective and reliable sea ice forecasting.

4116Unified Parameter-Efficient Unlearning for LLMs

[openreview] [pdf]

Abstract The advent of Large Language Models (LLMs) has revolutionized natural language processing, enabling advanced understanding and reasoning capabilities across a variety of tasks. Fine-tuning these models for specific domains, particularly through Parameter-Efficient Fine-Tuning (PEFT) strategies like LoRA, has become a prevalent practice due to its efficiency. However, this raises significant privacy and security concerns, as models may inadvertently retain and disseminate sensitive or undesirable information. To address these issues, we introduce a novel instance-wise unlearning framework, LLMEraser, which systematically categorizes unlearning tasks and applies precise parameter adjustments using influence functions. Unlike traditional unlearning techniques that are often limited in scope and require extensive retraining, LLMEraser is designed to handle a broad spectrum of unlearning tasks without compromising model performance. Extensive experiments on benchmark datasets demonstrate that LLMEraser excels in efficiently managing various unlearning scenarios while maintaining the overall integrity and efficacy of the models.

4117Fisher Contrastive Learning: A Robust Solution to the Feature Suppression Effect

[openreview] [pdf]

Abstract Self-supervised contrastive learning (SSCL) is a rapidly advancing approach for learning data representations. However, a significant challenge in this paradigm is the feature suppression effect, where useful features for downstream tasks are suppressed due to dominant or easy-to-learn features overshadowing others crucial for downstream performance, ultimately degrading the performance of SSCL models. While prior research has acknowledged the feature suppression effect, solutions with theoretical guarantees to mitigate this issue are still lacking. In this work, we address the feature suppression problem by proposing a novel method, Fisher Contrastive Learning, which unbiasedly and exhaustively estimates the central sufficient dimension reduction function class in SSCL settings. In addition, FCL empirically maintains the embedding dimensionality by maximizing the discriminative power of each linear classifier learned through Fisher Contrastive Learning. We demonstrate that using our proposed method, the class-relevant features are not suppressed by strong or easy-to-learn features on datasets known for strong feature suppression effects. In addition, the embedding dimensionality is not preserved in practice. Furthermore, we show that Fisher Contrastive Learning consistently outperforms existing benchmark methods on standard image benchmarks, illustrating its practical advantages.

4118TweedieMix: Improving Multi-Concept Fusion for Diffusion-based Image/Video Generation

[openreview] [pdf]

Abstract Despite significant advancements in customizing text-to-image and video generation models, generating images and videos that effectively integrate multiple personalized concepts remains a challenging task. To address this, we present TweedieMix, a novel method for composing customized diffusion models during the inference phase. By analyzing the properties of reverse diffusion sampling, our approach divides the sampling process into two stages. During the initial steps, we apply a multiple object-aware sampling technique to ensure the inclusion of the desired target objects. In the later steps, we blend the appearances of the custom concepts in the de-noised image space using Tweedie’s formula. Our results demonstrate that TweedieMix can generate multiple personalized concepts with higher fidelity than existing methods. Moreover, our framework can be effortlessly extended to image-to-video diffusion models, enabling the generation of videos that feature multiple personalized concepts.

4119AIME: AI System Optimization via Multiple LLM Evaluators

[openreview] [pdf]

Abstract Text-based AI system optimization typically involves a feedback loop scheme where a \textit{single} LLM generates an evaluation in natural language of the current output to improve the next iteration’s output. However, in this work, we empirically demonstrate that for a practical and complex task (code generation) with multiple criteria to evaluate, utilizing only one LLM evaluator tends to let errors in generated code go undetected, thus leading to incorrect evaluations and ultimately suboptimal test case performance. Motivated by this failure case, we assume there exists an optimal evaluation policy that samples an evaluation between response and ground truth. We then theoretically prove that a linear combination of multiple evaluators can approximate this optimal policy. From this insight, we propose AI system optimization via Multiple LLM Evaluators (AIME). AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We provide an extensive empirical study showing AIME outperforming baseline methods in code generation tasks, with up to 62% higher error detection rate and up to 16% higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets. We also show that the selection of the number of evaluators and which criteria to utilize is non-trivial as it can impact pact success rate by up to 12%.

4120Reliable and Efficient Amortized Model-based Evaluation

[openreview] [pdf]

Abstract Current generative model evaluation procedures are costly and sensitive to test set selection, making continuous monitoring impractical. In this paper, we employ a model-based evaluation framework using Item Response Theory (IRT), which decouples model performance from the test subset selection, ensuring reliable and efficient evaluation. We propose two innovations: amortized calibration to reduce the cost of estimating item parameters of the IRT model and an item generator based on a large language model to automate diverse question generation. Our experiments on 24 common natural language processing benchmarks and 180 language models show that this approach is more reliable and resource-efficient compared to traditional evaluation methods, offering a scalable solution to evaluate generative models.

4121Complexity Lower Bounds of Adaptive Gradient Algorithms for Non-convex Stochastic Optimization under Relaxed Smoothness

[openreview] [pdf]

Abstract Recent results in non-convex stochastic optimization demonstrate the convergence of popular adaptive algorithms (e.g., AdaGrad) under the (L0,L1)(L_0, L_1)-smoothness condition, but the rate of convergence is a higher-order polynomial in terms of problem parameters like the smoothness constants. The complexity guaranteed by such algorithms to find an ϵ\epsilon-stationary point may be significantly larger than the optimal complexity of Θ(ΔLσ2ϵ4)\Theta \left( \Delta L \sigma^2 \epsilon^{-4} \right) achieved by SGD in the LL-smooth setting, where Δ\Delta is the initial optimality gap, σ2\sigma^2 is the variance of stochastic gradient. However, it is currently not known whether these higher-order dependencies can be tightened. To answer this question, we investigate complexity lower bounds for several adaptive optimization algorithms in the (L0,L1)(L_0, L_1)-smooth setting, with a focus on the dependence in terms of problem parameters Δ,L0,L1\Delta, L_0, L_1. We provide complexity bounds for three variations of AdaGrad, which show at least a quadratic dependence on problem parameters Δ,L0,L1\Delta, L_0, L_1. Notably, we show that the decorrelated variant of AdaGrad-Norm requires at least Ω(Δ2L12σ2ϵ4)\Omega \left( \Delta^2 L_1^2 \sigma^2 \epsilon^{-4} \right) stochastic gradient queries to find an ϵ\epsilon-stationary point. We also provide a lower bound for SGD with a broad class of adaptive stepsizes. Our results show that, for certain adaptive algorithms, the (L0,L1)(L_0, L_1)-smooth setting is fundamentally more difficult than the standard smooth setting, in terms of the initial optimality gap and the smoothness constants.

4122Preble: Efficient Distributed Prompt Scheduling for LLM Serving

[openreview] [pdf]

Abstract Prompts to large language models (LLMs) have evolved beyond simple user questions. For LLMs to solve complex problems, today’s practices are to include domain-specific instructions, illustration of tool usages, and/or long context such as textbook chapters in prompts. As such, many parts of prompts are repetitive across requests. Recent works propose to cache and reuse KV state of prompts. However, they are all confined to a single- GPU optimization, while production LLM serving systems are distributed by nature.This paper proposes Preble, the first distributed LLM serving platform that targets and op- timizes for prompt sharing. We designed a distributed scheduling system that co-optimizes KV state reuse and computation load-balancing with a new scheduling algorithm and a hierarchical scheduling mechanism. Our evaluation of Preble with real workloads and re- quest arrival patterns on two open-source LLMs shows that Preble outperforms the SOTA serving systems by 1.5× to 14.5× on average latency and 2× to 10× on p99 latency.

4123SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe

[openreview] [pdf]

Abstract To induce desired behaviors in large language models (LLMs) for interaction-driven tasks, the instruction-tuning stage typically trains LLMs on instruction-response pairs using the next-token prediction (NTP) loss. Previous work aiming to improve instruction-tuning performance often emphasizes the need for higher-quality supervised fine-tuning (SFT) datasets, which typically involves expensive data filtering with proprietary LLMs or labor-intensive data generation by human annotators. However, these approaches do not fully leverage the datasets’ intrinsic properties, resulting in high computational and labor costs, thereby limiting scalability and performance gains. In this paper, we propose SFTMix, a novel recipe that elevates instruction-tuning performance beyond the conventional NTP paradigm, without the need for well-curated datasets. Observing that LLMs exhibit uneven confidence across the semantic representation space, we argue that examples with different confidence levels should play distinct roles during the instruction-tuning process. Based on this insight, SFTMix leverages training dynamics to identify examples with varying confidence levels, then applies a Mixup-based regularization to mitigate overfitting on confident examples while propagating supervision signals to improve learning on relatively unconfident ones. This approach enables SFTMix to significantly outperform NTP across a wide range of instruction-following and healthcare domain-specific SFT tasks, demonstrating its adaptability to diverse LLM families and scalability to datasets of any size. Comprehensive ablation studies further verify the robustness of SFTMix’s design choices, underscoring its versatility in consistently enhancing performance across different LLMs and datasets in broader natural language processing applications.

4124Adapting Multi-modal Large Language Model to Concept Drift From Pre-training Onwards

[openreview] [pdf]

Abstract Multi-modal Large Language Models (MLLMs) frequently face challenges from concept drift when dealing with real-world streaming data, wherein distributions change unpredictably. This mainly includes gradual drift due to long-tailed data and sudden drift from Out-Of-Distribution (OOD) data, both of which have increasingly drawn the attention of the research community. While these issues have been extensively studied in the individual domain of vision or language, their impacts on MLLMs in concept drift settings remain largely underexplored. In this paper, we reveal the susceptibility and vulnerability of Vision-Language (VL) models to significant biases arising from gradual drift and sudden drift, particularly in the pre-training. To effectively address these challenges, we propose a unified framework that extends concept drift theory to the multi-modal domain, enhancing the adaptability of the VL model to the distribution unpredictable changes. Additionally, a T-distribution based drift adapter is proposed to effectively mitigate the bias induced by the gradual drift, which also facilitates the model in distinguishing sudden distribution changes through explicit distribution modeling. Extensive experiments demonstrate our method enhances the efficiency and accuracy of image-text alignment in the pre-training of VL models, particularly in the concept drift scenario. Moreover, various downstream tasks exhibit significant improvements in our model’s ability to adapt to the long-tailed open world. Furthermore, we create a set of multi-modal datasets called OpenMMlo, specifically tailored for the long-tailed open world settings, to validate our findings. To foster the development of the multi-modal community, we have made both OpenMMlo datasets and our code publicly available at: \url{https://github.com/Anonymous0Knight/ConceptDriftMLLMs}.

4125Self-Introspective Decoding: Alleviating Hallucinations for Large Vision-Language Models

[openreview] [pdf]

Abstract Hallucination remains a significant challenge in Large Vision-Language Models (LVLMs). To alleviate this issue, some methods, known as contrastive decoding, induce hallucinations by manually disturbing the raw vision or instruction inputs and then mitigate them by contrasting the outputs of the original and disturbed LVLMs. However, these holistic input disturbances sometimes induce potential noise and also double the inference cost. To tackle these issues, we propose a simple yet effective method named Self-Introspective Decoding\textit{Self-Introspective Decoding} (SID). Our empirical investigations reveal that pre-trained LVLMs can introspectively assess the importance of vision tokens based on preceding vision and text (both instruction and generated) tokens. Leveraging this insight, we develop the Context and Text-aware Token Selection (CT2^2S) strategy, which preserves only the least important vision tokens after the early decoder layers, thereby adaptively amplify vision-and-text association hallucinations during auto-regressive decoding. This strategy ensures that multimodal knowledge absorbed in the early decoder layers induces multimodal contextual rather than aimless hallucinations, and significantly reduces computation burdens. Subsequently, the original token logits subtract the amplified fine-grained hallucinations, effectively alleviating hallucinations without compromising the LVLMs’ general ability. Extensive experiments illustrate SID generates less-hallucination and higher-quality texts across various metrics, without much additional computation cost. Codes are in the Supplementary Material and also available athttps://anonymous.4open.science/r/SID-1795.

4126Upcycling Instruction Tuning from Dense to Mixture-of-Experts via Parameter Merging

[openreview] [pdf]

Abstract Mixture-of-Experts (MoE) shines brightly in large language models (LLMs) and demonstrates outstanding performance in plentiful natural language processing tasks. However, existing methods that transform LLMs from dense to MoE face significant data requirements and typically rely on large-scale post-training. In this paper, we propose Upcycling Instruction Tuning (UpIT), a data-efficient approach for tuning a dense pre-trained model into an MoE instruct model. Specifically, we first point out that intermediate checkpoints during instruction tuning of the dense model are naturally suitable for specialized experts, and then propose an expert expansion stage to flexibly achieve models with different numbers of experts, where genetic algorithm and parameter merging are introduced to ensure sufficient diversity of new extended experts. To ensure that each differentiated expert in the MoE model works as expected, we select a small amount of seed data that each expert excels to pre-optimize the router. Extensive experiments with various data scales and upcycling settings demonstrate the outstanding performance and data efficiency of UpIT, as well as stable improvement in expert or data scaling. Further analysis reveals the importance of ensuring expert diversity in upcycling.

4127Differential Privacy of Cross-Attention with Provable Guarantee

[openreview] [pdf]

Abstract Cross-attention has become a fundamental module nowadays in many important artificial intelligence applications, e.g., retrieval-augmented generation (RAG), system prompt, guided stable diffusion, and many more. Ensuring cross-attention privacy is crucial and urgently needed because its key and value matrices may contain sensitive information about model providers and their users. In this work, we design a novel differential privacy (DP) data structure to address the privacy security of cross-attention with a theoretical guarantee. In detail, let nn be the input token length of system prompt/RAG data, dd be the feature dimension, 0<α10 < \alpha \le 1 be the relative error parameter, RR be the maximum value of the query and key matrices, RwR_w be the maximum value of the value matrix, and r,s,ϵsr,s,\epsilon_s be parameters of polynomial kernel methods. Then, our data structure requires O~(ndr2)\widetilde{O}(ndr^2) memory consumption with O~(nr2)\widetilde{O}(nr^2) initialization time complexity and O~(α1r2)\widetilde{O}(\alpha^{-1} r^2) query time complexity for a single token query. In addition, our data structure can guarantee that the process of answering user query satisfies (ϵ,δ)(\epsilon, \delta)-DP with O~(n1ϵ1α1/2R2sRwr2)\widetilde{O}(n^{-1} \epsilon^{-1} \alpha^{-1/2} R^{2s} R_w r^2) additive error and n1(α+ϵs)n^{-1} (\alpha + \epsilon_s) relative error between our output and the true answer. Furthermore, our result is robust to adaptive queries in which users can intentionally attack the cross-attention system. To our knowledge, this is the first work to provide DP for cross-attention and is promising to inspire more privacy algorithm design in large generative models (LGMs).

4128Linear Recurrent Neural Networks with a Feature-Sequence Twist

[openreview] [pdf]

Abstract The transformer network architecture has led to advances in artificial intelligence. Conversational AI applications, such as ChatGPT, and protein folding predictions with AlphaFold are made possible by transformer architectures and the self-attention mechanism. However, advancing towards more general, flexible, and energy-efficient artificial intelligence may require exploring new architectures that differ significantly from those currently used. Transformer networks have largely replaced recurrent neural networks (RNNs) for state-of-the-art performance on sequence-based tasks. However, in recent years there has been some successful competition from linear recurrent neural networks (LRNNs) and state space models (SSMs). A core advantage of LRNNs and SSMs over traditional RNNs is that the hidden states can be calculated in parallel. Therefore, like the transformer, they can make efficient use of GPU computation.Unlike the transformer, computational costs of parallelized LRNNs and SSMs can scale sub-quadratically with sequence length. Despite these advantages, LRNNs and SSMs often struggle to generate the deep and rich representations that have contributed to the success of transformer architectures. We introduce Feature-Sequence Twisting (FST), a novel technique that transposes the sequence and feature dimensions between LRNN blocks. The purpose of FST is to generate deeper representations of the sequence in subsequent LRNN blocks. Since the computational cost of LRNNs scale sub-quadratically with sequence length, FST remains practical to compute even for large feature dimensions. Our experiments demonstrate that the FST architecture outperforms transformer networks on tasks such as Long ListOps, achieving performance competitive with state-of-the-art models.

4129A Discrete Actor and Critic for Reinforcement Learning on Continuous Tasks

[openreview] [pdf]

Abstract Solving continuous reinforcement learning (RL) tasks typically requires models with continuous action spaces, as discrete models face challenges such as the curse of dimensionality. Inspired by discrete controlling signals in control systems, such as pulse-width modulation, we investigated RL models with discrete action spaces with performance comparable to continuous models on continuous tasks. In this paper, we propose an RL model with a discrete action space, designed a discrete actor that outputs action distributions and twin discrete critics for value distribution estimation. We also developed both the training method and exploration strategy for this model. The model successfully solved BipedalWalkerHardcore-v3, a continuous robot control task in a complex environment, achieved a higher score than the state-of-the-art baselines and comparable results across various other control tasks.

4130Implicit In-context Learning

[openreview] [pdf]

Abstract In-context Learning (ICL) empowers large language models (LLMs) to swiftly adapt to unseen tasks during at inference-time by prefixing a few demonstration examples before queries. Despite its versatility, ICL incurs substantial computa- tional and memory overheads compared to zero-shot learning and is sensitive to the selection and order of demonstration examples. In this work, we introduce Implicit In-context Learning (I2CL), an innovative paradigm that reduces the inference cost of ICL to that of zero-shot learning with minimal information loss. I2CL operates by first generating a condensed vector representation, namely a context vector, extracted from the demonstration examples. It then conducts an inference-time intervention through injecting a linear combination of the context vector and query activations back into the model’s residual streams. Empirical evaluation on nine real-world tasks across three model architectures demonstrates that I2CL achieves few-shot level performance at zero-shot cost, and it exhibits robustness against variations in demonstration examples. Furthermore, I2CL facilitates a novel representation of “task-ids”, enhancing task similarity detection and fostering effective transfer learning. We also performs a comprehensive analysis and ablation study on I2CL, offering deeper insights into its internal mechanisms.

4131Vector Segmented and Recombined Adaptation for Scalable and Efficient Model Tuning

[openreview] [pdf]

Abstract Among the most commonly utilized parameter-efficient fine-tuning (PEFT) methods, LoRA and its variations have achieved significant popularity. The Vector-based Random Matrix Adaptation (VeRA), one typical variant, utilizes random weights and projections to reduce the number of trainable parameters greatly. However, it requires additional GPU memory and computational resources, probably resulting in a lack of scalability that leads to performance bottlenecks in complex tasks. Besides, the inappropriate initialization of random matrices may affect model performance. To address these problems, we propose a new method called Vector Segmented and Recombined Adaptation (SeRA). SeRA segments input vectors into sub-vectors for individual dimensionality reduction, then introduces a square matrix to combine the information from the reduced sub-vectors, and finally expands the dimensionality independently to adapt the size of pre-trained model. SeRA allows for flexible increase of trainable parameters to enhance performance in complex tasks, and avoids the problem caused by random matrices initialization. Through evaluations on the image classification, cross-modal image-text retrieval, instruction-tuning and GLUE benchmark, we demonstrate the scalability and efficiency of SeRA. Furthermore, we utilize Singular Value Decomposition on the adaptation matrices of SeRA, to analyze how the information characteristics of the matrices change in different ranks and tasks. The results can serve as the guide for selecting appropriate parameter amounts in different tasks.

4132RethinkMCTS: Refining Erroneous Thoughts in Monte Carlo Tree Search for Code Generation

[openreview] [pdf]

Abstract LLM agents enhanced by tree search algorithms have shown significant performance in code generation. However, existing search methods generally operate directly in the code language space, leading to suboptimal search quality due to ignoring the reasoning process behind the code. Specifically, two key challenges remain largely unaddressed: 1) A lack of exploration for the reasoning process, which is essential for high-reasoning-demand tasks like code generation, and 2) Inadequate search quality due to the absence of refinement mechanism. In this paper, we introduce RethinkMCTS, a framework that explores and refines the reasoning process for generating code. Specifically, we employ MCTS to search for the thoughts before code generation and integrate MCTS with a refinement mechanism called “rethink”, which incorporates fine-grained code execution feedback to refine erroneous thoughts during the search. It ensures the search path aligns with the better reasoning, improving overall search quality. Through extensive experiments, we demonstrate that RethinkMCTS outperforms previous search-enhanced and feedback-enhanced code generation baselines. On the HumanEval dataset, it boosts the pass@1 of GPT-3.5-turbo from 70.12 to 89.02 and that of GPT-4o-mini from 87.20 to 94.51. By conducting thought-level exploration and integrating the rethink mechanism, it significantly enhances the search quality of the entire search tree

4133LNUCB-TA: Linear-nonlinear Hybrid Bandit Learning with Temporal Attention

[openreview] [pdf]

Abstract Existing contextual multi-armed bandit (MAB) algorithms struggle to simultaneously capture long-term trends as well as local patterns across all arms, leading to suboptimal performance in complex environments with rapidly changing reward structures. Additionally, they typically employ static exploration rates, which do not adapt to dynamic conditions. To address these issues, we present LNUCB-TA, a hybrid bandit model that introduces a novel nonlinear component (adaptive kk-Nearest Neighbors (kk-NN)) designed to reduce time complexity, and an innovative global-and-local attention-based exploration mechanism. Our method incorporates a unique synthesis of linear and nonlinear estimation techniques, where the nonlinear component dynamically adjusts kk based on reward variance, thereby effectively capturing spatiotemporal patterns in the data. This is critical for reducing the likelihood of selecting suboptimal arms and accurately estimating rewards while reducing computational time. Also, our proposed attention-based mechanism prioritizes arms based on their historical performance and frequency of selection, thereby balancing exploration and exploitation in real-time without the need for fine-tuning exploration parameters. Incorporating both global attention (based on overall performance across all arms) and local attention (focusing on individual arm performance), the algorithm efficiently adapts to temporal and spatial complexities in the available context. Empirical evaluation demonstrates that LNUCB-TA significantly outperforms state-of-the-art contextual MAB algorithms, including purely linear, nonlinear, and vanilla combination of linear and nonlinear bandits based on cumulative and mean rewards, convergence performance, and demonstrates consistency of results across different exploration rates. Theoretical analysis further proves the robustness of LNUCB-TA with a sub-linear regret bound.

4134SimPER: Simple Preference Fine-Tuning without Hyperparameters by Perplexity Optimization

[openreview] [pdf]

Abstract Preference optimization has made significant advances in aligning large language models with preference data. However, existing preference optimization objectives require additional hyperparameters that must be extensively manually adjusted to achieve optimal performance, increasing the complexity and time required for fine-tuning large language models. In this paper, we propose a simple hyperparameter-free preference optimization algorithm for alignment. We observe that we can achieve promising performance simply by optimizing inverse perplexity, which is computed as the inverse of the exponentiated average log-likelihood of the chosen and rejected responses in the preference dataset. The resulting simple learning objective, SimPER, is easy to implement and eliminates the need for expensive hyperparameter tuning and a reference model, making it both learning and memory efficient. We show theoretically that SimPER can avoid overestimation of rejected responses in preference data and is closely related to the total variation distance, encouraging promising mode-seeking behavior for alignment. Extensive experiments on widely used real-world benchmarks: MT-Bench, AlpacaEval 2, and 10 key benchmarks of the Open LLM Leadboard with 5 base models show that SimPER consistently and significantly outperforms existing approaches even without any hyperparameters and the reference model. For instance, despite its simplicity, SimPER outperforms state-of-the-art methods by up to 5.7 points on AlpacaEval 2 and achieves the highest average ranking across 10 benchmarks on the Open LLM Leaderboard. Code for SimPER is publicly available at this link.

4135Simulating Human-like Daily Activities with Desire-driven Autonomy

[openreview] [pdf]

Abstract Existing task-oriented AI agents often depend on explicit instructions or external rewards, limiting their ability to be driven by intrinsic motivations like humans. In this paper, we present a desire-driven autonomy framework to guide a Large Language Model based (LLM-based) agent to simulate human-like daily activities. In contrast to previous agents, our Desire-driven Autonomous Agent (D2A) operates on the principle of intrinsic desire, allowing it to propose and select tasks that fulfill its motivational framework autonomously. Inspired by the Theory of Needs from Maslow. A.H., the motivational framework incorporates an understanding of human-like desires, such as the need for social interaction, personal fulfillment, and self-care. Utilizing a desire-driven task generation mechanism, the agent evaluates its current state and takes a sequence of activities aligned with its intrinsic motivations. Through simulations, we demonstrate that our Desire-driven Autonomous Agent (D2A) generates coherent, contextually relevant daily activities while exhibiting variability and adaptability similar to human behavior. A comparative analysis with other LLM-based frameworks demonstrates that our approach significantly enhances the rationality of the simulated activities.

4136Denoising Levy Probabilistic Models

[openreview] [pdf]

Abstract Investigating noise distributions beyond Gaussian in diffusion generative models remains an open challenge. The Gaussian case has been a large success experimentally and theoretically, admitting a unified stochastic differential equation (SDE) framework, encompassing score-based and denoising formulations. Recent studies have investigated the potential of \emph{heavy-tailed} noise distributions to mitigate mode collapse and effectively manage datasets exhibiting class imbalance, heavy tails, or prominent outliers. Very recently, Yoon et al.\ (NeurIPS 2023), presented the Levy-Ito model (LIM), directly extending the SDE-based framework to a class of heavy-tailed SDEs, where the injected noise followed an α\alpha-stable distribution -- a rich class of heavy-tailed distributions. Despite its theoretical elegance and performance improvements, LIM relies on highly involved mathematical techniques, which may limit its accessibility and hinder its broader adoption and further development. In this study, we take a step back, and instead of starting from the SDE formulation, we extend the denoising diffusion probabilistic model (DDPM) by directly replacing the Gaussian noise with α\alpha-stable noise. By using only elementary proof techniques, we show that the proposed approach, \emph{denoising L’{e}vy probabilistic model} (DLPM) algorithmically boils down to running vanilla DDPM with minor modifications, hence allowing the use of existing implementations with minimal changes. Remarkably, as opposed to the Gaussian case, DLPM and LIM yield different training algorithms and different backward processes, leading to distinct sampling algorithms. This fundamental difference translates favorably for the performance of DLPM in various aspects: our experiments show that DLPM achieves better coverage of the tails of the data distribution, better generation of unbalanced datasets, and improved computation times requiring significantly smaller number of backward steps.

4137Free-MoE: Tuning-Free Mixture-of-Experts Purifying LLMs to Thrive across Any Field

[openreview] [pdf]

Abstract The Mixture-of-Experts (MoE) framework efficiently scales large language models (LLMs) by selectively activating expert subnetworks, reducing computational costs. However, current MoE methods are costly in computation and include additional expert modules that require extra training data for tuning, leading to instability in the optimization process. To address these issues, we introduce Free-MoE, a tuning-free MoE method that leverages pre-trained LLMs’ inherent ability to generalize across a wide range of tasks and domains. Free-MoE dynamically activates experts based on specific domains, achieves improvements while 1) requiring no extra model parameters and 2) being completely tuning-free. Specifically, we design the DOWP Alg., a Domain-Oriented Weight Purification Algorithm that purifies the weights in hidden layers and selects the optimal domain-specific experts of domain-specific experts in the hidden layers of the LLM to optimize activation decisions. The activated DSS-Experts, Domain-Specific Subnetwork Experts,can thereby concentrate on specialized task generation, outperforming the corresponding original model. Moreover, Free-MoE incorporates a multi-level trainable router that activates only the most relevant subnetworks during task, effectively minimizing unnecessary inference computations. Comprehensive evaluations reveals that the DOWP Algorithm consistently achieves general performance gains of 2% to 3%, reaching up to 6.8% across datasets like MMLU, HumanEval, GSM8K, and etc. Additionally, when integrated into \model~framework, our method demonstrates a cumulative improvement of 1.11% in average. Findings indicate that Free-MoE not only enhances overall computational efficiency but improves the model’s adaptability across any field that encompassed in contemporary language generation model benchmarks, and can be seamlessly applied to any transformer-based LLMs. Code for this project will be released in reachable future.

4138FLDmamba: Integrating Fourier and Laplace Transform Decomposition with Mamba for Enhanced Time Series Prediction

[openreview] [pdf]

Abstract Time series prediction, a crucial task across various domains, faces significant challenges due to the inherent complexities of time series data, including non-stationarity, multi-scale periodicity, and transient dynamics, particularly when tackling long-term predictions. While Transformer-based architectures have shown promise, their quadratic complexity with sequence length hinders their efficiency for long-term predictions. Recent advancements in State-Space Models, such as Mamba, offer a more efficient alternative for long-term modeling, but they lack the capability to capture multi-scale periodicity and transient dynamics effectively. Meanwhile, they are susceptible to the data noise issue in time series. This paper proposes a novel framework, FLDmamba (Fourier and Laplace Transform Decomposition Mamba), addressing these limitations. FLDmamba leverages the strengths of both Fourier and Laplace transforms to effectively capture both multi-scale periodicity, transient dynamics within time series data, and improve the robustness of the model to the data noise issue. By integrating Fourier analysis into Mamba, FLDmamba enhances its ability to capture global-scale properties, such as multi-scale periodicity patterns, in the frequency domain. Meanwhile, the Fourier Transform aids in isolating underlying patterns or trends from noise in time series data by emphasizing key frequency components, thereby enabling the model to mitigate noise effects. Additionally, incorporating Laplace analysis into Mamba improves its capacity to capture local correlations between neighboring data points, leading to a more accurate representation of transient dynamics. Our extensive experiments demonstrate that FLDmamba achieves superior performance on time series prediction benchmarks, outperforming both Transformer-based and other Mamba-based architectures. This work offers a computationally efficient and effective solution for long-term time series prediction, paving the way for its application in real-world scenarios. To promote the reproducibility of our method, we have made both the code and data accessible via the following URL: \href{https://anonymous.4open.science/r/FLambas-AD7E/README.md}{https://anonymous.4open.science/r/FLDmamba}

4139Learning to Optimize for Mixed-Integer Nonlinear Programming

[openreview] [pdf]

Abstract Mixed-integer nonlinear programs (MINLPs) arise in various domains, such as energy systems and transportation, but are notoriously difficult to solve. Recent advances in machine learning have achieved remarkable success in optimization tasks, an area known as learning to optimize. This approach includes using predictive models to generate solutions for optimization problems with continuous decision variables, thereby avoiding the need for computationally expensive optimization algorithms. However, applying learning to MINLPs remains challenging primarily due to integer decision variables, which complicate gradient-based learning. To address this limitation, we propose two differentiable correction layers that generate integer outputs while preserving gradient information. The experiments demonstrate that the proposed learning-based approach consistently produces high-quality solutions for parametric MINLPs extremely quickly. As problem size increases, traditional exact solvers and heuristic methods struggle to find feasible solutions, whereas our approach continues to deliver reliable results. Our work extends the scope of learning-to-optimize to MINLP, paving the way for integrating integer constraints into deep learning models.

4140Context Clues: Evaluating Long Context Models for Clinical Prediction Tasks on EHR Data

[openreview] [pdf]

Abstract Foundation Models (FMs) trained on Electronic Health Records (EHRs) have achieved state-of-the-art results on numerous clinical prediction tasks. However, these EHR FMs typically have limited context windows of <<1k tokens due to computational constraints, which prevents them from modeling full patient EHRs which can easily span 10k’s of events. For making clinical predictions, both model performance and robustness to the unique properties of EHR data are crucial. Recent advancements in subquadratic long-context architectures offer a promising solution. However, the application of this long-context paradigm to EHR data has not been well-studied. We address this gap by presenting the first systematic evaluation of the effect of context length on modeling EHR data across four state-of-the-art transformer and non-transformer architectures. We find that longer context models indeed improve predictive performance -- our Mamba-based model surpasses the prior state-of-the-art on 9/14 tasks on the EHRSHOT prediction benchmark. Additionally, we measure robustness to three unique, previously underexplored properties of EHR data: (1) the prevalence of “copy-forwarded” diagnoses which create artificial token repetition in EHR sequences; (2) the irregular time intervals between EHR events which can lead to a wide range of timespans within a context window; and (3) the natural increase in disease complexity over time which makes later tokens in the EHR harder to predict than earlier ones. Stratifying our EHRSHOT results, we find that while higher levels of each property correlate negatively with model performance (e.g., a 50% higher Brier loss between the least and most irregular patients), longer context models are more robust to patients exhibiting extreme degrees of each property. Our work highlights the potential for using long-context architectures to model EHR data, and offers a case study on identifying and quantifying new challenges in modeling sequential data that are motivated by domains outside of natural language. We release our model checkpoints, data preprocessing pipelines, and evaluation code.

4141Explain Like I’m Five: Using LLMs to Improve PDE Surrogate Models with Text

[openreview] [pdf]

Abstract Solving Partial Differential Equations (PDEs) is ubiquitous in science and engineering. Computational complexity and difficulty in writing numerical solvers has motivated the development of machine learning techniques to generate solutions quickly. Many existing methods are purely data driven, relying solely on numerical solution fields, rather than known system information such as boundary conditions and governing equations. However, the recent rise in popularity of Large Language Models (LLMs) has enabled easy integration of text in multimodal machine learning models. In this work, we use pretrained LLMs to integrate various amounts known system information into PDE learning. Our multimodal approach significantly outperforms our baseline model, FactFormer, in both next-step prediction and autoregressive rollout performance on the 2D Heat, Burgers, Navier-Stokes, and Shallow Water equations. Further analysis shows that pretrained LLMs provide highly structured latent space that is consistent with the amount of system information provided through text.

4142Generative Representational Instruction Tuning

[openreview] [pdf]

Abstract All text-based language problems can be reduced to either generation or embedding. Current models only perform well at one or the other. We introduce generative representational instruction tuning (GRIT) whereby a large language model is trained to handle both generative and embedding tasks by distinguishing between them through instructions. Compared to other open models, our resulting GritLM-7B is among the top models on the Massive Text Embedding Benchmark (MTEB) and outperforms various models up to its size on a range of generative tasks. By scaling up further, GritLM-8x7B achieves even stronger generative performance while still being among the best embedding models. Notably, we find that GRIT matches training on only generative or embedding data, thus we can unify both at no performance loss. Among other benefits, the unification via GRIT speeds up Retrieval-Augmented Generation (RAG) by > 60% for long documents, by no longer requiring separate retrieval and generation models. Models, code, etc. will be made freely available.

4143No MCMC Teaching For me: Learning Energy-Based Models via Diffusion Synergy

[openreview] [pdf]

Abstract Markov chain Monte Carlo (MCMC) sampling-based maximum likelihood estimation is a standard approach for training Energy-Based Models (EBMs). However, its effectiveness and training stability in high-dimensional settings remain thorny issues due to challenges like mode collapse and slow mixing of MCMC. To address these limitations, we introduce a novel MCMC teaching-free learning framework that jointly trains an EBM and a diffusion-based generative model, leveraging the variational formulation of divergence between time-reversed diffusion paths. In each iteration, the generator model is trained to align with both the empirical data distribution and the current EBM, bypassing the need for biased MCMC sampling. The EBM is then updated by maximizing the likelihood of the synthesized examples generated through a diffusion generative process that more accurately reflects the EBM’s distribution. Moreover, we propose a novel objective function that further improves EBM learning by minimizing the discrepancy between the EBM and the generative model. Our proposed approach enhances training efficiency and overcomes key challenges associated with traditional MCMC-based methods. Experimental results on generative modeling and likelihood estimation demonstrate the superior performance of our method.

4144Discrete Neural Algorithmic Reasoning

[openreview] [pdf]

Abstract Neural algorithmic reasoning aims to capture computations with neural networks via learning the models to imitate the execution of classic algorithms. While common architectures are expressive enough to contain the correct model in the weights space, current neural reasoners are struggling to generalize well on out-of-distribution data. On the other hand, classic computations are not affected by distributional shifts as they can be described as transitions between discrete computational states. In this work, we propose to force neural reasoners to maintain the execution trajectory as a combination of finite predefined states. To achieve that, we separate discrete and continuous data flows and describe the interaction between them. Trained with supervision on the algorithm’s state transitions, such models are able to perfectly align with the original algorithm. To show this, we evaluate our approach on multiple algorithmic problems and get perfect test scores both in single-task and multitask setups. Moreover, the proposed architectural choice allows us to prove the correctness of the learned algorithms for any test data.

4145Improved Sample Complexity for Private Nonsmooth Nonconvex Optimization

[openreview] [pdf]

Abstract We study differentially private (DP) optimization algorithms for stochastic and empirical objectives which are neither smooth nor convex, and propose methods that return a Goldstein-stationary point with sample complexity bounds that improve on existing works. We start by providing a single-pass (ϵ,δ)(\epsilon,\delta)-DP algorithm that returns an (α,β)(\alpha,\beta)-stationary point as long as the dataset is of size Ω~(1/αβ3+d/ϵαβ2+d3/4/ϵ1/2αβ5/2)\widetilde{\Omega}\left(1/\alpha\beta^{3}+d/\epsilon\alpha\beta^{2}+d^{3/4}/\epsilon^{1/2}\alpha\beta^{5/2}\right), which is Ω(d)\Omega(\sqrt{d}) times smaller than the algorithm of \citet{zhang2023private} for this task, where dd is the dimension. We then provide a multi-pass polynomial time algorithm which further improves the sample complexity to Ω~(d/β2+d3/4/ϵα1/2β3/2)\widetilde{\Omega}\left(d/\beta^2+d^{3/4}/\epsilon\alpha^{1/2}\beta^{3/2}\right), by designing a sample efficient ERM algorithm, and proving that Goldstein-stationary points generalize from the empirical loss to the population loss.

4146Differentiation and Specialization of Attention Heads via the Refined Local Learning Coefficient

[openreview] [pdf]

Abstract We introduce refined variants of the Local Learning Coefficient (LLC), a measure of model complexity grounded in singular learning theory, to study the development of internal structure in transformer language models during training. By applying these refined LLCs (rLLCs) to individual components of a two-layer attention-only transformer, we gain novel insights into the progressive differentiation and specialization of attention heads. Our methodology reveals how attention heads differentiate into distinct functional roles over the course of training, analyzes the types of data these heads specialize to process, and discovers a previously unidentified multigram circuit. These findings demonstrate that rLLCs provide a principled, quantitative toolkit for developmental interpretability, which aims to understand models through their evolution across the learning process. This work advances the field of developmental interpretability by providing a mathematically rigorous approach to understanding neural networks through the lens of their learning process. More broadly, this work takes a step towards establishing the correspondence between data distributional structure, geometric properties of the loss landscape, learning dynamics, and emergent computational structures in neural networks.

4147Is Factuality Enhancement a Free Lunch For LLMs? Better Factuality Can Lead to Worse Context-Faithfulness

[openreview] [pdf]

Abstract As the modern tools of choice for text understanding and generation, large language models (LLMs) are expected to accurately output answers by leveraging the input context. This requires LLMs to possess both context-faithfulness and factual accuracy. Extensive efforts have been made to enable better outputs from LLMs by mitigating hallucinations through factuality enhancement methods. However, they also pose risks of hindering context-faithfulness, as factuality enhancement can lead LLMs to become overly confident in their parametric knowledge, causing them to overlook the relevant input context. In this work, we argue that current factuality enhancement methods can significantly undermine the context-faithfulness of LLMs. We first revisit the current factuality enhancement methods and evaluate their effectiveness in enhancing factual accuracy. Next, we evaluate their performance on knowledge editing tasks to assess the potential impact on context-faithfulness. The experimental results reveal that while these methods may yield inconsistent improvements in factual accuracy, they also cause a more severe decline in context-faithfulness, with the largest decrease reaching a striking 69.7%. To explain these declines, we analyze the hidden states and logit distributions for the tokens representing new knowledge and parametric knowledge respectively, highlighting the limitations of current approaches. Our finding highlights the complex trade-offs inherent in enhancing LLMs. Therefore, we recommend that more research on LLMs’ factuality enhancement make efforts to reduce the sacrifice of context-faithfulness.

4148A Skewness-Based Criterion for Addressing Heteroscedastic Noise in Causal Discovery

[openreview] [pdf]

Abstract Real-world data often violates the equal-variance assumption (homoscedasticity), making it essential to account for heteroscedastic noise in causal discovery. In this work, we explore heteroscedastic symmetric noise models (HSNMs), where the effect YY is modeled as Y=f(X)+σ(X)NY = f(X) + \sigma(X)N, with XX as the cause and NN as independent noise following a symmetric distribution. We introduce a novel criterion for identifying HSNMs based on the skewness of the score (i.e., the gradient of the log density) of the data distribution. This criterion establishes a computationally tractable measurement that is zero in the causal direction but nonzero in the anticausal direction, enabling the causal direction discovery. We extend this skewness-based criterion to the multivariate setting and propose \texttt{SkewScore}, an algorithm that handles heteroscedastic noise without requiring the extraction of exogenous noise. We also conduct a case study on the robustness of \texttt{SkewScore} in a bivariate model with a latent confounder, providing theoretical insights into its performance. Empirical studies further validate the effectiveness of the proposed method.

4149Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction

[openreview] [pdf]

Abstract Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected through a stochastic behavior policy. We present \textit{Latent Macro Action Planner} (L-MAP), which addresses this challenge by learning a set of temporally extended macro-actions through a state-conditional Vector Quantized Variational Autoencoder (VQ-VAE), effectively reducing action dimensionality. L-MAP employs a (separate) learned prior model that acts as a latent transition model and allows efficient sampling of plausible actions. During planning, our approach accounts for stochasticity in both the environment and the behavior policy by using Monte Carlo tree search (MCTS). In offline RL settings, including stochastic continuous control tasks, L-MAP efficiently searches over discrete latent actions to yield high expected returns. Empirical results demonstrate that L-MAP maintains low decision latency despite increased action dimensionality. Notably, across tasks ranging from continuous control with inherently stochastic dynamics to high-dimensional robotic hand manipulation, L-MAP significantly outperforms existing model-based methods and performs on par with strong model-free actor-critic baselines, highlighting the effectiveness of the proposed approach in planning in complex and stochastic environments with high-dimensional action spaces.

4150GLoRa: A Benchmark to Evaluate the Ability to Learn Long-Range Dependencies in Graphs

[openreview] [pdf]

Abstract Learning on graphs is one of the most active research topics in machine learning (ML). Among the key challenges in this field, effectively learning long-range dependencies in graphs has been a particularly difficult problem. It has been observed that, in practice, the performance of many ML approaches, including various types of graph neural networks (GNNs), degrades significantly when the learning task involves long-range dependencies—that is, when the answer is determined by the presence of a certain path of significant length in the graph. This issue has been attributed to several phenomena, including, most prominently, oversmoothing, over-squashing, and vanishing gradient. A number of solutions have been proposed to mitigate these causes. However, evaluation of these solutions is complicated by the fact that existing benchmarks do not really test systems for their ability to learn tasks based on long-range dependencies in a transparent manner. In this paper, we design a synthetic benchmark that provably allows testing systems for this learning ability. We then evaluate state-of-the-art systems against it and conclude that none of them can claim that it can learn long-range dependencies well. We also observe that this weak performance cannot be attributed to any of the three causes, thus indicating that further investigation is necessary.

4151Auto-Arena: Automating LLM Evaluations with Agent Peer Battles and Committee Discussions

[openreview] [pdf]

Abstract As LLMs continuously evolve, there is an urgent need for a reliable evaluation method that delivers trustworthy results promptly. Currently, static benchmarks suffer from inflexibility and unreliability, leading users to prefer human voting platforms like Chatbot Arena. However, human evaluations require significant manual effort. To address this, we propose the Auto-Arena, an innovative framework that automates the entire evaluation process using LLM-powered agents. Firstly, an LLM examiner generates questions. Then, two LLM candidates engage in a multi-round peer battle based on individual questions, aiming at revealing their true performance differences. Finally, a committee of LLM judges collaboratively discusses and decides the winner, reducing bias and enhancing fairness. During the peer battles, we observe intriguing scenarios where the LLM candidates display competitive behaviors and even learn from the opponents. In our extensive experiments involving 15 recent LLMs, Auto-Arena shows a 92.14% correlation with human preferences, surpassing all previous expert-annotated benchmarks without any manual efforts. As a result, Auto-Arena offers a promising alternative to current human evaluation platforms for evaluating LLMs automatically.

4152Efficient Cross-Episode Meta-RL

[openreview] [pdf]

Abstract We introduce Efficient Cross-Episodic Transformers (ECET), a new algorithm for online Meta-Reinforcement Learning that addresses the challenge of enabling reinforcement learning agents to perform effectively in previously unseen tasks. We demonstrate how past episodes serve as a rich source of in-context information, which our model effectively distills and applies to new contexts. Our learned algorithm is capable of outperforming the previous state-of-the-art and provides more efficient meta-training while significantly improving generalization capabilities. Experimental results, obtained across various simulated tasks of the Meta-World and ManiSkill benchmarks, indicate a significant improvement in learning efficiency and adaptability compared to the state-of-the-art. Our approach enhances the agent’s ability to generalize from limited data and paves the way for more robust and versatile AI systems.

4153Protecting against simultaneous data poisoning attacks

[openreview] [pdf]

Abstract Current backdoor defense methods are evaluated against a single attack at a time. This is unrealistic, as powerful machine learning systems are trained on large datasets scraped from the internet, which may be attacked multiple times by one or more attackers. We demonstrate that multiple backdoors can be simultaneously installed in a single model through parallel data poisoning attacks without substantially degrading clean accuracy. Furthermore, we show that existing backdoor defense methods do not effectively defend against multiple simultaneous attacks. Finally, we leverage insights into the nature of backdoor attacks to develop a new defense, BaDLoss, that is effective in the multi-attack setting. With minimal clean accuracy degradation, BaDLoss attains an average attack success rate in the multi-attack setting of 7.98% in CIFAR-10, 10.29% in GTSRB, and 19.17% in Imagenette, compared to the average of other defenses at 63.44%, 74.83%, and 41.74% respectively.

4154Latent Matrix Completion Model

[openreview] [pdf]

Abstract Large amounts of missing data are becoming increasingly ubiquitous in modern high-dimensional datasets. High-rank matrix completion (HRMC) uses the powerful union of subspace (UoS) model to handle these vast amounts of missing data. However, existing HRMC methods often fail when dealing with real data that does not follow the UoS model exactly. Here we propose a new approach: instead of finding a UoS that fits the observed data directly, we will find a UoS in a latent space that can fit a non-linear embedding of the original data. Embeddings of this sort are typically attained with deep architectures. However, the abundance of missing data impedes the training process, as the coordinates of the observed samples rarely overlap. We overcome this difficulty with a novel pseudo-completion layer (in charge of estimating the missing values) followed by an auto-encoder (in charge of finding the embedding) coupled with a self-expressive layer (that clusters data according to a UoS in the latent space). Our design reduces the exponential memory requirements typically induced by uneven patterns of missing data. We describe our architecture, model, loss functions, and training strategy. Our experiments on several real datasets show that our method consistently outperforms the state-of-the-art accuracy by more than a staggering 40%.

4155GLoRA: Geometric Adaptive Ranks for Efficient LoRA Fine-Tuning

[openreview] [pdf]

Abstract Fine-tuning large language models is computationally intensive because it requires updating all parameters. Low-Rank Adaptation (LoRA) improves efficiency by modifying only a subset of weights but introduces a trade-off between expressivity and computational cost: lower ranks reduce resources but limit expressiveness, while higher ranks enhance expressivity at increased cost. Despite recent advances in adaptive LoRA techniques, existing methods fail to provide a theoretical basis for optimizing the trade-off between model performance and efficiency. We propose Geometric Low-Rank Adaptation (GLoRA), a novel framework that computes the intrinsic dimensionality of hidden state representations to adaptively select LoRA ranks. We demonstrate that the intrinsic dimension provides a lower bound for the optimal rank of LoRA matrices, allowing for a principled selection that balances efficiency and expressivity. GLoRA dynamically adjusts the rank for each layer based on the intrinsic dimensionality of its input and output representations, recognizing that not all model parameters equally impact fine-tuning. Empirical validation on multiple tasks shows that GLoRA consistently outperforms recent baselines within the same parameter budget.

4156Composing Unbalanced Flows for Flexible Docking and Relaxation

[openreview] [pdf]

Abstract Diffusion models have emerged as a successful approach for molecular docking, but they often cannot model protein flexibility or generate nonphysical poses. We argue that both these challenges can be tackled by framing the problem as a transport between distributions. Still, existing paradigms lack the flexibility to define effective maps between such complex distributions. To address this limitation we propose Unbalanced Flow Matching, a generalization of Flow Matching (FM) that allows trading off sample efficiency with approximation accuracy and enables more accurate transport. Empirically, we apply Unbalanced FM on flexible docking and structure relaxation, demonstrating our ability to model protein flexibility and generate energetically favorable poses. On the PDBBind docking benchmark, our method FlexDock improves the docking performance while increasing the proportion of energetically favorable poses from 30% to 73%.

4157VISION-LANGUAGE MODELS AS TRAINERS FOR INSTRUCTION-FOLLOWING AGENTS

[openreview] [pdf]

Abstract Developing agents that can understand and follow language instructions is critical for effective and reliable human-AI collaboration. Recent approaches train these agents using reinforcement learning with infrequent environment rewards, placing a significant burden on environment designers to create language-conditioned reward functions. As environments and instructions grow in complexity, crafting such reward functions becomes increasingly impractical. To address this challenge, we introduce V-TIFA, a novel method that trains instruction-following agents by leveraging feedback from vision-language models (VLMs). The core idea of V-TIFA is to query VLMs to rate entire trajectories based on language instructions, using the resulting ratings to directly train the agent. Unlike prior VLM reward generation methods, V-TIFA does not require manually crafted task specifications, enabling agents to learn from a diverse set of natural language instructions. Extensive experiments in embodied environments demonstrate that V-TIFA outperforms existing reward generation methods under the same conditions.

4158Latent Point Collapse Induces an Information Bottleneck in Deep Neural Network Classifiers

[openreview] [pdf]

Abstract The information-bottleneck principle suggests that the foundation of learning lies in the ability to create compact representations. In machine learning, this goal can be formulated as a Lagrangian optimization problem, where the mutual information between the input and latent representations must be minimized without compromising the correctness of the model’s predictions. Unfortunately, mutual information is difficult to compute in deterministic deep neural network classifiers, which greatly limits the application of this approach to challenging scenarios. In this paper, we tackle this problem from a different perspective that does not involve direct computation of the mutual information. We develop a method that induces the collapse of latent representations belonging to the same class into a single point. Such a point collapse yields a significant decrease in the entropy associated with the latent distribution, thereby creating an information bottleneck. Our method is straightforward to implement, and we demonstrate that it enhances the robustness, generalizability, and reliability of the network.

4159Training-Free Dataset Pruning for Instance Segmentation

[openreview] [pdf]

Abstract Existing dataset pruning techniques primarily focus on classification tasks, limiting their applicability to more complex and practical tasks like instance segmentation. Instance segmentation presents three key challenges: pixel-level annotations, instance area variations, and class imbalances, which significantly complicate dataset pruning efforts. Directly adapting existing classification-based pruning methods proves ineffective due to their reliance on time-consuming model training process. To address this, we propose a novelTraining-FreeDatasetPruning (TFDP) method for instance segmentation. Specifically, we leverage shape and class information from image annotations to design a Shape Complexity Score (SCS), refining it into a Scale-Invariant (SI-SCS) and Class-Balanced (CB-SCS) versions to address instance area variations and class imbalances, all without requiring model training. We achieve state-of-the-art results on VOC 2012, Cityscapes, and MS COCO datasets, generalizing well across CNN and Transformer architectures. Remarkably, our approach accelerates the pruning process by an average of1349×\timeson COCO compared to the adapted baselines.

4160Do We Really Need Parameter-Isolation to Protect Task Knowledge?

[openreview] [pdf]

Abstract Due to the dynamic nature of tasks, how deep networks can transition from a static structure, trained on previous tasks, to a dynamic structure that adapts to continuously changing data inputs has garnered significant attention. This involves learning new task knowledge while avoiding catastrophic forgetting of previously acquired knowledge. Continual learning is a learning approach aimed at addressing the problem of catastrophic forgetting, primarily by constraining or isolating parameter changes to protect the knowledge of prior tasks. However, while existing methods offer good protection for old task knowledge, they often diminish the ability to learn new task knowledge. Given the sparsity of activation channels in a deep network, we introduce a novel misaligned fusion method within the context of continual learning. This approach allows for the adaptive allocation of available pathways to protect crucial knowledge from previous tasks, replacing traditional isolation techniques. Furthermore, when new tasks are introduced, the network can undergo full parameter training, enabling a more comprehensive learning of new tasks. This work conducts comparative tests of our method against other approaches using deep network architectures of various sizes and popular benchmark datasets. The performance demonstrates the effectiveness and superiority of our method.

[openreview] [pdf]

Abstract Temporal link prediction aims to forecast future link existence in temporal graphs, with numerous real-world applications. Existing methods often rely on designing complex model architectures to parameterize the interaction patterns between nodes. Instead, we re-think the interaction dynamics in temporal graphs (which we call ``interaction rhythms’‘) by addressing a fundamental research question: \textit{Is there a strong yet prevalent latent interaction rhythm pattern across different temporal graphs that can be leveraged for temporal link prediction?} Our introduced empirical analyses reveal that there indeed exists temporal clustering in node interaction rhythms, where for a specific node, interactions tend to occur in bursts. Such observation leads to two key insights for predicting future links: (i) recent historical links that carry the latest rhythm pattern information; and (ii) the intervals between interactions that further illuminate temporal dynamics. Building on these empirical findings, we propose TG-Mixer, a novel method that explicitly captures temporal clustering patterns to advance temporal link prediction. TG-Mixer samples the most recent historical links to extract surrounding neighborhoods, preserving currently invaluable interaction rhythms while avoiding massive computations. Additionally, it integrates a carefully designed silence decay mechanism that penalizes nodes’ long-term inactivity, effectively incorporating temporal clustering information for future link prediction. Both components ensure concise implementations, leading to a lightweight architecture. Exhaustive experiments on seven benchmarks against nine baselines demonstrate that TG-Mixer achieves state-of-the-art performance with faster convergence, stronger generalization capabilities, and higher efficiency. The experimental results also highlight the importance of explicitly considering temporal clustering for temporal link prediction.

4162Unlocking the Power of Function Vectors for Characterizing and Mitigating Catastrophic Forgetting in Continual Instruction Tuning

[openreview] [pdf]

Abstract Catastrophic forgetting (CF) poses a significant challenge in machine learning, where a model forgets previously learned information upon learning new tasks. Despite the advanced capabilities of Large Language Models (LLMs), they continue to face challenges with CF during continual learning. The majority of existing research focuses on analyzing forgetting patterns through a singular training sequence, thereby overlooking the intricate effects that diverse tasks have on model behavior. Our study explores CF across various settings, discovering that model forgetting is influenced by both the specific training tasks and the models themselves. To this end, we interpret forgetting by examining the function vector (FV), a compact representation of functions in LLMs, offering a model-dependent indicator for the occurrence of CF. Through theoretical and empirical analyses, we demonstrated that CF in LLMs primarily stems from biases in function activation rather than the overwriting of task processing functions. Leveraging these insights, we propose a novel function vector guided training methodology, incorporating a regularization technique to stabilize the FV and mitigate forgetting. Empirical tests on four benchmarks confirm the effectiveness of our proposed training method, substantiating our theoretical framework concerning CF and model function dynamics. We plan to make our code publicly accessible in the near future.

4163TreeTop: Topology-Aware Fine-Tuning for LLM Conversation Tree Understanding

[openreview] [pdf]

Abstract While Large Language Models (LLMs) have dominated a wide diversity of natural language tasks, improving their capabilities on \emph{structured} inputs such as graphs remains an open challenge. We introduce TreeTop\texttt{TreeTop}, a pre-training framework for LLMs that significantly improves their ability to understand and reason over structural relationships in multi-party, threaded discussions, such as those found on social media platforms. TreeTop\texttt{TreeTop} is a novel set of 17 QA-style tasks specifically designed to allow LLMs to selectively focus on both the structure of and content in discussion graphs. We find that LLMs fine-tuned with TreeTop\texttt{TreeTop} outperform their counterparts in every setting: zero-shot/few-shot performance on unseen pretraining tasks as well as downstream social media inference tasks (e.g.rumor detection), as well as fine-tuned performance on the downstream tasks, including their challenging “early-detection” variants. In particular, Gemini Pro\texttt{Gemini Pro} fine-tuned with TreeTop\texttt{TreeTop} and further fine-tuned on downstream tasks surpasses both vanilla Gemini Pro\texttt{Gemini Pro} and state-of-the-art GNN baselines. Our framework paves the way for LLMs with enhanced capabilities on heavily-structured inputs.

4164FIG: Flow with Interpolant Guidance for Linear Inverse Problems

[openreview] [pdf]

Abstract Diffusion and flow matching models have been recently used to solve various linear inverse problems such as image restoration. Using a pre-trained diffusion or flow-matching model as a prior, most existing methods modify the reverse-time sampling process by incorporating the likelihood information from the measurement. However, they struggle in challenging scenarios, e.g., in case of high measurement noise or severe ill-posedness. In this paper, we propose `Flow with Interpolant Guidance’ (FIG), an algorithm where the reverse-time sampling is efficiently guided with measurement interpolants through theoretically justified schemes. Experimentally, we demonstrate that FIG efficiently produce highly competitive results on a variety of linear image reconstruction tasks on natural image datasets. We improve upon state-of-the-art baseline algorithms, especially for challenging tasks. Code will be released.

4165PERFT: Parameter-Efficient Routed Fine-Tuning for Mixture-of-Expert Model

[openreview] [pdf]

Abstract The Mixture-of-Experts (MoE) paradigm has emerged as a powerful approach for scaling transformers with improved resource utilization. However, efficiently fine-tuning MoE models remains largely underexplored. Inspired by recent works on Parameter-Efficient Fine-Tuning (PEFT), we present a unified framework for integrating PEFT modules directly into the MoE mechanism. Aligning with the core principles and architecture of MoE, our framework encompasses a set of design dimensions including various functional and composition strategies. By combining design choices within our framework, we introduce Parameter-Efficient Routed Fine-Tuning (PERFT) as a flexible and scalable family of PEFT strategies tailored for MoE models. Extensive experiments on adapting OLMoE-1B-7B and Mixtral-8×7B for commonsense and arithmetic reasoning tasks demonstrate the effectiveness, scalability, and intriguing dynamics of PERFT. Additionally, we provide empirical findings for each specific design choice to facilitate better application of MoE and PEFT.

4166Let Me Grok for You: Accelerating Grokking via Embedding Transfer from a Weaker Model

[openreview] [pdf]

Abstract ‘‘Grokking’’ is a phenomenon where a neural network first memorizes training data and generalizes poorly, but then suddenly transitions to near-perfect generalization after prolonged training. While intriguing, this delayed generalization phenomenon compromises predictability and efficiency. Ideally, models should generalize directly without delay. To this end, this paper proposes GrokTransfer, a simple and principled method for accelerating grokking in training neural networks, based on the key observation that data embedding plays a crucial role in determining whether generalization is delayed. GrokTransfer first trains a smaller, weaker model to reach a nontrivial (but far from optimal) test performance. Then, the learned input embedding from this weaker model is extracted and used to initialize the embedding in the target, stronger model. We rigorously prove that, on a synthetic XOR task where delayed generalization always occurs in normal training, GrokTransfer enables the target model to generalize directly without delay. Moreover, we demonstrate that, across empirical studies of different tasks, GrokTransfer effectively reshapes the training dynamics and eliminates delayed generalization, for both fully-connected neural networks and Transformers.

4167ZETA: LeveragingZ-order Curves for Efficient Top-kAttention

[openreview] [pdf]

Abstract Over recent years, the Transformer has become a fundamental building block for sequence modeling architectures. Yet at its core is the use of self-attention, whose memory and computational cost grow quadratically with the sequence length NN, rendering it prohibitively expensive for long sequences. A promising approach is top-kk attention, which selects only the kk most relevant tokens and achieves performance comparable to vanilla self-attention while significantly reducing space and computational demands. However, causal masks require the current query token to only attend to past tokens, preventing existing top-kk attention methods from efficiently searching for the most relevant tokens in parallel, thereby limiting training efficiency. In this work, we propose ZETA, leveraging Z-Order Curves for Efficient Top-k Attention, to enable parallel querying of past tokens for entire sequences. We first theoretically show that the choice of key and query dimensions involves a trade-off between the curse of dimensionality and the preservation of relative distances after projection. In light of this insight, we propose reducing the dimensionality of keys and queries in contrast to values and further leveraging Z-order curves to map low-dimensional keys and queries into one-dimensional space, which permits parallel sorting, thereby largely improving the efficiency for top-kk token selection. Experimental results demonstrate that ZETA~matches the performance of standard attention on synthetic tasks Associative Recall and outperforms attention and its variants on Long-Range Arena and WikiText-103 language modeling.

4168DRIMA: Differential Reward Interaction for Cooperative Multi-Agent Reinforcement Learning

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) owning to its potent capabilities in complex systems has gained remarkable research attention nowadays, in which collaborative decision-making and control for multi-agent systems is one of the key research focuses. The prevalent learning framework is centralized training with decentralized execution (CTDE), in which the decentralized execution realizes strategy flexibility, and the use of centralized training ensures stationarity and goal consistency while becoming incapable when facing scalability and complexity situations. To address this issue, we follow the concept of distributed training with decentralized execution (DTDE). Decentralization is naturally accompanied by the game during the learning process, which has not been entirely studied in related work, resulting in the constrained strategy combination of MARL. In this paper, we devise a novel approach of differential reward interaction (DRI) with conflict-triggered for the distributed evaluation that enables overall goal consistency through highly efficient local information exchange. With this collaborative learning method, the DRI-based MARL can eliminate the notorious issue of converging to saddle equilibriums of stochastic games. Meanwhile, it possesses provable convergence and is well compatible for general value-based and policy-based algorithms. Experiments in several benchmark scenarios demonstrate that DRIMA realizes collaborative strategy learning with enhanced global goal-achieving.

4169Cost-Sensitive Multi-Fidelity Bayesian Optimization

[openreview] [pdf]

Abstract In this paper, we address the problem of cost-sensitive multi-fidelity Bayesian Optimization (BO) for efficient hyperparameter optimization (HPO). Specifically, we assume a scenario where users want to early-stop the BO when performance increase is not satisfactory with respect to the required computational cost. Motivated by this scenario, we introduce \emph{utility function}, which is predefined by each user and describes the trade-off between the required BO steps and the cumulative best performance during the BO. This utility function, combined with our novel acquisition function and the stopping criteria, allows us to dynamically choose for each BO step the best configuration that we expect to achieve the maximum utility in future, and also automatically stop the BO around the maximum utility. Further, we improve the sample efficiency of existing learning curve (LC) extrapolation methods (e.g., Prior Fitted Networks) with transfer learning, while successfully capturing the correlations between different configurations to develop a sensible surrogate function for multi-fidelity BO. We validate our algorithm on various LC datasets and found it outperform all the previous multi-fidelity BO baselines, achieving significantly better trade-off between cost and performance of multi-fidelity BO.

4170Knockout: A simple way to handle missing inputs

[openreview] [pdf]

Abstract Deep learning models can extract predictive and actionable information from complex inputs. The richer the inputs, the better these models usually perform. However, models that leverage rich inputs (e.g., multi-modality) can be difficult to deploy widely, because some inputs may be missing at inference. Current popular solutions to this problem include marginalization, imputation, and training multiple models. Marginalization can obtain calibrated predictions but it is computationally costly and therefore only feasible for low dimensional inputs. Imputation may result in inaccurate predictions because it employs point estimates for missing variables and does not work well for high dimensional inputs (e.g., images). Training multiple models whereby each model takes different subsets of inputs can work well but requires knowing missing input patterns in advance. Furthermore, training and retaining multiple models can be costly. We propose an efficient way to learn both the conditional distribution using full inputs and the marginal distributions. Our method, Knockout, randomly replaces input features with appropriate placeholder values during training. We provide a theoretical justification of Knockout and show that it can be viewed as an implicit marginalization strategy. We evaluate Knockout in a wide range of simulations and real-world datasets and show that it can offer strong empirical performance.

4171CONSTRAINT-AWARE ZERO-SHOT VISION-LANGUAGE NAVIGATION IN CONTINUOUS ENVIRONMENTS

[openreview] [pdf]

Abstract We present Constraint-Aware Navigator (CA-Nav), a zero-shot approach for Vision-Language Navigation in Continuous Environments (VLN-CE). CA-Nav reframes the zero-shot VLN-CE task as a sequential constraint-aware sub-instruction completion process, continuously translating sub-instructions into navigation plans via a cross-modal value map. Central to our approach are two modules namely Constraint-aware Sub-instruction Manager (CSM) and Constraint-aware Value Mapper (CVM). CSM defines the completion criteria of decomposed sub-instructions as constraints and tracks navigation progress by switching sub-instructions in a constraint-aware manner. Based on the constraints identified by CSM, CVM builds a value map on-the-fly and refines it using superpixel clustering to enhance navigation stability. CA-Nav achieves the state-of-the-art performance on two VLN-CE benchmarks, surpassing the compared best method by 12% on R2R-CE and 13% on RxR-CE in terms of Success Rate on the validation unseen split. Furthermore, CA-Nav demonstrates its effectiveness in real-world robot deployments across diverse indoor scenes and instructions.

4172Adversarial Contrastive Decoding: Aligning Large Language Models via Exploiting Their Safety and Harm

[openreview] [pdf]

Abstract With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates, limiting the degree of contrast. To this end, we propose Adversarial Contrastive Decoding (ACD), an optimization-based framework to generate two opposite soft system prompts, the Safeguarding Prompt (SP) and the Adversarial Prompt (AP), for prompt-based contrastive decoding. The SP aims to promote safer outputs while the AP aims to exploit the harmful parts of the model, providing a strong contrast to align the model with safety. ACD only needs to apply a lightweight prompt tuning on a rather small anchor dataset without training the target model. Experiments conducted on extensive models and benchmarks demonstrate that the proposed method achieves much better safety performance than previous model training-free decoding methods without sacrificing its original generation ability.

4173One Model for All: Multi-Objective Controllable Language Models

[openreview] [pdf]

Abstract Aligning large language models (LLMs) with human preference is critical to enhancing LLMs’ safety, helpfulness, helpfulness, humor, faithfulness, etc. The current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptivity and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences on multi-objective trade-offs, such as prioritizing humor and empathy in one context, while seeking efficiency and precision in another. Can we train one LLM to produce personalized outputs for different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains an LLM as a meta-policy to directly generate responses in the preference-defined regions of Pareto front. Our approach integrates multi-objective optimization (MOO) principles into Proximal Policy Optimization (PPO) to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, which enables us to finetune an LLM of 7B parameters on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) Controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) Quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) Generalization to unseen preferences. These results highlight MOC’s potential for real-world applications requiring scalable and customizable LLMs.

4174Early Fusion Helps Vision Language Action Models Generalize Better

[openreview] [pdf]

Abstract Recent advances in Vision-Language-Action (VLA) models can enable robots to perform a wide range of tasks based on language or goal-based instructions. These VLA models typically encode text and images into disjoint tokens, generating actions that align with the given instructions. This requires the VLA models to simultaneously perform vision-language understanding and precise closed-loop control, resulting in significant challenges for them to generalize to new environments. However, contrastive pre-trained VLMs, such as CLIP, already possess vision-language alignment capabilities, which are underutilized by current VLA models. In this paper, we propose Early Fusion VLA (EF-VLA), a novel VLA architecture that exploits CLIP’s vision-language understanding by performing early fusion, extracting fine-grained vision-language tokens relevant to the task instructions before passing them to the transformer policy. EF-VLA keeps the VLM frozen, allowing it to effectively perform unseen tasks without requiring fine-tuning, which often reduces generalization capabilities. Simulation and real-world experiments suggest that EF-VLA outperforms state-of-the-art VLA models on diverse tasks, with significant generalization capabilities in unseen environments.

4175Model merging with SVD to tie the Knots

[openreview] [pdf]

Abstract Recent model merging methods demonstrate that the parameters of fully-finetuned models specializing in distinct tasks can be combined into one model capable of solving all tasks without retraining. Yet, this success does not transfer well when merging LoRA finetuned models. We study this phenomenon and observe that the weights of LoRA finetuned models showcase a lower degree of alignment compared to their fully-finetuned counterparts. We hypothesize that improving this alignment is key to obtaining better LoRA model merges, and propose KnOTS to address this problem. KnOTS uses the SVD to jointly transform the weights of different LoRA models into an aligned space, where existing merging methods can be applied. In addition, we introduce a new benchmark that explicitly evaluates whether merged models are general models. Notably, KnOTS consistently improves LoRA merging by up to 4.3% across several vision and language benchmarks, including our new setting.

4176SWE-bench Multimodal: Do Autonomous Programming Systems Generalize to New Software Domains?

[openreview] [pdf]

Abstract Autonomous systems for software engineering are now capable of fixing bugs and developing features. These systems are commonly evaluated on SWE-bench (Jimenez et al., 2024a), which assesses their ability to solve software issues from GitHub repositories. However, SWE-bench uses only Python repositories, with problem statements presented predominantly as text and lacking visual elements such as images. This limited coverage motivates our inquiry into how existing systems might perform on unrepresented software engineering domains (e.g., front-end, game development, DevOps), which use different programming languages and paradigms. Therefore, we propose SWE-bench Multimodal (SWE-bench M), to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software. SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping. Each SWE-bench M task instance contains at least one image in its problem statement or unit tests. Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization. Lastly, we show that SWE-agent’s flexible language-agnostic features enable it to substantially outperform alternatives on SWE-bench M, resolving 12% of task instances compared to 6% for the next best system.

4177Hiding Images in Diffusion Models by Editing Learned Score Functions

[openreview] [pdf]

Abstract Hiding data in deep neural networks (DNNs) has achieved remarkable successes, including both discriminative and generative models. Yet, the potential for hiding images in diffusion models remains underdeveloped. Existing approaches fall short in extracting fidelity, secrecy, and efficiency. In particular, the intensive computational demands of the hiding process, coupled with the slow extraction due to multiple denoising stages, make these methods impractical for resource-limited environments. To address these challenges, we propose hiding images at a specific denoising stage in diffusion models by modifying the learned score functions. We also introduce a parameter-efficient fine-tuning (PEFT) approach that combines parameter selection with a variant of low-rank adaptation (LoRA) to boost secrecy and hiding efficiency. Comprehensive experiments demonstrate the effectiveness of our proposed method.

4178X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos

[openreview] [pdf]

Abstract Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence. In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and textual instructions, the goal is to generate future frames of the ego-centric video. Inspired by the notion that hand-object interactions (HOI) in ego-centric videos represent the primary intentions and actions of the current actor, we present X-Gen that explicitly models the hand-object dynamics for cross-view video prediction. X-Gen consists of two stages. First, we design a cross-view HOI mask prediction model that anticipates the HOI masks in future ego-frames by modeling the spatio-temporal ego-exo correspondence. Next, we employ a video diffusion model to predict future ego-frames using the first ego-frame and textual instructions, while incorporating the HOI masks as structural guidance to enhance prediction quality. To facilitate training, we develop a fully automated pipeline to generate pseudo HOI masks for both ego- and exo-videos by exploiting vision foundation models. Extensive experiments demonstrate that our proposed X-Gen achieves better prediction performance compared to previous video prediction models on the public Ego-Exo4D and H2O benchmark datasets, with the HOI masks significantly improving the generation of hands and interactive objects in the ego-centric videos.

4179KAN versus MLP on Irregular or Noisy Functions

[openreview] [pdf]

Abstract In this paper, we compare the performance of Kolmogorov-Arnold Networks (KAN) and Multi-Layer Perceptron (MLP) networks on irregular or noisy functions. We control the number of parameters and the size of the training samples to ensure a fair comparison. For clarity, we categorize the functions into six types: regular functions, continuous functions with local non-differentiable points, functions with jump discontinuities, functions with singularities, functions with coherent oscillations, and noisy functions. Our experimental results indicate that KAN does not always perform best. For some types of functions, MLP outperforms or performs comparably to KAN. Furthermore, increasing the size of training samples can improve performance to some extent. When noise is added to functions, the irregular features are often obscured by the noise, making it challenging for both MLP and KAN to extract these features effectively. We hope these experiments provide valuable insights for future neural network research and encourage further investigations to overcome these challenges.

4180Re-evaluating Open-ended Evaluation of Large Language Models

[openreview] [pdf]

Abstract Evaluation has traditionally focused on ranking candidates for a specific skill. Modern generalist models, such as Large Language Models (LLMs), decidedly outpace this paradigm. Open-ended evaluation systems, where candidate models are compared on user-submitted prompts, have emerged as a popular solution. Despite their many advantages, we show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental, due to their sensitivity to redundancies. To address this issue, we propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy. We show that our method leads to intuitive ratings and provide insights into the competitive landscape of LLM development.

4181Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory

[openreview] [pdf]

Abstract Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers frequently become performance bottlenecks due to repeated failures. In this paper, we argue that the primary cause of failure in many low-level controllers is the absence of an episodic memory system. To address this, we introduce Mr.Steve (Memory Recall STEVE-1), a novel low-level controller equipped with Place Event Memory (PEM), a form of episodic memory that captures what, where, and when information from episodes. This directly addresses the main limitation of the popular low-level controller, STEVE-1. Unlike previous models that rely on short-term memory, PEM organizes spatial and event-based data, enabling efficient recall and navigation in long-horizon tasks. Additionally, we propose an Exploration Strategy and a Memory-Augmented Task Solving Framework, allowing agents to alternate between exploration and task-solving based on recalled events. Our approach significantly improves task-solving and exploration efficiency compared to existing methods, and we are releasing our code to support further research.

4182Theoretical Aspects of Bias and Diversity in Minimum Bayes Risk Decoding

[openreview] [pdf]

Abstract Text generation commonly relies on greedy and beam decoding that limit the search space and degrade output quality. Minimum Bayes Risk (MBR) decoding can mitigate this problem by utilizing automatic evaluation metrics and model-generated pseudo-references. Previous studies have conducted empirical analyses to reveal the improvement by MBR decoding, and reported various observations. However, despite these observations, the theoretical relationship between them remains uncertain. To address this, we present a novel theoretical interpretation of MBR decoding from the perspective of bias-diversity decomposition. We decompose errors in the estimated quality of generated hypotheses in MBR decoding into two key factors:bias, which reflects the closeness between utility functions and human evaluations, anddiversity, which represents the variation in the estimated quality of utility functions. Our theoretical analysis reveals the difficulty in simultaneously improving both bias and diversity, and highlights the effectiveness of increasing diversity to enhance MBR decoding performance. This analysis verifies the alignment between our theoretical insights and the empirical results reported in previous work. Furthermore, to support our theoretical findings, we propose a new metric, pseudo-bias, which approximates the bias term using gold references. We also introduce a new MBR approach, Metric-augmented MBR (MAMBR), which increases diversity by adjusting the behavior of utility functions without altering the pseudo-references. Experimental results across multiple NLP tasks show that the decomposed terms in the bias-diversity decomposition correlate well with performance, and that MAMBR improves text generation quality by modifying utility function behavior. Our code will be available athttps://github.com/[Anonymized].

4183Constraining embedding learning with Self-Matrix Factorization

[openreview] [pdf]

Abstract We focus on the problem of learning object representations from solely association data, that is observed associations between objects of two different types, e.g. movies rated by users. We aim to obtain embeddings encoding object attributes that were not part of the learning process, e.g. movie genres. It has been shown that meaningful representations can be obtained by constraining the learning with manually curated object similarities. Here, we assume that objects lie in multiple linear manifolds embedded in high-dimensional space, and we argue that similarities between objects that correspond to sharing manifolds can be learned from the observed associations. We propose Self-Matrix Factorization (SMF), a method that learns object representations by constraining them with object similarities that are learned together with the representations. In our extensive evaluation across three real-world datasets, we compared SMF with SLIM, HCCF and NMF obtaining better performance at predicting missing associations as measured by RMSE and precision at top-K. We also show that SMF outperforms the competitors at encoding object attributes as measured by the embedding distances between objects divided into attribute-driven groups.

4184Do Deep Neural Network Solutions Form a Star Domain?

[openreview] [pdf]

Abstract It has recently been conjectured that neural network solution sets reachable via stochastic gradient descent (SGD) are convex, considering permutation invariances. This means that a linear path can connect two independent solutions with low loss, given the weights of one of the models are appropriately permuted. However, current methods to test this theory often require very wide networks to succeed. In this work, we conjecture that more generally, the SGD solution set is a star domain that contains a star model that is linearly connected to all the other solutions via paths with low loss values, modulo permutations. We propose the Starlight algorithm that finds a star model of a given learning task. We validate our claim by showing that this star model is linearly connected with other independently found solutions. As an additional benefit of our study, we demonstrate better uncertainty estimates on Bayesian Model Averaging over the obtained star domain. Further, we demonstrate star models as potential substitutes for model ensembles.

4185The Blessing of Smooth Initialization for Video Diffusion Models

[openreview] [pdf]

Abstract Extending the success of text-to-image (T2I) synthesis to text-to-video (T2V) synthesis is a promising direction for visual generative AI. Popular training-free sampling algorithms currently generate high-fidelity images within the Stable Diffusion family. However, when applied to video diffusion models (VDMs), these techniques result in limited diversity and quality due to the low-quality data in a video datasets. We focus on inference to mitigate this issue, and then we propose a training-free paradigm that optimizes the initial Gaussian noise by introducing a targeted semantic prior bias into the sampling process from a smoothing perspective. The paradigm significantly improves both the fidelity and semantic faithfulness of the synthesized videos. Guided by theoretical analysis using random smoothing and differential equations, our resulting method SmoothInit can be understood as approximately incorporating third-order derivatives into gradient descent, which contributes to be better convergence in learning semantic information. A more efficient version, Fast-SmoothInit, is proposed to achieve better experimental results by leveraging a momentum mechanism. Both SmoothInit and Fast-SmoothInit demonstrate promising empirical results across various benchmarks, including UCF-101/MSR-VTT-related FVD, Chronomagic-bench, and T2V-Compbench, setting a new standard for noise initialization in VDMs.

4186CMamba: Channel Correlation Enhanced State Space Models for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Recent advancements in multivariate time series forecasting have been propelled by Linear-based, Transformer-based, and Convolution-based models, with Transformer-based architectures gaining prominence for their efficacy in temporal and cross-channel mixing. More recently, Mamba, a state space model, has emerged with robust sequence and feature mixing capabilities. However, the suitability of the vanilla Mamba design for time series forecasting remains an open question, particularly due to its inadequate handling of cross-channel dependencies. Capturing cross-channel dependencies is critical in enhancing the performance of multivariate time series prediction. Recent findings show that self-attention excels in capturing cross-channel dependencies, whereas other simpler mechanisms, such as MLP, may degrade model performance. This is counterintuitive, as MLP, being a learnable architecture, should theoretically capture both correlations and irrelevances, potentially leading to neutral or improved performance. Diving into the self-attention mechanism, we attribute the observed degradation in MLP performance to its lack of data dependence and global receptive field, which result in MLP’s lack of generalization ability. Considering the powerful sequence modeling capabilities of Mamba and the high efficiency of MLP, the combination of the two is an effective strategy for solving multivariate time series prediction. Based on the above insights, we introduce a refined Mamba variant tailored for time series forecasting. Our proposed model, \textbf{CMamba}, incorporates a modified Mamba (M-Mamba) module for temporal dependencies modeling, a global data-dependent MLP (GDD-MLP) to effectively capture cross-channel dependencies, and a Channel Mixup mechanism to mitigate overfitting. Comprehensive experiments conducted on seven real-world datasets demonstrate the efficacy of our model in improving forecasting performance.

4187NODE-SAT: Temporal Graph Learning with Neural ODE-Guided Self-Attention

[openreview] [pdf]

Abstract We propose NODE-SAT, a novel temporal graph learning model that integrates Neural Ordinary Differential Equations (NODEs) with self-attention mechanisms. NODE-SAT’s design requires only historical 1-hop neighbors as input and comprises three key components: a temporal link processing module utilizing NODE-guided self-attention layers to capture temporal link information, a node representation module summarizing neighbor information, and a prediction layer. Extensive experiments across thirteen temporal link prediction datasets demonstrate that NODE-SAT achieves state-of-the-art performance on most datasets with significantly faster convergence. The model demonstrates high accuracy, rapid convergence, robustness across varying dataset complexities, and strong generalization capabilities in both transductive and inductive settings in temporal link prediction. These findings highlight NODE-SAT’s effectiveness in capturing node correlations and temporal link dynamics.

4188Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks

[openreview] [pdf]

Abstract Optimization methods are widely employed in deep learning to address and mitigate undesired model responses. While gradient-based techniques have proven effective for image models, their application to language models is hindered by the discrete nature of the input space. This study introduces a novel optimization approach, termed thefunctional homotopymethod, which leverages the functional duality between model training and input generation. By constructing a series of easy-to-hard optimization problems, we iteratively solve these using principles derived from established homotopy methods. We apply this approach to jailbreak attack synthesis for large language models (LLMs), achieving a 20%-30% improvement in success rate over existing methods in circumventing established safe open-source models such as Llama-2 and Llama-3.

4189Synthetic Datasets for Machine Learning on Spatio-Temporal Graphs using PDEs

[openreview] [pdf]

Abstract In this work, we describe the creation and use of synthetic datasets based on various partial differential equations to support spatio-temporal graph modeling in machine learning for different applications. More precisely, we showcase three equations to model different types of disasters and hazards in the fields of epidemiology, atmospheric particles, and tsunami waves. Further, we show how such created datasets can be used by benchmarking several machine learning models on the epidemiological dataset and, additionally, by showing how pre-training on such synthetic datasets can improve model performance on real-world epidemiological data. The presented methods enable others to create datasets and benchmarks customized to individual requirements. The source code for our methodology and the three created datasets can be found onhttps://github.com/github-usr-ano/Temporal_Graph_Data_PDEs.

4190IT3: Idempotent Test-Time Training

[openreview] [pdf]

Abstract This paper introduces Idempotent Test-Time Training (IT3^3), a novel approach to addressing the challenge of distribution shift. While supervised-learning methods assume matching train and test distributions, this is rarely the case for machine learning systems deployed in the real world. Test-Time Training (TTT) approaches address this by adapting models during inference, but they are limited by a domain specific auxiliary task. IT3^3 is based on the universal property of idempotence. An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, namely f(f(x))=f(x)f(f(x))=f(x). An idempotent operator is one that can be applied sequentially without changing the result beyond the initial application, that is f(f(X)=f(X)f(f(X)=f(X). At training, the model receives an input XX along with another signal that can either be the ground truth label yy or a neutral “don’t know” signal 0\mathbf{0}. At test time, the additional signal can only be 0\mathbf{0}. When sequentially applying the model, first predicting y0=f(X,0)y_0 = f(X, \mathbf{0}) and then y1=f(X,y0)y_1 = f(X, y_0), the distance between y1y_1 and y2y_2 measures certainty and indicates out-of-distribution input xx if high. We use this distance, that can be expressed as f(X,f(X,0))f(x,0)||f(X, f(X, \mathbf{0})) - f(x, \mathbf{0})|| as our TTT loss during inference. By carefully optimizing this objective, we effectively train f(X,)f(X,\cdot) to be idempotent, projecting the internal representation of the input onto the training distribution. We demonstrate the versatility of our approach across various tasks, including corrupted image classification, aerodynamic predictions, tabular data with missing information, and large-scale aerial photo segmentation. Moreover, these tasks span different architectures such as MLPs, CNNs, and GNNs.

4191Measuring and Controlling Solution Degeneracy across Task-Trained Recurrent Neural Networks

[openreview] [pdf]

Abstract Task-trained recurrent neural networks (RNNs) are versatile models of dynamical processes widely used in machine learning and neuroscience. While RNNs are easily trained to perform a wide range of tasks, the nature and extent of the degeneracy in the resultant solutions (i.e., the variability across trained RNNs) remain poorly understood. Here, we provide a unified framework for analyzing degeneracy across three levels: behavior, neural dynamics, and weight space. We analyzed RNNs trained on diverse tasks across machine learning and neuroscience domains, including N-bit flip-flop, sine wave generation, delayed discrimination, and path integration. Our key finding is that the variability across RNN solutions, quantified on the basis of neural dynamics and trained weights, depends primarily on network capacity and task characteristics such as complexity. We introduce information-theoretic measures to quantify task complexity and demonstrate that increasing task complexity consistently reduces degeneracy in neural dynamics and generalization behavior while increasing degeneracy in weight space. These relationships hold across diverse tasks and can be used to control the degeneracy of the solution space of task-trained RNNs. Furthermore, we provide several strategies to control solution degeneracy, enabling task-trained RNNs to learn more consistent or diverse solutions as needed. We envision that these insights will lead to more reliable machine learning models and could inspire strategies to better understand and control degeneracy observed in neuroscience experiments.

4192ConvCodeWorld: Benchmarking Conversational Code Generation in Reproducible Feedback Environments

[openreview] [pdf]

Abstract Large language models (LLMs) have proven invaluable for code generation, particularly in interactive settings. However, existing code generation benchmarks fail to capture the diverse feedback encountered in multi-turn interactions, limiting our ability to evaluate LLMs in these contexts. To address this gap, we present a set of novel benchmarks that explicitly model the quality of feedback provided to code generation LLMs. Our contributions are threefold:First, we introduce CONVCODEWORLD, a novel and reproducible environment for benchmarking interactive code generation. CONVCODEWORLD simulates 9 distinct interactive code generation scenarios while systematically combining three types of feedback: (a) compilation feedback; (b) execution feedback with varying test coverage; (c) verbal feedback generated by GPT-4o with different levels of expertise.Second, we introduce CONVCODEBENCH, a fast, static version of benchmark that uses pre-generated feedback logs, eliminating the need for costly dynamic verbal feedback generation while maintaining strong Spearman’s rank correlations (0.82 to 0.99) with CONVCODEWORLD.Third, extensive evaluations of both closed-source and open-source LLMs on CONVCODEWORLD reveal key insights: (a) LLM performance varies significantly based on the feedback provided; (b) Weaker LLMs, with sufficient feedback, can outperform single-turn results of state-of-the-art LLMs without feedback; (c) Training on a specific feedback combination can limit an LLM’s ability to utilize unseen combinations; (d) LLMs solve problems in fewer turns (high MRR) may not solve as many problems overall (high Recall), and vice versa. All implementations and benchmarks will be made publicly available athttps://huggingface.co/spaces/ConvCodeWorld/ConvCodeWorld

4193I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

[openreview] [pdf]

Abstract Rectified Flow Transformers (RFTs) offer superior training and inference efficiency, making them likely the most viable direction for scaling up diffusion models. However, progress in generation resolution has been relatively slow due to data quality and training costs. Tuning-free resolution extrapolation presents an alternative, but current methods often reduce generative stability, limiting practical application. In this paper, we review existing resolution extrapolation methods and introduce the I-Max framework to maximize the resolution potential of Text-to-Image RFTs. I-Max features: (i) a novel Projected Flow strategy for stable extrapolation and (ii) an advanced inference toolkit for generalizing model knowledge to higher resolutions. Experiments with Lumina-Next-2K and Flux.1-dev demonstrate I-Max’s ability to enhance stability in resolution extrapolation and show that it can bring image detail emergence and artifact correction, confirming the practical value of tuning-free resolution extrapolation.

4194ROPO: Robust Preference Optimization for Large Language Models

[openreview] [pdf]

Abstract Preference alignment is pivotal for empowering large language models (LLMs) to generate helpful and harmless responses. However, the performance of preference alignment is highly sensitive to the prevalent noise in the preference data. Recent efforts for this problem either marginally alleviate the impact of noise without the ability to actually reduce its presence, or rely on costly teacher LLMs prone to reward misgeneralization. To address these challenges, we propose theRObustPreferenceOptimization (ROPO) framework, a novel iterative alignment approach that integratesnoise-toleranceandfiltering of noisy sampleswithout the aid of external models. Specifically, ROPO first formulates the training process with adaptive noise reduction as an optimization problem, which can be efficiently solved in an iterative paradigm. Then, to enhance this iterative solving process with noise-tolerance and noise-identification capabilities, we derive a robust loss that suppresses the gradients from samples with high uncertainty. We demonstrate both empirically and theoretically that the derived loss is key to the noise-tolerance and effective filtering of noisy samples. Furthermore, inspired by our derived loss, we propose a robustness-guided rejection sampling technique to compensate for the potential important information in discarded queries. Experiments on three widely-used datasets of dialogue and post-summarization demonstrate that ROPO significantly outperforms existing preference alignment methods in the practical noise setting and under artificial random symmetric noise, with its advantage increasing as the noise rate increases.

4195BenTo: Benchmark Reduction with In-Context Transferability

[openreview] [pdf]

Abstract Evaluating large language models (LLMs) is costly: it requires the generation and examination of LLM outputs on a large-scale benchmark of various tasks. This paper investigates how to efficiently reduce the tasks used to benchmark LLMs without affecting the evaluation quality. Our study reveals that task transferability and relevance provide critical information to identify the most representative subset of tasks via optimizing a facility location function. We propose a practically efficient metric for estimating the transferability between two tasks via in-context learning (ICL). By analyzing the pairwise transferability, we can reduce tasks in a modern LLM benchmark (e.g., MMLU or FLAN) to 5% while inducing only a <4<4% difference to the evaluation on the original benchmark. Compared to prior works, our method is training-free, gradient-free, and highly efficient requiring ICL only.

4196Kinematics-Informed Reinforcement Learning for Trajectory Optimization in CNC Machining

[openreview] [pdf]

Abstract Toolpath smoothing and feedrate planning are key techniques in Computer Numerical Control (CNC) machining, and play a significant role in machining accuracy, efficiency, and tool life. Traditional methods typically decouple path smoothing from feedrate planning, without considering the kinematic constraints during the smoothing process. As a result, the subsequent feedrate planning process is subject to more stringent kinematic limitations, which hinders the achievement of optimal speed execution. However, the integration of these two processes presents a significant challenge due to severe complexity and nonlinearity of the problem. Here, we propose a novel Reinforcement Learning (RL) based method, termed KIRL, to address the integrated optimization problem. Experimental results demonstrate that KIRL can generate smoother trajectories and optimize machining time compared to traditional decoupled methods. To our best knowledge, KIRL is the first RL-based method for solving the integrated toolpath smoothing and feedrate planning optimization problem in CNC machining.

4197Predicting perturbation targets with causal differential networks

[openreview] [pdf]

Abstract Rationally identifying variables responsible for changes to a biological system can enable myriad applications in disease understanding and cell engineering. From a causality perspective, we are given two datasets generated by the same causal model, one observational (control) and one interventional (perturbed). The goal is to isolate the subset of measured variables (e.g. genes) that were the targets of the intervention, i.e. those whose conditional independencies have changed. Knowing the causal graph would limit the search space, allowing us to efficiently pinpoint these variables. However, current algorithms that infer causal graphs in the presence of unknown intervention targets scale poorly to the hundreds or thousands of variables in biological data, as they must jointly search the combinatorial spaces of graphs and consistent intervention targets. In this work, we propose a causality-inspired approach for predicting perturbation targets that decouples the two search steps. First, we use an amortized causal discovery model to separately infer causal graphs from the observational and interventional datasets. Then, we learn to map these paired graphs to the sets of variables that were intervened upon, in a supervised learning framework. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets, each with thousands of measured variables. We also demonstrate significant improvements over six causal discovery algorithms in predicting intervention targets across a variety of tractable, synthetic datasets.

4198Minimax Optimal Reinforcement Learning with Quasi-Optimism

[openreview] [pdf]

Abstract In our quest for a reinforcement learning (RL) algorithm that is both practical and provably optimal, we introduce EQO (Exploration via Quasi-Optimism). Unlike existing minimax optimal approaches, EQO avoids reliance on empirical variances and employs a simple bonus term proportional to the inverse of the state-action visit count. Central to EQO is the concept ofquasi-optimism, where estimated values need not be fully optimistic, allowing for a simpler yet effective exploration strategy. The algorithm achieves the sharpest known regret bound for tabular RL under the mildest assumptions, proving that fast convergence can be attained with a practical and computationally efficient approach. Empirical evaluations demonstrate that EQO consistently outperforms existing algorithms in both regret performance and computational efficiency, providing the best of both theoretical soundness and practical effectiveness.

4199Unifying Generative and Dense Retrieval for Sequential Recommendation

[openreview] [pdf]

Abstract Sequential dense retrieval models utilize advanced sequence learning techniques to compute item and user representations, which are then used to rank relevant items for a user through inner product computation between the user and all item representations. However, this approach requires storing a unique representation for each item, resulting in significant memory requirements as the number of items grow. In contrast, the recently proposed generative retrieval paradigm offers a promising alternative by directly predicting item indices using a generative model trained on semantic IDs that encapsulate items’ semantic information. Despite its potential for large-scale applications, a comprehensive comparison between generative retrieval and sequential dense retrieval under fair conditions is still lacking, leaving open questions regarding performance, storage, and computation trade-offs. To address this gap, we conduct a thorough comparison of both approaches under identical conditions and propose LIGER (LeveragIng dense retrieval forGEnerativeRetrieval), a hybrid model that combines the strengths of these two widely used paradigms. Our proposed model seamlessly integrates sequential dense into generative retrieval, effectively addressing performance disparities and improving cold-start item recommendation. This approach demonstrates significant improvements in both efficiency and effectiveness for recommendation systems.

4200SparsitySolver: Efficient Reinforcement Learning-based Pruning for LLMs

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved significant success in the field of Natural Language Processing (NLP). However, due to their large model size and high inference costs, the application of LLMs is restricted. Pruning is regarded as an effective method to reduce the size of LLMs. Mainstream pruning methods for LLMs typically apply a uniform ratio to prune all the layers or determine layerwise sparsity based on simple criteria. Such manually or semi-manually designed pruning strategies often lead to suboptimal results, which makes reinforcement learning a feasible solution. However, current reinforcement learning-based pruning methods usually have redundant environment designs or multiple agents, rendering them ill-suited to massive LLMs. Hence, we propose SparsitySolver, which first incorporates reinforcement learning into the pruning of LLMs, supporting various pruning granularity. SparsitySolver employs an improved reinforcement learning environment, allowing for a rapid pruning strategy search with a small-scale agent. Moreover, to lessen the performance decline caused by structured pruning, we propose a compensation method capable of restoring performance without introducing additional parameters to the model. We evaluate our approach on LLaMA-V1/V2, Mistral, and the OPT families across multiple pruning granularities, achieving performances surpassing the state-of-the-art methods.

4201Improving Consistency Models with Generator-Induced Flows

[openreview] [pdf]

Abstract Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network. They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network. In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field. The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit. To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from the currently trained model - as a proxy of the true flow. Our empirical findings demonstrate that this approach mitigates the previously identified discrepancy. Furthermore, we present theoretical and empirical evidence indicating that our generator-induced flow surpasses dedicated optimal transport-based consistency models in effectively reducing the noise-data transport cost. Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance.

4202Centrality Graph Shift Operators for Graph Neural Networks

[openreview] [pdf]

Abstract Graph Shift Operators (GSOs), such as the adjacency and graph Laplacian matrices, play a fundamental role in graph theory and graph representation learning. Traditional GSOs are typically constructed by normalizing the adjacency matrix by the degree matrix, a local centrality metric. In this work, we instead propose and study Centrality GSOs (CGSOs), which normalize adjacency matrices by global centrality metrics such as the PageRank, kk-core or count of fixed length paths. We study spectral properties of the CGSOs, allowing us to get an understanding of their action on graph signals. We confirm this understanding by defining and running the spectral clustering algorithm based on different CGSOs on several synthetic and real-world datasets. We furthermore outline how our CGSO can act as the message passing operator in any Graph Neural Network and in particular demonstrate strong performance of a variant of the Graph Convolutional Network and Graph Attention Network using our CGSOs on several real-world benchmark datasets.

4203Equivariant Neural Functional Networks for Transformers

[openreview] [pdf]

Abstract This paper systematically explores neural functional networks (NFN) for transformer architectures. NFN are specialized neural networks that treat the weights, gradients, or sparsity patterns of a deep neural network (DNN) as input data and have proven valuable for tasks such as learnable optimizers, implicit data representations, and weight editing. While NFN have been extensively developed for MLP and CNN, no prior work has addressed their design for transformers, despite the importance of transformers in modern deep learning. This paper aims to address this gap by providing a systematic study of NFN for transformers. We first determine the maximal symmetric group of the weights in a multi-head attention module as well as a necessary and sufficient condition under which two sets of hyperparameters of the multi-head attention module define the same function. We then define the weight space of transformer architectures and its associated group action, which leads to the design principles for NFN in transformers. Based on these, we introduce Transformer-NFN, an NFN that is equivariant under this group action. Additionally, we release a dataset of more than 125,000 Transformers model checkpoints trained on two datasets with two different tasks, providing a benchmark for evaluating Transformer-NFN and encouraging further research on transformer training and performance.

4204Balancing Act: Diversity and Consistency in Large Language Model Ensembles

[openreview] [pdf]

Abstract Ensembling strategies for Large Language Models (LLMs) have demonstrated significant potential in improving performance across various tasks by combining the strengths of individual models. However, identifying the most effective ensembling method remains an open challenge, as neither maximizing output consistency through self-consistency decoding nor enhancing model diversity via frameworks like “Mixture of Agents” has proven universally optimal. Motivated by this, we propose a unified framework to examine the trade-offs between task performance, model diversity, and output consistency in ensembles. More specifically, we introduce a consistency score that defines a gating mechanism for mixtures of agents and an algorithm for mixture refinement to investigate these trade-offs at the semantic and model levels, respectively. We incorporate our insights into a novel inference-time LLM ensembling strategy called the Dynamic Mixture of Agents (DMoA) and demonstrate that it achieves a new state-of-the-art result in the challenging Big Bench Hard mixed evaluations benchmark. Our analysis reveals that cross-validation bias can enhance performance, contingent on the expertise of the constituent models. We further demonstrate that distinct reasoning tasks—such as arithmetic reasoning, commonsense reasoning, and instruction following—require different model capabilities, leading to inherent task-dependent trade-offs that DMoA balances effectively.

4205Enhancing Deep Symbolic Regression via Reasoning Equivalent Expressions

[openreview] [pdf]

Abstract Symbolic regression seeks to uncover physical knowledge from experimental data. Recently a line of work on deep reinforcement learning (DRL) formulated the search for optimal expressions as a sequential decision-making problem. However, training these models is challenging due to the inherent instability of the policy gradient estimator. We observe that many numerically equivalent yet symbolically distinct expressions exist, such as log(x12x23)\log(x_1^2 x_2^3) and 2log(x1)+3log(x2)2\log(x_1) + 3\log(x_2). Building on this, we propose Deep Symbolic Regression via Reasoning Equivalent eXpressions (DSR-Rex). The high-level idea is to enhance policy gradient estimation by leveraging both expressions sampled from the DRL and their numerically identical counterparts generated via an expression reasoning module. Our DSR-Rex (1) embeds mathematical laws and equalities into the deep model, (2) reduces gradient estimator variance with theoretical justification and (3) encourages RL exploration of different symbolic forms in the search space of all expressions. In our experiments, DSR-Rex is evaluated on several challenging scientific datasets, demonstrating superior performance in discovering equations with lower Normalized MSE scores. Additionally, DSR-Rex computes gradients with smaller empirical standard deviation, compared to the previous DSR method.

4206Round and Round We Go! What makes Rotary Positional Encodings useful?

[openreview] [pdf]

Abstract Positional Encodings (PEs) are a critical component of Transformer-based Large Language Models (LLMs), providing the attention mechanism with important sequence-position information. One of the most popular types of encoding used today in LLMs are Rotary Positional Encodings (RoPE), that rotate the queries and keys based on their relative distance. A common belief is that RoPE is useful because it helps to decay token dependency as relative distance increases. In this work, we argue that this is unlikely to be the core reason. We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level. We find that Gemma learns to use RoPE to construct robust `positional’ attention patterns by exploiting the highest frequencies. We also find that, in general, Gemma greatly prefers to use the lowest frequencies of RoPE, which we suspect are used to carry semantic information. We mathematically prove interesting behaviours of RoPE and conduct experiments to verify our findings, proposing a modification of RoPE that fixes some highlighted issues and improves performance. We believe that this work represents an interesting step in better understanding PEs in LLMs, which we believe holds crucial value for scaling LLMs to large sizes and context lengths.

4207Towards Learning to Reason at Pre-Training Scale

[openreview] [pdf]

Abstract Prompting a Large Language Model (LLM) to output Chain-of-Thought (CoT) reasoning improves performance on complex problem-solving tasks. Further, several popular approaches exist to ``self-improve" the abilities of LLMs to use CoT on tasks where supervised (question, answer) datasets are available. However, an emerging line of work explores whether self-improvement is possible without supervised datasets, instead utilizing the same large, general-purpose text corpora as used during pre-training. These pre-training datasets encompass large parts of human knowledge and dwarf all finetuning datasets in size. Self-improving CoT abilities on such general datasets could enhance reasoning for any general-purpose text generation task, and doing so at pre-training scale may unlock unprecedented reasoning abilities. In this paper, we outline the path towards self-improving CoT reasoning at pre-training scale and address fundamental challenges in this direction. We start by framing this as a reinforcement learning problem: given the first nn tokens from a large pre-training corpus, the model generates a CoT and receives a reward based on how well the CoT helps predict the following mm tokens. We then investigate a fundamental question: What constitutes a suitable reward function for learning to reason during general language modelling? We outline the desirable qualities of such a reward function and empirically demonstrate how different functions affect what reasoning is learnt and where reasoning is rewarded. Using these insights, we introduce a novel reward function called Reasoning Advantage (RA) that facilitates self-improving CoT reasoning on free-form question-answering (QA) data, where answers are unstructured and difficult to verify. Equipped with a suitable reward function, we explore the optimization of it on general-purpose text using offline RL. Our analysis indicates that future work should investigate more powerful optimisation algorithms, potentially moving towards more online algorithms that better explore the space of CoT generations.

4208Privacy-Preserving Federated Learning via Homomorphic Adversarial Networks

[openreview] [pdf]

Abstract Privacy-preserving federated learning (PPFL) aims to train a global model for multiple clients while maintaining their data privacy. However, current PPFL protocols exhibit one or more of the following insufficiencies: considerable degradation in accuracy, the requirement for sharing keys, and cooperation during the key generation or decryption processes. As a mitigation, we develop the first protocol that utilizes neural networks to preserve privacy in federated learning, as well as incorporating an Aggregatable Hybrid Encryption scheme tailored to the needs of the PPFL. We name these networks as Homomorphic Adversarial Networks (HANs) which demonstrate that neural networks are capable of performing tasks similar to multi-key homomorphic encryption (MK-HE) while solving the problems of key distribution and collaborative decryption. Our experiments show that HANs are robust against privacy attacks. Compared with non-private federated learning, experiments conducted on multiple datasets demonstrate that HANs exhibit a negligible accuracy loss (at most 1.35%). Compared to traditional MK-HE schemes, HANs increase encryption aggregation speed by 6,075 times while incurring a 29.2-fold increase in communication overhead.

4209AIMing for Explainability in GNNs

[openreview] [pdf]

Abstract As machine learning models become increasingly complex and are deployed in critical domains such as healthcare, finance, and autonomous systems, the need for effective explainability has grown. Graph Neural Networks (GNNs), which excel in processing graph-structured data, have seen significant advancements, but explainability for GNNs is still in its early stages. Existing approaches fall under two broad categories: post-hoc explainers, which are evaluated using ground truth explanations for synthetic data, or models based on prototypes or graph kernels that claim inherent interpretability and do not evaluate any actual measures. These evaluation practices fundamentally restrict the utility of any discussions regarding explainability for GNNs. We propose a unified and comprehensive framework for measuring and evaluating explainability in GNNs that extends beyond synthetic data and ground truths, while also allowing for further model development and refinement based on derived explanations. The framework involves measures of Accuracy, Instance-level explanations, and Model-level explanations (AIM), inspired by the generic Co-12 conceptual properties of explanations quality (Nauta et al., 2023). We apply this framework to a suite of existing models, deriving ways to extract explanations from them and to highlight their strengths and weaknesses. Furthermore, based on this analysis using AIM, we develop a new model called XGKN that demonstrates improved explainability while performing on par with existing models. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, offering more robust and practical solutions for understanding and interpreting complex models.

4210Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study over Open-ended Question Answering

[openreview] [pdf]

Abstract Recent works integrating Knowledge Graphs (KGs) have led to promising improvements in enhancing reasoning accuracy of Large Language Models (LLMs). However, current benchmarks mainly focus on closed tasks, leaving a gap in the assessment of more complex, real-world scenarios. This gap has also obscured the evaluation of KGs’ potential to mitigate the problem of hallucination in LLMs. To fill the gap, we introduce OKGQA, a new benchmark specifically designed to assess LLMs enhanced with KGs under open-ended, real-world question answering scenarios. OKGQA is designed to closely reflect the complexities of practical applications using questions from different types, and incorporates specific metrics to measure both the reduction in hallucinations and the enhancement in reasoning capabilities. To consider the scenario in which KGs may have varying levels of mistakes, we further propose another experiment setting OKGQA-P to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. OKGQA aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on methods and future directions for leveraging KGs to reduce LLMs’ hallucination. We believe that this study can facilitate a more complete performance comparison and encourage continuous improvement in integrating KGs with LLMs. The code of this paper is released athttps://anonymous.4open.science/r/OKGQA-CBB0.

4211Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs

[openreview] [pdf]

Abstract Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.

4212Rethinking Data Selection at Scale: Random Selection is Almost All You Need

[openreview] [pdf]

Abstract Supervised fine-tuning (SFT) is crucial for aligning Large Language Models (LLMs) with human instructions. The primary goal during SFT is to select a small yet representative subset of training data from the larger pool, such that fine-tuning with this subset achieves results comparable to or even exceeding those obtained using the entire dataset. However, most existing data selection techniques are designed for small-scale data pools, which fail to meet the demands of real-world SFT scenarios. In this paper, we replicated several self-scoring methods—those that do not rely on external model assistance—on two million-scale datasets, and found that nearly all methods struggled to significantly outperform random selection when dealing with such large-scale data pools. Moreover, our comparisons suggest that, during SFT, diversity in data selection is more critical than simply focusing on high-quality data. We also analyzed the limitations of several current approaches, explaining why they perform poorly on large-scale datasets and why they are unsuitable for such contexts. Finally, we found that filtering data by token length offers a stable and efficient method for improving results. This approach, particularly when training on long-text data, proves highly beneficial for relatively weaker base models, such as Llama3.

4213Continual Learning via Continual Weighted Sparsity and Meta-Plasticity Scheduling

[openreview] [pdf]

Abstract Continual Learning (CL) is fundamentally challenged by the stability-plasticity dilemma: the trade-off between acquiring new information and maintaining past knowledge. To address the stability, many methods keep a replay buffer containing a small set of samples from prior tasks and employ parameter isolation strategies that allocate separate parameter subspaces for each task, reducing interference between tasks. To get more refined, task-specific groups, we adapt a dynamic sparse training technique and introduce a continual weight score function to guide the iterative pruning process over multiple rounds of training. We refer to this method as the continual weighted sparsity scheduler. Furthermore, with more incremental tasks introduced, the network inevitably becomes saturated, leading to a loss of plasticity, where the model’s adaptability decreases due to dormant or saturated neurons. To mitigate this, we draw inspiration from biological meta-plasticity mechanisms, and develop a meta-plasticity scheduler to dynamically adjust these task-specific groups’ learning rates based on the sensitive score function we designed, ensuring a balance between retaining old knowledge and acquiring new skills. The results of comparison on popular datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods, confirming its effectiveness in managing the stability-plasticity trade-off.

4214Large Scale Video Continual Learning with Bootstrapped Compression

[openreview] [pdf]

Abstract Continual learning (CL) promises to allow neural networks to learn from continuous streams of inputs, instead of IID (independent and identically distributed) sampling, which requires random access to a full dataset. This would allow for much smaller storage requirements and self-sufficiency of deployed systems that cope with natural distribution shifts, similarly to biological learning. We focus on video CL employing a rehearsal-based approach, which reinforces past samples from a memory buffer. We posit that part of the reason why practical video CL is challenging is the high memory requirements of video, further exacerbated by long-videos and continual streams, which are at odds with the common rehearsal-buffer size constraints. To address this, we propose to use compressed vision, i.e. store video codes (embeddings) instead of raw inputs, and train a video classifier by IID sampling from this rolling buffer. Training a video compressor online (so not depending on any pre-trained networks) means that it is also subject to catastrophic forgetting. We propose a scheme to deal with this forgetting by refreshing video codes, which requires careful decompression with a previous version of the network and recompression with a new one. We expand current video CL benchmarks to large-scale settings, namely EpicKitchens-100 and Kinetics-700, with thousands of relatively long videos, and demonstrate empirically that our video CL method outperforms prior art with a significantly reduced memory footprint.

4215PEARL: Towards Permutation-Resilient LLMs

[openreview] [pdf]

Abstract The in-context learning (ICL) ability of large language models (LLMs) enables them to undertake challenging tasks using provided demonstrations. However, it is prone to instability: different orderings of demonstrations can significantly influence predictions, revealing LLMs’ limitations in processing combinatorial inputs. This paper shows that this vulnerability can be exploited to design a natural and completely imperceptible attack that achieves nearly 80% success rates on the SOTA open-source model, LLaMA, by simply permuting the demonstrations. In light of this, how to overcome the ordering sensitivity problem is an important issue for improving the performance of LLMs. However, current mitigation methods focus on post-processing and fail to enhance models’ inherent robustness to the vast space of possible input permutations. To overcome this issue, we propose a novel Permutation-resilient learning framework (PEARL) based on distributional robust optimization (DRO), which optimizes model performance against the worst case among all possible permutations. Specifically, PEARL consists of a hard permutation mining network (P-Net) and the LLM. The P-Net identifies the most challenging permutations by formulating the task as an optimal transport problem, which is solved using an entropy-constrained Sinkhorn algorithm. Through minimax optimization, the P-Net progressively generates harder samples to enhance the LLM’s worst-case performance. Experiments with synthetic data and instruction tuning tasks demonstrate that the proposed PEARL framework effectively mitigates permutation attacks and improves overall performance.

4216Log-Concave Sampling on Compact Supports: A Versatile Proximal Framework

[openreview] [pdf]

Abstract In this paper, we investigate the theoretical aspects of sampling from strongly log-concave distributions defined on convex and compact supports. We propose a general proximal framework that involves projecting onto the constrained set, which is highly flexible and supports various projection options. Specifically, we consider the cases of Euclidean and Gauge projections, with the latter having the advantage of being performed efficiently using a membership oracle. This framework can be seamlessly integrated with multiple sampling methods. Our analysis focuses on Langevin-type sampling algorithms within the context of constrained sampling. We provide nonasymptotic upper bounds on the W1W_1 and W2W_2 errors, offering a detailed comparison of the performance of these methods in constrained sampling.

4217Robust Learning in Bayesian Parallel Branching Graph Neural Networks: The Narrow Width Limit

[openreview] [pdf]

Abstract The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization. However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Graph Neural Network (BPB-GNN), an architecture that resembles residual GCN. We demonstrate that when the width of a BPB-GNN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-GNN in the narrow width limit is generally superior or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.

4218Provably Reliable Conformal Prediction Sets in the Presence of Data Poisoning

[openreview] [pdf]

Abstract Conformal prediction provides model-agnostic and distribution-free uncertainty quantification through prediction sets that are guaranteed to include the ground truth with any user-specified probability. Yet, conformal prediction is not reliable under poisoning attacks where adversaries manipulate both training and calibration data, which can significantly alter prediction sets in practice. As a solution, we propose reliable prediction sets (RPS): the first efficient method for constructing conformal prediction sets with provable reliability guarantees under poisoning. To ensure reliability under training poisoning, we introduce smoothed score functions that reliably aggregate predictions of classifiers trained on distinct partitions of the training data. To ensure reliability under calibration poisoning, we construct multiple prediction sets, each calibrated on distinct subsets of the calibration data. We then aggregate them into a majority prediction set, which includes a class only if it appears in a majority of the individual sets. Both proposed aggregations mitigate the influence of datapoints in the training and calibration data on the final prediction set. We experimentally validate our approach on image classification tasks, achieving strong reliability while maintaining utility and preserving coverage on clean data. Overall, our approach represents an important step towards more trustworthy uncertainty quantification in the presence of data poisoning.

4219CofCA: A STEP-WISE Counterfactual Multi-hop QA benchmark

[openreview] [pdf]

Abstract While Large Language Models (LLMs) excel in question-answering (QA) tasks, their real reasoning abilities on multiple evidence retrieval and integration on Multi-hop QA tasks remain less explored. Firstly, LLMs sometimes generate answers that rely on internal memory rather than retrieving evidence and reasoning in the given context, which brings concerns about the evaluation quality of real reasoning abilities. Although previous counterfactual QA benchmarks can separate the internal memory of LLMs, they focus solely on final QA performance, which is insufficient for reporting LLMs’ real reasoning abilities. Because LLMs are expected to engage in intricate reasoning processes that involve evidence retrieval and answering a series of sub-questions from given passages. Moreover, current factual Multi-hop QA (MHQA) benchmarks are annotated on open-source corpora such as Wikipedia, although useful for multi-step reasoning evaluation, they show limitations due to the potential data contamination in LLMs’ pre-training stage. To address these issues, we introduce the Step-wise and Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data that reveals LLMs’ real reasoning abilities on multi-step reasoning and reasoning chain evaluation. Our experimental results reveal a significant performance gap of several LLMs between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks. Moreover, we observe that LLMs usually bypass the correct reasoning chain, showing an inflated multi-step reasoning performance. We believe that our CofCA benchmark will enhance and facilitate the evaluations of trustworthy LLMs.

4220TextSquare: Scaling up Text-Centric Visual Instruction Tuning

[openreview] [pdf]

Abstract Text-centric visual question answering (VQA) has made great strides with the development of Multimodal Large Language Models (MLLMs), yet open-source models still fall short of leading models like GPT4V and Gemini. A key contributing factor to this disparity is the absence of extensive, high-quality instruction tuning data. To this end, we introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M, generated by leveraging the versatile multimodal capabilities of closed-source MLLMs. The data construction process, termed Square, consists of four steps: Self-Questioning, Answering, Reasoning, and Evaluation. Our experiments with Square-10M led to three key findings: 1) Our model, TextSquare, considerably surpasses open-source previous state-of-the-art text-centric MLLMs and sets a new standard on OCRBench (62.2%). It even outperforms top-tier models like GPT4V and Gemini on six out of ten text-centric benchmarks. 2) We demonstrate the importance of VQA reasoning data in offering comprehensive contextual insights for specific questions, which not only improves accuracy but also substantially mitigates hallucinations. Specifically, TextSquare scores an average of 75.1% across four general VQA and hallucination evaluation datasets, outperforming previous state-of-the-art models. 3) Notably, the phenomenon observed in scaling text-centric VQA datasets reveals a vivid pattern: an exponential increase of instruction tuning data volume is directly proportional to the improvement in model performance, thereby validating the necessity of the dataset scale and the high quality of Square-10M.

4221Robust Watermarking for Diffusion Models: A Unified Multi-Dimensional Recipe

[openreview] [pdf]

Abstract Diffusion models are known for the supreme capability to generate realistic images. However, ethical concerns, such as copyright protection and generation of inappropriate content, pose significant challenges for the practical deployment of diffusion models. Recent work has proposed a flurry of watermarking techniques that inject visually noteless patterns into generated images, offering a promising solution to these issues. While effective, the essential elements for watermarking and the interconnections among various methods are still chaos. In this paper, we dissect the design principles of state-of-the-art watermarking techniques and introduce a unified framework. We identify a set of dimensions that explain the manipulation enforced by watermarking methods, including the distribution of individual elements, the specification of watermark regions within each channel, and the choice of channels for watermark embedding. Moreover, under this framework we instantiate a new watermarking method to minimize impacts on the model performance from a distributional perspective. Through the empirical studies on regular text-to-image applications and the first systematic attempt on watermarking image-to-image diffusion models, we thoroughly verify the effectiveness of our proposed framework through comprehensive evaluations. On all the diffusion models, including Stable Diffusion, our approach induced from the proposed framework not only preserves image quality but also outperforms existing methods in robustness against a range of attacks.

4222Exploiting Structure in Offline Multi-Agent RL: The Benefits of Low Interaction Rank

[openreview] [pdf]

Abstract We study the problem of learning an approximate equilibrium in the offline multi-agent reinforcement learning (MARL) setting. We introduce a structural assumption---the interaction rank---and establish that functions with low interaction rank are significantly more robust to distribution shift compared to general ones. Leveraging this observation, we demonstrate that utilizing function classes with low interaction rank, when combined with regularization and no-regret learning, admits decentralized, computationally and statistically efficient learning in offline MARL. Our theoretical results are complemented by experiments that showcase the potential of critic architectures with low interaction rank in offline MARL, contrasting with commonly used single-agent value decomposition architectures.

4223Neural Topic Modeling with Large Language Models in the Loop

[openreview] [pdf]

Abstract Topic modeling is a fundamental task in natural language processing, allowing the discovery of latent thematic structures in text corpora. While Large Language Models (LLMs) have demonstrated promising capabilities in topic discovery, their direct application to topic modeling suffers from issues such as incomplete topic coverage, misalignment of topics, and inefficiency. To address these limitations, we propose LLM-ITL, a novel LLM-in-the-loop framework that integrates LLMs with many existing Neural Topic Models (NTMs). In LLM-ITL, global topics and document representations are learned through the NTM, while an LLM refines the topics via a confidence-weighted Optimal Transport (OT)-based alignment objective. This process enhances the interpretability and coherence of the learned topics, while maintaining the efficiency of NTMs. Extensive experiments demonstrate that LLM-ITL can help NTMs significantly improve their topic interpretability while maintaining the quality of document representation.

4224The Low-Rank Bottleneck in Attention

[openreview] [pdf]

Abstract Attention-based mechanisms are widely used in machine learning, most prominently in transformers. However, hyperparameters such as the rank of the attention matrices and the number of attention heads are scaled nearly the same way in all realizations of this architecture, without theoretical justification. In this paper, we prove that the rank can have a dramatic effect on the representational capacity of attention. This effect persists even when the number of heads and the parameter count are very large. Specifically, we present a simple and natural target function based on nearest neighbor search that can be represented using a single full-rank attention head for any context length, but that cannot be approximated by low-rank attention unless the number of heads is exponential in the embedding dimension, even for short context lengths. Moreover, we show that, for short context lengths, adding depth allows the target to be approximated by low-rank attention. For long contexts, we conjecture that full-rank attention is necessary. Finally, we present experiments with standard multilayer transformers that validate our theoretical findings.

4225Learning Geometric Reasoning Networks For Robot Task And Motion Planning

[openreview] [pdf]

Abstract Task and Motion Planning (TAMP) is a computationally challenging robotics problem due to the tight coupling of discrete symbolic planning and continuous geometric planning of robot motions. In particular, planning manipulation tasks in complex 3D environments leads to a large number of costly geometric planner queries to verify the feasibility of considered actions and plan their motions. To address this issue, we propose Geometric Reasoning Networks (GRN), a graph neural network (GNN)-based model for action and grasp feasibility prediction, designed to significantly reduce the dependency on the geometric planner. Moreover, we introduce two key interpretability mechanisms: inverse kinematics (IK) feasibility prediction and grasp obstruction (GO) estimation. These modules not only improve feasibility predictions accuracy, but also explain why certain actions or grasps are infeasible, thus allowing a more efficient search for a feasible solution. Through extensive experimental results, we show that our model outperforms state-of-the-art methods, while maintaining generalizability to more complex environments, diverse object shapes, multi-robot settings, and real-world robots.

4226EReLELA: Exploration in Reinforcement Learning via Emergent Language Abstractions

[openreview] [pdf]

Abstract The ability of AI agents to follow natural language (NL) instructions is important for Human-AI collaboration. Training Embodied AI agents for instruction-following can be done with Reinforcement Learning (RL), yet it poses many challenges. Among which is the exploitation versus exploration trade-off in RL. Previous works have shown that NL-based state abstractions can help address this challenge. However, NLs descriptions have limitations in that they are not always readily available and are expensive to collect. In order to address these limitations, we propose to use the Emergent Communication paradigm, where artificial agents learn an emergent language (EL) in an unsupervised fashion, via referential games. Thus, ELs constitute cheap and readily-available abstractions. In this paper, we investigate (i) how EL-based state abstractions compare to NL-based ones for RL in hard-exploration, procedurally-generated environments, and (ii) how properties of the referential games used to learn ELs impact the quality of the RL exploration and learning. We provide insights about the kind of state abstractions performed by NLs and ELs over RL state spaces, using our proposed Compactness Ambiguity Metric. Our results indicate that our proposed EL-guided agent, entitled EReLELA, achieves similar performance as its NL-based counterparts without its limitations. Our work shows that RL agents can leverage unsupervised EL abstractions to greatly improve their exploration skills in sparse reward settings, thus opening new research avenues between Embodied AI and Emergent Communication.

4227RankSHAP: Shapley Value Based Feature Attributions for Learning to Rank

[openreview] [pdf]

Abstract Numerous works propose post-hoc, model-agnostic explanations for learning to rank, focusing on ordering entities by their relevance to a query through feature attribution methods. However, these attributions often weakly correlate or contradict each other, confusing end users. We adopt an axiomatic game-theoretic approach, popular in the feature attribution community, to identify a set of fundamental axioms that every ranking-based feature attribution method should satisfy. We then introduce Rank-SHAP, extending classical Shapley values to ranking. We evaluate the RankSHAP framework through extensive experiments on two datasets, multiple ranking methods and evaluation metrics. Additionally, a user study confirms RankSHAP’s alignment with human intuition. We also perform an axiomatic analysis of existing rank attribution algorithms to determine their compliance with our proposed axioms. Ultimately, our aim is to equip practitioners with a set of axiomatically backed feature attribution methods for studying IR ranking models, that ensure generality as well as consistency.

4228Language Models Need Inductive Biases to Count Inductively

[openreview] [pdf]

Abstract Counting is a fundamental example of generalization, whether viewed through the mathematical lens of Peano’s axioms defining the natural numbers or the cognitive science literature for children learning to count. The argument holds for both cases that learning to count means learning to count infinitely. While few papers have tried to distill transformer “reasoning” to the simplest case of counting, investigating length generalization does occur throughout the literature. In the “train short, test long” paradigm of NLP, length refers to the training sentence length. In formal language recognition, length refers to the input sequence length, or the maximum stack size induced by a pushdown automata. In general problem solving, length refers to the number of hops in a deductive reasoning chain or the recursion depth. For all cases, counting is central to task success. And crucially, generalizing counting inductively is central to success on OOD instances. This work provides extensive empirical results on training language models to count. We experiment with architectures ranging from RNNs, Transformers, State-Space Models and RWKV. We present carefully-designed task formats, auxiliary tasks and positional embeddings to avoid limitations in generalization with OOD-position and OOD-vocabulary. We find that while traditional RNNs trivially achieve inductive counting, Transformers have to rely on positional embeddings to count out-of-domain. As counting is the basis for many arguments concerning the expressivity of Transformers, our finding calls for the community to reexamine the application scope of primitive functions defined in formal characterizations. Finally, modern RNNs also largely underperform traditional RNNs in generalizing counting inductively. We discuss how design choices that enable parallelized training of modern RNNs cause them to lose merits of a recurrent nature.

4229Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge

[openreview] [pdf]

Abstract Recent advances in Large Language Models (LLMs) have enabled the development of Video-LLMs, advancing multimodal learning by bridging video data with language tasks. However, current video understanding models struggle with processing long video sequences, supporting multi-turn dialogues, and adapting to real-world dynamic scenarios. To address these issues, we propose StreamChat, a training-free framework for streaming video reasoning and conversational interaction. StreamChat leverages a novel hierarchical memory system to efficiently process and compress video features over extended sequences, enabling real-time, multi-turn dialogue. Our framework incorporates a parallel system scheduling strategy that enhances processing speed and reduces latency, ensuring robust performance in real-world applications. Furthermore, we introduce StreamBench, a versatile benchmark that evaluates streaming video understanding across diverse media types and interactive scenarios, including multi-turn interactions and complex reasoning tasks. Extensive evaluations on StreamBench and other public benchmarks demonstrate that StreamChat significantly outperforms existing state-of-the-art models in terms of accuracy and response times, confirming its effectiveness for streaming video understanding.

4230pMoE: Prompting Diverse Experts Together Wins More in Visual Adaptation

[openreview] [pdf]

Abstract Parameter-efficient fine-tuning has demonstrated promising results across various visual adaptation tasks, such as classification and segmentation. Typically, prompt tuning techniques have harnessed knowledge from a single pre-trained model, whether from a general or a specialized medical domain. However, this approach typically overlooks the potential synergies that could arise from integrating diverse domain knowledge within the same tuning process. In this work, we propose a novel Mixture-of-Experts prompt tuning method called pMoE, which leverages the strengths of multiple expert domains through expert-specialized prompt tokens and the learnable dispatcher, effectively combining their expertise in a unified model framework. Our pMoE introduces expert-specific prompt tokens and utilizes a dynamic token dispatching mechanism at various prompt layers to optimize the contribution of each domain expert during the adaptation phase. By incorporating both domain knowledge from diverse experts, the proposed pMoE significantly enhances the model’s versatility and applicability to a broad spectrum of tasks. We conduct extensive experiments across 47 adaptation tasks, including both classification and segmentation in general and medical domains. The results demonstrate that our pMoE not only achieves superior performance with a large margin of improvements but also offers an optimal trade-off between computational efficiency and adaptation effectiveness compared to existing methods.

4231A Robust Method to Discover Causal or Anticausal Relation

[openreview] [pdf]

Abstract Understanding whether the data generative process follows causal or anticausal relations is important for many applications. Existing causal discovery methods struggle with high-dimensional perceptual data such as images. Moreover, they require well-labeled data, which may not be feasible due to measurement error. In this paper, we propose a robust method to detect whether the data generative process is causal or anticausal. To determine the causal or anticausal relation, we identify an asymmetric property: under the causal relation, the instance distribution does not contain information about the noisy class-posterior distribution. We also propose a practical method to verify this via a noise injection approach. Our method is robust to label errors and is designed to handle both large-scale and high-dimensional datasets effectively. Both theoretical analyses and empirical results on a variety of datasets demonstrate the effectiveness of our proposed method in determining the causal or anticausal direction of the data generative process.

4232Evaluating Perceptual Distances Models by Fitting Binomial Distributions to Two-Alternative Forced Choice Data

[openreview] [pdf]

Abstract The two-alternative forced choice (2AFC) experimental method is popular in the visual perception literature, where practitioners aim to understand how human observers perceive distances within triplets made of a reference image and two distorted versions. In the past, this had been conducted in controlled environments, with triplets sharing images, so it was possible to rank the perceived quality. This ranking would then be used to evaluate perceptual distance models against the experimental data. Recently, crowd-sourced perceptual datasets have emerged, with no images shared between triplets, making ranking infeasible. Evaluating perceptual distance models using this data reduces the judgements on a triplet to a binary decision, namely, whether the distance model agrees with the human decision - which is suboptimal and prone to misleading conclusions. Instead, we statistically model the underlying decision-making process during 2AFC experiments using a binomial distribution. Having enough empirical data, we estimate a smooth and consistent distribution of the judgements on the reference-distorted distance plane, according to each distance model. By applying maximum likelihood, we estimate the parameter of the local binomial distribution, and a global measurement of the expected log-likelihood of the measured responses. We calculate meaningful and well-founded metrics for the distance model, beyond the mere prediction accuracy as percentage agreement, even with variable numbers of judgements per triplet -- key advantages over both classical and neural network methods.

4233SYMPOL: Symbolic Tree-Based On-Policy Reinforcement Learning

[openreview] [pdf]

Abstract Reinforcement learning (RL) has seen significant success across various domains, but its adoption is often limited by the black-box nature of neural network policies, making them difficult to interpret. In contrast, symbolic policies allow representing decision-making strategies in a compact and interpretable way. However, learning symbolic policies directly within on-policy methods remains challenging. In this paper, we introduce SYMPOL, a novel method for SYMbolic tree-based on-POLicy RL. SYMPOL employs a tree-based model integrated with a policy gradient method, enabling the agent to learn and adapt its actions while maintaining a high level of interpretability. We evaluate SYMPOL on a set of benchmark RL tasks, demonstrating its superiority over alternative tree-based RL approaches in terms of performance and interpretability. To the best of our knowledge, this is the first method, that allows a gradient-based end-to-end learning of interpretable, axis-aligned decision trees on-policy. Therefore, SYMPOL can become the foundation for a new class of interpretable RL based on decision trees.

4234A Little Depth Goes a Long Way: the Expressive Power of Log-Depth Transformers

[openreview] [pdf]

Abstract Most analysis of transformer expressivity treats the depth (number of layers) of a model as a fixed constant, and analyzes the kinds of problems such models can solve across inputs of unbounded length. In practice, however, the context length of a trained transformer model is bounded. Thus, a more pragmatic question is:What kinds of computation can a transformer perform on inputs of bounded length?We formalize this by studying highly uniform transformers where the depth can grow minimally with context length. In this regime, we show that transformers with depth O(logC)O(\log C) can, in fact, compute solutions to two important problems for inputs bounded by some max context length CC, namelysimulating finite automata, which relates to the ability to track state, andgraph connectivity, which underlies multi-step reasoning. Notably, both of these problems have previously been proven to be asymptotically beyond the reach of fixed depth transformers under standard complexity conjectures, yet empirically transformer models can successfully track state and perform multi-hop reasoning on short contexts. Our novel analysis thus explains how transformer models may rely on depth to feasibly solve problems up to bounded context that they cannot solve over long contexts. It makes actionable suggestions for practitioners as to how to minimally scale the depth of a transformer to support reasoning over long contexts, and also argues for dynamically unrolling depth as a more effective way of adding compute compared to increasing model dimension or adding a short chain of thought.

4235Bias-Augmented Consistency Training Reduces Biased Reasoning in Chain-of-Thought

[openreview] [pdf]

Abstract While chain-of-thought prompting (CoT) has the potential to improve the explainability of language model reasoning, it can systematically misrepresent the factors influencing models’ behavior--for example, rationalizing answers in line with a user’s opinion without mentioning this bias. To mitigate this biased reasoning problem, we introduce bias-augmented consistency training (BCT), an unsupervised fine-tuning scheme that trains models to give consistent reasoning across prompts with and without biasing features. We construct a suite testing nine forms of biased reasoning on seven question-answering tasks, and find that applying BCT to GPT-3.5-Turbo with one bias reduces the rate of biased reasoning by 86% on held-out tasks. Moreover, this model generalizes to other forms of bias, reducing biased reasoning on held-out biases by an average of 37%. As BCT generalizes to held-out biases and does not require gold labels, this method may hold promise for reducing biased reasoning from as-of-yet unknown biases and on tasks where ground truth reasoning is unavailable.

4236Oversmoothing as Loss of Sign: Towards Structural Balance in Graph Neural Networks

[openreview] [pdf]

Abstract Oversmoothing is a common phenomenon in a wide range of graph neural networks (GNNs), where node representation becomes homogeneous and thus model performance worsens as the number of layers increases. Various strategies have been proposed to combat oversmoothing, but they are based on different heuristics and lack a unified understanding of their inherent mechanisms. In this paper, we revisit the concept of signed graphs and show that a wide class of anti-oversmoothing techniques can be viewed as the propagation on corresponding signed graphs with both positive and negative edges. Leveraging the classic theory of signed graphs, we characterize the asymptotic behaviors of existing methods and reveal that they deviate from the ideal state of structural balance that provably prevents oversmoothing and improves node classification performance. Driven by this unified analysis and theoretical insights, we propose Structural Balanced Propagation (SBP) where we explicitly enhance the structural balance of the signed graph with the help of label and feature information. We theoretically and empirically prove that SBP can improve the structural balance to alleviate oversmoothing under certain conditions. Experiments on synthetic and real-world datasets demonstrate the effectiveness of our methods, highlighting the value of our signed graph framework.

4237Laplace-Transform-Filters render spectral Graph Neural Networks transferable

[openreview] [pdf]

Abstract We introduce a new point of view on transferability of graph neural networks based on the intrinsic notion of information diffusion within graphs. Transferability of graph neural networks is then considered between graphs that are similar from this novel perspective of information diffusion. After carefully analysing transferability of single filters, the transferability properties of entire networks are reduced to the transferability characteristics of the filter functions employed inside their convolutional blocks. A rigorous analysis establishes our main theoretical finding: Spectral convolutional networks are transferable if their filters arise as Laplace transforms of certain generalized functions. Example settings illustrate the developed theory and numerical experiments validate the theoretical findings in practice.

4238Increasing Both Batch Size and Learning Rate Accelerates Stochastic Gradient Descent

[openreview] [pdf]

Abstract The performance of mini-batch stochastic gradient descent (SGD) strongly depends on setting the batch size and learning rate to minimize the empirical loss in training the deep neural network. In this paper, we present theoretical analyses of mini-batch SGD with four schedulers: (i) constant batch size and decaying learning rate scheduler, (ii) increasing batch size and decaying learning rate scheduler, (iii) increasing batch size and increasing learning rate scheduler, and (iv) increasing batch size and warm-up decaying learning rate scheduler. We show that mini-batch SGD using scheduler (i) does not always minimize the expectation of the full gradient norm of the empirical loss, whereas it does using any of schedulers (ii), (iii), and (iv). Furthermore, schedulers (iii) and (iv) accelerate mini-batch SGD. The paper also provides numerical results of supporting analyses showing that using scheduler (iii) or (iv) minimizes the full gradient norm of the empirical loss faster than using scheduler (i) or (ii).

4239Bayesian Optimization via Continual Variational Last Layer Training

[openreview] [pdf]

Abstract Gaussian Processes (GPs) are widely seen as the state-of-the-art surrogate models for Bayesian optimization (BO) due to their ability to model uncertainty and their performance on tasks where correlations are easily captured (such as those defined by Euclidean metrics) and their ability to be efficiently updated online. However, the performance of GPs depends on the choice of kernel, and kernel selection for complex correlation structures is often difficult or must be made bespoke. While Bayesian neural networks are a promising direction for higher capacity surrogate models, they have so far seen limited use due to a combination of cost of use and poor performance. In this paper, we propose an approach which offers the strengths of both methods. We build on variational Bayesian last layers (VBLLs), which provide a simple and computationally lightweight approach to Bayesian uncertainty quantification in neural networks. We connect training of these models to exact conditioning in GPs, and propose an efficient online training algorithm that interleaves conditioning and optimization. Our findings suggest that VBLL networks significantly outperform GPs and other BNN architectures on tasks with complex input correlations, and match the performance of well-tuned GPs on established benchmark tasks.

4240Seeded LoRA: Collaborative Fine-Tuning Through Seed Initialization of Adapters

[openreview] [pdf]

Abstract Parameter-Efficient Fine-Tuning (PEFT) methods facilitate the cost-effective adaptation of pretrained language models to specific tasks and domains. These methods have enabled the open-source community to develop thousands of specialized models tailored to various domains and tasks. Collaborative Fine-Tuning (CoFT) is the paradigm that seeks to merge these specialized models into a single model -- often a routed Mixture-of-Expert (MoE) model -- to achieve better generalization across domains and tasks. However, current CoFT models require a post-merge fine-tuning stage to successfully combine existing models, making CoFT approaches inaccessible to users who lack fine-tuning expertise. In this work, we introduce Seeded LoRA, a novel CoFT approach that does not require post-merge fine-tuning thus enabling plug-and-play PEFT adapter merging. Seeded LoRA significantly outperforms LoRA and MoE LoRA (MoLoRA) approaches, improving by an average of 7 percentage points across a battery of 16 zero-shot tasks and we find that the main benefit from Seeded LoRA comes from mitigating task interference during finetuning. Seeded LoRA works by initializing a model before fine-tuning using a generic seed expert low-rank adapter which was finetuned on a small random subset of the finetuning data such that subsequent fine-tuning runs are initialized in the same optimization subspace. This process enables the integration of any combination of independently fine-tuned models through simple averaging of expert adapter outputs. We show that averaging, or routing with assigning equal probability weights to each expert, is equivalent to grouped convolution, explaining its effectiveness. Additionally, we study subtle routing failures in post-merge fine-tuning and highlight that Seeded LoRA can alleviate most routing failures, making it a suitable base method for future routed CoFT approaches.

4241READ-SQL: Reasoning Path Decomposer for Text-to-SQL

[openreview] [pdf]

Abstract Text-to-SQL is a longstanding task aimed at automatically converting natural language questions into SQL queries for database retrieval. Despite impressive advancements, particularly with Large Language Models (LLMs), existing methods still struggle with issues such as misinterpreted, omitted, or unwanted constraints. To address these challenges, we propose READ-SQL, a novel framework employing a \underline{re}asoning p\underline{a}th \underline{d}compos\underline{er}, \textbf{READ}ER, for text-to-SQL tasks. READER decomposes SQLs into clauses, sub-SQLs, and reasoning paths, supporting data preparation and confidence level determination in post-processing. READ-SQL comprises two main models: a Generator and a Corrector, both trained via LoRA for parameter efficiency. Based on READER’s decomposition, READ-SQL generates two types of augmented data using an LLM: question/SQL pairs and question/reason pairs. The Generator is trained on both original and augmented data to identify constraint changes and enhance reasoning. The Corrector is trained on data from READER’s post-processing, improving self-correction by refining high-confidence SQLs and addressing low-confidence elements. Extensive experiments show that READ-SQL significantly outperforms leading baselines, with READ-SQL-3B achieving 57.37% execution accuracy on BIRD’s dev set, surpassing several 7B-parameter models and setting a new state-of-the-art with fewer parameters. Additionally, READER and the Corrector show broad applicability when integrated with LLMs or other base models.

4242CMIMP: Effortlessly Achieving Diverse Population Training for Zero-Shot Coordination

[openreview] [pdf]

Abstract Zero-shot coordination has recently become a hot topic in reinforcement learning research recently. It focuses on the generalization ability of agents, requiring them to coordinate well with collaborators that are not seen before without any fine-tuning. Population-based training has been proven to provide good zero-shot coordination performance; nevertheless, existing algorithms exhibit inefficiency, as the training cost scales linearly with the population size. To address this issue, this paper proposes the Conditional Mutual Information Maximized Population (CMIMP), an efficient training framework comprising two key components: a meta-agent that efficiently realizes a population by selectively sharing parameters across agents, and a mutual information regularizer that guarantees population diversity. To empirically validate the effectiveness of CMIMP, this paper evaluates it along with representational frameworks in Hanabi and confirms its superiority.

4243SaMer: A Scenario-aware Multi-dimensional Evaluator for Large Language Models

[openreview] [pdf]

Abstract Evaluating the response quality of large language models (LLMs) for open-ended questions poses a significant challenge, especially given the subjectivity and multi-dimensionality of “quality” in natural language generation. Existing LLM evaluators often neglect that different scenarios require distinct evaluation criteria. In this work, we proposeSaMer, a scenario-aware multi-dimensional evaluator designed to provide both overall and fine-grained assessments of LLM-generated responses. Unlike fixed-dimension evaluation approaches, SaMer adapts to different scenarios by automatically identifying and prioritizing relevant evaluation dimensions tailored to the given query. To achieve this, we construct a large-scale fine-grained preference dataset spanning multiple real-world scenarios, each with distinct evaluation dimensions. We then leverage a text embedding model combined with three specialized heads to predict the appropriate evaluation dimensions and corresponding scores, as well as the respective weights that contribute to the overall score. The resulting model offers fine-grained and interpretable evaluations and shows robust adaptability across diverse scenarios. Extensive experiments on eight single rating and pairwise comparison datasets demonstrate that SaMer outperforms existing baselines in a variety of evaluation tasks, showcasing its robustness, versatility, and generalizability.

4244Air Quality Prediction with Physics-Informed Dual Neural ODEs in Open Systems

[openreview] [pdf]

Abstract Air pollution significantly threatens human health and ecosystems, necessitating effective air quality prediction to inform public policy. Traditional approaches are generally categorized into physics-based and data-driven models. Physics-based models usually struggle with high computational demands and closed-system assumptions, while data-driven models may overlook essential physical dynamics, confusing the capturing of spatiotemporal correlations. Although some physics-informed approaches combine the strengths of both models, they often face a mismatch between explicit physical equations and implicit learned representations. To address these challenges, we propose Air-DualODE, a novel physics-informed approach that integrates dual branches of Neural ODEs for air quality prediction. The first branch applies open-system physical equations to capture spatiotemporal dependencies for learning physics dynamics, while the second branch identifies the dependencies not addressed by the first in a fully data-driven way. These dual representations are temporally aligned and fused to enhance prediction accuracy. Our experimental results demonstrate that Air-DualODE achieves state-of-the-art performance in predicting pollutant concentrations across various spatial scales, thereby offering a promising solution for real-world air quality challenges.

4245Adaptively Private Next-Token Prediction of Large Language Models

[openreview] [pdf]

Abstract As Large Language Models (LLMs) proliferate, developing privacy safeguards for these models is crucial. One popular safeguard involves training LLMs in a differentially private manner. However, such solutions are shown to be computationally expensive and detrimental to the utility of these models. Since LLMs are deployed on the cloud and thus only accessible via an API, a Machine Learning as a Service (MLaaS) provider can protect its downstream data by privatizing the predictions during the decoding process. However, the practicality of such solutions still largely lags behind DP training methods. One recent promising approach, Private Mixing of Ensemble Distributions (PMixED), avoids additive noise by sampling from the output distributions of private LLMs mixed with the output distribution of a public model. Yet, PMixED must satisfy a fixed privacy level for a given number of queries, which is difficult for an analyst to estimate before inference and, hence, does not scale. To this end, we relax the requirements to a more practical setting by introducing Adaptive PMixED (AdaPMixED\texttt{AdaPMixED}), a private decoding framework based on PMixED that is adaptive to the private and public output distributions evaluated on a given input query. In this setting, we introduce a noisy screening mechanism that filters out queries with potentially expensive privacy loss, and a data-dependent analysis that exploits the divergence of the private and public output distributions in its privacy loss calculation. Our experimental evaluations demonstrate that our mechanism and analysis \textit{can reduce the privacy loss by 16\times} while preserving the utility over the original PMixED. Furthermore, performing 100K predictions with AdaPMixED\texttt{AdaPMixED} still achieves strong utility and a reasonable data-dependent privacy loss of ϵ=5.25\epsilon=5.25.

4246Independence Tests for Language Models

[openreview] [pdf]

Abstract We consider the following problem of model provenance: can a third party verify whether two language models are trained independently versus fine-tuned from one another given the weights of both models? We propose a family of statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained with independent randomness (e.g., independent random initialization). These p-values are valid regardless of the composition of either model’s training data, and we obtain them via a permutation test by simulating independent copies of each model and comparing various measures of similarity in the weights and activations of the original two models to these independent copies. We evaluate the power of these tests on pairs of 21 open-weight models (210 total pairs) and find they reliably identify all 69 pairs of fine-tuned models. Notably, our tests remain effective even after substantial fine-tuning; we can accurately detect dependence between Llama 2 and Llemma, even though the latter was fine-tuned on an 750B additional tokens (37.5% of the original Llama 2 training budget). Finally, we identify transformations of model weights that break the effectiveness of our tests without altering model outputs, and—motivated by the existence of these evasion attacks—we propose a mechanism for matching hidden activations between the MLP layers of two models that is robust to these transformations. Though we no longer obtain exact p-values from this mechanism, empirically we find it reliably distinguishes fine-tuned models and is even robust to completely retraining the MLP layers from scratch.

4247Language Models Are Good Tabular Learners

[openreview] [pdf]

Abstract Transformer-based language models have become the de facto standard in natural language processing. However, they underperform in the tabular data domain compared to traditional tree-based methods. We posit that current models fail to achieve the full potential of language models due to (i) heterogeneity of tabular data; and (2) challenges faced by the model in interpreting numerical values. Based on this hypothesis, we propose a method titled Tabular Domain Transformer (TDTransformer). TDTransformer has distinct embedding processes for different types of columns. The alignment layers for different types of columns transform column embeddings to a common embedding space. Besides, TDTransformer adapts piece-wise linear encoding for numerical values in transformer-based architectures. We examine the proposed method on 76 real-world tabular classification datasets from the standard OpenML benchmark. Extensive experiments indicate that TDTransformer significantly improves the state-of-the-art methods.

4248Causal Order: The Key to Leveraging Imperfect Experts in Causal Inference

[openreview] [pdf]

Abstract Large Language Models (LLMs) have recently been used as experts to infer causal graphs, often by repeatedly applying a pairwise prompt that asks about the causal relationship of each variable pair. However, such experts, including human domain experts, cannot distinguish between direct and indirect effects given a pairwise prompt. Therefore, instead of the graph, we propose that causal order be used as a more stable output interface for utilizing expert knowledge. When querying a perfect expert with a pairwise prompt, we show that the inferred graph can have significant errors whereas the causal order is always correct. In practice, however, LLMs are imperfect experts and we find that pairwise prompts lead to multiple cycles and do not yield a valid order. Hence, we propose a prompting strategy that introduces an auxiliary variable for every variable pair and instructs the LLM to avoid cycles within this triplet. We show, both theoretically and empirically, that such a triplet prompt leads to fewer cycles than the pairwise prompt. Across multiple real-world graphs, the triplet prompt yields a more accurate order using both LLMs and human annotators as experts. By querying the expert with different auxiliary variables for the same variable pair, it also increases robustness---triplet method with much smaller models such as Phi-3 and Llama-3 8B outperforms a pairwise prompt with GPT-4. For practical usage, we show how the estimated causal order from the triplet method can be used to reduce error in downstream discovery and effect inference tasks.

4249SeRA: Self-Reviewing and Alignment of LLMs using Implicit Reward Margins

[openreview] [pdf]

Abstract Direct alignment algorithms (DAAs), such as direct preference optimization (DPO), have become popular alternatives to Reinforcement Learning from Human Feedback (RLHF) due to their simplicity, efficiency, and stability. However, the preferences used by DAAs are usually collected before alignment training begins and remain unchanged (off-policy). This design leads to two problems where the policy model (1) picks up on spurious correlations in the dataset (as opposed to only learning alignment to human preferences), and (2) overfits to feedback on off-policy trajectories that have less likelihood of being generated by the updated policy model. To address these issues, we introduce Self-Reviewing and Alignment (SeRA), a cost-efficient and effective method that can be readily combined with existing DAAs. SeRA comprises of two components: (1) sample selection using implicit reward margin to alleviate over-optimization on such undesired features, and (2) preference bootstrapping using implicit rewards to augment preference data with updated policy models in a cost-efficient manner. Extensive experiments, including on instruction-following tasks, demonstrate the effectiveness and generality of SeRA in training LLMs with diverse offline preference datasets and and DAAs.

4250What to align in multimodal contrastive learning?

[openreview] [pdf]

Abstract Humans perceive the world through multisensory integration, blending the information of different modalities to adapt their behavior. Contrastive learning offers an appealing solution for multimodal self-supervised learning. Indeed, by considering each modality as a different view of the same entity, it learns to align features of different modalities in a shared representation space. However, this approach is intrinsically limited as it only learns shared or redundant information between modalities, while multimodal interactions can arise in other ways. In this work, we introduce CoMM, a Contrastive Multimodal learning strategy that enables the communication between modalities in a single multimodal space. Instead of imposing cross- or intra- modality constraints, we propose to align multimodal representations by maximizing the mutual information between augmented versions of these multimodal features. Our theoretical analysis shows that shared, synergistic and unique terms of information naturally emerge from this formulation, allowing us to estimate multimodal interactions beyond redundancy. We test CoMM both in a controlled and in a series of real-world settings: in the former, we demonstrate that CoMM effectively captures redundant, unique and synergistic information between modalities. In the latter, CoMM learns complex multimodal interactions and achieves state-of-the-art results on seven multimodal tasks.

4251Enhancing Cooperative Problem-Solving in Sparse-Reward Systems via Co-evolutionary Curriculum Learning

[openreview] [pdf]

Abstract Sparse reward environments consistently challenge reinforcement learning, as agents often need to finish tasks before receiving any feedback, leading to limited incentive signals. This issue becomes even more pronounced in multi-agent systems (MAS), where a single reward must be distributed among multiple agents over time, frequently resulting in suboptimal or inconsistent learning outcomes. To tackle this challenge, we introduce a novel approach called Collaborative Multi-dimensional Course Learning (CCL) for multi-agent cooperation scenarios. CCL features three key innovations: (1) It establishes an adaptive curriculum framework tailored for MAS, refining intermediate tasks to individual agents to ensure balanced strategy development. (2) A novel variant evolution algorithm creates more detailed intermediate tasks. (3) Co-evolution between agents and their environment is modeled to enhance training stability under sparse reward conditions. In evaluations across five tasks within multi-particle environments (MPE) and Hide and Seek (Hns), CCL demonstrated superior performance, surpassing existing benchmarks and excelling in sparse reward settings.

4252Can Large Language Models Help Experimental Design for Causal Discovery?

[openreview] [pdf]

Abstract Designing proper experiments and intervening targets is a longstanding problem in scientific or causal discovery. It is fundamentally impossible to identify the underlying causal structure merely based on the observational data. Obtaining interventional data, on the other hand, is crucial to causal discovery, yet it is usually expensive or time-consuming to obtain sufficient interventional data to facilitate causal discovery. Previous approaches usually leverage uncertainty or gradient signals to determine the intervention targets, and may suffer from the suboptimality. In this work, we investigate a different approach, whether we can leverage Large Language Models (LLMs) to assist with the intervention targeting in causal discovery by making use of the rich world knowledge about the experimental design in LLM. Specifically, we present Large Language Model Guided Intervention Targeting (LeGIT), a robust framework that effectively incorporates LLMs to assist with the intervention targeting in causal discovery. Surprisingly, across 4 different scales of realistic benchmarks, LeGIT significantly outperforms previous approaches. LeGIT opens up a new frontier for using LLMs in experimental design.

4253Self-Play Preference Optimization for Language Model Alignment

[openreview] [pdf]

Abstract Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate language model alignment. In this paper, we propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbedSelf-Play Preference Optimization(SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium. Additionally, we propose a new SPPO objective which is both strongly motivated by theory and is simple and effective in practice. In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. Starting from a stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of 38.77%. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger language models.

4254Asymmetric Factorized Bilinear Operation for Vision Transformer

[openreview] [pdf]

Abstract As a core component of Transformer-like deep architectures, a feed-forward network (FFN) for channel mixing is responsible for learning features of each token. Recent works show channel mixing can be enhanced by increasing computational burden or can be slimmed at the sacrifice of performance. Although some efforts have been made, existing works are still struggling to solve the paradox of performance and complexity trade-offs. In this paper, we propose an Asymmetric Factorized Bilinear Operation (AFBO) to replace FFN of vision transformer (ViT), which attempts to efficiently explore rich statistics of token features for achieving better performance and complexity trade-off. Specifically, our AFBO computes second-order statistics via a spatial-channel factorized bilinear operation for feature learning, which replaces a simple linear projection in FFN and enhances the feature learning ability of ViT by modeling second-order correlation among token features. Furthermore, our AFBO presents two structured-sparsity channel mapping strategies, namely Grouped Cross Channel Mapping (GCCM) and Overlapped Cycle Channel Mapping (OCCM). They decompose bilinear operation into grouped channel features by considering information interaction between groups, significantly reducing computational complexity while guaranteeing model performance. Finally, our AFBO is built with GCCM and OCCM in an asymmetric way, aiming to achieve a better trade-off. Note that our AFBO is model-agnostic, which can be flexibly integrated with existing ViTs. Experiments are conducted with twenty ViTs on various tasks, and the results show our AFBO is superior to its counterparts while improving existing ViTs in terms of generalization and robustness.

4255Information Bottleneck for Active Feature Acquisition

[openreview] [pdf]

Abstract Traditional supervised learning typically assumes that all features are available simultaneously during deployment. However, this assumption does not hold in many real-world scenarios, such as medicine, where information is acquired sequentially based on an evolving understanding of a specific patient’s condition. Active Feature Acquisition aims to address this problem by dynamically selecting which feature to measure based on the current observations, independently for each test instance. Current approaches either use Reinforcement Learning, which suffers from training difficulties; or greedily maximize the conditional mutual information of the label and unobserved features, which inherently makes myopic acquisitions. To address these shortcomings, we introduce a novel method using information bottleneck. Via stochastic encodings, we make acquisitions by reasoning about the features across many possible unobserved realizations in a regularized latent space. Extensive evaluation on a large range of synthetic and real datasets demonstrates that our approach reliably outperforms a diverse set of baselines.

4256Evaluating Model Robustness Against Unforeseen Adversarial Attacks

[openreview] [pdf]

Abstract When considering real-world adversarial settings, defenders are unlikely to have access to the full range of deployment-time adversaries, and adversaries are likely to use realistic adversarial distortions that will not be limited to small LpL_p-constrained perturbations. To narrow in on this discrepancy between research and reality we introduce ImageNet-UA, a new benchmark for evaluating model robustness against a wide range of unforeseen adversaries. We make use of our benchmark to identify holes in current popular adversarial defense techniques, highlighting a rich space of techniques which can improve unforeseen robustness. We hope the greater variety and realism of ImageNet-UA will make it a useful tool for those working on real-world worst-case robustness, enabling development of more robust defenses which can generalize beyond attacks seen during training.

4257Evolving Alignment via Asymmetric Self-Play

[openreview] [pdf]

Abstract Current RLHF approaches for aligning large language models (LLMs) typically assume a fixed prompt distribution, which is sub-optimal and limits the generalization capabilities for language models. To address this issue, we introduce a general framework that casts alignment as an asymmetric game between two players: (i) a creator, which strategically generates informative prompt distributions using reward signals, and (ii) a solver, which learns to produce preferred responses on prompts produced by the creator.This framework of Evolving Alignment via Asymmetric Self-Play (eva), results in a simple and efficient approach that can utilize any existing RLHF algorithm.evaachieves a new state of the art in widely adopted alignment benchmarks, without the need of any additional human crafted prompts, e.g., it can improve the win rate of finetuned gemma-2-9b-it on Arena-Hard from 51.6% to 60.1% with DPO, from 55.7% to 58.9% with SPPO, from 52.3% to 60.7% with SimPO, and from 54.8% to 60.3% with ORPO, surpassing its 27B version and matching Claude-3-opus. Finally, we showevais effective and robust under various ablation settings.We hopeevacan serve as a scalable methodology for the research community to build open-ended, robust, and self-improving language agents, that align with human values.

4258InnateCoder: Learning Programmatic Options with Foundation Models

[openreview] [pdf]

Abstract Outside of transfer learning settings, reinforcement learning agents start their learning process from a clean slate. As a result, such agents have to go through a slow process to learn even the most obvious skills required to solve a problem. In this paper, we present InnateCoder, a system that leverages human knowledge encoded in foundation models to provide programmatic policies that encode “innate skills” in the form of temporally extended actions, or options. In contrast to existing approaches to learning options, InnateCoder learns them from the general human knowledge encoded in foundation models in a zero-shot setting, and not from the knowledge the agent gains by interacting with the environment. Then, InnateCoder searches for a programmatic policy by combining the programs encoding these options into a larger and more complex program. We hypothesized that InnateCoder’s scheme of learning and using options could improve the sampling efficiency of current methods for synthesizing programmatic policies. We evaluated our hypothesis in MicroRTS and Karel the Robot, two challenging domains. Empirical results support our hypothesis, since they show that InnateCoder is more sample efficient than versions of the system that do not use options or learn the options from experience. The policies InnateCoder learns are competitive and often outperform current state-of-the-art agents in both domains.

4259How Reliable Is Human Feedback For Aligning Large Language Models?

[openreview] [pdf]

Abstract Most alignment research today focuses on designing new learning algorithms using datasets like Anthropic-HH, assuming human feedback data is inherently reliable. However, little attention has been given to the qualitative unreliability of human feedback and its impact on alignment. To address this gap, we conduct a comprehensive study and provide an in-depth analysis of human feedback data. We assess feedback reliability using a committee of gold reward models, revealing that over 25% of the dataset shows low or no agreement with these models, implying a high degree of unreliability. Through a qualitative analysis, we identify six key sources of unreliability, such as mis-labeling, subjective preferences, differing criteria and thresholds for helpfulness and harmlessness, etc. Lastly, to mitigate unreliability, we propose Source-Aware Cleaning, an automatic data-cleaning method guided by the insight of our qualitative analysis, to significantly improve data quality. Extensive experiments demonstrate that models trained on our cleaned dataset, HH-Clean, substantially outperform those trained on the original dataset. We release HH-Clean to support more reliable LLM alignment research in the future.

4260Temperature Optimization for Bayesian Deep Learning

[openreview] [pdf]

Abstract The Cold Posterior Effect (CPE) is a phenomenon in Bayesian Deep Learning (BDL), where tempering the posterior to a cold temperature often improves the predictive performance of the posterior predictive distribution (PPD). Although the term `CPE’ suggests colder temperatures are inherently better, the BDL community increasingly recognizes that this is not always the case. Despite this, there remains no systematic method for finding the optimal temperature beyond grid search. In this work, we propose a data-driven approach to select the temperature that maximizes test log-predictive density, treating the temperature as a model parameter and estimating it directly from the data. We empirically demonstrate that our method performs comparably to grid search, at a fraction of the cost, across both regression and classification tasks. Finally, we highlight the differing perspectives on CPE between the BDL and Generalized Bayes communities: while the former primarily focuses on predictive performance of the PPD, the latter emphasizes calibrated uncertainty and robustness to model misspecification; these distinct objectives lead to different temperature preferences.

4261Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models

[openreview] [pdf]

Abstract Recently, Multimodal Large Language Models (MLLMs) achieve remarkable performance in numerous zero-shot tasks due to their outstanding cross-modal interaction and comprehension abilities. However, MLLMs are found to still be vulnerable to human-imperceptible adversarial examples. In the exploration of security vulnerabilities in real-world scenarios, transferability, which can achieve cross-model impact, is considered the greatest threat posed by adversarial examples. However, there is currently no systematic research on the threat of cross-MLLMs adversarial transferability. Therefore, this paper as the first step to provide a comprehensive evaluation of the transferability of adversarial examples generated by various MLLMs. Furthermore, leveraging two key factors that influence transferability performance: 1) The strength of information diversity involved in the adversarial generation process; 2) Editing across vision-language modality information. We propose a boosting method called Typography Augment Transferability Method (TATM) to investigate the adversarial transferability performance across MLLMs further. Through extensive experimental validation, our TATM demonstrates exceptional performance in real-world applications of “Harmful Word Insertion” and “Important Information Protection.”

4262Less is More: Adaptive Coverage for Synthetic Training Data

[openreview] [pdf]

Abstract Synthetic training data generation with Large Language Models (LLMs) like Google’s Gemma and OpenAI’s GPT offer a promising solution to the challenge of obtaining large, labeled datasets for training classifiers, especially when rapid model deployment is critical, such as classifying emerging social media trends or combating new forms of online abuse tied to current events. While prior research has examined the comparability of synthetic data to human-labeled data, this study introduces a novel sampling algorithm based on the maximum coverage problem to select a representative subset from a synthetically generated dataset. Our results demonstrate that training a classifier on this contextually sampled subset achieves superior performance compared to training on the entire dataset. This ``less is more’’ approach not only improves accuracy but also reduces the volume of data required, leading to potentially more efficient training.

4263GIFT: Unlocking Full Potential of Labels in Distilled Dataset at Near-zero Cost

[openreview] [pdf]

Abstract Recent advancements in dataset distillation have demonstrated the significant benefits of employing soft labels generated by pre-trained teacher models. In this paper, we introduce a novel perspective by emphasizing the full utilization of labels. We first conduct a comprehensive comparison of various loss functions for soft label utilization in dataset distillation, revealing that the model trained on the synthetic dataset exhibits high sensitivity to the choice of loss function for soft label utilization. This finding highlights the necessity of a universal loss function for training models on synthetic datasets. Building on these insights, we introduce an extremely simple yet surprisingly effective plug-and-play approach, GIFT, which encompasses soft label refinement and a cosine similarity-based loss function to efficiently leverage full label information. Extensive experiments indicate that GIFT consistently enhances state-of-the-art dataset distillation methods across various dataset scales without incurring additional computational costs. Importantly, GIFT significantly enhances cross-optimizer generalization, an area previously overlooked. For instance, on ImageNet-1K with IPC = 10, GIFT enhances the state-of-the-art method RDED by 30.8% in cross-optimizer generalization.

4264Mechanism and emergence of stacked attention heads in multi-layer transformers

[openreview] [pdf]

Abstract In this paper, we introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. We demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, we train several transformers on a minimal formulation. We find that successful learning occurs only under the presence of an implicit curriculum. We uncover the learned mechanisms by studying the attention maps in the trained transformers. We also study the training process, uncovering that attention heads always emerge in a specific sequence.

4265Language-Assisted Feature Transformation for Anomaly Detection

[openreview] [pdf]

Abstract This paper introduces LAFT, a novel feature transformation method designed to incorporate user knowledge and preferences into anomaly detection using natural language. Accurately modeling the boundary of normality is crucial for distinguishing abnormal data, but this is often challenging due to limited data or the presence of nuisance attributes. While unsupervised methods that rely solely on data without user guidance are common, they may fail to detect anomalies of specific interest. To address this limitation, we propose Language-Assisted Feature Transformation (LAFT), which leverages the shared image-text embedding space of vision-language models to transform visual features according to user-defined requirements. Combined with anomaly detection methods, LAFT effectively aligns visual features with user preferences, allowing anomalies of interest to be detected. Extensive experiments on both toy and real-world datasets validate the effectiveness of our method.

4266WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning

[openreview] [pdf]

Abstract Large language models (LLMs) have shown remarkable potential as autonomous agents, particularly in web-based tasks. However, existing LLM web agents face significant limitations: high-performing agents rely on expensive proprietary LLM APIs, while open LLMs lack the necessary decision-making capabilities. This paper introduces WebRL, a novel self-evolving online curriculum reinforcement learning framework designed to train high-performance web agents using open LLMs. Our approach addresses key challenges in this domain, including the scarcity of training tasks, sparse feedback signals, and policy distribution drift in online learning. WebRL incorporates a self-evolving curriculum that generates new tasks from unsuccessful attempts, a robust outcome-supervised reward model (ORM), and adaptive reinforcement learning strategies to ensure consistent improvement. We apply WebRL to transform Llama-3.1 models into proficient web agents, achieving remarkable results on the WebArena-Lite benchmark. Our Llama-3.1-8B agent improves from an initial 4.8% success rate to 42.4%, while the Llama-3.1-70B agent achieves a 47.3% success rate across five diverse websites. These results surpass the performance of GPT-4-Turbo (17.6%) by over 160% relatively and significantly outperform previous state-of-the-art web agents trained on open LLMs (AutoWebGLM, 18.2%). Our findings demonstrate WebRL’s effectiveness in bridging the gap between open and proprietary LLM-based web agents, paving the way for more accessible and powerful autonomous web interaction systems.

4267Intermediate Layer Classifiers for OOD generalization

[openreview] [pdf]

Abstract Deep classifiers are known to be sensitive to data distribution shifts, primarily due to their reliance on spurious correlations in training data. It has been suggested that these classifiers can still find useful features in the network’s last layer that hold up under such shifts. In this work, we question the use of last-layer representations for out-of-distribution (OOD) generalisation and explore the utility of intermediate layers. To this end, we introduce Intermediate Layer Classifiers (ILCs). We discover that intermediate layer representations frequently offer substantially better generalisation than those from the penultimate layer. In many cases, zero-shot OOD generalisation using earlier-layer representations approaches the few-shot performance of retraining on penultimate layer representations. This is confirmed across multiple datasets, architectures, and types of distribution shifts. Our analysis suggests that intermediate layers are less sensitive to distribution shifts compared to the penultimate layer. These findings highlight the importance of understanding how information is distributed across network layers and its role in OOD generalisation, while also pointing to the limits of penultimate layer representation utility.

4268A Diffusion-based Generative Approach for Model-free Finite-time Control of Complex Systems

[openreview] [pdf]

Abstract Complex systems with nonlinear dynamics pose significant challenges for finite-time optimal control, especially when accurate system models are unavailable. This paper introduces DIFOCON (DIffusion Finite-time Optimal CONtrol), a novel data-driven framework for finite-time optimal control that operates without prior knowledge of system parameters or dynamics. DIFOCON reformulates the control problem as a generative task, optimizing control signal trajectories to guide systems to target states within a finite time. Our approach utilizes a diffusion model with a dual-Unet architecture to capture nonlinear system dynamics and generate entire control sequences in a single step. Additionally, an inverse dynamics module is integrated to ensure that the generated control signals are appropriate for complex systems. To further enhance performance, we propose a retraining strategy that improves out-of-distribution generalization. Experiments on two nonlinear complex systems demonstrate DIFOCON’s superior performance, reducing target loss by over 26.9% and control energy by over 15.8% compared to baselines while achieving up to 4 times faster convergence in practical steering tasks. The implementation of this work can be found athttps://anonymous.4open.science/r/DIFOCON-C019/.

4269FedDES: A Discrete-Event Simulator For Large-Scale Federated Learning

[openreview] [pdf]

Abstract We introduce FedDES, a performance simulator for Federated Learning (FL) that leverages Discrete Event Simulation (DES) techniques to model key events—such as client updates, communication delays, and aggregation operations—as discrete occurrences in time. This approach accurately captures the runtime features of FL systems, providing a high-fidelity simulation environment that closely mirrors real-world deployments. FedDES incorporates all three known aggregation settings: Synchronous (e.g., FedAvg and FedProx), Asynchronous (e.g., FedAsync and FedFa), and Semi-Asynchronous (e.g., FedBuff and FedCompass). Designed to be framework-, dataset-, and model-agnostic, FedDES allows researchers and developers to explore various configurations without restrictions. Our evaluations involving over 1,000 clients with heterogeneous computation and communication characteristics demonstrate that FedDES accurately models event distribution and delivers performance estimates within 2% error of real-world measurements. While real-world workloads often take hours to evaluate, FedDES generates detailed, timestamped event logs in just few seconds. As a result, FedDES can significantly accelerate FL developing and debugging cycles, enabling developers to rapidly prototype and evaluate algorithms and system designs, bypassing the need for costly, time-consuming real-world deployments. It offers valuable performance insights—such as identifying bottlenecks, stragglers, fault-tolerance mechanisms, and edge-case scenarios—facilitating the optimization of FL systems for efficiency, scalability, and resilience.

4270Systematic Outliers in Large Language Models

[openreview] [pdf]

Abstract Outliers have been widely observed in Large Language Models (LLMs), significantly impacting model performance and posing challenges for model compression. Therefore, understanding the mechanisms by which these outliers affect the models is important. However, existing works either highlight outliers to guide specific algorithmic design or analyze isolated instances without providing a comprehensive understanding. In this work, we present the first systematic analysis of outliers in LLMs. We identify and categorize three types of outliers—activation outliers, weight outliers, and attention outliers—and discover inherent connections between their occurrences. By employing numerical computation and gradient optimization methods, we analyze the causes of these outliers. We summarize their roles within the models, demonstrating through experiments that they function as implicit context-aware scaling factors in the attention mechanism. As these outliers arise from systematic influences, we call them as \textit{systematic outliers}. Our study not only deepens our understanding of Transformer-based LLMs but also shows that structurally eliminating outliers can accelerate convergence and improve model compression, offering fresh insights for future model design. The code has been submitted as supplementary material and will be released upon acceptance to facilitate further research.

4271Generalization Bounds and Model Complexity for Kolmogorov–Arnold Networks

[openreview] [pdf]

Abstract Kolmogorov–Arnold Network (KAN) is a network structure recently proposed in Liu et al. (2024) that offers improved interpretability and a more parsimonious design in many science-oriented tasks compared to multi-layer perceptrons. This work provides a rigorous theoretical analysis of KAN by establishing generalization bounds for KAN equipped with activation functions that are either represented by linear combinations of basis functions or lying in a low-rank Reproducing Kernel Hilbert Space (RKHS). In the first case, the generalization bound accommodates various choices of basis functions in forming the activation functions in each layer of KAN and is adapted to different operator norms at each layer. For a particular choice of operator norms, the bound scales with the l1l_1 norm of the coefficient matrices and the Lipschitz constants for the activation functions, and it has no dependence on combinatorial parameters (e.g., number of nodes) outside of logarithmic factors. Moreover, our result does not require the boundedness assumption on the loss function and, hence, is applicable to a general class of regression-type loss functions. In the low-rank case, the generalization bound scales polynomially with the underlying ranks as well as the Lipschitz constants of the activation functions in each layer. These bounds are empirically investigated for KANs trained with stochastic gradient descent on simulated and real data sets. The numerical results demonstrate the practical relevance of these bounds.

4272RedHat: Towards Reducing Hallucination in Essay Critiques with Large Language Models

[openreview] [pdf]

Abstract Essay critiques refer to the textual assessment of an essay, serving as the basis for the scoring of the essay, and are crucial for the improvements of the essay. Essay critique generation has received increasing attention after the blooming of large language models (LLMs), which show promising potential in writing and critiquing essays. Automatic critique generation can streamline both instructors and reviewers as well as spur LLM advancement in long context generation characterized by essay writing. However, current LLMs suffer from hallucinations when generating essay critiques, which are still under-explored in the community. To facilitate research in reliable essay critique generation, we first define this task with a unified input-output format as well as clear judging criteria. To minimize hallucinations in critique generation, we introduce RedHat, a novel approach that embeds the key information from essays directly into the generation process through document-level question-answering, ensuring critiques stay firmly anchored to the original text. We collected a large-scale, high-quality essay critique dataset called EssayC, annotated by human experts over multiple LLM-generated critiques, from a campus undergraduate essay writing course. We experimented RedHat backboned by commercial and open-sourced LLMs. Results showed that critiques generated by RedHat are preferred by human experts over baseline in 20% of cases on EssayC in detailedness and informativeness, with a decrement of 10% on hallucinations in our judging criteria.

4273DynFrs: An Efficient Framework for Machine Unlearning in Random Forest

[openreview] [pdf]

Abstract Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.

4274Planning in Strawberry Fields: Evaluating and Improving the Planning and Scheduling Capabilities of LRM o1

[openreview] [pdf]

Abstract The ability to plan a course of action that achieves a desired state of affairs has long been considered a core competence of intelligent agents and has been an integral part of AI research since its inception. With the advent of large language models (LLMs), there has been considerable interest in the question of whether or not they possess such planning abilities, but--despite the slew of new private and open source LLMs since GPT3--progress has remained slow. OpenAI claims that their recent o1 (Strawberry) model has been specifically constructed and trained to escape the normal limitations of autoregressive LLMs--making it a new kind of model: a Large Reasoning Model (LRM). In this paper, we evaluate the planning capabilities of two LRMs (o1-preview and o1-mini) on both planning and scheduling benchmarks. We see that while o1 does seem to offer significant improvements over autoregressive LLMs, this comes at a steep inference cost, while still failing to provide any guarantees over what it generates. We also show that combining o1 models with external verifiers--in a so-called LRM-Modulo system--guarantees the correctness of the combined system’s output while further improving performance.

4275Budgeted Online Continual Learning by Adaptive Layer Freezing and Frequency-based Sampling

[openreview] [pdf]

Abstract The majority of online continual learning (CL) advocates single-epoch training and imposes restrictions on the size of replay memory. However, single-epoch training would incur a different amount of computations per CL algorithm, and the additional storage cost to store logit or model in addition to replay memory is largely ignored in calculating the storage budget. Arguing different computational and storage budgets hinder fair comparison among CL algorithms in practice, we propose to use floating point operations (FLOPs) and total memory size in Byte as a metric for computational and memory budgets, respectively, to compare and develop CL algorithms in the same “total resource budget”. To improve a CL method in a limited total budget, we propose adaptive layer freezing that does not update the layers for less informative batches to reduce computational costs with a negligible loss of accuracy. In addition, we propose a memory retrieval method that allows the model to learn the same amount of knowledge as using random retrieval in fewer iterations. Empirical validations on the CIFAR-10/100, CLEAR-10/100, and ImageNet-1K datasets demonstrate that the proposed approach outperforms the state-of-the-art methods within the same total budget.

4276SaTran: An efficient Transformer exploiting Spatiotemporal Redundancies for Satellite Image Time Series Representation Learning

[openreview] [pdf]

Abstract Earth observation applications like crop yield prediction, solar energy prediction, land cover classification, etc., need large size Satellite Image Time Series (SITS) leading to huge computational requirements. A couple of BERT-based models exist which work at pixel level unable to exploit spatial correlation among pixels and also require ground truth at pixel granularity during fine-tuning, rendering them infeasible for prediction tasks. The models based on Vision Transformer factorize spatial and time dimensions and first process images and then time series of image embeddings. However, in many cases, SITS require simultaneous analysis of both dimensions. We present a transformer, SaTran, which focuses on non-redundant patch tubes to overcome the limitations listed above. Transformers developed for RGB videos are found lacking when applied to SITS data characterized by the presence of patches with spatiotemporal redundancy persisting throughout the time series. SITS data also has patches where temporal redundancy lasts only for a few timestamps. The salient features of SaTran include: 1) an automatic patch tube selection mechanism which ignores spatiotemporally redundant patches; 2) exploitation of spatial correlation between pixels by the processing of patch tubes and handling of their temporal redundancy using tube masking; 3) two-fold handling of redundancy and distributed application of VideoMAE enables space and time efficient processing of large size SITS; and 4) learning end task agnostic representation of entire time series. Extensive experimentation shows that SaTran outperforms competing models and exhibit state-of-the-art performance for various earth observation applications. The code is available on (.. will be given after acceptance..).

4277Large Language Models Suffer From Their Own Output: An Analysis of the Self-Consuming Training Loop

[openreview] [pdf]

Abstract Large Language Models (LLM) are already widely used to generate content for a variety of online platforms. As we are not able to safely distinguish LLM-generated content from human-produced content, LLM-generated content is used to train the next generation of LLMs, giving rise to a self-consuming training loop. From the image generation domain we know that such a self-consuming training loop reduces both quality and diversity of images finally ending in a model collapse. However, it is unclear whether this alarming effect can also be observed for LLMs. Therefore, we present the first study investigating the self-consuming training loop for LLMs. Further, we propose a novel method based on logic expressions that allows us to unambiguously verify the correctness of LLM-generated content, which is difficult for natural language text. We find that the self-consuming training loop produces correct outputs, however, the output declines in its diversity depending on the proportion of the used generated data. Fresh data can slow down this decline, but not stop it. Given these concerning results, we encourage researchers to study methods to negate this process.

4278Moral Alignment for LLM Agents

[openreview] [pdf]

Abstract Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital.The prevailing practice in alignment often relies on human preference data (e.g., in RLHF or DPO), in which values are implicit and are essentially deduced from relative preferences over different model outputs. In this work, instead of relying on human feedback, we introduce the design of reward functions that explicitly encode core human values for Reinforcement Learning-based fine-tuning of foundation agent models. Specifically, we use intrinsic rewards for the moral alignment of LLM agents.We evaluate our approach using the traditional philosophical frameworks of Deontological Ethics and Utilitarianism, quantifying moral rewards for agents in terms of actions and consequences on the Iterated Prisoner’s Dilemma (IPD) environment. We also show how moral fine-tuning can be deployed to enable an agent to unlearn a previously developed selfish strategy. Finally, we find that certain moral strategies learned on the IPD game generalize to several other matrix game environments. In summary, we demonstrate that fine-tuning with intrinsic rewards is a promising general solution for aligning LLM agents to human values, and it might represent a more transparent and cost-effective alternative to currently predominant alignment techniques.

4279Do better language models have crisper vision?

[openreview] [pdf]

Abstract How well do text-only Large Language Models (LLMs) grasp the visual world? As LLMs are increasingly used in computer vision, addressing this question becomes both fundamental and pertinent. However, existing studies have primarily focused on limited scenarios, such as their ability to generate visual content or cluster multimodal data. To this end, we propose the Visual Text Representation Benchmark (ViTeRB) to isolate key properties that make language models well-aligned with the visual world. With this, we identify large-scale decoder-based LLMs as ideal candidates for representing text in vision-centric contexts, counter to the current practice of utilizing text encoders. Building on these findings, we propose ShareLock, an ultra-lightweight CLIP-like model. By leveraging precomputable frozen features from strong vision and language models, ShareLock achieves an impressive 51% accuracy on ImageNet despite utilizing just 563k image-caption pairs. Moreover, training requires only 1 GPU hour (or 10 hours including the precomputation of features) - orders of magnitude less than prior methods. Code will be released.

4280Social Learning: Towards Collaborative Learning with Large Language Models

[openreview] [pdf]

Abstract We introduce the framework of “social learning” in the context of large language models (LLMs), whereby models share knowledge with each other in a privacy-aware manner using natural language. We present and evaluate two approaches for knowledge transfer between LLMs. In the first scenario, we allow the model to generate abstract prompts aiming to teach the task. In our second approach, models transfer knowledge by generating synthetic examples. We evaluate these methods across diverse datasets and quantify memorization as a proxy for privacy loss. These techniques inspired by social learning yield promising results with low memorization of the original data. In particular, we show that performance using these methods is comparable to results with the use of original labels and prompts. Our work demonstrates the viability of social learning for LLMs, establishes baseline approaches and highlights several unexplored areas for future work.

4281Benchmarking Machine Learning Methods for Stock Prediction

[openreview] [pdf]

Abstract Machine learning has been widely applied to stock movement prediction. However, research in this field is often hindered by the lack of high-quality benchmark datasets and comprehensive evaluation methods. To address these challenges, we introduce \textit{BenchStock}, a benchmark that includes standardized datasets from the two largest stock markets (the U.S. and China) along with an evaluation method designed to facilitate a thorough examination of machine learning stock prediction methods. This benchmark covers a range of models, from traditional machine learning techniques to the latest deep learning approaches. Using BenchStock, we conducted large-scale experiments predicting individual stock returns over three decades in both markets to assess both short-term and long-term performance. To evaluate the impact of these predictions in actual market conditions, we constructed a portfolio based on the predictions and used a backtesting program to simulate its performance. The experiments revealed several key findings that have not been reported: 1) Most methods outperformed the S&P 500 in the U.S. market but experienced significant losses in the Chinese market. 2) Prediction accuracy of a method was not correlated with its portfolio return. 3) Advanced deep learning methods did not outperform traditional approaches. 4) The performance of the models was highly dependent on the testing period. These findings highlight the complexity of stock prediction and call for more in-depth machine learning research in this field.

4282ConcreTizer: Model Inversion Attack via Occupancy Classification and Dispersion Control for 3D Point Cloud Restoration

[openreview] [pdf]

Abstract The growing use of 3D point cloud data in autonomous vehicles (AVs) has raised serious privacy concerns, particularly due to the sensitive information that can be extracted from 3D data. While model inversion attacks have been widely studied in the context of 2D data, their application to 3D point clouds remains largely unexplored. To fill this gap, we present the first in-depth study of model inversion attacks aimed at restoring 3D point cloud scenes. Our analysis reveals the unique challenges, the inherent sparsity of 3D point clouds and the ambiguity between empty and non-empty voxels after voxelization, which are further exacerbated by the dispersion of non-empty voxels across feature extractor layers. To address these challenges, we introduce ConcreTizer, a simple yet effective model inversion attack designed specifically for 3D point cloud data. ConcreTizer incorporates Voxel Occupancy Classification to distinguish between empty and non-empty voxels and Dispersion-Controlled Supervision to mitigate non-empty voxel dispersion. Extensive experiments on widely used 3D feature extractors and benchmark datasets, such as KITTI and Waymo, demonstrate that ConcreTizer concretely restores the original 3D point cloud scene from disrupted 3D feature data. Our findings highlight both the vulnerability of 3D data to inversion attacks and the urgent need for robust defense strategies.

4283GFlowNets Need Automorphism Correction for Unbiased Graph Generation

[openreview] [pdf]

Abstract Generative Flow Networks (GFlowNets) are generative models capable of producing graphs. While GFlowNet theory guarantees that a fully trained model samples from an unnormalized target distribution, computing state transition probabilities remains challenging due to the presence of equivalent actions that lead to the same state. In this paper, we analyze the properties of equivalent actions in the context of graph generation tasks and propose efficient solutions to address this problem. Our theoretical analysis reveals that naive implementations, which ignore equivalent actions, introduce systematic bias in the sampling distribution for both atom-based and fragment-based graph generation. This bias is directly related to the number of symmetries in a graph, a factor that is particularly critical in applications such as drug discovery, where symmetry plays a key role in molecular structure and function. Experimental results demonstrate that a simple reward-scaling technique not only enables the generation of graphs that closely match the target distribution but also facilitates the sampling of diverse and high-reward samples.

4284Maximum Noise Level as Third Optimality Criterion in Black-box Optimization Problem

[openreview] [pdf]

Abstract This paper is devoted to the study (common in many applications) of the black-box optimization problem, where the black-box represents a gradient-free oracle f~p=f(x)+ξp\tilde{f}_p = f(x) + \xi_p providing the objective function value with some stochastic noise. Assuming that the objective function is μ\mu-strongly convex, and also not just LL-smooth, but has a higher order of smoothness (β2\beta \geq 2) we provide a novel optimization method:Zero-Order Accelerated Batched Stochastic Gradient Descent, whose theoretical analysis closes the question regarding the iteration complexity,achieving optimal estimates. Moreover, we provide a thorough analysis of the maximum noise level, and show under which condition the maximum noise level will take into account information about batch size BB as well as information about the smoothness order of the function β\beta. Finally, we show the importance of considering the maximum noise level Δ\Delta as a third optimality criterion along with the standard two on the example of a numerical experiment of interest to the machine learning community, where we compare with SOTA gradient-free algorithms.

4285Reinforcement Learning with Action Sequence for Data-Efficient Robot Learning

[openreview] [pdf]

Abstract Training reinforcement learning (RL) agents on robotic tasks typically requires a large number of training samples. This is because training data often consists of noisy trajectories, whether from exploration or human-collected demonstrations, making it difficult to learn value functions that understand the effect of taking each action. On the other hand, recent behavior-cloning (BC) approaches have shown that predicting a sequence of actions enables policies to effectively approximate noisy, multi-modal distributions of expert demonstrations. Can we use a similar idea for improving RL on robotic tasks? In this paper, we introduce a novel RL algorithm that learns a critic network that outputs Q-values over a sequence of actions. By explicitly training the value functions to learn the consequence of executing a series of current and future actions, our algorithm allows for learning useful value functions from noisy trajectories. We study our algorithm across various setups with sparse and dense rewards, and with or without demonstrations, spanning mobile bi-manual manipulation, whole-body control, and tabletop manipulation tasks from BiGym, HumanoidBench, and RLBench. We find that, by learning the critic network with action sequences, our algorithm outperforms various RL and BC baselines, in particular on challenging humanoid control tasks.

4286Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning—enabling a trained model to forget specific data—is crucial for addressing biased data and adhering to privacy regulations like the General Data Protection Regulation (GDPR)'s ``right to be forgotten." Recent works have paid little attention to privacy concerns, leaving the data intended for forgetting vulnerable to membership inference attacks. Moreover, they often come with high computational overhead. In this work, we propose Pseudo-Probability Unlearning (PPU), a novel method that enables models to forget data efficiently and in a privacy-preserving manner. Our method replaces the final-layer output probabilities of the neural network with pseudo-probabilities for the data to be forgotten. These pseudo-probabilities follow either a uniform distribution or align with the model’s overall distribution, enhancing privacy and reducing risk of membership inference attacks. Our optimization strategy further refines the predictive probability distributions and updates the model’s weights accordingly, ensuring effective forgetting with minimal impact on the model’s overall performance. Through comprehensive experiments on multiple benchmarks, our method achieves over 20% improvements in forgetting error compared to the state-of-the-art. Additionally, our method enhances privacy by preventing the forgotten set from being inferred to around random guesses.

4287EM-DARTS: Preventing Performance Collapse in Differentiable Architecture Search with The Edge Mutation Mechanism

[openreview] [pdf]

Abstract Differentiable Architecture Search (DARTS) relaxes the discrete search space into a continuous form, significantly improving architecture search efficiency through gradient-based optimization. However, DARTS often suffers from performance collapse, where the performance of discovered architectures degrades during the search process, and the final architectures tend to be dominated by excessive skip-connections. In this work, we analyzes how continuous relaxation impacts architecture optimization, identifying two main causes for performance collapse. First, the continuous relaxation framework introduces coupling between network weights and architecture parameters. This coupling leads to insufficient training of parametric operations, resulting in smaller architecture parameters for these operations. Second, DARTS’s unrolled estimation property leads to larger architecture parameters for skip-connections. To attack this issue, we propose Edge Mutation Differentiable Architecture Search (EM-DARTS), which mutates DARTS supernet edges during network weight updates. EM-DARTS reduces the impact of architecture parameters on parametric operations, allowing for better training of the parametric operations, thereby increasing their architecture parameters and preventing performance collapse. Theoretical results and experimental studies across diverse search spaces and datasets validate the effectiveness of the proposed method.

4288Dynamic Skill Adaptation for Large Language Models

[openreview] [pdf]

Abstract We present Dynamic Skill Adaptation (DSA), an adaptive and dynamic framework to adapt novel and complex skills to Large Language Models (LLMs). Compared with previous work which learns from human-curated and static data in random orders, we propose to first automatically generate and organize the training data by mimicking the learning pathways of human and then dynamically tailor the training data based on the training dynamics. Specifically, inspired by the learning structures and teaching strategies in the human education system, we first construct a skill graph by decomposing complex skills into sub-skills and arranging them based on their dependencies in human syllables. For every skill, we utilize LLMs to generate both textbook-like data which contains detailed descriptions of skills for pre-training and exercise-like data which targets at explicitly utilizing the skills to solve problems for instruction-tuning. Furthermore, during the instruction-tuning, we dynamically update the training data which down-weight easy-to-learn examples, generate more complex examples, and filter out data with errors. Experiments on large language models such as LLAMA and Mistral demonstrate the effectiveness of our proposed methods in adapting math reasoning skills and social study skills.

4289Using GNNs to Model Biased Crowdsourced Data for Urban Applications

[openreview] [pdf]

Abstract Graph neural networks (GNNs) are widely used to make predictions on graph-structured data in urban spatiotemporal forecasting applications, such as predicting infrastructure problems and weather events. In urban settings, nodes have a true latent state (e.g., street condition) that is sparsely observed (e.g., via government inspection ratings). We more frequently observe biased proxies for the latent state (e.g., via crowdsourced reports) that correlate with resident demographics. We introduce a GNN-based model that uses both unbiased rating data and biased reporting data to predict the true latent state. We show that our approach can both recover the latent state at each node and quantify the reporting biases. We apply our model to a case study of urban incidents using reporting data from New York City 311 complaints across 141 complaint types and rating data from government inspections. We show (i) that our model predicts more correlated ground truth latent states compared to prior work which trains models only on the biased reporting data, (ii) that our model’s inferred reporting biases capture known demographic biases, and (iii) that our model’s learned ratings capture correlations across locations and between complaint types. Especially in urban crowdsourcing applications, our analysis reveals a widely applicable approach for using GNNs and sparse ground truth data to estimate latent states.

4290Retrieval or Reasoning: The Roles of Graphs and Large Language Models in Efficient Knowledge-Graph-Based Retrieval-Augmented Generation

[openreview] [pdf]

Abstract Large Language Models (LLMs) demonstrate strong reasoning abilities but face limitations such as hallucinations and outdated knowledge. Knowledge Graph (KG)-based Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLM outputs in structured external knowledge from KGs. However, current KG-based RAG frameworks still struggle to optimize the trade-off between retrieval accuracy and efficiency in identifying a suitable amount of relevant graph information for the LLM to digest. We introduce SubgraphRAG, extending the KG-based RAG framework that retrieves subgraphs centered on query/topic entities and leverages LLMs for reasoning. Our approach innovatively integrates a lightweight multilayer perceptron (MLP) with a parallel triple-scoring mechanism for efficient subgraph retrieval while encoding directional structural distances to enhance retrieval accuracy. The size of retrieved subgraphs can be flexibly adjusted to match the query’s need and the downstream LLM’s reasoning capacity. This design strikes a balance between model complexity and reasoning power, enabling scalable and generalizable retrieval processes. Notably, based on our retrieved subgraphs, smaller models like Llama3.1-8B deliver competitive results with explainable reasoning, while larger models like GPT-4o achieve comparable or better state-of-the-art accuracy compared with previous baselines—all without fine-tuning. Extensive evaluations on the WebQSP and CWQ benchmarks highlight SubgraphRAG’s strengths in efficiency, accuracy, and reliability by reducing hallucinations and improving response grounding.

4291StyleGuide: Crafting visual style prompting with negative visual query guidance

[openreview] [pdf]

Abstract In the domain of text-to-image generation, diffusion models have emerged as powerful tools. Recently, studies on visual prompting, where images are used as prompts, have enabled more precise control over style and content. However, existing methods often suffer from content leakage, where undesired elements from the visual style prompt are transferred along with the intended style (content leakage). To address this issue, we 1) extends classifier-free guidance (CFG) to utilize swapping self-attention and propose 2)negative visual query guidance (NVQG) to reduce the transfer of unwanted contents. NVQG employ negative score by intentionally simulating content leakage scenarios which swaps queries instead of key and values of self-attention layers from visual style prompts. This simple yet effective method significantly reduces content leakage. Furthermore, we provide careful solutions for using a real image as a visual style prompts and for image-to-image (I2I) tasks. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, reflecting the style of the references and ensuring that resulting images match the text prompts.

4292MULTIMODAL GENERATIVE AI FOR STORY POINT ESTIMATION

[openreview] [pdf]

Abstract This research explores the application of Multimodal Generative AI to enhance story point estimation in Agile software development. By integrating text, image, and categorical data using advanced models like BERT, CNN, and XGBoost, our approach surpasses the limitations of traditional single-modal estimation methods. The results demonstrate good accuracy for simpler story points, while also highlighting challenges in more complex categories due to data imbalance. This study further explores the impact of categorical data, particularly severity, on the estimation process, emphasizing its influence on model performance. Our findings emphasize the transformative potential of multimodal data integration in refining AI-driven project management, paving the way for more precise, adaptable, and domain-specific AI capabilities. Additionally, this work outlines future directions for addressing data variability and enhancing the robustness of AI in Agile methodologies.

4293FMP-AE: A HYBRID APPROACH TO TIME SERIES ANOMALY DETECTION

[openreview] [pdf]

Abstract Unsupervised anomaly detection in time series presents significant challenges, particularly due to the lack of clear criteria and the prevalence of highly imbalanced data. Traditional statistical and machine learning methods often struggle with low recall rates and computational inefficiency. While deep learning techniques offer the advantage of automatic feature extraction, they are also affected by the issue of data imbalance. This paper introduces an integrated time series anomaly detection model, Feature map Matrix Profile with an AutoEncoder (FMP-AE), which combines matrix profile structures with deep learning techniques. The model leverages a one-dimensional convolutional neural network (1D-CNN) to extract features and compute the matrix profile. Then a novel Matrix Profile loss function is defined and integrated with the Autoencoder’s reconstruction loss for model training and anomaly detection. Experimental results on the UCR250 benchmark datasets highlight the model’s impressive performance, showing notable success across various metrics, including accuracy, precision, recall, F1-score, and AUC. These findings indicate that the hybrid FMP-AE model, significantly improves accuracy, robustness, and computational efficiency in anomaly detection tasks.

4294Learn-by-interact: A Data-Centric Framework For Self-Adaptive Agents in Realistic Environments

[openreview] [pdf]

Abstract Autonomous agents powered by large language models (LLMs) have the potential to enhance human capabilities, assisting with digital tasks from sending emails to performing data analysis. The abilities of existing LLMs at such tasks are often hindered by the lack of high-quality agent data from the corresponding environments they interact with. We propose LEARN-BY-INTERACT, a data-centric framework to adapt LLM agents to any given environments without human annotations. LEARN-BY-INTERACT synthesizes trajectories of agent-environment interactions based on documentations, and constructs instructions by summarizing or abstracting the interaction histories, a process called backward construction. We assess the quality of our synthetic data by using them in both training-based scenarios and training-free in-context learning (ICL), where we craft innovative retrieval approaches optimized for agents. Extensive experiments on SWE-bench, WebArena, OSWorld, and Spider2-V spanning across realistic coding, web, and desktop environments show the effectiveness of LEARN-BY-INTERACT in various downstream agentic tasks — baseline results are improved up to 11.1% for ICL with Claude-3.5 and 23.1% for training with Codestral-22B. We further demonstrate the critical role of backward construction, which provides up to 10.6% improvement for training. Our ablation studies demonstrate the efficiency provided by our synthesized data in ICL and the superiority of our retrieval pipeline over alternative approaches like conventional retrieval-augmented generation (RAG). We expect that LEARN-BY-INTERACT will serve as a foundation for agent data synthesis as LLMs are increasingly deployed at real-world environments.

4295Multi-modal graph neural networks for localized off-grid weather forecasting

[openreview] [pdf]

Abstract Urgent applications like wildfire management and renewable energy generation require precise, localized weather forecasts near the Earth’s surface. However, weather forecast products from machine learning or numerical weather models are currently generated on a global regular grid, on which a naive interpolation cannot accurately reflect fine-grained weather patterns close to the ground. In this work, we train a heterogeneous graph neural network (GNN) end-to-end to downscale gridded forecasts to off-grid locations of interest. This multi-modal GNN takes advantage of local historical weather observations (e.g., wind vector, temperature) to correct the gridded weather forecast at different lead times towards locally accurate forecasts. Each data modality is modeled as a different type of node in the graph. Using message passing, the node at the prediction location aggregates information from its heterogeneous neighbor nodes. Experiments using weather stations across the Northeastern United States show that our model outperforms a range of data-driven and non-data-driven off-grid forecasting methods. Our approach demonstrates how the gap between global large-scale weather models and locally accurate predictions can be bridged to inform localized decision-making.

4296Linear Attention Sequence Parallelism

[openreview] [pdf]

Abstract Sequence parallelism (SP) serves as a prevalent strategy to handle long sequences that exceed the memory limit of a single device. However, for linear sequence modeling methods like linear attention, existing SP approaches do not take advantage of their right-product-first feature, resulting in sub-optimal communication efficiency and usability. In this paper, we introduce Linear Attention Sequence Parallelism (LASP), an efficient SP approach designed for linear attention-based transformer models. Specifically, we design an efficient point-to-point ring-style communication mechanism to leverage the right-product kernel trick of linear attention, which sharply decreases the communication overhead, comparing with existing SP methods. We enhance the computation efficiency of LASP by performing kernel fusion and intermediate state caching, making the implementation of LASP hardware-friendly on GPUs. Furthermore, we meticulously ensure the compatibility of sequence-level LASP with all types of batch-level data parallel methods, which is vital for distributed training on large clusters with very-long sequences. We also discuss the generalization of LASP on other linear sequence modeling methods. Extensive experiments on linear attention-based models are conducted with varying sequence lengths from 2K to 4096K. LASP scales sequence length up to 4096K on 128 GPUs, which is 8×\times longer than existing SP methods.

4297Learning-Augmented Streaming Algorithms for Correlation Clustering

[openreview] [pdf]

Abstract We study streaming algorithms for Correlation Clustering. Given a complete graph as an arbitrary-order stream of edges, with each edge labelled as positive or negative, the goal is to partition the vertices into disjoint clusters, such that the number of disagreements is minimized. In this paper, we introduce the first learning-augmented streaming algorithms for the problem, achieving the first better-than-3-approximation in dynamic streams. Our algorithms draw inspiration from recent works of Cambus et al. (SODA’24), and Chakrabarty and Makarychev (NeurIPS’23). Our algorithms use the predictions of pairwise dissimilarities between vertices provided by a predictor and achieve an approximation ratio that is close to 2.06 under good prediction quality. Even if the prediction quality is poor, our algorithms cannot perform worse than the well known Pivot algorithm, which achieves a 3-approximation. Our algorithms are much simpler than the recent 1.847-approximation streaming algorithm by Cohen-Addad et al. (STOC’24) which appears to be challenging to implement and is restricted to insertion-only streams. Experimental results on synthetic and real-world datasets demonstrate the superiority of our proposed algorithms over their non-learning counterparts.

4298FuseChat: Knowledge Fusion of Chat Models

[openreview] [pdf]

Abstract While training large language models (LLMs) from scratch can indeed lead to models with distinct capabilities and strengths, it incurs substantial costs and may lead to redundancy in competencies. Knowledge fusion aims to integrate existing LLMs of diverse architectures and capabilities into a more potent LLM through lightweight continual training, thereby reducing the need for costly LLM development. In this work, we propose a new framework for the knowledge fusion of chat LLMs through two main stages, resulting in FuseChat. Firstly, we conduct pairwise knowledge fusion on source chat LLMs of varying structures and scales to create multiple target LLMs with identical structure and size via lightweight fine-tuning. During this process, a statistics-based token alignment approach is introduced as the cornerstone for fusing LLMs with different structures. Secondly, we merge these target LLMs within the parameter space, where we propose a novel method for determining the merging coefficients based on the magnitude of parameter updates before and after fine-tuning. We implement and validate FuseChat using six prominent chat LLMs with diverse architectures and scales, including OpenChat-3.5-7B, Starling-LM-7B-alpha, NH2-SOLAR-10.7B, InternLM2-Chat-20B, Mixtral-8x7B-Instruct, and Qwen-1.5-Chat-72B. Experimental results on two instruction-following benchmarks, AlpacaEval 2.0 and MT-Bench, demonstrate the superiority of FuseChat-7B over baselines of various sizes. Our model is even comparable to the larger Mixtral-8x7B-Instruct and approaches GPT-3.5-Turbo-1106 on MT-Bench.

4299Fast and Space-Efficient Fixed-Length Path Optimization

[openreview] [pdf]

Abstract Several optimization problems seek a path of predetermined length among network nodes that minimizes a cost function. Conventionally, such problems are tackled by dynamic programming (DP) applying a Bellman-type equation. A prominent example is Viterbi decoding, which returns the path in a Hidden Markov Model that best explains a series of observations, with applications from bioinformatics to communication systems and speech recognition. However, DP-based solutions (i) exhaustively explore a search space linear in both network size and path length in time quadratic in network size, without exploiting data characteristics, and (ii) require memory commensurate with that search space to reconstruct the optimal path. In this paper, we propose Isabella (Dijkstra-Bellman), a novel framework that finds optimal paths of predetermined length in time- and space-efficient fashion by a combination of best-first-search, depth-first-search, and divide-and-conquer strategies. The best-first-search component avoids the exhaustive exploration of the search space using a priority queue; the depth-first-search component keeps the size of that queue in check; and the divide-and-conquer component constructs the optimal path recursively and parsimoniously after determining its cost. We apply Isabella to Viterbi decoding, introducing algorithms that visit the most promising pathways first and control memory consumption. To emphasize the generality of Isabella, we also instantiate it with an algorithm for histogram construction. To our knowledge, no previous work addresses such problems in this manner. Our experimental evaluation shows our solutions to be highly time- and space-efficient compared to standard dynamic programming.

4300Towards Formally Verifying LLMs: Taming the Nonlinearity of the Transformer

[openreview] [pdf]

Abstract Large language models are increasingly used across various domains, which raises important safety concerns, particularly regarding adversarial attacks. While recent advancements in formal neural network verification have shown promising results, the complexity of transformers, the backbone of large language models, poses unique challenges for formal robustness verification. Traditional convex relaxation methods often result in large approximation errors due to the transformer’s parallel, nonlinear attention heads. In this work, we address these limitations by introducing a novel approach based on non-convex, set-based computing to preserve the nonlinear dependencies through a transformer. Our approach generalizes previous methods on robustness verification of transformers, and the desired precision is tunable at the cost of additional computation time with a single parameter.

4301Inference, Fast and Slow: Reinterpreting VAEs for OOD Detection

[openreview] [pdf]

Abstract lthough likelihood-based methods are theoretically appealing, deep generative models (DGMs) often produce unreliable likelihood estimates in practice, particu larly for out-of-distribution (OOD) detection. We reinterpret variational autoen coders (VAEs) through the lens of fast and slow weights. Our approach is guided by the proposed Likelihood Path (LPath) Principle, which extends the classical likelihood principle. A critical decision in our method is the selection of statistics for classical density estimation algorithms. The sweet spot should contain just enough information that’s sufficient for OOD detection but not too much to suffer from the curse of dimensionality. Our LPath principle achieves this by selecting the sufficient statistics that form the “path” toward the likelihood. We demonstrate that this likelihood path leads to SOTA OOD detection performance, even when the likelihood itself is unreliable.

4302ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model

[openreview] [pdf]

Abstract Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from sparse views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction problem as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. Nevertheless, it is challenging to preserve 3D view consistency when directly generating video frames from pre-trained models. To address this issue, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of ReconX over state-of-the-art methods in terms of quality and generalizability.

4303From Isolated Conversations to Hierachical Schemas: Dynamic Tree Memory Representation for LLMs

[openreview] [pdf]

Abstract Recent advancements in large language models have significantly improved their context windows, yet challenges in effective long-term memory management remain. We introduce MemTree, an algorithm that leverages a dynamic, tree-structured memory representation to optimize the organization, retrieval, and integration of information, akin to human cognitive schemas. MemTree organizes memory hierarchically, with each node encapsulating aggregated textual content, corresponding semantic embeddings, and varying abstraction levels across the tree’s depths. Our algorithm dynamically adapts this memory structure by computing and comparing semantic embeddings of new and existing information to enrich the model’s context-awareness. This approach allows MemTree to handle complex reasoning and extended interactions more effectively than traditional memory augmentation methods, which often rely on flat lookup tables. Evaluations on benchmarks for multi-turn dialogue understanding and document question answering show that MemTree significantly enhances performance in scenarios that demand structured memory management.

4304Graph Neural Networks Gone Hogwild

[openreview] [pdf]

Abstract Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excludes these architectures from many potential applications where synchrony is difficult or impossible to enforce, e.g., robotic swarms or sensor networks. In this work we identify ‘‘implicitly-defined’’ GNNs as a class of architectures which is provably robust to asynchronous ‘‘hogwild’’ inference, adapting convergence guarantees from work in asynchronous and distributed optimization. We then propose a novel implicitly-defined GNN architecture, which we call an energy GNN. We show that this architecture outperforms other GNNs from this class on a variety of synthetic tasks inspired by multi-agent systems.

4305Safe Multi-agent Reinforcement Learning with Protection Motivation Theory

[openreview] [pdf]

Abstract A challenging problem for implementing multi-agent reinforcement learning (MARL) in real-world applications is ensuring the safety of cooperative strategies. According to the Protection Motivation Theory (PMT), threat appraisals result in negative emotions and elicit protective behaviors, which are instrumental for coping with security threats. Drawing inspiration from the PMT, we focus on two discrete emotions--fear and regret--to evaluate threat severity and facilitate multiple agents to learn protective behaviors. These can promote cooperative decision-making with fewer safety violations. Specifically, we propose two safety guarantee methods with PMT: fear for safety guarantee (F4SG) and regret for safety guarantee (R4SG), utilizing the active inference technique to model the emotions of fear and regret separately. The threat severity evaluated by these emotions influences the state value and the executed action respectively, which avoids the potential threat of visiting certain states or taking certain actions. Experimental results demonstrate that our proposed methods are safer and more efficient than state-of-the-art baselines on challenging tasks in safe MARL benchmarks.

4306Cracking the Collective Mind: Adversarial Manipulation in Multi-Agent Systems

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated significant capabilities across various domains such as healthcare, weather forecasting, finance, and law. These works have showcased the powerful abilities of individual LLMs. Recently, numerous studies have shown that coordinated multi-agent systems exhibit enhanced decision-making and reasoning capabilities through collaboration. However, since individual LLMs are susceptible to various adversarial attacks, a key vulnerability arises: Can an attacker manipulate the collective decision of such systems by accessing a single agent? To address this issue, we formulate it as a game with incomplete information, where agents lack full knowledge of adversarial strategies. We then propose a framework, M-Spoiler, which simulates a stubborn adversary in multi-agent debates during the training phase to tackle this problem. Through extensive experiments across various tasks, our findings confirm the risk of manipulation in multi-agent systems and demonstrate the effectiveness of our attack strategies. Additionally, we explore several defense mechanisms, revealing that our proposed attack method remains more potent than existing baselines, underscoring the need for further research on defensive strategies.

4307OmniChat: Enhancing Spoken Dialogue Systems with Scalable Synthetic Data for Diverse Scenarios

[openreview] [pdf]

Abstract With the rapid development of large language models, researchers have created increasingly advanced spoken dialogue systems that can naturally converse with humans. However, these systems still struggle to handle the full complexity of real-world conversations, including audio events, musical contexts, and emotional expressions, mainly because current dialogue datasets are constrained in both scale and scenario diversity. In this paper, we propose leveraging synthetic data to enhance the dialogue models across diverse scenarios. We introduceShareChatX, the first comprehensive, large-scale dataset for spoken dialogue that spans diverse scenarios. Based on this dataset, we introduceOmniChat, a multi-turn dialogue system with a heterogeneous feature fusion module, designed to optimize feature selection in different dialogue contexts. In addition, we explored critical aspects of training dialogue systems using synthetic data. Through comprehensive experimentation, we determined the ideal balance between synthetic and real data, achieving state-of-the-art results on the real-world dialogue dataset DailyTalk. We also highlight the crucial importance of synthetic data in tackling diverse, complex dialogue scenarios, especially those involving audio and music. For more details, please visit our demo page at \url{https://sharechatx.github.io/}.

4308Controllable Satellite-to-Street-View Synthesis with Precise Pose Alignment and Zero-Shot Environmental Control

[openreview] [pdf]

Abstract Generating street-view images from satellite imagery is a challenging task, particularly in maintaining accurate pose alignment and incorporating diverse environmental conditions. While diffusion models have shown promise in generative tasks, their ability to maintain strict pose alignment throughout the diffusion process is limited. In this paper, we propose a novel Iterative Homography Adjustment (IHA) scheme applied during the denoising process, which effectively addresses pose misalignment and ensures spatial consistency in the generated street-view images. Additionally, currently, available datasets for satellite-to-street-view generation are limited in their diversity of illumination and weather conditions, thereby restricting the generalizability of the generated outputs. To mitigate this, we introduce a text-guided illumination and weather-controlled sampling strategy that enables fine-grained control over the environmental factors. Extensive quantitative and qualitative evaluations demonstrate that our approach significantly improves pose accuracy and enhances the diversity and realism of generated street-view images, setting a new benchmark for satellite-to-street-view generation tasks.

4309On the Optimization and Generalization of Two-layer Transformers with Sign Gradient Descent

[openreview] [pdf]

Abstract The Adam optimizer is widely used for transformer optimization in practice, which makes understanding the underlying optimization mechanisms an important problem. However, due to the Adam’s complexity, theoretical analysis of how it optimizes transformers remains a challenging task. Fortunately, Sign Gradient Descent (SignGD) serves as an effective surrogate for Adam. Despite its simplicity, theoretical understanding of how SignGD optimizes transformers still lags behind. In this work, we study how SignGD optimizes a two-layer transformer -- consisting of a softmax attention layer with trainable query-key parameterization followed by a linear layer -- on a linearly separable noisy dataset. We identify four stages in the training dynamics, each exhibiting intriguing behaviors. Based on the training dynamics, we prove the fast convergence but poor generalization of the learned transformer on the noisy dataset. We also show that Adam behaves similarly to SignGD in terms of both optimization and generalization in this setting. Additionally, we find that the poor generalization of SignGD is not solely due to data noise, suggesting that both SignGD and Adam requires high-quality data for real-world tasks. Finally, experiments on synthetic and real-world datasets empirically support our theoretical results.

4310NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative

[openreview] [pdf]

Abstract Existing video captioning benchmarks and models lack coherent representations of causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents. This lack of narrative restricts models’ ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content. To address this gap, we propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting, explicitly encoding cause-effect temporal relationships in video descriptions, evaluated automatically to ensure caption quality and relevance and validated through human evaluation; and (2) a dedicated Cause-Effect Network (CEN) architecture with separate encoders for capturing cause and effect dynamics independently, enabling effective learning and generation of captions with causal- temporal narrative. Extensive experiments demonstrate that CEN significantly outperforms state-of-the-art models, including fine-tuned vision-language models, and is more accurate in articulating the causal and temporal aspects of video content than the second best model (GIT): 17.88 and 17.44 CIDEr on the MSVD and MSR-VTT datasets, respectively. Cross-dataset evaluations further showcase CEN’s strong generalization capabilities. The proposed framework understands and generates nuanced text descriptions with intricate causal-temporal narrative structures present in videos, addressing a critical limitation in video captioning.

4311Self-Preference Bias in LLM-as-a-Judge

[openreview] [pdf]

Abstract Automated evaluation leveraging large language models (LLMs), commonly referred to as LLM evaluators or LLM-as-a-judge, has been widely used in measuring the performance of dialogue systems. However, the self-preference bias in LLMs has posed significant risks, including promoting specific styles or policies intrinsic to the LLMs. Despite the importance of this issue, there is a lack of established methods to measure the self-preference bias quantitatively, and its underlying causes are poorly understood. In this paper, we introduce a novel quantitative metric to measure the self-preference bias. Our experimental results demonstrate that GPT-4 exhibits a significant degree of self-preference bias. To explore the causes, we hypothesize that LLMs may favor outputs that are more familiar to them, as indicated by lower perplexity. We analyze the relationship between LLM evaluations and the perplexities of outputs. Our findings reveal that LLMs assign significantly higher evaluations to outputs with lower perplexity than human evaluators, regardless of whether the outputs were self-generated. This suggests that the essence of the bias lies in perplexity and that the self-preference bias occurs because the LLMs’ own outputs have lower perplexity.

4312Towards Understanding Multi-Round Large Language Model Reasoning: Approximability, Learnability and Generalizability

[openreview] [pdf]

Abstract Recent advancements in cognitive science and multi-round reasoning techniques for Large Language Models (LLMs) suggest that iterative thinking processes improve problem-solving performance in complex tasks. Inspired by this, approaches like Chain-of-Thought, debating, and self-refinement have been applied to auto-regressive LLMs, achieving significant successes in tasks such as mathematical reasoning, commonsense reasoning, and multi-hop question answering. Despite these successes, the theoretical basis for how multi-round reasoning enhances problem-solving abilities remains underexplored. In this work, we investigate the approximation, learnability, and generalization properties of multi-round auto-regressive models. We show that Transformers with finite context windows are universal approximators for steps of Turing-computable functions and can approximate any Turing-computable sequence-to-sequence function through multi-round reasoning. We extend PAC learning to sequence generation and demonstrate that multi-round generation is learnable even when the sequence length exceeds the model’s context window. Finally, we examine how generalization error propagates across rounds, and show how the aforementioned approaches can help constrain this error, ensuring outputs stay within an expectation boundary. This work sheds light on the systemic theoretical foundations of multi-round sequence learning and reasoning, emphasizing its role in inference complexity.

4313Think Beyond Size: Dynamic Prompting for More Effective Reasoning

[openreview] [pdf]

Abstract This paper presents Dynamic Prompting, a novel framework aimed at improving the reasoning capabilities of Large Language Models (LLMs). In contrast to conventional static prompting methods, Dynamic Prompting enables the adaptive modification of prompt sequences and step counts based on real-time task complexity and model performance. This dynamic adaptation facilitates more efficient problem-solving, particularly in smaller models, by reducing hallucinations and repetitive cycles. Our empirical evaluations demonstrate that Dynamic Prompting allows smaller LLMs to perform competitively with much larger models, thereby challenging the conventional emphasis on model size as the primary determinant of reasoning efficacy.

4314AERO: Softmax-Only LLMs for Efficient Private Inference

[openreview] [pdf]

Abstract The pervasiveness of proprietary language models has raised privacy concerns for users’ sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the {\em first time}, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23×\times communication and 1.94×\times latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

4315EarthquakeNPP: Benchmark Datasets for Earthquake Forecasting with Neural Point Processes

[openreview] [pdf]

Abstract Classical point process models, such as the epidemic-type aftershock sequence (ETAS) model, have been widely used for forecasting the event times and locations of earthquakes for decades. Recent advances have led to Neural Point Processes (NPPs), which promise greater flexibility and improvements over classical models. However, the currently-used benchmark dataset for NPPs does not represent an up-to-date challenge in the seismological community since it lacks a key earthquake sequence from the region and improperly splits training and testing data. Furthermore, initial earthquake forecast benchmarking lacks a comparison to state-of-the-art earthquake forecasting models typically used by the seismological community. To address these gaps, we introduce EarthquakeNPP: a collection of benchmark datasets to facilitate testing of NPPs on earthquake data, accompanied by a credible implementation of the ETAS model. The datasets cover a range of small to large target regions within California, dating from 1971 to 2021, and include different methodologies for dataset generation. In a benchmarking experiment, we compare three spatio-temporal NPPs against ETAS and find that none outperform ETAS in either spatial or temporal log-likelihood. These results indicate that current NPP implementations are not yet suitable for practical earthquake forecasting. However, EarthquakeNPP will serve as a platform for collaboration between the seismology and machine learning communities with the goal of improving earthquake predictability.

4316Reconstructing Training Data From Real-World Models Trained with Transfer Learning

[openreview] [pdf]

Abstract Current methods for reconstructing the training data from trained classifiers are restricted to very small models, limited training set sizes, and low-resolution images. Such restrictions hinder their applicability to real-world scenarios. In this paper, we present a novel approach enabling data reconstruction in realistic settings for models trained on high-resolution images. Our method adapts the reconstruction scheme of Haim et al. [2022] to real-world scenarios -- specifically, targeting models trained via transfer learning over image embeddings of large pre-trained models like DINO-ViT and CLIP. Our work employs data reconstruction in the embedding space rather than in the image space, showcasing its applicability beyond visual data. Moreover, we introduce a novel clustering-based method to identify good reconstructions from thousands of candidates. This significantly improves on previous works that relied on knowledge of the training set to identify good reconstructed images. Our findings shed light on a potential privacy risk for data leakage from models trained using transfer learning methods.

4317Deep Learning Aided Broadcast Codes With Feedback

[openreview] [pdf]

Abstract Deep learning aided codes have been shown to improve code performance in feedback codes in high noise regimes due to the ability to leverage non-linearity in code design. In the additive white Gaussian broadcast channel (AWGN-BC), the addition of feedback may allow the capacity region to extend far beyond the capacity region of the channel without feedback, enabling higher data rates. On the other hand, there are limited deep-learning aided implementations of broadcast codes. In this work, we extend two classes of deep-learning assisted feedback codes to the AWGN-BC channel; the first being an RNN-based architecture and the second being a lightweight MLP-based architecture. Both codes are trained using a global model, and then they are trained using a more realistic vertical federated learning based framework. We first show that in most cases, using an AWGN-BC code outperforms a linear-based concatenated scheme. Second, we show in some regimes, the lightweight architecture far exceeds the RNN-based code, but in especially unreliable conditions, the RNN-based code dominates. The results show the promise of deep-learning aided broadcast codes in unreliable channels, and future research directions are discussed.

4318Plots unlock time-series understanding in multimodal models

[openreview] [pdf]

Abstract While multimodal foundation models can now natively work with data beyond text, they remain underutilized in analyzing the considerable amounts of multi-dimensional time-series data in fields like healthcare, finance, and social sciences, representing a missed opportunity for richer, data-driven insights. This paper proposes a simple but effective method that leverages the existing vision encoders of these models to “see” time-series data via plots, avoiding the need for additional, potentially costly, model training. Our empirical evaluations show that this approach outperforms providing the raw time-series data as text, with the additional benefit that visual time-series representations demonstrate up to a 90% reduction in model API costs. We validate our hypothesis through synthetic data tasks of increasing complexity, progressing from simple functional form identification on clean data, to extracting trends from noisy scatter plots. To demonstrate generalizability from synthetic tasks with clear reasoning steps to more complex, real-world scenarios, we apply our approach to consumer health tasks – specifically fall detection, activity recognition, and readiness assessment – which involve heterogeneous, noisy data and multi-step reasoning. The overall success in plot performance over text performance (up to an 120% performance increase on zero-shot synthetic tasks, and up to 150% performance increase on real-world tasks), across both GPT and Gemini model families, highlights our approach’s potential for making the best use of the native capabilities of foundation models.

4319FairFedMed: Achieving Equity in Medical Federated Learning via FairLoRA

[openreview] [pdf]

Abstract Fairness remains a critical concern in healthcare, where unequal access to services and treatment outcomes can adversely affect patient health. While Federated Learning (FL) presents a collaborative and privacy-preserving approach to model training, ensuring fairness is challenging due to heterogeneous data across institutions, and current research primarily addresses non-medical applications. To fill this gap, we introduce FairFedMed, the first FL dataset specifically designed to study group fairness (i.e., demographics) in the medical field. It consists of paired 2D SLO funfus images and 3D OCT B-Scans from 15,165 glaucoma patients, along with six different demographic attributes. Existing state-of-the-art FL models may work well for natural images but often struggle with medical images due to their unique characteristics. Moreover, these models do not sufficiently address performance disparities across diverse demographic groups. To overcome these limitations, we propose FairLoRA, a novel fairness-aware FL framework based on singular value decomposition(SVD)-based low-rank approximation. FairLoRA incorporates customized singular value matrices for each demographic group and shares singular vector matrices across all demographic groups, ensuring both model equity and computational efficiency. Experimental results on the FairFedMed dataset demonstrate that FairLoRA not only achieves state-of-the-art performance in medical image classification but also significantly improves fairness across diverse populations. Our code and dataset can be accessible via the Github anonymous link:https://github.com/Anonymouse4Science/FairFedMed-FairLoRA.git

4320NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

[openreview] [pdf]

Abstract Decoder-only large language model (LLM)-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce the NV-Embed model, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility.For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed- v1 model secured the No.1 position on the Massive Text Embedding Benchmark (MTEB) (as of May 24, 2024), across 56 embedding tasks. NV-Embed-v2 has reclaimed and maintained the top spot on MTEB since August 30, 2024, demonstrating the sustained effectiveness of the proposed methods over time. Additionally, it achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB.

4321Collaborative Discrete-Continuous Black-Box Prompt Learning for Language Models

[openreview] [pdf]

Abstract Large Scale Pre-Trained Language Models (PTMs) have demonstrated unprecedented capabilities across diverse natural language processing tasks. Adapting such models to downstream tasks is computationally intensive and time-consuming, particularly in black-box scenarios common in Language-Model-as-a-Service (LMaaS) environments, where model parameters and gradients are inaccessible. Recently, black-box prompt learning using zeroth-order gradients has emerged as a promising approach to address these challenges by optimizing learnable continuous prompts in embedding spaces, starting with \textit{randomly initialized discrete text prompts}. However, its reliance on randomly initialized discrete prompts limits adaptability to diverse downstream tasks or models. To address this limitation, this paper introduces ZO-PoG, a novel framework that optimizes prompts through a collaborative approach, combining Policy Gradient optimization for initial discrete text prompts and Zeroth-Order optimization for continuous prompts in embedding space. By optimizing collaboratively between discrete and continuous prompts, ZO-PoG maximizes adaptability to downstream tasks, achieving superior results without direct access to the model’s internal structures. Importantly, we establish the sub-linear convergence of ZO-PoG under mild assumptions. The experiments on different datasets demonstrate significant improvements in various tasks compared to the baselines. Our code is available at the following anonymous URL:https://anonymous.4open.science/r/ZO-PoG-12B4.

4322Invisibility Stickers Against LiDAR: Adversarial Attacks on Point Cloud Intensity for LiDAR-based Detection

[openreview] [pdf]

Abstract Point cloud detection is crucial in applications such as autonomous driving systems and robotics. These systems utilize onboard LiDAR sensors to capture input point clouds, consisting of numerous three-dimensional coordinate points and their corresponding intensity of laser reflection. Recent studies have proposed various adversarial schemes to highlight the vulnerability of point cloud detectors. However, these studies primarily focused on generating or perturbing the coordinate positions of input points and are hard to attack in the physical world, while largely overlooking the significance of their intensity. Through our exploration, we found that perturbing point cloud intensity poses significant security risks for point cloud object detectors. To the best of our knowledge, we are the first to attack on point cloud intensity and we propose an effective adversarial attack scheme, named I-ADV. Our method employs a voxel partition scheme to enhance physical implementation. To boost attack performance, we incorporate a gradient enhancement technique using 3D angle and distance features, along with an extremum-based gradient fusion strategy. Extensive experimental results demonstrate that by altering only point cloud intensity, our approach achieves state-of-the-art performance across detectors with various input representations, attaining attack success rates between 83.9% and 99.1%. Comprehensive ablation studies confirm the effectiveness and generality of the method’s components. Additionally, comparing different attack schemes underscores the advantages of our point cloud intensity attack method in both performance and real-world applicability.

4323Inductive or Deductive? Rethinking the Fundamental Reasoning Abilities of LLMs

[openreview] [pdf]

Abstract Reasoning encompasses two typical types: deductive reasoning and inductive reasoning. Despite extensive research into the reasoning capabilities of Large Language Models (LLMs), most studies have failed to rigorously differentiate between inductive and deductive reasoning, leading to a blending of the two. This raises an essential question: In LLM reasoning, which poses a greater challenge - deductive or inductive reasoning? While the deductive reasoning capabilities of LLMs, (i.e. their capacity to follow instructions in reasoning tasks), have received considerable attention, their abilities in true inductive reasoning remain largely unexplored due to the inseparability of the two types of reasoning in most of the tasks. To delve into the true inductive reasoning capabilities of LLMs, we propose a novel framework, SolverLearner. This framework enables LLMs to learn the underlying function (i.e., y=fw(x)y = f_w(x)), that maps input data points (x)(x) to their corresponding output values (y)(y), using only in-context examples. By focusing on inductive reasoning and separating it from LLM-based deductive reasoning, we can isolate and investigate inductive reasoning of LLMs in its pure form via SolverLearner. Our observations reveal that LLMs demonstrate remarkable inductive reasoning capabilities through SolverLearner, achieving near-perfect performance with ACC of 1 in most cases. Surprisingly, despite their strong inductive reasoning abilities, LLMs tend to relatively lack deductive reasoning capabilities, particularly in tasks involving ``counterfactual’’ reasoning.

4324General Preference Modeling with Preference Representations for Aligning Language Models

[openreview] [pdf]

Abstract Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. Although supervised pair preference models (PairPM) can express general preferences, their implementation is highly ad-hoc and cannot guarantee a consistent preference probability of compared pairs. Additionally, they impose high computational costs due to their quadratic query complexity when comparing multiple responses. In this paper, we introduce preference representation learning, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback. Experimental results show that our General Preference representation model (GPM) outperforms the BT reward model on the RewardBench benchmark with a margin of up to 5.6% and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0 and MT-Bench, following the language model post-training with GPO and our general preference model, reveal substantial performance improvements with margins up to 9.3%. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values.

4325Hybrid Spatial Representations for Species Distribution Modeling

[openreview] [pdf]

Abstract We address an important problem in ecology called Species Distribution Modeling (SDM), whose goal is to predict whether a species exists at a certain position on Earth. In particular, we tackle a challenging version of this task, where we learn from presence-only data in a community-sourced dataset, model a large number of species simultaneously, and do not use any additional environmental information. Previous work has used neural implicit representations to construct models that achieve promising results. However, implicit representations often generate predictions of limited spatial precision. We attribute this limitation to their inherently global formulation and inability to effectively capture local feature variations. This issue is especially pronounced with presence-only data and a large number of species. To address this, we propose a hybrid embedding scheme that combines both implicit and explicit embeddings. Specifically, the explicit embedding is implemented with a multiresolution hashgrid, enabling our models to better capture local information. Experiments demonstrate that our results exceed other works by a large margin on various standard benchmarks, and that the hybrid representation is better than both purely implicit and explicit ones. Qualitative visualizations and comprehensive ablation studies reveal that our hybrid representation successfully addresses the two main challenges. Our code is open-sourced athttps://anonymous.4open.science/r/HSR-SDM-7360.

4326Open Vocabulary Panoptic Segmentation With Retrieval Augmentation

[openreview] [pdf]

Abstract Given an input image and set of class names, panoptic segmentation aims to label each pixel in an image with class labels and instance labels. In comparison, Open Vocabulary Panoptic Segmentation aims to facilitate the segmentation of arbitrary classes according to user input. The challenge is that a panoptic segmentation system trained on a particular dataset typically does not generalize well to unseen classes beyond the training data. In this work, we propose a retrieval-augmented panoptic segmentation method that improves the performance of unseen classes. In particular, we construct a masked segment feature database using paired image-text data. At inference time, we use masked segment features from the input image as query keys to retrieve similar features and associated class labels from the database. Classification scores for the masked segment are assigned based on the similarity between query features and retrieved features. The retrieval-based classification scores are combined with CLIP-based scores to produce the final output. We incorporate our solution with a previous SOTA method (FC-CLIP). When trained on COCO, the proposed method demonstrates 30.9 PQ, 19.3 mAP, 44.0 mIoU on the ADE20k dataset, achieving +4.5 PQ, +2.5 mAP, +10.0 mIoU absolute improvement over the baseline.

4327Model-Free Offline Reinforcement Learning with Enhanced Robustness

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) has gained considerable attention for its ability to learn policies from pre-collected data without real-time interaction, which makes it particularly useful for high-risk applications. However, due to its reliance on offline datasets, existing works inevitably introduce assumptions to ensure effective learning, which, however, often lead to a trade-off between robustness to model mismatch and scalability to large environments. In this paper, we enhance both aspects with a novel double-pessimism principle, which conservatively estimates performance and accounts for both limited data and potential model mismatches, two major reasons for the previous trade-off. We then propose a universal, model-free algorithm to learn an optimal policy that is robust to potential environment mismatches, which enhances robustness in a scalable manner. Furthermore, we provide a sample complexity analysis of our algorithm when the mismatch is modeled by the lαl_\alpha-norm, which also theoretically demonstrates the efficiency of our method. Extensive experiments further demonstrate that our approach significantly improves robustness in a more scalable manner than existing methods.

4328Transforming Ocean Analysis: Learning 4D ocean field from in-situ observations via uncertainty-aware implicit representations

[openreview] [pdf]

Abstract A complete and accurate representation of Earth’s time-evolving ocean field is crucial for understanding global warming as well as climate dynamics. However, the sparsity of current in-situ ocean measurements presents a significant challenge in estimating values in largely unobserved regions. Traditional methods, such as objective interpolation (OI), struggle with accuracy due to their reliance on discrete grids and fixed spatial correlation structures. In this paper, we propose a novel approach to reconstruct 4D ocean fields only from raw observations using implicit neural representations (INRs). Our method improves field representations by leveraging neural networks to capture continuous, complex, and nonlinear patterns inherent in ocean data. To address uncertainties in ocean measurements and the limited availability of daily observations, we incorporate uncertainty estimates and a meta-learning strategy into existing INRs. These innovations enable our approach to provide daily, resolution-free ocean temperature reconstructions, a significant improvement over monthly averaged discrete fields. Experiments demonstrate the accuracy and adaptability of our method compared with approaches, establishing our method as a transformative solution for future ocean analysis and climate monitoring.

4329BadJudge: Backdoor Vulnerabilities of LLM-As-A-Judge

[openreview] [pdf]

Abstract Recently, LLMs are being used to evaluate free-form language generation, in a increasingly popular paradigm called LLM-as-a-Judge. While the ratings of these judges achieve SOTA correlation with human preferences on LLM generation, acquiring data to train these models is often community-driven and open-source, inadvertently creating opportunities for malicious actors to compromise the eval- uation pipeline. Current research predominantly focuses on de-biasing LLM evaluators, improving robustness to spurious correlations. However, they overlook potential threats from adversaries. This paper exposes a devastating attack on LLM evaluators: the backdoor, where an adversary inserts a predefined trigger-target pair into a model’s training set and activates it during test time to control the model’s decision. Results elucidate how 1 extra token in 1% of the evaluator training corpus can inflate the adversary model’s score by over 3 times. However, (malicious) human annotators typically lack access to the entire training dataset. As such, experiments evidence how score inflation severity correlates with data access. The most severe setting, achieves an inflated 4.9/5 rating, despite scoring 1.5/5 on legitimate evaluation. Experiments across 2 preference models (point-wise and pair-wise), 3 model families, and 3 triggers evince the generalizability of this attack. Disconcertingly, case studies on real-world systems indicate LlaMA-3.1-Guard, LMSYS Chatbot Arena, and list-wise reranking evaluators in RAG are all susceptible to attack. Moreover, defending evaluators presents a new challenge, with many exploitable components, e.g. score rubric. Likewise, falsely editing the input may shift scores, as LLM evaluation hinges upon both semantic and stylistic features, constraining the defense search space. Our results reinforce this, indicating that many canonical defense strategies, including ONION and BKI are ineffective. Fortunately, a straightforward defensive tool—the model merge—demonstrates exceptional efficacy, reducing the Attack Success Rate (ASR) by 93% on even the most severe levels of data access. As a pioneering work in this domain, we release our code and data to ensure reproducibility and to foster further research in this critical direction.

4330Scaling Laws for Pre-training Agents and World Models

[openreview] [pdf]

Abstract The performance of embodied agents has been shown to improve by increasing model parameters, dataset size, and compute. This has been demonstrated in domains from robotics to video games, when simple learning objectives on offline datasets (pre-training) are used to model an agent’s behavior (imitation learning) or their environment (world modeling). This paper characterizes the role of scale in these tasks more precisely. Going beyond the simple intuition that `bigger is better’, we show that the same types of power laws found in language modeling (e.g. between loss and optimal model size), also arise in world modeling and imitation learning. However, the coefficients of these laws are influenced by the tokenizer, task & architecture -- this has important implications on optimal sizing of models and data.

4331EBES: Easy Benchmarking for Event Sequences

[openreview] [pdf]

Abstract Event sequences, characterized by irregular sampling intervals and a mix of categorical and numerical features, are common data structures in various real-world domains such as healthcare, finance, and user interaction logs. Despite advances in temporal data modeling techniques, there is no standardized benchmarks for evaluating their performance on event sequences. This complicates result comparison across different papers due to varying evaluation protocols, potentially misleading progress in this field. We introduce EBES, a comprehensive benchmarking tool with standardized evaluation scenarios and protocols, focusing on regression and classification problems with sequence-level targets. Our library~\footnote{We attach an archive with the code. The code will be publicly available after the conference decision.} simplifies benchmarking, dataset addition, and method integration through a unified interface. It includes a novel synthetic dataset and provides preprocessed real-world datasets, including the largest publicly available banking dataset. Our results provide an in-depth analysis of datasets, identifying some as unsuitable for model comparison. We investigate the importance of modeling temporal and sequential components, as well as the robustness and scaling properties of the models. These findings highlight potential directions for future research. Our benchmark aim is to facilitate reproducible research, expediting progress and increasing real-world impacts.

4332Stochastic Semi-Gradient Descent for Learning Mean Field Games with Population-Aware Function Approximation

[openreview] [pdf]

Abstract Mean field games (MFGs) model interactions in large-population multi-agent systems through population distributions. Traditional learning methods for MFGs are based on fixed-point iteration (FPI), where policy updates and induced population distributions are computed separately and sequentially. However, FPI-type methods may suffer from inefficiency and instability due to potential oscillations caused by this forward-backward procedure. In this work, we propose a novel perspective that treats the policy and population as a unified parameter controlling the game dynamics. By applying stochastic parameter approximation to this unified parameter, we develop SemiSGD, a simple stochastic gradient descent (SGD)-type method, where an agent updates its policy and population estimates simultaneously and fully asynchronously. Building on this perspective, we further apply linear function approximation (LFA) to the unified parameter, resulting in the first population-aware LFA (PA-LFA) for learning MFGs on continuous state-action spaces. A comprehensive finite-time convergence analysis is provided for SemiSGD with PA-LFA, including its convergence to the equilibrium for linear MFGs—a class of MFGs with a linear structure concerning the population—under the standard contractivity condition, and to a neighborhood of the equilibrium under a more practical condition. We also characterize the approximation error for non-linear MFGs. We validate our theoretical findings with six experiments on three MFGs.

4333UniDetox: Universal Detoxification of Large Language Models via Dataset Distillation

[openreview] [pdf]

Abstract We present UniDetox, a universally applicable method designed to mitigate toxicity across various large language models (LLMs). Previous detoxification methods are typically model-specific, addressing only individual models or model families, and require careful hyperparameter tuning due to the trade-off between detoxification efficacy and language modeling performance. In contrast, UniDetox provides a detoxification technique that can be universally applied to a wide range of LLMs without the need for separate model-specific tuning. Specifically, we propose a novel and efficient dataset distillation technique for detoxification using contrastive decoding. This approach distills detoxifying representations in the form of synthetic text data, enabling universal detoxification of any LLM through fine-tuning with the distilled text. Our experiments demonstrate that the detoxifying text distilled from GPT-2 can effectively detoxify larger models, including OPT, Falcon, and LLaMA-2. Furthermore, UniDetox eliminates the need for separate hyperparameter tuning for each model, as a single hyperparameter configuration can be seamlessly applied across different models. Additionally, analysis of the detoxifying text reveals a reduction in politically biased content, providing insights into the attributes necessary for effective detoxification of LLMs.

[openreview] [pdf]

Abstract Recent advancements in neural theorem proving integrate large language models with tree search algorithms like Monte Carlo Tree Search (MCTS), where the language model suggests tactics and the tree search finds the complete proof path. However, many tactics proposed by the language model converge to semantically or strategically similar, reducing diversity and increasing search costs by expanding redundant proof paths. This issue exacerbates as computation scales and more tactics are explored per state. Furthermore, the trained value function suffers from false negatives, label imbalance, and domain gaps due to biased data construction. To address these challenges, we propose CARTS (diversified tactic CAlibration and bias-Resistant Tree Search), which balances tactic diversity and importance while calibrating model confidence. CARTS also introduce preference modeling and an adjustment term related to the ratio of valid tactics to improve the bias-resistance of the value function. Experimental results demonstrate that CARTS consistently outperforms previous methods achieving a pass@l rate of 49.6% on the miniF2F-test benchmark. Further analysis confirms that CARTS improves tactic diversity and leads to a more balanced tree search.

4335Co-Evolution Learning

[openreview] [pdf]

Abstract Generative and representation models, whether trained independently or evolved separately, require high-quality, diverse training data, imposing limitations on their advancement. Specifically, self-supervised learning, as a popular paradigm for representation learning, decreases the reliance on labeled data in representation models. However, it still necessitates large datasets, specialized data augmentation techniques, and tailored training strategies. While generative models have shown promise in generating diverse data, ensuring semantic consistency is still a challenge. This paper introduces a novel co-evolution framework (referred to as CORE) designed to address these challenges through the mutual enhancement of generative and representation models. Without incurring additional, unacceptable training overhead compared to independent training, the generative model utilizes semantic information from the representation model to enhance the quality and semantic consistency of generated data. Simultaneously, the representation model gains from the diverse data produced by the generative model, leading to richer and more generalized representations. By iteratively applying this co-evolution framework, both models can be continuously enhanced. Experiments demonstrate the effectiveness of the co-evolution framework across datasets of varying scales and resolutions. For example, implementing our framework in LDM can reduce the FID from 43.40 to 20.13 in unconditional generation tasks over the ImageNet-1K dataset. In more challenging scenarios, such as tasks with limited data, this framework significantly outperforms independent training of generative or representation model. Furthermore, employing the framework in a self-consuming loop effectively mitigates model collapse. Our code will be publicly released.

4336Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis

[openreview] [pdf]

Abstract Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regression by an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of \cO(\eps_n^2), which is the same rate as that for kernel regression trained by GD with early stopping, where \eps_n is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and nn is the size of the training data. It is remarked that our result does not require distributional assumptions on the training data, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions. As a special case of our general result, when the eigenvalues of the associated NTK decay at a rate of λjjdd1\lambda_j \asymp j^{-\frac{d}{d-1}} for j1j \ge 1 which happens if the training data is distributed uniformly on the unit sphere in \RR^d, we immediately obtain the minimax optimal rate of \cO(n^{-\frac{d}{2d-1}}), which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only n,d,\eps_n, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data. Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the Reproducing Kernel Hilbert Space associated with the NTK and an error function with a small LL^{\infty}-norm. Second, local Rademacher complexity is employed to tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD.

4337Learning to Help in Multi-Class Settings

[openreview] [pdf]

Abstract Deploying complex machine learning models on resource-constrained devices is challenging due to limited computational power, memory, and model retrainability. To address these limitations, a hybrid system can be established by augmenting the local model with a server-side model, where samples are selectively deferred by arejectorand then sent to the server for processing. The hybrid system enables efficient use of computational resources while minimizing the overhead associated with server usage. The recently proposed Learning to Help (L2H) model proposed training a server model given a fixed local (client) model. This differs from the Learning to Defer (L2D) framework which trains the client for a fixed (expert) server. In both L2D and L2H, the training includes learning a rejector at the client to determine when to query the server. In this work, we extend the L2H model from binary to multi-class classification problems and demonstrate its applicability in a number of different scenarios of practical interest in which access to the server may be limited by cost, availability, or policy. We derive a stage-switching surrogate loss function that is differentiable, convex, and consistent with the Bayes rule corresponding to the 0-1 loss for the L2H model. Experiments show that our proposed methods offer an efficient and practical solution for multi-class classification in resource-constrained environments.

4338Innate-Values-driven Reinforcement Learning

[openreview] [pdf]

Abstract Innate values describe agents’ intrinsic motivations, which reflect their inherent interests and preferences for pursuing goals and drive them to develop diverse skills that satisfy their various needs. Traditional reinforcement learning (RL) is learning from interaction based on the environment’s feedback rewards. However, in real scenarios, the rewards are generated by agents’ innate value systems, which differ vastly from individuals based on their needs and requirements. In other words, considering the AI agent as a self-organizing system, developing its awareness through balancing internal and external utilities based on its needs in different tasks is a crucial problem for individuals learning to support others and integrate community with safety and harmony in the long term. To address this gap, we propose a new RL model termed innate-values-driven RL (IVRL) based on combined motivations’ models and expected utility theory to mimic its complex behaviors in the evolution through decision-making and learning. Then, we introduce two IVRL-based models: IV-DQN and IV-A2C. By comparing them with benchmark algorithms such as DQN, DDQN, A2C, and PPO in the Role-Playing Game (RPG) reinforcement learning test platform VIZDoom, we demonstrated that the IVRL-based models can help the agent rationally organize various needs, achieve better performance effectively.

4339Cut Your Losses in Large-Vocabulary Language Models

[openreview] [pdf]

Abstract As language models grow ever larger, so do their vocabularies. This has shifted the memory footprint of LLMs during training disproportionately to one single layer: the cross-entropy in the loss computation. Cross-entropy builds up a logit matrix with entries for each pair of input tokens and vocabulary items and, for small models, consumes an order of magnitude more memory than the rest of the LLM combined. We propose Cut Cross-Entropy (CCE), a method that computes the cross-entropy loss without materializing the logits for all tokens into global memory. Rather, CCE only computes the logit for the correct token and evaluates the log-sum-exp over all logits on the fly. We implement a custom kernel that performs the matrix multiplications and the log-sum-exp reduction over the vocabulary in flash memory, making global memory consumption for the cross-entropy computation negligible. This has a dramatic effect. Taking the Gemma 2 (2B) model as an example, CCE reduces the memory footprint of the loss computation from 24 GB to 1 MB, and the total training-time memory consumption of the classifier head from 28 GB to 1 GB. To improve the throughput of CCE, we leverage the inherent sparsity of softmax and propose to skip elements of the gradient computation that have a negligible (i.e. below numerical precision) contribution to the gradient. Experiments demonstrate that the dramatic reduction in memory consumption is accomplished without sacrificing training speed or convergence.

4340Delta - Contrastive Decoding Mitigates Text Hallucinations in Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing tasks. Still, they are prone to generating hallucinations—factually incorrect or fabricated content that can undermine their reliability, especially in high-stakes domains such as healthcare and legal advisory. In response to this challenge, we propose Delta, a novel inference-time approach that leverages contrastive decoding to mitigate hallucinations without requiring model retraining or additional training data. Delta works by randomly masking portions of the input prompt, then contrasting the original and masked output distribution generated by the model, effectively mitigating hallucinations through inference-only computations. Delta was evaluated across multiple benchmark datasets, including SQuAD v1.1 and v2, concerning 4 and 6 percent improvements. Delta demonstrated substantial advancement of 14.56 percent more extract match outcome with no definitive answers within the SQuAD version 2 benchmark. These findings suggest that Delta is particularly effective when hallucinations arise from contextual ambiguity. Delta presents a computationally efficient and scalable solution for reducing hallucinations in real-world LLM applications by focusing on inference-time enhancements.

4341Mutual Effort for Efficiency: A Similarity-based Token Pruning for Vision Transformers in Self-Supervised Learning

[openreview] [pdf]

Abstract Self-supervised learning (SSL) offers a compelling solution to the challenge of extensive labeled data requirements in traditional supervised learning. With the proven success of Vision Transformers (ViTs) in supervised tasks, there is increasing interest in adapting them for SSL frameworks. However, the high computational demands of SSL pose substantial challenges, particularly on resource-limited platforms like edge devices, despite its ability to achieve high accuracy without labeled data. Recent studies in supervised learning have shown that token pruning can reduce training costs by removing less informative tokens without compromising accuracy. However, SSL’s dual-branch encoders make traditional single-branch pruning strategies less effective, as they fail to account for the critical cross-branch similarity information, leading to reduced accuracy in SSL. To this end, we introduce SimPrune, a novel token pruning strategy designed for ViTs in SSL. SimPrune leverages cross-branch similarity information to efficiently prune tokens, retaining essential semantic information across dual branches. Additionally, we incorporate a difficulty-aware pruning strategy to further enhance SimPrune’s effectiveness. Experimental results show that our proposed approach effectively reduces training computation while maintaining accuracy. Specifically, our approach offers 24% savings in training costs compared to SSL baseline, without sacrificing accuracy.

4342Structure-aware Domain Knowledge Injection for Large Language Models

[openreview] [pdf]

Abstract This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly reduces the training corpus requirement to a mere 0.3%, while achieving an impressive 50% of traditional knowledge injection performance. Our method is inspired by the educational processes of human students, particularly how structured domain knowledge from textbooks is assimilated and subsequently applied to tackle real-world challenges through specific exercises. Based on this, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address practical problems. Our ultimate method has undergone extensive evaluations across model architectures and scales, using closed-book question-answering tasks on LongBench and MMedBench datasets. Remarkably, our method demonstrates the potential of comparable improvement against the state-of-the-art MMedLM2 on MMedBench, while significantly reducing the training costs to 5%. This breakthrough paves the way for scaling up our StructTuning for stronger domain-specific LLMs with comprehensive data utilization. Code is available at this anonymous URL:https://anonymous.4open.science/r/StructTuning/.

4343A Foundation Model for Patient Behavior Monitoring and Suicide Detection

[openreview] [pdf]

Abstract Foundation models have achieved remarkable success across various domains, yet their adoption in healthcare remains limited, particularly in areas requiring the analysis of smaller and more complex datasets. While foundation models have made significant advances in medical imaging, genetic biomarkers, and time series from electronic health records, the potential for patient behavior monitoring through wearable devices remains underexplored. Wearable device datasets are inherently heterogeneous and multisource and often exhibit high rates of missing data, presenting unique challenges. Notably, missing patterns in these datasets are frequently not-at-random, and when adequately modeled, these patterns can reveal crucial insights into patient behavior. This paper introduces a novel foundation model based on a modified vector quantized variational autoencoder (VQ-VAE), specifically designed to process real-world data from wearable devices. Our model excels at reconstructing heterogeneous multisource time-series data and effectively models missing data patterns. We demonstrate that our pretrained model, trained on a broad cohort of psychiatric patients with diverse mental health issues, can perform downstream tasks without fine-tuning on a held-out cohort of suicidal patients. This is illustrated through the use of a change-point detection algorithm that identifies suicide attempts with high accuracy, matching or surpassing patient-specific methods, thereby highlighting the potential of VQ-VAE as a versatile tool for behavioral analysis in healthcare.

4344Generalization and Knowledge Transfer in Abstract Visual Reasoning Models

[openreview] [pdf]

Abstract We study generalization and knowledge reuse capabilities of deep neural networks in the domain of abstract visual reasoning (AVR), employing Raven’s Progressive Matrices (RPMs), a recognized benchmark task for assessing AVR abilities. Two knowledge transfer scenarios referring to the I-RAVEN dataset are investigated. Firstly, inspired by generalization assessment capabilities of the PGM dataset and popularity of I-RAVEN, we introduce Attributeless-I-RAVEN, a benchmark with 10 generalization regimes that allow to test generalization of abstract rules applied to held-out attributes. Secondly, we construct I-RAVEN-Mesh, a dataset that enriches RPMs with a novel component structure comprising line-based patterns, facilitating assessment of progressive knowledge acquisition in transfer learning setting. The developed benchmarks reveal shortcomings of the contemporary deep learning models, which we partly address with Pathways of Normalized Group Convolution (PoNG) model, a novel neural architecture for solving AVR tasks. PoNG excels in both presented challenges, as well as the standard I-RAVEN and PGM setups. Encouraged by these promising results, we further evaluate PoNG in another AVR task, visual analogy problem with both synthetic and real-world images, demonstrating its strength beyond PRMs.

4345Thread: A Logic-Based Data Organization Paradigm for How-To Question Answering with Retrieval Augmented Generation

[openreview] [pdf]

Abstract Recent advances in retrieval-augmented generation have significantly improved the performance of question-answering systems, particularly on factoid ‘5Ws’ questions. However, these systems still face substantial challenges when addressing ‘1H’ questions, specifically how-to questions, which are integral to decision-making processes and require dynamic, step-by-step answers. The key limitation lies in the prevalent data organization paradigm, chunk, which divides documents into fixed-size segments, and disrupts the logical coherence and connections within the context. To overcome this, in this paper, we propose Thread, a novel data organization paradigm aimed at enabling current systems to handle how-to questions more effectively. Specifically, we introduce a new knowledge granularity, termed ‘logic unit’, where documents are transformed into more structured and loosely interconnected logic units with large language models. Extensive experiments conducted across both open-domain and industrial settings demonstrate that Thread outperforms existing paradigms significantly, improving the success rate of handling how-to questions by 21% to 33%. Moreover, Thread exhibits high adaptability in processing various document formats, drastically reducing the candidate quantity in the knowledge base and minimizing the required information to one-fourth compared with chunk, optimizing both efficiency and effectiveness.

4346Mutual Information Preserving Neural Network Pruning

[openreview] [pdf]

Abstract Model pruning is attracting increasing interest because of its positive implications in terms of resource consumption and costs. A variety of methods have been developed in the past years. In particular, structured pruning techniques discern the importance of nodes in neural networks (NNs) and filters in convolutional neural networks (CNNs). Global versions of these rank all nodes in a network and select the top-kk, offering an advantage over local methods that rank nodes only within individual layers. By evaluating all nodes simultaneously, global techniques provide greater control over the network architecture, which improves performance. However, the ranking and selecting process carried out during global pruning can have several major drawbacks. First, the ranking is not updated in real time based on the pruning already performed, making it unable to account for inter-node interactions. Second, it is not uncommon for whole layers to be removed from a model, which leads to untrainable networks. Lastly, global pruning methods do not offer any guarantees regarding re-training. In order to address these issues, we introduce Mutual Information Preserving Pruning (MIPP). The fundamental principle of our method is to select nodes such that the mutual information (MI) between the activations of adjacent layers is maintained. We evaluate MIPP on an array of vision models and datasets, including a pre-trained ResNet50 on ImageNet, where we demonstrate MIPP’s ability to outperform state-of-the-art methods. The implementation of MIPP will be made available upon publication.

4347MOEfication by Experts as Masks

[openreview] [pdf]

Abstract In this work, we investigate how to sparsify a pre-trained dense large language model into a mixture-of-experts (MoE) architecture for faster inference. Our approach applies mask matrix to the activations for each expert, constrained by L0L_0 regularization to minimize the number of activated parameters. Starting with all parameters active, the model is progressively sparsified during training, ensuring minimal performance loss. This approach proves more efficient than one-shot sparsification techniques~\citep{zhang2022moefication}, which typically require significant resources for performance recovery. Moreover, our approach automatically identifies shared, token-specific, and inactive experts, allowing for more efficient allocation of computational resources. Through extensive experiments, we achieve up to 97% performance retention on downstream tasks with only 50% of the feed-forward parameters activated in dense models. Beyond enhancing inference efficiency, this strategy of sharing computational units among experts presents a valuable framework for designing more generalized and efficient MoE architectures, opening avenues for future advancements in expert-based models.

4348SCOPE: A Self-supervised framework for Improving Faithfulness in Conditional Text Generation

[openreview] [pdf]

Abstract Large Language Models (LLMs), when used for conditional text generation, often produce hallucinations, i.e., information that is unfaithful or not grounded in the input context. This issue arises in typical conditional text generation tasks, such as text summarization and data-to-text generation, where the goal is to produce fluent text based on contextual input. When fine-tuned on specific domains, LLMs struggle to provide faithful answers to a given context, often adding information or generating errors. One underlying cause of this issue is that LLMs rely on statistical patterns learned from their training data. This reliance can interfere with the model’s ability to stay faithful to a provided context, leading to the generation of ungrounded information. We build upon this observation and introduce a novel self-supervised method for generating a training set of unfaithful samples. We then refine the model using a training process that encourages the generation of grounded outputs over unfaithful ones, drawing on preference-based training. Our approach leads to significantly more grounded text generation, outperforming existing self-supervised techniques in faithfulness, as evaluated through automatic metrics, LLM-based assessments, and human evaluations.

4349Preference-Enhanced Instruction Tuning for Machine Translation

[openreview] [pdf]

Abstract Although Large Language Models (LLMs) like GPT-4 perform excellently in machine translation, their high costs and scalability make them unavailable in many scenarios. Recently, there has been increased effort to build smaller LLMs that can achieve comparable performance. However, while typical instruction tuning methods tend to directly mimic reference translations, leading to less meaningful results, recent preference optimization methods have shown improvements. Despite this, they still fail to effectively utilize crucial preference information during inference. In this paper, we introduce Preference-Enhanced Instruction Tuning (PEIT), a novel method that explicitly incorporates preferences into both the instruction fine-tuning and the inference phase. Our extensive experiments show that PEIT not only improves translation quality but also significantly outperforms state-of-the-art preference optimization methods and instruction tuning baselines on multiple language benchmarks.

4350RL, but don’t do anything I wouldn’t do

[openreview] [pdf]

Abstract In reinforcement learning, if the agent’s reward differs from the designers’ true utility, even only rarely, the state distribution resulting from the agent’s policy can be very bad, in theory and in practice. When RL policies would devolve into undesired behavior, a common countermeasure is KL regularization to a trusted policy (“Don’t do anything I wouldn’t do”). All current cutting-edge language models are RL agents that are KL-regularized to a “base policy” that is purely predictive. Unfortunately, we demonstrate that when this base policy is a Bayesian predictive model of a trusted policy, the KL constraint is no longer reliable for controlling the behavior of an advanced RL agent. We demonstrate this theoretically using algorithmic information theory, and while systems today are too weak to exhibit this theorized failure precisely, we RL-finetune a language model and find evidence that our formal results are plausibly relevant in practice. We also propose a theoretical alternative that avoids this problem by replacing the “Don’t do anything I wouldn’t do” principle with “Don’t do anything I mightn’t do”.

4351Interpretable Analysis and Reasoning Enhancement for LLMs via Cross-Generation Reasoning Trees

[openreview] [pdf]

Abstract Generating diverse reasoning paths by varying the context (such as demonstrations, prompts, instructions, etc) or sampling methods (such as top-k, top-p, beam-search, etc) and then selecting appropriate paths via majority voting or verifier-based strategies to enhance the reasoning capabilities of large language models (LLMs) is a commonly recognized approach. Although both different contexts and sampling techniques can generate diverse contents, using sampling methods alone does not significantly enhance the diversity of generations. Context variation, however, while fostering greater diversity in reasoning, can also introduce negative effects, which causes that switching contexts can not necessarily lead to proportional improvements in performance. Therefore, there is a need to investigate how context influences LLM generation and mitigate any adverse impacts. The primary challenge lies in the inability to conduct comparative studies once divergences occur in reasoning paths generated under different contexts. Specifically, once the predicted tokens at a given step differ, it becomes unclear whether subsequent tokens in the inference path are influenced by the context or the content already generated. In this paper, we propose a Cross-Generation Reasoning Tree (CGRT) algorithm for studying the impact of different contexts on LLM generation and enhancing LLMs’ reasoning performance. Experimental findings reveal that, beyond enhancing interpretability, CGRT integrates the positive effects of both context and sampling strategies more effectively than previous approaches, leading to more rational inference paths. Experiments conducted on Llama2, Llama3, and Qwen demonstrate that, when generating an equivalent number of diverse inference paths, those produced via the “reasoning tree” method exhibit higher accuracy.

4352What Matters in Hierarchical Search for Combinatorial Reasoning Problems?

[openreview] [pdf]

Abstract Combinatorial reasoning problems, particularly the notorious NP-hard tasks, remain a significant challenge for AI research. A common approach to addressing them combines search with learned heuristics. Recent methods in this domain utilize hierarchical planning, executing strategies based on subgoals. Our goal is to advance research in this area and establish a solid conceptual and empirical foundation. Specifically, we identify the following key obstacles, whose presence favors the choice of hierarchical search methods:hard-to-learn value functions,complex action spaces,presence of dead ends in the environment, ordata collected from diverse sources. Through in-depth empirical analysis, we establish that hierarchical search methods consistently outperform standard search methods across these dimensions, and we formulate insights for future research. On the practical side, we also propose a consistent evaluation methodology to enable meaningful comparisons between methods and to reassess the state-of-the-art algorithms.

4353A Riemannian Framework for Learning Reduced-order Lagrangian Dynamics

[openreview] [pdf]

Abstract By incorporating physical consistency as inductive bias, deep neural networks display increased generalization capabilities and data efficiency in learning nonlinear dynamic models. However, the complexity of these models generally increases with the system dimensionality, requiring larger datasets, more complex deep networks, and significant computational effort. We propose a novel geometric network architecture to learn physically-consistent reduced-order dynamic parameters that accurately describe the original high-dimensional system behavior. This is achieved by building on recent advances in model-order reduction and adopting a Riemannian perspective to jointly learn a structure-preserving latent space and the associated low-dimensional dynamics. Our approach enables accurate long-term predictions of the high-dimensional dynamics of rigid and deformable systems with increased data efficiency by inferring interpretable and physically plausible reduced Lagrangian models.

4354Self-supervised Privacy-preservation via Latent Anonymization for Generalizable Video Understanding

[openreview] [pdf]

Abstract The rapid advancements in large video models have unlocked new horizons in video understanding, enhancing applications in various domains such as surveillance, healthcare, and entertainment. However, these models often compromise individual privacy by inadvertently revealing sensitive private information such as skin color and gender. Existing privacy preservation methods are often limited in their scope and tailored to specific downstream tasks. Since current methods directly apply an anonymization function to the input pixel space, they demand extensive computational resources due to the retraining of the utility video model. To address these challenges, we propose a novel approach that shifts privacy-preserving anonymization from the input pixel space to the latent feature space, significantly reducing computational costs and enabling deployment in large foundational video models. Our method employs a self-supervised privacy budget in the latent space by minimizing the mutual information between static clip features. This approach notably allows, for the first time, supervision from downstream tasks such as anomaly detection and temporal action detection through collaborative co-training. Furthermore, we introduce a latent consistency loss to maintain the utility video model’s multitask generalization capabilities and prevent single task overfitting. Our extensive evaluations demonstrate a significant (\approx\textbf{29%}) reduction in privacy leakage while maintaining near peak (within \textbf{1%}) utility performance across various downstream tasks: Action Recognition (Kinetics400, UCF101, HMDB51), Temporal Action Detection (THUMOS14), and Anomaly Detection (UCF-Crime). Moreover, we propose new protocols for assessing gender bias in action recognition models, demonstrating that our method effectively mitigates such biases and promotes equitable video understanding.

4355CipherPrune: Efficient and Scalable Private Transformer Inference

[openreview] [pdf]

Abstract Private Transformer inference using cryptographic protocols offers promising solutions for privacy-preserving machine learning; however, it still faces significant runtime overhead (efficiency issues) and challenges in handling long-token inputs (scalability issues). We observe that the Transformer’s operational complexity scales quadratically with the number of input tokens, making it essential to reduce the input token length. Notably, each token varies in importance, and many inputs contain redundant tokens. Additionally, prior private inference methods that rely on high-degree polynomial approximations for non-linear activations are computationally expensive. Therefore, reducing the polynomial degree for less important tokens can significantly accelerate private inference. Building on these observations, we propose CipherPrune\textit{CipherPrune}, an efficient and scalable private inference framework that includes a secure encrypted token pruning protocol, a polynomial reduction protocol, and corresponding Transformer network optimizations. At the protocol level, encrypted token pruning adaptively removes unimportant tokens from encrypted inputs in a progressive, layer-wise manner. Additionally, encrypted polynomial reduction assigns lower-degree polynomials to less important tokens after pruning, enhancing efficiency without decryption. At the network level, we introduce protocol-aware network optimization via a gradient-based search to maximize pruning thresholds and polynomial reduction conditions while maintaining the desired accuracy. Our experiments demonstrate that CipherPrune reduces the execution overhead of private Transformer inference by approximately 6.1×6.1\times for 128-token inputs and 10.6×10.6\times for 512-token inputs, compared to previous methods, all without compromising accuracy.

4356Improving Cross-view Object Geo-localization: A Dual Attention Approach with Cross-view Interaction and Multi-Scale Spatial Features

[openreview] [pdf]

Abstract Cross-view object geo-localization has recently gained attention due to potential applications. Existing methods aim to capture spatial dependencies of query objects between different views through attention mechanisms to obtain spatial relationship feature maps, which are then used to predict object locations. Although promising, these approaches fail to effectively transfer information between views and do not further refine the spatial relationship feature maps. This results in the model erroneously focusing on irrelevant edge noise, thereby affecting localization performance. To address these limitations, we introduce aCross-view and Cross-attention Module (CVCAM), which performs multiple iterations of interaction between the two views, enabling continuous exchange and learning of contextual information about the query object from both perspectives. This facilitates a deeper understanding of cross-view relationships while suppressing the edge noise unrelated to the query object. Furthermore, we integrate aMulti-head Spatial Attention Module (MHSAM), which employs convolutional kernels of various sizes to extract multi-scale spatial features from the feature maps containing implicit correspondences, further enhancing the feature representation of the query object. Additionally, given the scarcity of datasets for cross-view object geo-localization, we created a new dataset calledG2Dfor the “Ground→Drone” localization task, enriching existing datasets and filling the gap in “Ground→Drone” localization task. Extensive experiments on the CVOGL and G2D datasets demonstrate that our proposed method achieves high localization accuracy, surpassing the current state-of-the-art.

4357OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling

[openreview] [pdf]

Abstract Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning. Solving realistic optimization (OPT) problems in application scenarios requires advanced and applied mathematics ability. However, current OPT benchmarks that merely solve linear programming are far from complex realistic situations. In this work, we proposeOptiBench, a benchmark forEnd-to-endoptimization problem-solving with human-readable inputs and outputs.OptiBenchcontains rich optimization problems, including linear and nonlinear programming with or without tabular data, which can comprehensively evaluate LLMs’ solving ability. In our benchmark, LLMs are required to call a code solver to provide precise numerical answers. Furthermore, to alleviate the data scarcity for optimization problems, and to bridge the gap between open-source LLMs on a small scale (e.g., Llama-3-8b) and closed-source LLMs (e.g., GPT-4), we further propose a data synthesis method namelyReSocratic. Unlike general data synthesis methods that proceed from questions to answers, \ReSocratic first incrementally synthesizes formatted optimization demonstration with mathematical formulations step by step and then back-translates the generated demonstrations into questions. Based on this, we synthesize theReSocratic-29kdataset. We further conduct supervised fine-tuning withReSocratic-29kon multiple open-source models. Experimental results show thatReSocratic-29ksignificantly improves the performance of open-source models.

4358Retrieval-based Zero-shot Crowd Counting

[openreview] [pdf]

Abstract Existing crowd-counting methods rely on the manual localization of each person in the image. While recent efforts have attempted to circumvent the annotation burden through vision-language models or crowd image generation, these approaches rely on pseudo-labels to perform crowd-counting. Simulated datasets provide an alternative to the annotation cost associated with real datasets. However, the use of large-scale simulated data often results in a distribution gap between real and simulated domains. To address the latter, we introduce knowledge retrieval inspired by knowledge-enhanced models in natural language processing. With knowledge retrieval, we extract simulated crowd images and their text descriptions to augment the image embeddings of real crowd images to improve zero-shot crowd-counting. Knowledge retrieval allows one to use a vast amount of non-parameterized knowledge during testing, enhancing a model’s inference capability. Our work is the first to actively incorporate text information to regress the crowd count in any supervised manner. Moreover, to address the domain gap, we propose a pre-training and retrieval mechanism that uses unlabeled real crowd images along with simulated data. We report state-of-the-art results for zero-shot counting on five public datasets, surpassing existing multi-model crowd-counting methods. The code will be made publicly available after the review process.

4359Low-Rank Interconnected Adaptation across Layers

[openreview] [pdf]

Abstract Low-rank adaptation (LoRA) is a powerful parameter-efficient fine-tuning method that utilizes low-rank projectors AA and BB to learn weight updates ΔW\Delta W for adaptation targets WW. Previous research has shown that LoRA is essentially a gradient compressor, performing random projections on the gradient using a fixed projection matrix A0A_0. However, this setup restricts the overall weight update to be low-rank, which limits the adaptation performance. In this paper, we propose low-rank interconnected adaptation across layers (Lily). Specifically, we employ a hierarchical framework where low-dimensional projectors (LPs) retained for downward projection at a particular level, while globally-shared high-dimensional projector (HP) experts perform upward projection across all levels of layers. Lily uniquely connects each LP to all HP experts, therefore the gradient projections are no longer dominated by fixed projection matrices, but rather by selective combinations of all the projectors, thereby breaking the low-rank constraint of LoRA. Furthermore, Lily’s cross-layer connections facilitate the capture of intricate information and dependencies across different layers, thereby enhancing the model’s representational capabilities. Experiments across various modalities, architectures, and model sizes underscore Lily’s great performance and efficiency.

4360DeFine: Enhancing LLM Decision-Making with Factor Profiles and Analogical Reasoning

[openreview] [pdf]

Abstract LLMs are ideal for decision-making due to their ability to reason over long contexts and identify critical factors. However, challenges arise when processing transcripts of spoken speech describing complex scenarios. These transcripts often contain ungrammatical or incomplete sentences, repetitions, hedging, and vagueness. For example, during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite significant uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce DeFine, a new framework that constructs probabilistic factor profiles from complex scenarios. DeFine then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in novel situations. Our framework separates the tasks of quantifying uncertainty in complex scenarios and incorporating it into LLM decision-making. This approach is particularly useful in fields such as medical consultations, negotiations, and political debates, where making decisions under uncertainty is vital.

4361Better than Your Teacher: LLM Agents that learn from Privileged AI Feedback

[openreview] [pdf]

Abstract While large language models (LLMs) show impressive decision-making abilities, current methods lack a mechanism for automatic self-improvement from errors during task execution. We propose LEAP, an iterative fine-tuning framework that continually improves LLM agents using feedback from AI expert teachers. Our key insight is to equip the expert teachers with a privileged state -- information available during training but hidden at test time. This allows even weak experts to provide precise guidance, significantly improving the student agent’s performance without access to privileged information at test time. We evaluate LEAP on diverse decision-making benchmarks, including text-based games, web navigation, and interactive coding. Our experiments show that LEAP (1) outperforms state-of-the-art baselines (2) enables weak student models (e.g., Llama3-8B) to exceed the performance of strong teacher models (GPT4-o), and (3) allows weak models to self-improve using privileged versions of themselves. We also provide a theoretical analysis showing that LEAP’s success hinges on balancing privileged information with the student’s realizability, which we empirically validate. Our code is provided as part of the supplementary material.

4362Truthfulness Without Supervision: Model Evaluation Using Peer Prediction

[openreview] [pdf]

Abstract Current evaluation methods for language models rely on supervision, but trusted supervision for difficult tasks is often unavailable, especially for superhuman models. In such cases, evaluation schemes based on imperfect supervision can be exploited, leading to deceptive results. However, underutilized in the context of model evaluation, a wealth of research from the mechanism design literature focuses on game-theoreticincentive compatibility- eliciting honest and informative answers without trusted supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation. It tells apart honest and informative answers from deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method’s effectiveness and resistance to deception, with both theoretical guarantees and comprehensive empirical validation on up to 405B-parameter models. In contrast to LLM-as-a-Judge methods which require strong and trusted judges, we discover an inverse scaling property in peer prediction, where, surprisingly, resistance to deception isstrengthenedas the capability gap between the jury and participantswidens, enabling reliable evaluation of superhuman models without trusted supervision. In particular, LLM-as-a-Judge evaluations become worse than random guesses when facing deceptive models 5-20×\times its size, while peer prediction thrives when such gaps are large, including in cases with over 100×\times size difference. Looking forward, we view this work as a step towards game-theoretic resistance to model deception in alignment and evaluation.

4363Shapley-Guided Utility Learning for Effective Graph Inference Data Valuation

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have demonstrated remarkable performance in various graph-based machine learning tasks, yet evaluating the importance of neighbors of testing nodes remains largely unexplored due to the challenge of assessing data importance without test labels. To address this gap, we propose Shapley-Guided Utility Learning (SGUL), a novel framework for graph inference data valuation. SGUL innovatively combines transferable data-specific and modelspecific features to approximate test accuracy without relying on ground truth labels. By incorporating Shapley values as a preprocessing step and using feature Shapley values as input, our method enables direct optimization of Shapley value prediction while reducing computational demands. SGUL overcomes key limitations of existing methods, including poor generalization to unseen test-time structures and indirect optimization. Experiments on diverse graph datasets demonstrate that SGUL consistently outperforms existing baselines in both inductive and transductive settings. SGUL offers an effective, efficient, and interpretable approach for quantifying the value of test-time neighbors.

4364Deep Temporal Deaggregation: Large-Scale Spatio-Temporal Generative Models

[openreview] [pdf]

Abstract Access to spatio-temporal trajectory data is essential for improving infrastructure, preventing the spread of disease and for building autonomous vehicles. However, it remains underutilized due to limited availability, as it cannot be shared publicly due privacy concerns or other sensitive attributes. Generative time-series models have shown promise in generating non-sensitive data, but show poor performance for large-scale and complex environments. In this paper we propose a spatio-temporal generative model for trajectories, TDDPM, which outperforms and scales substantially better than state-of-the-art. The focus is primarily on trajectories of peoples’ movement in cities. We propose a conditional distribution approach which unlock out-of-distribution generalization, such as to city-areas not trained on, from a spatial aggregate prior. We also show that data can be generated in a privacy-preserving manner using kk-anonymity. Further, we propose a new comprehensive benchmark across several standard datasets, and evaluation measures, considering key distribution properties.

4365ScaLES: Scalable Latent Exploration Score for Pre-Trained Generative Networks

[openreview] [pdf]

Abstract We develop Scalable Latent Exploration Score (ScaLES) to mitigate over-exploration in Latent Space Optimization (LSO), a popular method for solving black-box discrete optimization problems. LSO utilizes continuous optimization within the latent space of a Variational Autoencoder (VAE) and is known to be susceptible to over-exploration, which manifests in unrealistic solutions that reduce its practicality. ScaLES is an exact and theoretically motivated method leveraging the trained decoder’s approximation of the data distribution. ScaLES can be employed with any VAE decoder--including pretrained ones--without additional training, architectural changes, access to the training data or hyperparameters. Our evaluation across five LSO benchmark tasks and twenty-two VAE models demonstrates that ScaLES always enhances the quality of the solutions while maintaining high objective values, leading to improvements over existing solutions in most cases. We believe that new avenues to LSO will be opened by ScaLES’ ability to identify out of distribution areas, differentiability, and computational tractability.

4366Subspace Optimiztion for Large Language Models with Convergence Guarantees

[openreview] [pdf]

Abstract Subspace optimization algorithms, with GaLore (Zhao et al., 2024) as a representative method, have gained popularity for pre-training or fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we unexpectedly discover that GaLore does not always converge to the optimal solution and substantiate this finding with an explicit counterexample. We then investigate the conditions under which GaLore can achieve convergence, demonstrating that it does so either in deterministic scenarios or when using a sufficiently large mini-batch size. More significantly, we introduceGoLore(Gradient randomLow-rank projection), a novel variant of GaLore that provably converges in stochastic settings, even with standard batch sizes. Our convergence analysis can be readily extended to other sparse subspace optimization algorithms. Finally, we conduct numerical experiments to validate our theoretical results and empirically explore the proposed mechanisms.

4367Multi-Concept Editing Using Task Arithmetic

[openreview] [pdf]

Abstract Model owners often wish to introduce new capabilities into their trained models or remove undesired ones. Task Vectors (TVs) present a promising new approach to editing models after training, allowing simple and controllable addition of new capabilities to the model and the removal of undesired ones. But what happens when the model owner wants to change multiple capabilities?In this work, we study the interactions of task vectors in a multi-edit setting for image classifiers and diffusion models. We start by quantifying the overall model degradation induced by applying many specific TVs simultaneously. We show that the overall model performance degrades rapidly as the quantity of TV edits increases. Finally, we explore different ways to mitigate this degradation and present an adaptive method to select the most relevant TVs to apply to a diffusion model during inference. Our technique achieves a 94.6% ROC AUC in identifying the correct TV, enabling the effective integration of multiple TV edits while significantly mitigating quality degradation.

4368Causal Abstraction Finds Universal Representation of Race in Large Language Models

[openreview] [pdf]

Abstract While there is growing interest in the potential bias of large language models (LLMs), especially in high-stakes decision making, it remains an open question how LLMs mechanistically encode such bias. We use causal abstraction (Geiger et al., 2023) to study how models use the race information in two high-stakes decision settings: college admissions and hiring. We find that Alpaca 7B, Mistral 7B, and Gemma 2B check for an applicants’ race and apply different preferential or discriminatory decision boundaries. The race subspace found by distributed alignment search generalizes across different tasks with average interchange intervention accuracies from 78.09% to 88.64% across the three models. We also propose a novel RaceQA task, where the model is asked to guess an applicant’s race from the name in their profile, to further probe the mechanism of the bias. We show that patching in a different race representation changes the model’s perception of the applicant’s race 99.80% of the time for Alpaca and 98.20% of the time for Mistral. Overall, our work provides evidence for a universal mechanism of racial bias in LLMs’ decision-making.

4369Optimal Protocols for Continual Learning via Statistical Physics and Control Theory

[openreview] [pdf]

Abstract Artificial neural networks often struggle withcatastrophic forgettingwhen learning multiple tasks sequentially, as training on new tasks degrades the performance on previously learned tasks. Recent theoretical work has addressed this issue by analysing learning curves in synthetic frameworks under predefined training protocols. However, these protocols relied on heuristics and lacked a solid theoretical foundation assessing their optimality. In this paper, we fill this gap by combining exact equations for training dynamics, derived using statistical physics techniques, with optimal control methods. We apply this approach to teacher-student models for continual learning and multi-task problems, obtaining a theory for task-selection protocols maximising performance while minimising forgetting. Our theoretical analysis offers non-trivial yet interpretable strategies for mitigating catastrophic forgetting, shedding light on how optimal learning protocols modulate established effects, such as the influence of task similarity on forgetting. Finally, we validate our theoretical findings with experiments on real-world data.

4370Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes

[openreview] [pdf]

Abstract Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty. However, average-reward MDPs have remained largely unexplored in reinforcement learning (RL) settings, with the majority of RL-based efforts having been allocated to episodic and discounted MDPs. In this work, we study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning: a novel RL framework that can be used to effectively and efficiently solve various subtasks simultaneously in the average-reward setting. We introduce a family of RED learning algorithms for prediction and control, including proven-convergent algorithms for the tabular case. We then showcase the power of these algorithms by demonstrating how they can be used to learn a policy that optimizes, for the first time, the well-known conditional value-at-risk (CVaR) risk measure in a fully-online manner, without the use of an explicit bi-level optimization scheme or an augmented state-space.

4371Erasing Concept Combination from Text-to-Image Diffusion Model

[openreview] [pdf]

Abstract Advancements in the text-to-image diffusion model have raised security concerns due to their potential to generate images with inappropriate themes such as societal biases and copyright infringements. Current studies make a great process to prevent the model from generating images containing specific high-risk visual concepts. However, these methods neglect the issue that inappropriate themes may also arise from the combination of benign visual concepts. Considering that the same image theme might be represented via multiple different visual concept combinations, and the model’s generation performance of the corresponding individual visual concepts is distorted easily while processing the visual concept combination, effectively erasing such visual concept combinations from the diffusion model remains a formidable challenge. To this end, we formulate such challenge as the Concept Combination Erasing (CCE) problem and propose a Concept Graph-based high-level Feature Decoupling framework (CoGFD) to address CCE. CoGFD identifies and decomposes visual concept combinations with a consistent image theme from an LLM-induced concept logic graph, and erases these combinations through decoupling oc-occurrent high-level features. These techniques enable CoGFD to erase visual concept combinations of image content while enjoying a much less negative effect, compared to SOTA baselines, on the generative fidelity of related individual concepts. Extensive experiments on diverse visual concept combination scenarios verify the effectiveness of CoGFD.

4372Debiased Imbalanced Pseudo-Labeling for Generalized Category Discovery

[openreview] [pdf]

Abstract Generalized Category Discovery (GCD) is a challenging task that aims to recognize seen and novel categories within unlabeled data by leveraging labeled data. Designing a prototype classifier to identify unlabeled samples instead of relying on traditional time-consuming clustering is well recognized as a milestone in GCD.However, we discover there exists a bias in this classifier: some seen categories are mistakenly classified as novel ones, leading to imbalanced pseudo-labeling during classifier learning. Based on this finding, we identify the low discriminability between seen and novel prototypes as the key issue. To address this issue, we propose DebiasGCD, an effective debiasing method that integratesdynamic prototype debiasing(DPD) andlocal representation alignment(LRA). DPD dynamically maintains inter-prototype margins, encouraging the network to strengthen the learning of class-specific features and enhance prototype discrimination. Additionally, LRA promotes local representation learning, enabling DPD to capture subtle details that further refine the understanding of class-specific features. In this way, it successfully improves prototype discriminability and generates more reliable predictions for seen classes. Extensive experiments validate that our method effectively mitigates pseudo-labeling bias across all datasets, especially on fine-grained ones. For instance, it delivers a 10.7% boost on `Old’ classes in CUB. Our code is available at:https://anonymous.4open.science/r/DebiasGCD-34F0.

4373Learning Physical Simulation with Message Passing Transformer

[openreview] [pdf]

Abstract Machine learning methods for physical simulation have achieved significant success in recent years. We propose a new universal architecture based on Graph Neural Network, the Message Passing Transformer, which incorporates a Message Passing framework, employs an Encoder-Processor-Decoder structure, and applies Graph Fourier Loss as loss function for model optimization. To take advantage of the past message passing state information, we propose Hadamard-Product Attention to update the node attribute in the Processor, Hadamard-Product Attention is a variant of Dot-Product Attention that focuses on more fine-grained semantics and emphasizes on assigning attention weights over each feature dimension rather than each position in the sequence relative to others. We further introduce Graph Fourier Loss (GFL) to balance high-energy and low-energy components. To improve time performance, we precompute the graph’s Laplacian eigenvectors before the training process. Our architecture achieves significant accuracy improvements in long-term rollouts for both Lagrangian and Eulerian dynamical systems over current methods.

4374Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs

[openreview] [pdf]

Abstract Although safely enhanced Large Language Models (LLMs) have achieved remarkable success in tackling various complex tasks in a zero-shot manner, they remain susceptible to jailbreak attacks, particularly the unknown jailbreak attack. To enhance LLMs’ generalized defense capabilities, we propose a two-stage adversarial tuning framework, which generates adversarial prompts to explore worst-case scenarios by optimizing datasets containing pairs of adversarial prompts and their safe responses. In the first stage, we introduce the hierarchical meta-universal adversarial prompt learning to efficiently and effectively generate token-level adversarial prompts. In the second stage, we propose automatic adversarial prompt learning to iteratively construct out-of-distribution adversarial prompts, further enhancing LLM’s defense capabilities. We conducted comprehensive experiments on three widely used jailbreak datasets, comparing our framework with six defense baselines under five representative attack scenarios. \fan{ Specifically, for the computational efficiency of generating token-level adversarial prompts, we demonstrate both empirically and theoretically that our method achieves approximately a 15x speedup. Additionally, our methods exhibit superior defense performance against both known and unknown jailbreak attacks. Importantly, our adversarial tuning framework shows broad generalizability across various attack strategies and target LLMs (including the large 110B model), highlighting its potential as a transferable defense mechanism.

4375AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning

[openreview] [pdf]

Abstract Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a door, a handle, and a lock, where the door can only be opened when the latch is unlocked. The internal structure, such as the state of a lock or joint angle constraints, cannot be directly observed from visual observation. Consequently, successful manipulation of these objects requires adaptive adjustment based on trial and error rather than a one-time visual inference. However, previous datasets and simulation environments for articulated objects have primarily focused on simple manipulation mechanisms where the complete manipulation process can be inferred from the object’s appearance. To enhance the diversity and complexity of adaptive manipulation mechanisms, we build a novel articulated object manipulation environment and equip it with 9 categories of articulated objects. Based on the environment and objects, we further propose an adaptive demonstration collection pipeline and a 3D visual diffusion-based imitation learning that learns the adaptive manipulation policy. The effectiveness of our designs and proposed method are validated through both simulation and real-world experiments.

4376Revealing and Mitigating Over-Attention in Knowledge Editing

[openreview] [pdf]

Abstract Large Language Models~(LLMs) have demonstrated superior performance across a wide range of tasks, but they still exhibit undesirable errors due to incorrect knowledge learned from the training data. To avoid this, knowledge editing methods emerged to precisely edit the specific model knowledge via efficiently modifying a very small percentage of parameters. However, those methods can lead to the problem ofSpecificity Failure, where the existing knowledge and capabilities are severely degraded due to editing. Our preliminary indicates that Specificity Failure primarily stems from the model’s attention heads assigning excessive attention scores to entities related to the edited knowledge, thereby unduly focusing on specific snippets within the context, which we denote as theAttention Driftphenomenon. To mitigate such Attention Drift issue, we introduce a simple yet effective methodSelectiveAttentionDriftRestriction(SADR), which introduces an additional regularization term during the knowledge editing process to restrict changes in the attention weight distribution, thereby preventing undue focus on the edited entity. Experiments on five frequently-used strong LLMs demonstrate the effectiveness of our method, where SADR can significantly mitigate Specificity Failure in the predominant knowledge editing tasks.

4377Self-controller: Controlling LLMs with Multi-round Step-by-step Self-awareness

[openreview] [pdf]

Abstract The applications of large language models (LLMs) have been widely spread across all domains. However, the basic abilities such as the controllability of LLMs are still limited. To address this, we propose “Self-controller\textbf{Self-controller}”, a novel agentic framework bringing self-awareness into LLMs’ reasoning logic. The core idea of this work is to maintain states based on the LLM’s response, letting the LLM become self-aware of current status and think step by step in a multi-round chain-of-thought paradigm. Our experiment on the state of textual length has shown the controllability and effectiveness of the Self-controller. We further implement a binary search algorithm to accelerate the generation process based on the linearity and monotonicity of the textual length state. Another advantage of the Self-controller comes with DeepSeek’s Context Caching technology, which significantly saves computational token consumption when a cluster of conversations shares the same prefix of context. Theoretically, we prove that in this scenario the extra time complexity is O(clogn)O(c \log n). Results of the back-of-the-envelope estimation suggest that the token consumption of our method is no more than twice as much as that of the trivial single-round generation. Furthermore, our ablation study on word constraints demonstrates the Self-controller’s consistent controllability across all foundation models.

4378Hyperparameters in Continual Learning: A Reality Check

[openreview] [pdf]

Abstract Continual learning (CL) aims to train a model on a sequence of tasks (i.e., a CL scenario) while balancing the trade-off between plasticity (learning new tasks effectively) and stability (retaining prior knowledge). The dominantly adopted conventional evaluation protocol for CL algorithms selects the best hyperparameters within a given scenario and then evaluates the algorithms using these hyperparameters in the same scenario. However, this protocol has significant shortcomings: it overestimates the CL capacity of algorithms and relies on unrealistic hyperparameter tuning, which is not feasible for real-world applications. From the fundamental principles of evaluation in machine learning, we argue that the evaluation of CL algorithms should focus on assessing the generalizability of their CL capacity to unseen scenarios. Based on this, we propose a revised two-phase evaluation protocol consisting of a hyperparameter tuning phase and an evaluation phase. Both phases share the same scenario configuration (e.g., number of tasks) but are generated from different datasets. Hyperparameters of CL algorithms are tuned in the first phase and applied in the second phase to evaluate the algorithms. We apply this protocol to class-incremental learning, both with and without pretrained models. Across more than 8,000 experiments, our results show that most state-of-the-art algorithms fail to replicate their reported performance, highlighting that their CL capacity has been significantly overestimated in the conventional evaluation protocol.

4379Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos

[openreview] [pdf]

Abstract There has been growing sentiment recently that modern large multimodal models (LMMs) have addressed most of the key challenges related to short video comprehension. As a result, both academia and industry are gradually shifting their attention towards the more complex challenges posed by understanding long-form videos. However, is this really the case? Our studies indicate that LMMs still lack many fundamental reasoning capabilities even when dealing with short videos. We introduce Vinoground, a temporal counterfactual LMM evaluation benchmark encompassing 1000 short and natural video-caption pairs. We demonstrate that existing LMMs severely struggle to distinguish temporal differences between different actions and object transformations. For example, the best model GPT-4o only obtains \sim50% on our text and video scores, showing a large gap compared to the human baseline of \sim90%. All open-source multimodal models and CLIP-based models perform much worse, producing mostly random chance performance. Through this work, we shed light onto the fact that temporal reasoning in short videos is a problem yet to be fully solved. We will make our benchmark publicly available.

4380ToEdit: How to Synthesize Text Data to Avoid Model Collapse?

[openreview] [pdf]

Abstract We explore model collapse caused by synthetic data, where AI models trained on such data experience a gradual decline in performance. Our initial analysis examines language model pretraining on mixed human and synthetic data, highlighting performance degradation. Further statistical analysis reveals distributional shifts and an over-concentration of n-gram features caused by synthetic data. Inspired by these insights, we propose token-level editing on human data, to obtain semi-synthetic data instead of fully using model outputs. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conducted extensive experiments on pretraining, continual pretraining, and supervised fine-tuning of language models. The results validate our theoretical proof that token-level editing improves data quality and enhances model performance.

4381CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

[openreview] [pdf]

Abstract We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of six leading models across five scenarios (each with 337 use cases) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated “harmless” models making recommendations which should be recognised as obviously harmful to the user given the context provided. Key failure modes include improper weighing of conflicting preferences, sycophancy (prioritising user preferences above safety), a lack of attentiveness to critical user information in the context window, and inconsistent application of user-specific knowledge. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic ‘harmless and helpful’ reminder. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

4382Aligning Large Language Models With Preference Privacy

[openreview] [pdf]

Abstract Alignment is a crucial part in the implementation pipeline of Large Language Models (LLMs) that utilizes human feedback to ensure that LLMs adhere to human values and societal norms. This introduces privacy threats associated with the identity and preferences of the labelers responsible for creating the human feedback data. Several recent works have explored using differential privacy (DP) as a notion to protect the privacy of human labeled data; primarily relying on DP-SGD based solutions, which privatize the gradients during fine-tuning and alignment. Human preferences, however are only associated with the labels of the (prompt, response) tuples; therefore DP-SGD based approaches can be superfluous, providing more privacy than necessary and can degrade model utility. In this work, we focus on the problem of aligning LLMs with preference level privacy, which only preserve the privacy of preferences provided by humans. We build and expand upon the concept of label DP for this problem, and present a series of increasingly sophisticated, yet practical privacy preserving mechanisms for alignment. Specifically, starting from a standard randomized response (RR) mechanism which randomly flips human preferences, and it’s corresponding \textit{unbiased} RR mechanism (which ensures an unbiased loss during alignment), we propose a new mechanism, PROPS (PROgressively Private Self-alignment). PROPS works in multiple stages as follows: in each stage, the privately trained and partially aligned model from the previous stage to act as a labeler for the training data for the next stage and combine it with RR which is repeated across multiple stages. Motivation for PROPS comes from the following critical observations: a) learning to label correct preferences might be an easier problem than generating responsible content; b) progressively combining RR with partially aligned models for labeling preferences significantly reduces the amount of necessary perturbation needed for privacy and also shows the potential of possibly reducing the number of human labeled preference samples. We present proof-of-concept experiments that demonstrate the feasibility and effectiveness of our proposed approach and show that preference privacy based alignment can still attain a comparable utility to their non-privately aligned counterparts.

4383FCVL: Fourier Cross-View Learning for Generalizable 3D Object Detection in Bird’s Eye View

[openreview] [pdf]

Abstract Improving the generalization of Birds’ Eye View (BEV) detection models is essential for safe driving in real world. In this paper, we consider a realistic yet more challenging scenario, which aims to improve the generalization with single source data for training, as collecting multiple source data is time-consuming and labor intensive in autonomous driving. To achieve this, we rethink the task from a frequency perspective and exploit the cross-view consistency between adjacent perspectives. We propose the Fourier Cross-View Learning (FCVL) framework including Fourier Hierarchical Augmentation (FHiAug), an augmentation strategy in frequency domain to boost domain diversity and Fourier Cross-View Semantic Consistency Loss to facilitate the model to learn more domain-invariant features. Furthermore, we provide theoretical guarantees via augmentation graph theory. To the best of our knowledge, this is the first study to explore generalizable 3D Object Detection in BEV with single source data, and extensive experiments on various testing domains have demonstrated that our approach achieves the best performance on various test domains with single source data.

4384A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

[openreview] [pdf]

Abstract Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model’s context length from 4K to 128K.We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup. Compared to baselines that apply KV-cache reduction techniques to full-attention long-context LLMs, LongGen achieves substantially stronger performance not only on the Needle-in-a-Haystack retrieval task, but also on more challenging long-context reasoning tasks, including BABILong and RULER.

4385Can Transformers Reason Logically? A Study in SAT Solving

[openreview] [pdf]

Abstract We theoretically and empirically study the logical reasoning capabilities of LLMs in the context of the Boolean satisfiability (SAT) problem. First, we construct a decoder-only Transformer that can solve SAT using backtracking and deduction via Chain-of-Thought (CoT). We prove its correctness by showing trace equivalence to the well-known DPLL SAT-solving algorithm. Second, to support the implementation of this abstract construction, we design a compiler PARAT that takes as input a procedural specification and outputs a transformer model implementing this specification. Third, rather than programming a transformer to reason, we evaluate empirically whether it can be trained to do so by learning directly from algorithmic traces (``reasoning paths’') of the DPLL algorithm.

4386Decoding Generalization from Memorization in Deep Neural Networks

[openreview] [pdf]

Abstract Overparameterized Deep Neural Networks that generalize well have been key to the dramatic success of Deep Learning in recent years. The reasons for their remarkable ability to generalize are not well understood yet. It has also been known that deep networks possess the ability to memorize training data, as evidenced by perfect or high training accuracies on models trained with corrupted data that have class labels shuffled to varying degrees. Concomitantly, such models are known to generalize poorly, i.e. they suffer from poor test accuracies, due to which it is thought that the act of memorizing substantially degrades the ability to generalize. It has, however, been unclear why the poor generalization that accompanies such memorization, comes about. One possibility is that in the process of training with corrupted data, the layers of the network irretrievably re-organize their representations in a manner that makes generalization difficult. The other possibility is that the network retains significant ability to generalize, but the trained network somehow “chooses” to readout in a manner that is detrimental to generalization. Here, we provide evidence for the latter possibility by demonstrating, empirically, that such models possess information in their representations for substantially improved generalization, even in the face of memorization. Furthermore, such generalization abilities can be easily decoded from the internals of the trained model, and we build a technique to do so from the outputs of specific layers of the network. In particular, we show the following: (1) For models trained using standard methods & datasets with corrupted training data, while the model has poor test accuracy, we can build a simple classifier with dramatically better test accuracy that uses only the model’s hidden layer outputs obtained for the (corrupted) training set. (2) For the aforementioned models, if the true training class labels are known post hoc, i.e. after the model is trained, we can build a simple classifier, with significantly better generalization performance than in (1). This is true, in many cases, even for models where training class labels are shuffled with equal probability. This demonstrates that the layers of the network maintain representations in a manner that is amenable to straightforward generalization to a degree not previously recognized. (3) On the other hand, we asked if a model trained on the true training labels similarly retained the capability to memorize easily. Adapting our technique to this setting, we find that in a few cases, we can extract a high degree of memorization. The same classifier sometimes exhibits high test accuracy (on the true test labels), which further supports the idea that generalization can co-exist with memorization. Together, these results suggest a more nuanced view of the interplay of generalization with memorization in Deep Learning and suggest the need for further experiments and theory to better understand this phenomenon.

4387Memory Efficient Transformer Adapter for Dense Predictions

[openreview] [pdf]

Abstract While current Vision Transformer (ViT) adapter methods have shown promising accuracy, their inference speed is implicitly hindered by inefficient memory access operations, e.g., standard normalization and frequent reshaping. In this work, we propose META, a simple and fast ViT adapter that can improve the model’s memory efficiency and decrease memory time consumption by reducing the inefficient memory access operations. Our method features a memory-efficient adapter block that enables the common sharing of layer normalization between the self-attention and feed-forward network layers, thereby reducing the model’s reliance on normalization operations. Within the proposed block, the cross-shaped self-attention is employed to reduce the model’s frequent reshaping operations. Moreover, we augment the adapter block with a lightweight convolutional branch that can enhance local inductive biases, particularly beneficial for the dense prediction tasks, e.g., object detection, instance segmentation, and semantic segmentation. The adapter block is finally formulated in a cascaded manner to compute diverse head features, thereby enriching the variety of feature representations. Empirically, extensive evaluations on multiple representative datasets validate that META substantially enhances the predicted quality, while achieving a new state-of-the-art accuracy-efficiency trade-off. Theoretically, we demonstrate that META exhibits superior generalization capability and stronger adaptability.

4388A Unified Framework for Hierarchical Diffusion via Simplicial Complexes

[openreview] [pdf]

Abstract In this paper, we propose a unified framework for hierarchical diffusion via simplicial complexes (HDSC), which enables adaptive diffusion across different levels of simplicial complexes, including nodes, edges, and triangles. To ensure the accuracy and consistency of information transmission during the diffusion process, we investigate topological consistency constraints, achieving efficient coupling between structures at various levels. Additionally, by introducing a time-dependent topological memory mechanism, we further enhance the smoothness and coherence of global information flow, enabling features at different levels to diffuse cooperatively throughout the entire graph structure. Experimental results demonstrate that HDSC exhibits significant performance advantages over traditional methods. Furthermore, as the complexity and dimensionality of the graph increase, HDSC continues to maintain its superiority, effectively avoiding the phenomenon of node feature homogenization.

4389Multistep Consistency Models

[openreview] [pdf]

Abstract Diffusion models are relatively easy to train but require many steps to generate samples. Consistency models are far more difficult to train, but generate samples in a single step.In this paper we propose Multistep Consistency Models: A unification between Consistency Models (Song et al., 2023) and TRACT (Berthelotet al., 2023) that can interpolate between a consistency model and a diffusion model: a trade-off between sampling speed and sampling quality. Specifically, a 1-step consistency model is a conventional consistency model whereas a \infty-step consistency model is a diffusion model.Multistep Consistency Models work really well in practice. By increasing the sample budget from a single step to 2-8 steps, we can train models more easily that generate higher quality samples, while retaining much of the sampling speed benefits. Notable results are 1.4 FID on Imagenet 64 in 8 sampling steps and 2.1 FID on Imagenet128 in 8 sampling steps with consistency distillation, using simple losses without adversarial training. We also show that our method scales to a text-to-image diffusion model, generating samples that are close to the quality of the original model.

4390Private Stochastic Optimization for Achieving Second-Order Stationary Points

[openreview] [pdf]

Abstract This paper addresses the challenge of achieving second-order stationary points (SOSP) in differentially private stochastic non-convex optimization. We identify two key limitations in the state-of-the-art: (i) inaccurate error rates caused by the omission of gradient variance in saddle point escape analysis, resulting in inappropriate parameter choices and overly optimistic performance estimates, and (ii) inefficiencies in private SOSP selection via the AboveThreshold algorithm, particularly in distributed learning settings, where perturbing and sharing Hessian matrices introduces significant additional noise. To overcome these challenges, we revisit perturbed stochastic gradient descent (SGD) with Gaussian noise and propose a new framework that leverages general gradient oracles. This framework introduces a novel criterion based on model drift distance, ensuring provable saddle point escape and efficient convergence to approximate local minima with low iteration complexity. Using an adaptive SPIDER as the gradient oracle, we establish a new DP algorithm that corrects existing error rates. Furthermore, we extend our approach to a distributed adaptive SPIDER, applying our framework to distributed learning scenarios and providing the first theoretical results on achieving SOSP under differential privacy in distributed environments with heterogeneous data. Finally, we analyze the limitations of the AboveThreshold algorithm for private model selection in distributed learning and show that as model dimensions increase, the selection process introduces additional errors, further demonstrating the superiority of our proposed framework.

4391EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition

[openreview] [pdf]

Abstract The task of Visual Place Recognition (VPR) is to predict the location of a query image from a database of geo-tagged images. Recent studies in VPR have highlighted the significant advantage of employing pre-trained foundation models like DINOv2 for the VPR task. However, these models are often deemed inadequate for VPR without further fine-tuning on VPR-specific data. In this paper, we present an effective approach to harness the potential of a foundation model for VPR. We show that features extracted from self-attention layers can act as a powerful re-ranker for VPR, even in a zero-shot setting. Our method not only outperforms previous zero-shot approaches but also introduces results competitive with several supervised methods. We then show that a single-stage approach utilizing internal ViT layers for pooling can produce global features that achieve state-of-the-art performance, with impressive feature compactness down to 128D. Moreover, integrating our local foundation features for re-ranking further widens this performance gap. Our method also demonstrates exceptional robustness and generalization, setting new state-of-the-art performance, while handling challenging conditions such as occlusion, day-night transitions, and seasonal variations.

4392NuwaTS: a Foundation Model Mending Every Incomplete Time Series

[openreview] [pdf]

Abstract Time series imputation is critical for many real-world applications and has been widely studied. However, existing models often require specialized designs tailored to specific missing patterns, variables, or domains which limits their generalizability. In addition, current evaluation frameworks primarily focus on domain-specific tasks and often rely on time-wise train/validation/test data splits, which fail to rigorously assess a model’s ability to generalize across unseen variables or domains. In this paper, we present \textbf{NuwaTS}, a novel framework that repurposes Pre-trained Language Models (PLMs) for general time series imputation. Once trained, NuwaTS can be applied to impute missing data across any domain. We introduce specialized embeddings for each sub-series patch, capturing information about the patch, its missing data patterns, and its statistical characteristics. By combining contrastive learning with the imputation task, we train PLMs to create a versatile, one-for-all imputation model. Additionally, we employ a plug-and-play fine-tuning approach, enabling efficient adaptation to domain-specific tasks with minimal adjustments. To evaluate cross-variable and cross-domain generalization, we propose a new benchmarking protocol that partitions the datasets along the variable dimension. Experimental results on over seventeen million time series from diverse domains demonstrate that NuwaTS outperforms state-of-the-art domain-specific models across various datasets under the proposed benchmarking protocol. Furthermore, we show that NuwaTS generalizes to other time series tasks, such as forecasting.

4393PruneFuse: Efficient Data Selection via Weight Pruning and Network Fusion

[openreview] [pdf]

Abstract Efficient data selection is crucial for enhancing the training efficiency of deep neural networks and minimizing annotation requirements. Traditional methods often face high computational costs, limiting their scalability and practical use. We introduce PruneFuse, a novel strategy that leverages pruned networks for data selection and later fuses them with the original network to optimize training. PruneFuse operates in two stages: First, it applies structured pruning to create a smaller pruned network that, due to its structural coherence with the original network, is well-suited for the data selection task. This small network is then trained and selects the most informative samples from the dataset. Second, the trained pruned network is seamlessly fused with the original network. This integration leverages the insights gained during the training of the pruned network to facilitate the learning process of the fused network while leaving room for the network to discover more robust solutions. Extensive experimentation on various datasets demonstrates that PruneFuse significantly reduces computational costs for data selection, achieves better performance than baselines, and accelerates the overall training process.

4394QMP: Q-switch Mixture of Policies for Multi-Task Behavior Sharing

[openreview] [pdf]

Abstract Multi-task reinforcement learning (MTRL) aims to learn several tasks simultaneously for better sample efficiency than learning them separately. Traditional methods achieve this by sharing parameters or relabeling data between tasks. In this work, we introduce a new framework for sharing behavioral policies across tasks, which can be used in addition to existing MTRL methods. The key idea is to improve each task’s off-policy data collection by employing behaviors from other task policies. Selectively sharing helpful behaviors acquired in one task to collect training data for another task can lead to higher-quality trajectories, leading to more sample-efficient MTRL. Thus, we introduce a simple and principled framework called Q-switch mixture of policies (QMP) that selectively shares behavior between different task policies by using the task’s Q-function to evaluate and select useful shareable behaviors. We theoretically analyze how QMP improves the sample efficiency of the underlying RL algorithm. Our experiments show that QMP’s behavioral policy sharing provides complementary gains over many popular MTRL algorithms and outperforms alternative ways to share behaviors in various manipulation, locomotion, and navigation environments. Videos are available athttps://sites.google.com/view/qmp-mtrl.

4395InstructRAG: Instructing Retrieval-Augmented Generation via Self-Synthesized Rationales

[openreview] [pdf]

Abstract Retrieval-augmented generation (RAG) has shown promising potential to enhance the accuracy and factuality of language models (LMs). However, imperfect retrievers or noisy corpora can introduce misleading or even erroneous information to the retrieved contents, posing a significant challenge to the generation quality. Existing RAG methods typically address this challenge by directly predicting final answers despite potentially noisy inputs, resulting in an implicit denoising process that is difficult to interpret and verify. On the other hand, the acquisition of explicit denoising supervision is often costly, involving significant human efforts. In this work, we propose InstructRAG, where LMs explicitly learn the denoising process through self-synthesized rationales --- First, we instruct the LM to explain how the ground-truth answer is derived from retrieved documents. Then, these rationales can be used either as demonstrations for in-context learning of explicit denoising or as supervised fine-tuning data to train the model. Compared to standard RAG approaches, InstructRAG requires no additional supervision, allows for easier verification of the predicted answers, and effectively improves generation accuracy. Experiments show InstructRAG consistently outperforms existing RAG methods in both training-free and trainable scenarios, achieving a relative improvement of 8.3% over the best baseline method on average across five knowledge-intensive benchmarks. Extensive analysis indicates that InstructRAG scales well with increased numbers of retrieved documents and consistently exhibits robust denoising ability even in out-of-domain datasets, demonstrating strong generalizability.

4396LLM4Solver: Large Language Model for Efficient Algorithm Design of Combinatorial Optimization Solver

[openreview] [pdf]

Abstract The optimization of algorithms in exact combinatorial optimization (CO) solver plays a fundamental role in operations research. However, due to the extensive requirements on domain knowledge and the large search space for algorithm design, the refinement on these algorithms remains highly challenging for both manual and learning-based paradigms. To tackle this problem, we propose a novel machine learning framework---large language model for exact combinatorial optimization solver (LLM4Solver)---to efficiently\textit{efficiently} design high-quality algorithms of the CO solvers. The core idea is that, instead of searching in the high-dimensional and discrete symbolic space from scratch, we can utilize the prior knowledge learned from large language models to directly search in the space of programming languages. Specifically, we first use a pre-trained LLM as the generator for high-quality algorithms. Then, to efficiently explore the discrete and non-gradient algorithm space, we employ a derivative-free evolutionary framework as the algorithm optimizer. Experiments on extensive benchmarks show that the algorithms learned by LLM4Solver significantly\textit{significantly} outperform all the state-of-the-art (SOTA) human-designed and learning-based policies (on GPU) in terms of the solution quality, the solving efficiency, and the cross-benchmark generalization ability. The appealing features of LLM4Solver include 1) the high training efficiency to outperform SOTA methods within ten iterations, and 2) the high cross-benchmark generalization ability on heterogeneous MIPLIB 2017. LLM4Solver shows the encouraging potential to efficiently design algorithms for the next generation of modern CO solvers.

4397As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss

[openreview] [pdf]

Abstract Direct Preference Optimization (DPO) has emerged as a more computationally efficient alternative to Reinforcement Learning from Human Feedback (RLHF) with Proximal Policy Optimization (PPO), eliminating the need for reward models and online sampling. Despite these benefits, DPO and its variants remain sensitive to hyper-parameters and prone to instability, particularly on mathematical datasets. We argue that these issues arise from the unidirectional likelihood-derivative negative feedback inherent in the log-likelihood loss function. To address this, we propose a novel LLM alignment loss that establishes a stable Bidirectional Negative Feedback (BNF) during optimization. Our proposed BNF loss eliminates the need for pairwise contrastive losses and does not require any extra tunable hyper-parameters or pairwise preference data, streamlining the alignment pipeline to be as simple as supervised fine-tuning. We conduct extensive experiments across two challenging QA benchmarks and four reasoning benchmarks. The experimental results show that BNF achieves comparable performance to the best methods on QA benchmarks, while its performance decrease on the four reasoning benchmarks is significantly lower compared to the best methods, thus striking a better balance between value alignment and reasoning ability. In addition, we further validate the performance of BNF on non-pairwise datasets, and conduct in-depth analysis of log-likelihood and logit shifts across different preference optimization methods. We will release all the source code, checkpoints, and datasets on GitHub.

4398MOUCHI: Mitigating Over-forgetting in Unlearning Copyrighted Information

[openreview] [pdf]

Abstract Large language models are trained on massive internet datasets, which may inadvertently memorize illegal copyrighted content, making its inclusion unavoidable. Unlearning is a potential solution to remove such content. However, existing unlearning methods often suffer fromover-forgetting, where the process unintentionally erases knowledge similar to the copyrighted content that falls under fair use and should be preserved. To address this issue, we proposeMOUCHI, a novel unlearning framework that introduces the concept ofderivative knowledge, a subset of information derived from copyrighted content that must be retained during unlearning. MOUCHI first generates derivative knowledge and then incorporates a derivative loss function into the unlearning process to mitigate over-forgetting in unlearning copyrighted content. Due to its plug-and-play nature, MOUCHI can be effortlessly integrated into existing unlearning methods. Experimental results show that MOUCHI reduces unintended knowledge loss, improving performance byup to 145%compared to baseline methods when evaluated on the derivative set.

4399MOUCHI: Mitigating Over-forgetting in Unlearning Copyrighted Information

[openreview] [pdf]

Abstract No absctract

4400Towards Realistic Hyperparameter Optimization in Continual Learning

[openreview] [pdf]

Abstract In continual learning (CL)—where a learner trains on a stream of data—standard hyperparameter optimisation (HPO) cannot be applied, as a learner does not have access to all of the data at the same time. This has prompted the development of CL-specific HPO frameworks. The most popular way to tune hyperparameters in CL is to repeatedly train over the whole data stream with different hyperparameter settings. However, thisend-of-trainingHPO is unrealistic as in practice a learner can only see the stream once. Hence, there is an open question:what HPO framework should a practitioner use for a CL problem in reality?This paper answers this question by comparing several realistic HPO frameworks. We find that none of the HPO frameworks considered, including end-of-training HPO, perform consistently better than the rest on popular CL benchmarks. We therefore arrive at a twofold conclusion: a) on the popular CL benchmarks examined, a CL practitioner should select the HPO framework based on other factors, for example compute efficiency and b) to be able to discriminate between HPO frameworks there is a need to move beyond the current most commonly used CL benchmarks.

4401ROLoRA: Rank Optimization for Low-Rank Adaptation under Memory Constraints

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) has emerged as a prominent technique for fine-tuning large language models (LLMs) with limited computational resources. However, by injecting low-rank adapters with a rank identical across all layers, standard LoRA overlooks the varying importance of the weight matrices, often leading to suboptimal performance. Therefore, discovering an optimal rank configuration that efficiently utilizes limited training resources remains an open question. Existing solutions typically compromises computational constraints for performance gains, limiting their practical usage in resource-constrained scenarios. To address these issues, in this paper, we propose a novel method named ROLoRA to efficiently discover an effective rank configuration for low-rank adaptation, while strictly adhering to a constrained computational budget during training. In particular, our method iteratively prunes saturated adapters and expands under-fitted ones to increase their capacity until they converge to a highly optimized configuration. Our approach is delicately designed within the Frank-Wolfe algorithmic framework, which offers potential theoretical guarantees. Experimentally, we demonstrate that ROLoRA outperforms standard LoRA on common natural language processing tasks, including the GLUE and SQuAD benchmarks. Additionally, we provide a comprehensive analysis to explain why ROLoRA surpasses competing state-of-the-arts.

4402Deciphering and Enhancing Commonsense Reasoning in LLMs from the Perspective of Intrinsic Factual Knowledge Retrieval

[openreview] [pdf]

Abstract Commonsense reasoning in large language models (LLMs) bridges the gap to physical world, thus allowing them to think and behave more like humans. Previous research has shown that LLMs acquire the underlying factual knowledge from extensive training corpora and store it within their parameters. However, how LLMs apply this knowledge during the inference phase remains unclear. This lack of transparency makes it difficult to determine whether shortcomings in LLMs are due to a lack of factual knowledge or insufficient reasoning capabilities. In this work, we aim to decipher the commonsense reasoning process into human-understandable steps. By interpreting the hidden states in different transformer layers and token positions, we uncover a specific mechanism by which LLMs execute reasoning. Our extensive experiments indicate: 1) both attention head and multi-layer perceptron (MLP) contribute to the generation of factual knowledge from different perspective. 2) The process of commonsense reasoning in LLMs involves a clear sequence of knowledge augmentation, knowledge retrieval and answer generation, akin to retrieval-augmented generation. Building on these findings, we have discovered that LLMs often contain relevant facutal knowledge but fail to retrieve the correct knowledge at top. To address this issure, we selectively fine-tuned the key heads and MLPs, resulting in notably improvements in reasoning performance in both in-domain and out-of-domain settings.

4403ResidualViT for Efficient Zero-Shot Natural Language Temporal Video Grounding

[openreview] [pdf]

Abstract The goal of this work is to efficiently compute frame-level features from videos for the Zero-Shot Natural Language Temporal Video Grounding (NLTVG) task. The contributions of this work are three-fold. First, we introduce a novel vision transformer (ViT) architecture, dubbed ResidualViT, that capitalizes on the large temporal redundancies in video. Our architecture incorporates (i) learnable residual connections that ensure temporal consistency across consecutive frames and (ii) a token reduction module for enhancing processing speed by selectively discarding temporally redundant information. Second, we describe a lightweight distillation strategy that enables learning parameters of ResidualViT from existing frame encoders without additional manual annotation. Finally, we validate the effectiveness of our approach across three diverse datasets, demonstrating significant reductions in computational cost (up to 60%) and improvements in inference speed (up to 2.5x faster), all while observing marginal accuracy reduction with respect to the teacher model.

4404Runtime Learning Machine

[openreview] [pdf]

Abstract This paper proposes the Runtime Learning Machine for safety-critical autonomous systems. The learning machine has three interactive components: a high-performance (HP)-Student, a high-assurance (HA)-Teacher, and a Coordinator. The HP-Student is a high-performance but not fully verified Phy-DRL (physics-regulated deep reinforcement learning) agent that performs safe runtime learning in real plants, using real-time sensor data from real-time physical environments. On the other hand, HA-Teacher is a verified but simplified design, focusing on safety-critical functions. As a complementary, HA-Teacher’s novelty lies in real-time patch for two missions: i) correcting unsafe learning of HP-Student, and ii) backing up safety. The Coordinator manages the interaction between HP-Student and HA-Teacher. Powered by the three interactive components, the runtime learning machine notably features i) assuring lifetime safety (i.e., safety guarantee in any runtime learning stage, regardless of HP-Student’s success), ii) tolerating unknown unknowns, iii) addressing Sim2Real gap, and iv) automatic hierarchy learning (i.e., safety-first learning, and then high-performance learning). Experimental results involving a cart-pole system and a real quadruped robot, as well as comparisons with state-of-the-art safe DRL, fault-tolerant DRL, and approaches for addressing Sim2Real gap, demonstrate the machine’s effectiveness and unique features.

4405Empirical Guidelines for Deploying LLMs onto Resource-constrained Edge Devices

[openreview] [pdf]

Abstract The scaling laws have become the de facto guidelines for designing large language models (LLMs), but they were studied under the assumption of unlimited computing resources for both training and inference. As LLMs are increasingly used as personalized intelligent assistants, their customization (i.e., learning through fine-tuning) and deployment onto resource-constrained edge devices will become more and more prevalent. An urgent but open question is how a resource-constrained computing environment would affect the design choices for a personalized LLM. We study this problem empirically in this work. In particular, we consider the tradeoffs among a number of key design factors and their intertwined impacts on learning efficiency and accuracy. The factors include the learning methods for LLM customization, the amount of personalized data used for learning customization, the types and sizes of LLMs, the compression methods of LLMs, the amount of time afforded to learn, and the difficulty levels of the target use cases. Through extensive experimentation and benchmarking, we draw a number of surprisingly insightful guidelines for deploying LLMs onto resource-constrained devices. For example, an optimal choice between parameter learning and RAG may vary depending on the difficulty of the downstream task, the longer fine-tuning time does not necessarily help the model, and a compressed LLM may be a better choice than an uncompressed LLM to learn from limited personalized data.

4406Multi-Model Induced Source-free Video Domain Adaptation

[openreview] [pdf]

Abstract Existing Source-free Video Domain Adaptation (SFVDA) aims to learn a target video model for an unlabeled target domain by transferring knowledge from a labeled source domain using a single pre-trained source video model. In this paper, we explore a new SFVDA setting where multiple source domains exist, each offering a library of source models with different architectures. This setting offers both opportunities and challenges: while the presence of multiple source models enriches the pool of transferable knowledge, it also increases the risk of negative transfer due to inappropriate source knowledge. To tackle these challenges, we introduce the Multiple-Source-Video-Model Aggregation Framework (MSVMA), comprising two key modules. The first module, termed Multi-level Instance Transferability Calibration (MITC), enhances existing uncertainty-based transferability estimation metrics by incorporating scale information from both group and dataset levels. This integration facilitates accurate transferability estimation at the instance level across diverse models. The second module, termed Instance-level Multi Video Model Aggregation (IMVMA), leverages the calculated instance-level transferability to guide a path generation network. This network produces instance-specific weights for unsupervised aggregation of source models. Empirical results from three video domain adaptation datasets demonstrate the state-of-the-art performance of our MSVMA framework.

4407Efficient Imitation under Misspecification

[openreview] [pdf]

Abstract Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert’s policy lies within the learner’s policy class (i.e. the learner can perfectly imitate the expert in all states). However, in practice, perfect imitation of the expert perfectly is often impossible due to differences in state information and action space expressiveness (e.g. morphological differences between humans and humanoid robots.) In this paper, we consider the more generalmisspecifiedsetting, where no assumptions are made about the expert policy’s realizability. We introduce a novel structural condition,reward-agnostic policy completeness, and prove that it is sufficient for interactive IL algorithms to efficiently avoid the quadratically compounding errors that stymie offline approaches like behavioral cloning. We address an additional practical constraint---the case of limited expert data---and propose a principled method for using sub-optimal data to further improve the sample-efficiency of interactive IL algorithms. Finally, we corroborate our theory with experiments on a suite of continuous control tasks.

4408COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences

[openreview] [pdf]

Abstract Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous algorithms for finding the Nash policy either diverge or converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm,CovergentMetaAlignment Algorithm (COMAL), for language model alignment with general preferences, inspired by convergent algorithms in game theory. Theoretically, we prove that our meta-algorithm converges to an exact Nash policy. Additionally, our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization with minimal changes. Experimental results demonstrate the effectiveness of the proposed framework when combined with existing preference policy optimization methods.

4409Breach By A Thousand Leaks: Unsafe Information Leakage in ‘Safe’ AI Responses

[openreview] [pdf]

Abstract Vulnerability of Frontier language models to misuse and jailbreaks has prompted the development of safety measures like filters and alignment training in an effort to ensure safety through robustness to adversarially crafted prompts. We assert that robustness is fundamentally insufficient for ensuring safety goals, and current defenses and evaluation methods fail to account for risks of dual-intent queries and their composition for malicious goals. To quantify these risks, we introduce a new safety evaluation framework based on \textit{impermissible information leakage} of model outputs and demonstrate how our proposed question-decomposition attack can extract dangerous knowledge from a censored LLM more effectively than traditional jailbreaking. Underlying our proposed evaluation method is a novel information-theoretic threat model of \textit{inferential adversaries}, distinguished from \textit{security adversaries}, such as jailbreaks, in that success is measured by inferring impermissible knowledge from victim outputs as opposed to forcing explicitly impermissible outputs from the victim. Through our information-theoretic framework, we show that to ensure safety against inferential adversaries, defense mechanisms must ensure \textit{information censorship}, bounding the leakage of impermissible information. However, we prove that such defenses inevitably incur a safety-utility trade-off.

4410Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models

[openreview] [pdf]

Abstract Beyond high-fidelity image synthesis, diffusion models have recently exhibited promising results in dense visual perception tasks. However, most existing work treats diffusion models as a standalone component for perception tasks, employing them either solely for off-the-shelf data augmentation or as mere feature extractors. In contrast to these isolated and thus sub-optimal efforts, we introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, that can simultaneously handle both multi-modal data generation and dense visual perception, through a unique exploitation of the diffusion-denoising process. Within this framework, we further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set. Importantly, Diff-2-in-1 optimizes the utilization of the created diverse and faithful data by leveraging a novel self-improving learning mechanism. Comprehensive experimental evaluations validate the effectiveness of our framework, showcasing consistent performance improvements across various discriminative backbones and high-quality multi-modal data generation characterized by both realism and usefulness.

4411Overcoming Slow Decision Frequencies in Continuous Control: Model-Based Sequence Reinforcement Learning for Model-Free Control

[openreview] [pdf]

Abstract Reinforcement learning (RL) is rapidly reaching and surpassing human-level control capabilities. However, state-of-the-art RL algorithms often require timesteps and reaction times significantly faster than human capabilities, which is impractical in real-world settings and typically necessitates specialized hardware. Such speeds are difficult to achieve in the real world and often requires specialized hardware. We introduce Sequence Reinforcement Learning (SRL), an RL algorithm designed to produce a sequence of actions for a given input state, enabling effective control at lower decision frequencies. SRL addresses the challenges of learning action sequences by employing both a model and an actor-critic architecture operating at different temporal scales. We propose a “temporal recall” mechanism, where the critic uses the model to estimate intermediate states between primitive actions, providing a learning signal for each individual action within the sequence. Once training is complete, the actor can generate action sequences independently of the model, achieving model-free control at a slower frequency. We evaluate SRL on a suite of continuous control tasks, demonstrating that it achieves performance comparable to state-of-the-art algorithms while significantly reducing actor sample complexity. To better assess performance across varying decision frequencies, we introduce the Frequency-Averaged Score (FAS) metric. Our results show that SRL significantly outperforms traditional RL algorithms in terms of FAS, making it particularly suitable for applications requiring variable decision frequencies. Additionally, we compare SRL with model-based online planning, showing that SRL achieves superior FAS while leveraging the same model during training that online planners use for planning. Lastly, we highlight the biological relevance of SRL, showing that it replicates the “action chunking” behavior observed in the basal ganglia, offering insights into brain-inspired control mechanisms.

4412MQuAKE-Remastered: Multi-Hop Knowledge Editing Can Only Be Advanced with Reliable Evaluations

[openreview] [pdf]

Abstract Large language models (LLMs) can give out erroneous answers to factually rooted questions either as a result of undesired training outcomes or simply because the world has moved on after a certain knowledge cutoff date. Under such scenarios, knowledge editing often comes to the rescue by delivering efficient patches for such erroneous answers without significantly altering the rests, where many editing methods have seen reasonable success when the editing targets are simple and direct (e.g., “what club does Lionel Messi currently play for?”).However, knowledge fragments like this are often deeply intertwined in the real world, making effectively propagating the editing effect to non-directly related questions a practical challenge (e.g., “who is the offspring of the owner of the club that Messi currently plays for?”). Prior arts have coined this task as multi-hop knowledge editing with the most popular dataset being MQuAKE, serving as the sole evaluation benchmark for many later proposed editing methods due to the expensive nature of making knowledge editing datasets at scale.In this work, we reveal thatup to 33% or 76% of MQuAKE’s questions and ground truth labels are, in fact, corrupted in various fashions due to some unintentional clerical or procedural oversights.Our work provides a detailed audit of MQuAKE’s error pattern and a comprehensive fix without sacrificing its dataset capacity. Additionally, we benchmarked almost all proposed \mquake{}-evaluated editing methods on our post-fix dataset, \mquaker{}. It is our observation that many methods try to overfit the original \mquake{} by exploiting some data-specific properties of \mquake{}. We provide a guideline on how to faithfully approach such datasets and show that a simple, minimally invasive approach can bring excellent editing performance without such exploitation. Please refer to the supplemental material for assets.

4413Pre-Memorization Train Accuracy Reliably Predicts Generalization in LLM Reasoning

[openreview] [pdf]

Abstract When large language models (LLMs) are finetuned on reasoning tasks, they can either reduce their training loss by developing problem-solving abilities, or by simply memorizing target traces in the training data. Our work aims to better understand how this learning process shapes a model’s ability to generalize. We observe that, while LLMs often perfectly memorize most target solution traces by the end of training, their predictions at intermediate checkpoints can provide valuable insights into their behavior at test time. Concretely, we introduce the concept of pre-memorization train accuracy: the accuracy of model samples for training queries prior to exactly reproducing reasoning traces in the training data. We find that the average pre-memorization train accuracy of the model is strongly predictive of its test performance, with coefficients of determination around or exceeding 0.9 across various models (Llama3-8B, Gemma2-9B), datasets (GSM8k, MATH), and training setups. Beyond this aggregate statistic, we find that the pre-memorization train accuracy of individual examples can predict the model’s sensitivity to input perturbations for those examples, allowing us to identify examples for which the model fails to learn robust solutions. A natural application of this insight is in data curation. We find that prioritizing the collection of examples with low pre-memorization accuracy leads to 1.5-2x data efficiency compared to i.i.d. data scaling, and outperforms other standard data curation techniques.

4414AD-H: Autonomous Driving with Hierarchical Agents

[openreview] [pdf]

Abstract Due to the impressive capabilities of multimodal large language models (MLLMs), recent works have focused on employing MLLM-based agents for autonomous driving in large-scale and dynamic environments. However, prevalent approaches often directly use MLLMs to translate high-level instructions into low-level vehicle control signals. This approach deviates from the inherent language generation paradigm of MLLMs and fails to fully harness their emergent capabilities. As a result, the generalizability of these methods is limited by the autonomous driving datasets used during fine-tuning. To tackle this challenge, we propose AD-H, a hierarchical framework that enables two agents (the MLLM planner and the controller) to collaborate. The MLLM planner perceives environmental information and high-level instructions to generate mid-level, fine-grained driving commands, which the controller then executes as actions. This compositional paradigm liberates the MLLM from low-level control signal decoding, thus fully leveraging its high-level perception, reasoning, and planning capabilities. Furthermore, the fine-grained commands provided by the MLLM planner enable the controller to perform actions more effectively. To train AD-H, we build a new autonomous driving dataset with hierarchical action annotations encompassing multiple levels of instructions and driving commands. Comprehensive closed-loop evaluations demonstrate several key advantages of our proposed AD-H system. First, AD-H can notably outperform state-of-the-art methods in achieving exceptional driving performance, even exhibiting self-correction capabilities during vehicle operation, a scenario not encountered in the training dataset. Second, AD-H demonstrates superior generalization under long-horizon instructions and novel environmental conditions, significantly surpassing current state-of-the-art methods.

4415Retrieval Instead of Fine-tuning: A Retrieval-based Parameter Ensemble for Zero-shot Learning

[openreview] [pdf]

Abstract Foundation models have become a cornerstone in deep learning, with techniques like Low-Rank Adaptation (LoRA) offering efficient fine-tuning of large models. Similarly, methods such as Retrieval-Augmented Generation (RAG), which leverage vectorized databases, have further improved model performance by grounding outputs in external information. While these approaches have demonstrated notable success, they often require extensive training or labeled data, which can limit their adaptability in resource-constrained environments. To address these challenges, we introduce Retrieval-based Parameter Ensemble (RPE), a new method that creates a vectorized database of LoRAs, enabling efficient retrieval and application of model adaptations to new tasks. RPE minimizes the need for extensive training and eliminates the requirement for labeled data, making it particularly effective for zero-shot learning. Additionally, RPE is well-suited for privacy-sensitive domains like healthcare, as it modifies model parameters without accessing raw data. When applied to tasks such as medical report generation and image segmentation, RPE not only proved effective but also surpassed supervised fine-tuning methods in certain cases, highlighting its potential to enhance both computational efficiency and privacy in deep learning applications.

4416Getting Free Bits Back from Rotational Symmetries in LLMs

[openreview] [pdf]

Abstract Current methods for compressing neural network weights, such as decomposition, pruning, quantization, and channel simulation, often overlook the inherent symmetries within these networks and thus waste bits on encoding redundant information. In this paper, we propose a format based on bits-back coding for storing rotationally symmetric Transformer weights more efficiently than the usual array layout at the same floating-point precision. We evaluate our method on Large Language Models (LLMs) pruned by SliceGPT (Ashkboos et al., 2024) and achieve a 3-5% reduction in total bit usage for free across different model sizes and architectures without impacting model performance within a certain numerical precision.

4417Out-Of-Context and Out-Of-Scope: Subliminal Priming for Large Language Models

[openreview] [pdf]

Abstract Subliminal priming in humans describes the influencing of behaviour via stimuli they are unaware of. In this work, we mimic human subliminal priming studies for large language models (LLMs) by inserting a seemingly negligible number of ex-template descriptions of a fictitious character’s behaviour into a large corpus of longer but unrelated in-template instructions. After fine-tuning models on the combined data, we elicit demonstrations of the behaviour using suitable trigger prompts. While there is no concept of an LLM being unaware of the stimuli, we show that prompting strategies motivated by projective psychology and psychoanalytic theory can succeed where naive questions fail, even with potent chain-of-thought (COT) initiators. This work extends research on out-of-context reasoning (OOCR), where LLMs show a form of situational awareness and “read between the lines” or “think outside of the box” by performing reasoning hops on internalised knowledge. Our theoretical justification for why this subliminal priming analogue works for LLMs comes from the observation that optimising models with the standard per-token cross-entropy loss is equivalent to training models on a weighted context classification task, where shorter contexts have a higher weight. Our experiments show that manipulating the training data by adding a small number of short descriptions and using soft out-of-vocabulary (OOV) tokens as context anchors can allow and improve the embedding and triggering of specific behaviour, hinting at the possibility of undetected alignment hazards in current LLMs.

4418Unlearn and Burn: Adversarial Machine Unlearning Requests Destroy Model Accuracy

[openreview] [pdf]

Abstract Machine unlearning algorithms, designed for selective removal of training data from models, have emerged as a promising approach to growing privacy concerns. In this work, we expose a critical yet underexplored vulnerability in the deployment of unlearning systems: the assumption that the data requested for removal is always part of the original training set. We present a threat model where an attacker can degrade model accuracy by submitting adversarial unlearning requests for data not present in the training set. We propose white-box and black-box attack algorithms and evaluate them through a case study on image classification tasks using the CIFAR-10 and ImageNet datasets, targeting a family of widely used unlearning methods. Our results show extremely poor test accuracy following the attack—3.6% on CIFAR-10 and 0.4% on ImageNet for white-box attacks, and 8.5% on CIFAR-10 and 1.3% on ImageNet for black-box attacks. Additionally, we evaluate various verification mechanisms to detect the legitimacy of unlearning requests and reveal the challenges in verification, as most of the mechanisms fail to detect stealthy attacks without severely impairing their ability to process valid requests. These findings underscore the urgent need for research on more robust request verification methods and unlearning protocols, should the deployment of machine unlearning systems become more relevant in the future.

4419Second-Order Fine-Tuning without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) is necessary for specific downstream tasks, but classic first-order optimizer entails prohibitive GPU memory because of the back propagation. Recent works such as MeZO have turned to zeroth-order optimizers for fine-tuning, which reduce substantial memory by using two forward passes. However, heterogeneous curvatures across different parameter dimensions in LLMs often cause model convergence instability or even failure. In this work, we propose HiZOO, a diagonal Hessian informed Zeroth-Order Optimizer , which is the first work to leverage the diagonal Hessian to enhance ZOO for fine-tuning LLMs. We provide theoretical proof for HiZOO and visualize the optimization trajectories on test functions to illustrate how it improves convergence in handling heterogeneous curvatures. Extensive experiments on various models (RoBERTa, OPT, Phi-2 and LLama3, with 350M\sim66B parameters) indicate that HiZOO significantly reduces training steps and enhances model accuracy, while keeping the memory advantage of ZOO. For example, on SST2 task HiZOO achieves 8×8\times speedup and better accuracy over MeZO across different models. We also propose HiZOO-L, which reduces the Hessian memory cost to 10% of the MeZO, while maintaining almost same performance. Compared with ZO-Adam, HiZOO-L achieves a 4.3% improvement, just using 50% of the GPU memory. Code is available athttps://anonymous.4open.science/r/HiZOO-27F8.

4420Diffusion Feedback Helps CLIP See Better

[openreview] [pdf]

Abstract Contrastive Language-Image Pre-training (CLIP), which excels at abstracting open-world representations across domains and modalities, has become a foundation for a variety of vision and multimodal tasks. However, recent studies reveal that CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure, etc. These visual shortcomings also limit the perception capabilities of multimodal large language models (MLLMs) built on CLIP. The main reason could be that the image-text pairs used to train CLIP are inherently biased, due to the lack of the distinctiveness of the text and the diversity of images. In this work, we present a simple post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process. We introduce DIVA, which uses the DIffusion model as a Visual Assistant for CLIP. Specifically, DIVA leverages generative feedback from text-to-image diffusion models to optimize CLIP representations, with only images (without corresponding text). We demonstrate that DIVA improves CLIP’s performance on the challenging MMVP-VLM benchmark which assesses fine-grained visual abilities to a large extent (e.g., 3-7%), and enhances the performance of MLLMs and vision models on multimodal understanding and segmentation tasks. Extensive evaluation on 29 image classification and retrieval benchmarks confirms that our framework preserves CLIP’s strong zero-shot capabilities. The code will be publicly available soon.

4421Aligning Large Language Models with Domain Adaptation

[openreview] [pdf]

Abstract Aligning large language models (LLMs) has emerged as a critical challenge in the age of generative AI: LLMs must be appropriately aligned with human values and preferences in order to be helpful and harmless. In many real world cases, however, large amounts of preference data are not available on important tasks, limiting the effectiveness of resulting reward models. In some cases, data from a similar task is available, and unlabeled data on the target task is available or can be generated by an LLM. In other cases, clean data may be available to train an LLM for real-world use on noisy data, small amounts of labeled data on the target task may be available, or data may be available on an easier task. In this work, we demonstrate that domain adaptation can effectively use different types of data, by transferring supervision and human values across tasks with similar data distributions, strengthening resistance to noisy data, improving few-shot generalization ability, and even transfer from easy to hard tasks, in the form of short to long generalization. Specifically, we propose Data Efficient Alignment for Language (DEAL), using domain adaptation to effectively perform cross-task alignment in scenarios where labeled target data is not available. We evaluate our method for reward model training on a variety of benchmarks and demonstrate that our method can meaningfully improve performance on target tasks by utilizing data on related tasks or low amounts of data. Furthermore, we offer analysis on the inner mechanism of domain adaptation and the alignment of embedding distributions.

4422Semi-Supervised Neural Network Model For Quadratic Multiparametric Programming

[openreview] [pdf]

Abstract Neural Networks (NN) with ReLU activation functions have been used as surrogate models for multiparametric quadratic problems (mp-QP) for a wide range of engineering applications. Researchers have suggested leveraging the piecewise affine property of deep NN models to solve mp-QP with linear constraints, which also exhibit piecewise affine behaviour. However, traditional deep NN applications to mp-QP fall short of providing optimal and feasible predictions, even when trained with large datasets. This study introduces a semi-supervised NN (SSNN) architecture that directly represents the mathematical structure of the global solution function. In contrast to generic NN training approaches, the proposed SSNN method derives a large proportion of model weights directly from the physical characteristics of the system, producing solutions with higher accuracy despite training on significantly smaller data sets. Since many energy management problems are formulated as QP, the proposed approach has been applied in energy systems to demonstrate proof of concept. Model performance in terms of solution accuracy and speed of the predictions was compared against a commercial solver and a generic NN model based on classical training. Results show KKT sufficient conditions for SSNN consistently outperform generic NN architectures with classical training using far less data. A similar performance advantage is shown using extreme, out-of-training distribution test data. Given its advantages of speed and reliability, the SSNN model can quickly produce optimal and feasible solutions within a second for millions of input parameters sampled from a distribution of stochastic demands and renewable generator dispatches, which can be used for simulations and long term planning.

4423Long-Context Linear System Identification

[openreview] [pdf]

Abstract This paper addresses the problem of long-context linear system identification, where the state xtx_t of the system at time tt depends linearly on previous states xsx_s over a fixed context window of length pp. We establish a sample complexity bound that matches thei.i.d.parametric rate, up to logarithmic factors for a broad class of systems, extending previous work that considered only first-order dependencies. Our findings reveal a ``learning-without-mixing’’ phenomenon, indicating that learning long-context linear autoregressive models is not hindered by slow mixing properties potentially associated with extended context windows. Additionally, we extend these results to(i)shared low-rank feature representations, where rank-regularized estimators improve rates with respect to dimensionality, and(ii)misspecified context lengths in strictly stable systems, where shorter contexts offer statistical advantages.

4424Delving into Temperature Scaling for Adaptive Conformal Prediction

[openreview] [pdf]

Abstract Conformal prediction, as an emerging uncertainty qualification technique, constructs prediction sets that are guaranteed to contain the true label with pre-defined probability. Previous works often employ temperature scaling to calibrate the classifier, assuming that confidence calibration can benefit conformal prediction. In this work, we empirically show that current confidence calibration methods (e.g., temperature scaling) normally lead to larger prediction sets in adaptive conformal prediction. Theoretically, we prove that a prediction with higher confidence could result in a smaller prediction set on expectation. Inspired by the analysis, we propose \textbf{Conformal Temperature Scaling} (ConfTS), a variant of temperature scaling that aims to improve the efficiency of adaptive conformal prediction. Specifically, ConfTS optimizes the temperature value by minimizing the gap between the threshold and the non-conformity score of the ground truth for a held-out validation dataset. In this way, the temperature value obtained would lead to an optimal set with high efficiency without violating the coverage. Experiments demonstrate that our method can effectively enhance adaptive conformal prediction methods in both efficiency and conditional coverage, reducing the average size of APS and RAPS by approximately 50% on ImageNet with error rate α=0.1\alpha=0.1.

4425Looking Inward: Language Models Can Learn About Themselves by Introspection

[openreview] [pdf]

Abstract Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g. thoughts and feelings) that are not accessible to external observers. Can LLMs introspect? If they can, this would show that LLMs can acquire knowledge not contained in or inferable from training data. We investigate a form of introspection in which LLMs predict properties of their own behavior in hypothetical situations. If a model M1 can introspect, it should outperform a different model M2 in predicting M1’s behavior---even if M2 is trained on M1’s ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models, we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Further experiments and ablations provide additional evidence. Our results show that LLMs can offer reliable self-information independent of external data in certain domains. By demonstrating this, we pave the way for further work on introspection in more practical domains, which would have significant implications for model transparency and explainability.

4426Solving Normalized Cut Problem with Constrained Action Space

[openreview] [pdf]

Abstract We address the problem of Normalized Cut (NC) in weighted graphs where the shape of the partitions follow an apriori pattern, namely they must approximately be shaped like rings and wedges on a planar graph. Classical methods like spectral clustering and METIS do not have a provision to specify such constraints and neither do newer methods that combine GNNs and Reinforcement Learning as they are based on initialization from classical methods. The key insight that underpins our approach, Wedge and Ring Transformers (WRT), is based on representing a graph using polar coordinates and then using a multi-head transformer with a PPO objective to optimize the non-differential NC objective. To the best of our knowledge, WRT is the first method to explicitly constrain the shape of NC and opens up possibility of providing a principled approach for fine-grained shape-controlled generation of graph partitions. On the theoretical front we provide new Cheeger inequalities that connect the spectral properties of a graph with algebraic properties that capture the shape of the partitions. Comparisons with adaptations of strong baselines attest to the strength of WRT.

4427Measuring similarity between embedding spaces using induced neighborhood graphs

[openreview] [pdf]

Abstract Deep Learning techniques have excelled at generating embedding spaces that capture semantic similarities between items. Often these representations are paired, enabling experiments with analogies (pairs within the same domain) and cross-modality (pairs across domains). These experiments are based on specific assumptions about the geometry of embedding spaces, which allow finding paired items by extrapolating the positional relationships between embedding pairs in the training dataset, allowing for tasks such as finding new analogies, and multimodal zero-shot classification. In this work, we propose a metric to evaluate the similarity between paired item representations. Our proposal is built from the structural similarity between the nearest-neighbors induced graphs of each representation, and can be configured to compare spaces based on different distance metrics and on different neighborhood sizes. We demonstrate that our proposal can be used to identify similar structures at different scales, which is hard to achieve with kernel methods such as Centered Kernel Alignment (CKA). We further illustrate our method with two case studies: an analogy task using GloVe embeddings, and zero-shot classification in the CIFAR-100 dataset using CLIP embeddings. Our results show that accuracy in both analogy and zero-shot classification tasks correlates with the embedding similarity. These findings can help explain performance differences in these tasks, and may lead to improved design of paired-embedding models in the future.

4428Shallow Diffuse: Robust and Invisible Watermarking through Low-Dimensional Subspaces in Diffusion Models

[openreview] [pdf]

Abstract The widespread use of AI-generated content from diffusion models has raised significant concerns regarding misinformation and copyright infringement. Watermarking is a crucial technique for identifying these AI-generated images and preventing their misuse. In this paper, we introduceShallow Diffuse, a new watermarking technique that embeds robust and invisible watermarks into diffusion model outputs. Unlike existing approaches that integrate watermarking throughout the entire diffusion sampling process,Shallow Diffusedecouples these steps by leveraging the presence of a low-dimensional subspace in the image generation process. This method ensures that a substantial portion of the watermark lies in the null space of this subspace, effectively separating it from the image generation process. Our theoretical and empirical analyses show that this decoupling strategy greatly enhances the consistency of data generation and the detectability of the watermark. Extensive experiments further validate that ourShallow Diffuseoutperforms existing watermarking methods in terms of robustness and consistency.

4429Can Mamba Always Enjoy the “Free Lunch”?

[openreview] [pdf]

Abstract Transformers have been the cornerstone of current Large Language Models (LLMs); however, its linear growth in overhead during inference with respect to sequence length poses challenges for modeling long sequences. In this context, Mamba has gradually attracted attention due to its constant-level size during inference and existing empirical results have shown that it can perform comparably to Transformers in sequence modeling while offering significant savings. However, one may ask that, can Mamba always enjoy the ``free lunch"? In this paper, we focus on analyzing the expressive ability of Mamba from a theoretical standpoint. First, inspired by the connection between Mamba and linear attention, we investigate potential shortcomings of the Mamba when performing the COPY operation. Our results indicate that Mamba with constant size may encounter bottlenecks when handling COPY, while it can achieve perfect performance when the size scales linearly with sequence length. Based on this observation, we analyze Mamba’s ability to tackle DP problems when equipped with Chain of Thought (CoT). Our findings suggest that to solve arbitrary DP problems, the total cost of Mamba is comparable to standard and efficient Transformers. However, similar to efficient Transformers, when facing DP problems with favorable properties such as locality, Mamba can provide savings in overhead. Our results contribute to a deeper understanding of Mamba.

4430Learning Harmonized Representations for Speculative Sampling

[openreview] [pdf]

Abstract Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs). Recent advancements that leverage target LLM’s contextual information, such as hidden states and KV cache, have shown significant practical improvements. However, these approaches suffer from inconsistent context between training and decoding. We also observe another discrepancy between the training and decoding objectives in existing speculative sampling methods. In this work, we propose a solution named HArmonized Speculative Sampling (HASS) that learns harmonized representations to address these issues. HASS accelerates the decoding stage without adding inference overhead through harmonized objective distillation and harmonized context alignment. Experiments on four LLaMA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three datasets, surpassing EAGLE-2 by 8%-20%. The code is available athttps://github.com/HArmonizedSS/HASS.

4431Stream-level flow matching from a Bayesian decision theoretic perspective

[openreview] [pdf]

Abstract Flow matching (FM) is a family of training algorithms for fitting continuous normalizing flows (CNFs). A standard approach to FM, called conditional flow matching (CFM), exploits the fact that the marginal vector field of a CNF can be learned by fitting least-square regression to the so-called conditional vector field specified given one or both ends of the flow path. We show that viewing CFM training from a Bayesian decision theoretic perspective on parameter estimation opens the door to generalizations of CFM algorithms. We propose one such extension by introducing a CFM algorithm based on defining conditional probability paths given what we refer to as "streams’', instances of latent stochastic paths that connect pairs of noise and observed data. Further, we advocates the modeling of these latent streams using Gaussian processes (GPs). The unique distributional properties of GPs, and in particular the fact that the velocities of a GP is still a GP, allows drawing samples from the resulting stream-augmented conditional probability path without simulating the actual streams, and hence the “simulation-free” nature of CFM training is preserved. We show that this generalization of the CFM can substantially reduce the variance in the estimated marginal vector field at a moderate computational cost, thereby improving the quality of the generated samples under common metrics. Additionally, we show that adopting the GP on the streams allows for flexibly linking multiple related training data points (e.g., time series) and incorporating additional prior information. We empirically validate our claim through both simulations and applications to two hand-written image datasets.

4432AdaptiveQ-Network: On-the-fly Target Selection for Deep Reinforcement Learning

[openreview] [pdf]

Abstract Deep Reinforcement Learning (RL) is well known for being highly sensitive to hyperparameters, requiring practitioners substantial efforts to optimize them for the problem at hand. This also limits the applicability of RL in real-world scenarios. In recent years, the field of automated Reinforcement Learning (AutoRL) has grown in popularity by trying to address this issue. However, these approaches typically hinge on additional samples to select well-performing hyperparameters, hindering sample-efficiency and practicality. Furthermore, most AutoRL methods are heavily based on already existing AutoML methods, which were originally developed neglecting the additional challenges inherent to RL due to its non-stationarities. In this work, we propose a new approach for AutoRL, called Adaptive QQ-Network (AdaQN), that is tailored to RL to take into account the non-stationarity of the optimization procedure without requiring additional samples. AdaQN learns several QQ-functions, each one trained with different hyperparameters, which are updated online using the QQ-function with the smallest approximation error as a shared target. Our selection scheme simultaneously handles different hyperparameters while coping with the non-stationarity induced by the RL optimization procedure and being orthogonal to any critic-based RL algorithm. We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari 2600 games, showing benefits in sample-efficiency, overall performance, robustness to stochasticity and training stability.

4433tBen: Benchmarking and Testing the Rule-Based Temporal Logic Reasoning Ability of Large Language Models with DatalogMTL

[openreview] [pdf]

Abstract Large language models (LLMs) are increasingly adopted for a variety of tasks, including multi-hop question answering, knowledge probing, and symbolic commonsense reasoning. While LLMs have advanced the state-of-the-art in these areas, their ability to explicitly solve rule-based temporal logic reasoning problems—a complex cognitive process involving the understanding, representation, and manipulation of temporal information such as events, their durations, and relationships—remains unexplored. To enhance understanding of LLM performance in this common task widely explored in the traditional symbolic AI field, we have developed a new set of synthetic benchmarks for rule-based temporal logic reasoning tBen. Our tBen benchmarks are built within the context of DatalogMTL, a powerful knowledge representation language for reasoning about the properties of systems that evolve over time, in which we provide flexible configurations for customizing temporal rules and task complexity.We evaluated the close-sourced GPT-4o and the open-sourced Llama-3 using three common prompting settings—zero-shot\textit{zero-shot}, few-shot\textit{few-shot}, and zero-shot-CoT\textit{zero-shot-CoT}—on our synthetic benchmarks. Our key findings are as follows: (i) Without generating the reasoning process (chain-of-thought), even advanced LLMs like GPT-4o exhibited nearly random performance on these rule-based temporal logic reasoning tasks. However, with chain-of-thought prompting, LLMs demonstrated preliminary temporal logical reasoning abilities; (ii) Both GPT-4o and Llama-3 were unable to solve temporal logical reasoning problems involving recursion, indicating a lack of advanced complex reasoning capabilities in understanding symbolic representations involving time; (iii) There is significant room for improvement in leveraging large language models to address problems widely explored in the traditional logic-based AI domain. Prompts and datasets are available in the appendix, and a datasheet for tBen is also provided.

4434QuantBench: Benchmarking AI Modeling for Quantitative Investment

[openreview] [pdf]

Abstract The field of artificial intelligence (AI) in quantitative investment has seen significant advancements, yet it lacks a standardized benchmark aligned with industry practices. This gap hinders research progress and limits the practical application of academic innovations. We present QuantBench, an industrial-grade benchmark platform designed to address this critical need. QuantBench offers three key strengths: (1) standardization that aligns with quantitative investment industry practices, (2) flexibility to integrate various AI algorithms, and (3) full-pipeline coverage of the entire quantitative investment process. Our empirical studies using QuantBench reveal some critical research directions, including the need for continual learning to address distribution shifts, improved methods for modeling relational financial data, and more robust approaches to mitigate overfitting in low signal-to-noise environments. By providing a common ground for evaluation and fostering collaboration between researchers and practitioners, QuantBench aims to accelerate progress in AI for quantitative investment, similar to the impact of benchmark platforms in computer vision and natural language processing.

4435How Much Can RAG Help the Reasoning of LLM?

[openreview] [pdf]

Abstract Retrieval-Augmented Generation (RAG) has gained significant popularity in modern Large Language Models (LLMs) due to its effectiveness in introducing new knowledge and reducing hallucinations. However, the deep understanding of RAG remains limited, how does RAG help the reasoning process and can RAG help improve the reasoning capability remains question. While external documents are typically considered as a method to incorporate domain-specific information, they also contain intermediate reasoning results related to the query, this suggests that documents could enhance the reasoning capability of LLMs, which has not been previously explored. In this paper, we investigate this issue in depth and find that while RAG can assist with reasoning, the help is limited. If we conceptualize the reasoning process as a tree with fixed depth, then RAG struggles to assist LLMs in performing deeper reasoning. Additionally, the information in the documents requires preprocessing to filter out noise. We demonstrate that this preprocessing is difficult to achieve simply fine-tuning of the LLM, it often necessitates numerous additional transformer layers to solve the problem. To simplify the problem, we propose DPrompt tuning, which effectively resolves the issue within just limited transformer layers, leading to improved performance.

4436Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

[openreview] [pdf]

Abstract Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of node representations to the top-k eigenspace of the message-passing operator; (c) moreover, we show that the centering step of a normalization layer — which can be understood as a projection — alters the graph signal in message-passing in such a way that relevant information can become harder to extract. Building on the last theoretical insight, we introduce GraphNormv2, a novel and principled normalization layer. GraphNormv2 features a learnable centering step designed to preserve the integrity of the original graph signal. Experimental results corroborate the effectiveness of our method, demonstrating improved performance across various GNN architectures and tasks.

4437Decoupling the Class Label and the Target Concept in Machine Unlearning

[openreview] [pdf]

Abstract Machine unlearning as an emerging research topic for data regulations, aims to adjust a trained model to approximate a retrained one that excludes a portion of training data. Previous studies showed that class-wise unlearning is effective in forgetting the knowledge of target data, either through gradient ascent on the forgetting data or fine-tuning with the remaining data. However, while these methods are useful, they are insufficient as the class label and the target concept are often considered to coincide. In this work, we expand the scope by considering the label domain mismatch and investigate three problems beyond the conventionalall matchedforgetting, e.g.,target mismatch,model mismatch, anddata mismatchforgetting. We systematically analyze the new challenges in restrictively forgetting the target concept and also reveal crucial forgetting dynamics in the representation level to realize these tasks. Based on that, we propose a general framework, namely,TARget-aware Forgetting(TARF). It enables the additional tasks to actively forget the target concept while maintaining the rest part, by simultaneously conducting annealed gradient ascent on the forgetting data and selected gradient descent on the hard-to-affect remaining data. Empirically, various experiments under our newly introduced settings are conducted to demonstrate the effectiveness of our TARF.

4438Curse of Instructions: Large Language Models Cannot Follow Multiple Instructions at Once

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated impressive performance across various natural language processing (NLP) tasks owing to the strong capability of following instructions. To further accelerate the integration of LLMs into our society, it is essential to have LLMs follow many instructions as accurately as humans do. This study reveals that LLMs unexpectedly struggle to follow all instructions simultaneously as the number of instructions increases. First, to validate our claim, we introduce ManyIFEval, a large-scale benchmark dataset comprising task prompts with up to ten objectively verifiable instructions. Second, we conduct experiments based on ManyIFEval with GPT-4o, Claude-3.5, Gemini-1.5, Gemma2, and Llama3.1, demonstrating that as the instruction count rises, the models’ ability to follow individual instruction deteriorates gradually but constantly. As a result, the models’ ability to follow all the instructions significantly drops: the success rate of all the instructions is precisely explained by the success rate of individual instructions to the power of total number of instructions. We refer to it as the ``curse of instructions’'. Third, to remove the curse without retraining models, we propose an inference-time strategy that enhances performance through iterative self-refinement. We demonstrate that instruction-level chain-of-thought reasoning significantly improves their capability to detect and correct instruction-following errors. Notably, our method has improved the success rate of following ten instructions by GPT-4o from 15% to 31% and Claude 3.5 Sonnet from 44% to 58%. We also show that precision is more important than recall in feedback: just telling LLMs that they are not following all the instructions also improves self-refinement success. Our findings highlight a fundamental limitation of instruction-following ability and suggest a future direction for building trustworthy LLMs that can coexist with human society.

4439MOGIC: METADATA-INFUSED ORACLE GUIDANCE FOR IMPROVED EXTREME CLASSIFICATION

[openreview] [pdf]

Abstract While retrieval-augmented classification and generation models significantly benefit from the early-stage fusion of high-quality text-based auxiliary metadata, often called memory, they suffer from high inference latency and poor robustness to noise. In classifications tasks, particularly the extreme classification (XC) setting, where low latency is critical, existing methods incorporate metadata for context enrichment via an XC-based retriever and obtain the encoder representations of the relevant memory items and perform late-stage fusion to achieve low latency. With an aim of achieving higher accuracy within low latency constraints, in this paper, we propose MOGIC, an approach for metadata-infused oracle guidance for XC tasks. In particular, we train an early-fusion oracle classifier with access to both query- and label-side ground-truth metadata in the textual form. The oracle is subsequently used to guide the training of any existing memory-based XC disciple model via regularization. The MOGIC algorithm, when applied to memory-based XC disciple models such as OAK, improves precision@1 by 1-2% and propensity-scored precision@1 by 2-3% on four standard datasets, at no additional inference-time costs to the disciple. We also show the feasibility of applying the MOGIC algorithm to improve the performance of state-of-the-art memory-free XC approaches such as NGAME or DEXA, demonstrating that the MOGIC algorithm can be used atop any existing XC-based approach in a plug-and-play manner. Finally, we also show the robustness of the MOGIC method to missing and noisy metadata settings.

4440Teaching Human Behavior Improves Content Understanding Abilities Of LLMs

[openreview] [pdf]

Abstract Communication is defined as ``Who says what to whom with what effect.‘’ A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior data is often ignored while training large language models. We show that training LLMs on receiver behavior can actually help improve their content-understanding abilities. Specifically, we show that training LLMs to predict the receiver behavior of likes and comments improves the LLM’s performance on a wide variety of downstream content understanding tasks. We show this performance increase over 46 video and image understanding tasks over 26 benchmark datasets across both 0-shot and fine-tuning settings, outperforming many supervised baselines. Moreover, since receiver behavior, such as likes and comments, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free-lunch. We release the receiver behavior cleaned comments and likes of 750k images and videos collected from multiple platforms along with our instruction-tuning data.

4441In-context Time Series Predictor

[openreview] [pdf]

Abstract Recent Transformer-based large language models (LLMs) demonstrate in-context learning ability to perform various functions based solely on the provided context, without updating model parameters. To fully utilize the in-context capabilities in time series forecasting (TSF) problems, unlike previous Transformer-based or LLM-based time series forecasting methods, we reformulate “time series forecasting tasks” as input tokens by constructing a series of (lookback, future) pairs within the tokens. This method aligns more closely with the inherent in-context mechanisms and is more parameter-efficient without the need of using pre-trained LLM parameters. Furthermore, it addresses issues such as overfitting in existing Transformer-based TSF models, consistently achieving better performance across full-data, few-shot, and zero-shot settings compared to previous architectures.

4442ConML: A Universal Meta-Learning Framework with Task-Level Contrastive Learning

[openreview] [pdf]

Abstract Meta-learning enables learning systems to adapt quickly to new tasks, similar to humans. To emulate this human-like rapid learning and enhance alignment and discrimination abilities, we propose ConML, a universal meta-learning framework that can be applied to various meta-learning algorithms without relying on specific model architectures nor target models. The core of ConML is task-level contrastive learning, which extends contrastive learning from the representation space in unsupervised learning to the model space in meta-learning. By leveraging task identity as an additional supervision signal during meta-training, we contrast the outputs of the meta-learner in the model space, minimizing inner-task distance (between models trained on different subsets of the same task) and maximizing inter-task distance (between models from different tasks). We demonstrate that ConML integrates seamlessly with optimization-based, metric-based, and amortization-based meta-learning algorithms, as well as in-context learning, resulting in performance improvements across diverse few-shot learning tasks.

4443Graph GOSPA Similarity Function for Gaussian Process Regression on Graphs

[openreview] [pdf]

Abstract In this paper, we propose a similarity function between graphs based on a mathematically principled metric for graphs of different sizes: the graph generalised optimal subpattern assignment (GOSPA) metric. The similarity function is based on an optimal assignment between nodes and has an interpretable meaning in terms of similarity for node attribute error, number of unassigned nodes, and number of edge mismatches. The proposed similarity function is computable in polynomial time. We also propose its use in Gaussian processes (GPs) for graphs to predict molecular properties. Experimental results show the benefits of the proposed GP model compared to other GP baselines.

4444Integrating Expertise of Software Engineering Agents

[openreview] [pdf]

Abstract Large language model (LLM) agents have shown great potential in solving real-world software engineering (SWE) problems. The most advanced open-source SWE agent can resolve over 27% of real GitHub issues in SWE-Bench Lite. However, these sophisticated agent frameworks exhibit varying strengths, excelling in certain tasks while underperforming in others. To fully harness the diversity of these agents, we propose DEI (Diversity Empowered Intelligence), a framework that leverages their unique expertise. DEI functions as a meta-module atop existing SWE agent frameworks, managing agent collectives for enhanced problem-solving. Experimental results show that a DEI-guided committee of agents is able to surpass the best individual agent’s performance by a large margin. For instance, a group of open-source SWE agents, with a maximum individual resolve rate of 27.3% on SWE-Bench Lite, can achieve a 34.3% resolve rate with DEI, making a 25% improvement and beating most closed-source solutions. Our best-performing group excels with a 55% resolve rate, securing the highest ranking on SWE-Bench Lite. Our findings contribute to the growing body of research on collaborative AI systems and their potential to solve complex software engineering challenges.

4445Improving Adaptive Moment Optimization via Preconditioner Diagonalization

[openreview] [pdf]

Abstract Modern adaptive optimization methods, such as Adam and its variants, have emerged as the most widely used tools in deep learning over recent years. These algorithms offer automatic mechanisms for dynamically adjusting the update step based on estimates of gradient statistics. Compared to traditional algorithms like Stochastic Gradient Descent, these adaptive methods are typically more robust to model scale and hyperparameter tuning. However, the gradient statistics employed by these methods often do not leverage sufficient gradient covariance information, leading to suboptimal updates in certain directions of the parameter space and potentially slower convergence. In this work, we keep track of such covariance statistics in the form of a structured preconditioner matrix. Unlike other works, our approach does not apply direct approximations to estimate this matrix. We instead implement an invertible transformation that maps the preconditioner matrix into a new space where it becomes approximately diagonal. This enables a diagonal approximation of the preconditioner matrix in the transformed space, offering several computational advantages. Empirical results show that our approach can substantially enhance the convergence speed of the modern adaptive optimizers. Notably, for large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam. Additionally, our method can be integrated with memory-efficient optimizers like Adafactor to manage computational overhead.

4446Relax and Merge: A Simple Yet Effective Framework for Solving Fairk-Means andk-sparse Wasserstein Barycenter Problems

[openreview] [pdf]

Abstract The fairness of clustering algorithms has gained widespread attention across various areas, including machine learning, In this paper, we study fair kk-means clustering in Euclidean space. Given a dataset comprising several groups, the fairness constraint requires that each cluster should contain a proportion of points from each group within specified lower and upper bounds. Due to these fairness constraints, determining the optimal locations of kk centers is a quite challenging task. We propose a novel ``Relax and Merge’’ framework that returns a (1+4ρ+O(ϵ))(1+4\rho + O(\epsilon))-approximate solution, where ρ\rho is the approximate ratio of an off-the-shelf vanilla kk-means algorithm and O(ϵ)O(\epsilon) can be an arbitrarily small positive number. If equipped with a PTAS of kk-means, our solution can achieve an approximation ratio of (5+O(ϵ))(5+O(\epsilon)) with only a slight violation of the fairness constraints, which improves the current state-of-the-art approximation guarantee. Furthermore, using our framework, we can also obtain a (1+4ρ+O(ϵ))(1+4\rho +O(\epsilon))-approximate solution for the kk-sparse Wasserstein Barycenter problem, which is a fundamental optimization problem in the field of optimal transport, and a (2+6ρ)(2+6\rho)-approximate solution for the strictly fair kk-means clustering with no violation, both of which are better than the current state-of-the-art methods. In addition, the empirical results demonstrate that our proposed algorithm can significantly outperform baseline approaches in terms of clustering cost.

4447Worldcraft

[openreview] [pdf]

Abstract We present Worldcraft, a hybrid implicit method for generating vast, interactive 3D worlds at unprecedented scale and speed by modeling them as exchangeable sequences of latent 3D objects. In contrast to existing methods that produce limited scenes, Worldcraft’s novel approach constructs expansive environments comprising thousands of elements, extending to over a million objects in seconds, on a single GPU. The resulting created worlds are defined in terms of possessing certain essential properties: Object Individuality, Collective Semantics, and Expandability. To achieve this with both speed and scale, we conceptualize world generation as a set generation problem, introducing three key technical innovations: (i) Hierarchical and Exchangeable Sequence Modeling ensures Object Individuality while capturing Collective Semantics; (ii) Hybrid Implicit Generation Method enables rapid creation of vast worlds, supporting both Scale and Expandability; and (iii) Multi-level Indexing Functions allow efficient manipulation across scales, reinforcing Collective Semantics and enabling on-demand generation for Speed and Expandability. We demonstrate Worldcraft’s capabilities using Minecraft as a test-bed, generating complex, interactive environments that users can explore. However, this approach is applicable to any suitable platform, potentially revolutionizing various applications in 3D environment generation.

4448EmpathyRobot: A Dataset and Benchmark for Empathetic Task Planning of Robotic Agent

[openreview] [pdf]

Abstract Empathy is a fundamental instinct and essential need for humans, as they both demonstrate empathetic actions toward others and receive empathetic support. As robots become increasingly integrated into daily life, it is essential to explore whether they can provide human-like empathetic support. Although existing emotion agents have explored how to understand humans’ empathetic needs, they lack to further enable robots to generate empathy-oriented task planning, neglecting the evaluation of empathetic behaviors. To address this gap, we introduce \textbf{EmpathyRobot}, the first dataset specifically designed to benchmark and enhance the empathetic actions of agents across diverse scenarios. This dataset contains 10,000 samples based on human feedback, encompassing information from various modalities and corresponding empathetic task planning sequences, including navigation and manipulation. Agents are required to perform actions based on their understanding of both the visual scene and human emotions. To systematically evaluate the performance of existing agents on the EmpathyRobot dataset, we conduct comprehensive experiments to test the most capable models. Our findings reveal that generating accurate empathetic actions remains a significant challenge. Meanwhile, we finetune an \ac{llm} on our benchmark, demonstrating that it can effectively be used to enhance the empathetic behavior of robot agents. By establishing a standard benchmark for evaluating empathetic actions, we aim to drive advancements in the study and pursue of empathetic behaviors in robot agents. We will release our code and dataset.

4449Directed Structural Adaptation to Overcome Statistical Conflicts and Enable Continual Learning

[openreview] [pdf]

Abstract Adaptive networks today rely on overparameterized fixed topologies that cannot break through the statistical conflicts they encounter in the data they are exposed to, and are prone to “catastrophic forgetting” as the network attempts to reuse the existing structures to learn new task. We propose a structural adaptation method, DIRAD, that can complexify as needed and in a directed manner without being limited by statistical conflicts within a dataset. We then extend this method and present the PREVAL framework, designed to prevent “catastrophic forgetting” in continual learning by detection of new data and assigning encountered data to suitable models adapted to process them, without needing task labels anywhere in the workflow. We show the reliability of the DIRAD in growing a network with high performance and orders-of-magnitude simpler than fixed topology networks; and demonstrate the proof-of-concept operation of PREVAL, in which continual adaptation to new tasks is observed while being able to detect and discern previously-encountered tasks.

4450Unlocking Exocentric Video-Language Data for Egocentric Video Representation Learning

[openreview] [pdf]

Abstract We present EMBED (Egocentric Models Built with Exocentric Data), a framework designed to mine video-language data from exocentric sources for egocentric video representation learning. Large-scale exocentric data covers diverse activities with significant potential for egocentric learning, but inherent disparities between egocentric and exocentric data pose challenges in utilizing one view for the other seamlessly. In this study, we propose leveraging hand-object interactions and language narratives as cues to incorporate exocentric data into egocentric training. Specifically, we focus on identifying specific video clips that emphasize hand-object interactions and pairing them with action-focused language narrations. By applying our framework to exocentric datasets such as HowTo100M, we construct datasets thar are effective for egocentric video-language pretraining. Our extensive evaluations reveal that EMBED achieves state-of-the-art performance across various egocentric downstream tasks, including a 4.7% absolute improvement in multi-instance retrieval on the Epic-Kitchens-100 benchmark and a 6.2% improvement in classification on the EGTEA benchmark in zero-shot settings. Furthermore, EMBED enables egocentric video-language models to perform competitively in exocentric tasks. Finally, we showcase EMBED’s application across various exocentric datasets, exhibiting strong generalization capabilities when applied to different exocentric datasets.

4451Efficient Distribution Matching of Representations via Noise-Injected Deep InfoMax

[openreview] [pdf]

Abstract Deep InfoMax (DIM) is a well-established method for self-supervised representation learning (SSRL) based on maximization of the mutual information between the input and the output of a deep neural network encoder. Despite the DIM and contrastive SSRL in general being well-explored, the task of learning representations conforming to a specific distribution (i.e., distribution matching, DM) is still under-addressed. Motivated by the importance of DM to several downstream tasks (including generative modeling, disentanglement, outliers detection and other), we enhance DIM to enable automatic matching of learned representations to a selected prior distribution. To achieve this, we propose injecting an independent noise into the normalized outputs of the encoder, while keeping the same InfoMax training objective. We show that such modification allows for learning uniformly and normally distributed representations, as well as representations of other absolutely continuous distributions. Our approach is tested on various downstream tasks. The results indicate a moderate trade-off between the performance on the downstream tasks and quality of DM.

4452Since Faithfulness Fails: The Performance Limits of Neural Causal Discovery

[openreview] [pdf]

Abstract Neural causal discovery methods have recently improved in terms of scalability and computational efficiency. However, there are still opportunities for improving their accuracy in uncovering causal structures. We argue that the key obstacle in unlocking this potential is the faithfulness assumption, commonly used by contemporary neural approaches. We show that this assumption, which is often not satisfied in real-world or synthetic datasets, limits the effectiveness of existing methods. We evaluate the impact of faithfulness violations both qualitatively and quantitatively and provide a unified evaluation framework to facilitate further research.

4453When Graph Neural Networks Meet Dynamic Mode Decomposition

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have emerged as fundamental tools for a wide range of prediction tasks on graph-structured data. Recent studies have drawn analogies between GNN feature propagation and diffusion processes, which can be interpreted as dynamical systems. In this paper, we delve deeper into this perspective by connecting the dynamics in GNNs to modern Koopman theory and its numerical method, Dynamic Mode Decomposition (DMD). We illustrate how DMD can estimate a low-rank, finite-dimensional linear operator based on multiple states of the system, effectively approximating potential nonlinear interactions between nodes in the graph. This approach allows us to capture complex dynamics within the graph accurately and efficiently. We theoretically establish a connection between the DMD-estimated operator and the original dynamic operator between system states. Building upon this foundation, we introduce a family of DMD-GNN models that effectively leverage the low-rank eigenfunctions provided by the DMD algorithm. We further discuss the potential of enhancing our approach by incorporating domain-specific constraints such as symmetry into the DMD computation, allowing the corresponding GNN models to respect known physical properties of the underlying system. Our work paves the path for applying advanced dynamical system analysis tools via GNNs. We validate our approach through extensive experiments on various learning tasks, including directed graphs, large-scale graphs, long-range interactions, and spatial-temporal graphs. We also empirically verify that our proposed models can serve as powerful encoders for link prediction tasks. The results demonstrate that our DMD-enhanced GNNs achieve state-of-the-art performance, highlighting the effectiveness of integrating DMD into GNN frameworks.

4454Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models

[openreview] [pdf]

Abstract Hallucinations in large language models are a widespread problem, yet the mechanisms behind whether models will hallucinate are poorly understood, limiting our ability to solve this problem. Using sparse autoencoders as an interpretability tool, we discover that a key part of these mechanisms is entity recognition, where the model detects if an entity is one it can recall facts about. Sparse autoencoders uncover meaningful directions in the representation space, these detect whether the model recognizes an entity, e.g. detecting it doesn’t know about an athlete or a movie. This shows that models can have self-knowledge: internal representations about their own capabilities. These directions are causally relevant: capable of steering the model to refuse to answer questions about known entities, or to hallucinate attributes of unknown entities when it would otherwise refuse. We demonstrate that despite the sparse autoencoders being trained on the base model, these directions have a causal effect on the chat model’s refusal behavior, suggesting that chat finetuning has repurposed this existing mechanism. Furthermore, we provide an initial exploration into the mechanistic role of these directions in the model, finding that they disrupt the attention of downstream heads that typically move entity attributes to the final token.

4455Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy

[openreview] [pdf]

Abstract Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they have strong optimization guarantees but do not have DP ones. To address this gap in the literature, we propose and analyze a new method called Clip21-SGDM based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGDM has optimal convergence rate and also optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGDM over baselines in terms of the optimization performance for a given DP-budget.

4456ShuffleNorm: A Better Normalization for Semi-supervised Learning

[openreview] [pdf]

Abstract We identify critical challenges with normalisation layers commonly used in fully supervised learning when applied to semi-supervised settings. Specifically, batch normalisation (BN) can experience severe performance degradation when labelled and unlabelled data have mismatched label distributions, due to biased statistical estimation. This results in unstable gradients, hindering the model’s ability to converge effectively. While group/layer normalisation (GN/LN) avoids these issues, it lacks the stochastic regularisation provided by BN, leading to weaker generalisation. Poor generalisation, in turn, produces low-quality pseudo-labels, exacerbating confirmation bias. To address these limitations, we propose novel normalisation techniques termed Shuffle Layer normalisation and Shuffle Group normalisation (SLN/SGN) that introduce controllable randomness into LN/GN without increasing model parameters, thus making semi-supervised learning more robust and effective. Through experiments across diverse datasets, including image, text, and audio modalities, we demonstrate that SLN/SGN significantly enhances the performance of state-of-the-art semi-supervised learning algorithms.

4457Exploring The Loss Landscape Of Regularized Neural Networks Via Convex Duality

[openreview] [pdf]

Abstract We discuss several aspects of the loss landscape of regularized neural networks: the structure of stationary points, connectivity of optimal solutions, path with non-increasing loss to arbitrary global optimum, and the nonuniqueness of optimal solutions, by casting the problem into an equivalent convex problem and considering its dual. Starting from two-layer neural networks with scalar output, we first characterize the solution set of the convex problem using its dual and further characterize all stationary points. With the characterization, we show that the topology of the global optima goes through a phase transition as the width of the network changes, and construct counterexamples where the problem may have a continuum of optimal solutions. Finally, we show that the solution set characterization and connectivity results can be extended to different architectures, including two layer vector-valued neural networks and parallel three-layer neural networks.

4458SpaceGNN: Multi-Space Graph Neural Network for Node Anomaly Detection with Extremely Limited Labels

[openreview] [pdf]

Abstract Node Anomaly Detection (NAD) has gained significant attention in the deep learning community due to its diverse applications in real-world scenarios. Existing NAD methods primarily embed graphs within a single Euclidean space, while overlooking the potential of non-Euclidean spaces. Besides, to address the prevalent issue of limited supervision in real NAD tasks, previous methods tend to leverage synthetic data to collect auxiliary information, which is not an effective solution as shown in our experiments. To overcome these challenges, we introduce a novel SpaceGNN model designed for NAD tasks with extremely limited labels. Specifically, we provide deeper insights into a task-relevant framework by empirically analyzing the benefits of different spaces for node representations, based on which, we design a Learnable Space Projection function that effectively encodes nodes into suitable spaces. Besides, we introduce the concept of weighted homogeneity, which we empirically and theoretically validate as an effective coefficient during information propagation. This concept inspires the design of the Distance Aware Propagation module. Furthermore, we propose the Multiple Space Ensemble module, which extracts comprehensive information for NAD under conditions of extremely limited supervision. Our findings indicate that this module is more beneficial than data augmentation techniques for NAD. Extensive experiments conducted on 9 real datasets confirm the superiority of SpaceGNN, which outperforms the best rival by an average of 8.55% in AUC and 4.31% in F1 scores.

4459Multi-Epoch Learning with Data Augmentation for Deep Click-Through Rate Prediction

[openreview] [pdf]

Abstract This paper investigates the one-epoch overfitting phenomenon in Click-Through Rate (CTR) models, where performance notably declines at the start of the second epoch. Despite extensive research, the efficacy of multi-epoch training over the conventional one-epoch approach remains unclear. As a result, all potential rewards from multi-epoch training can hardly be obtained. We identify the overfitting of the embedding layer instead of the Multi-Layer Perceptron (MLP) layers, as the primary issue. To address this, we introduce a novel Multi-Epoch learning with Data Augmentation (MEDA) framework. We design algorithms for both non-incremental and incremental learning scenarios in the industry. MEDA minimizes overfitting by reducing the dependency of the embedding layer on trained data, and achieves data augmentation through training the MLP with varied embedding spaces. MEDA’s effectiveness is established on our finding that pre-trained MLP layers can adapt to new embedding spaces and enhance model performances. This adaptability highlights the importance of the relative relationships among embeddings over their absolute positions. We conduct extensive experiments on several public and business datasets, and the effectiveness of data augmentation and superiority over conventional single-epoch training are consistently demonstrated for both non-incremental and incremental learning scenarios. To our knowledge, MEDA represents the first universally reliable multi-epoch training strategy tailored for deep CTR prediction models. We provide theoretical analyses of the reason behind the effectiveness of MEDA. Finally, MEDA has exhibited significant benefits in a real-world incremental-learning online advertising system.

4460Gradient-Free Analytical Fisher Information of Diffused Distributions

[openreview] [pdf]

Abstract Diffusion models (DMs) have demonstrated powerful distributional modeling capabilities through matching the first-order score of diffused distributions. Recent advancements have explored incorporating the second-order Fisher information, defined as the negative Hessian of log-density, into various downstream tasks and theoretical analysis of DMs. However, current practices often overlook the inherent structure of diffused distributions, accessing Fisher information via applying auto-differentiation to the learned score network. This approach, while straightforward, leaves theoretical properties unexplored and is time-consuming. In this paper, we derive the analytical formulation of Fisher information (AFI) by applying consecutive differentials to the diffused distributions. As a result, AFI takes a gradient-free form of a weighted sum (or integral) of outer-products of the score and initial data. Based on this formulation, we propose two algorithmic variants of AFI for distinct scenarios. When evaluating the AFI’s trace, we introduce a parameterized network to learn the trace. When AFI is applied as a linear operator, we present a training-free method that simplifies it into several inner-product calculations. Furthermore, we provide theoretical guarantees for both algorithms regarding convergence analysis and approximation error bounds. Additionally, we leverage AFI to establish the first general theorem for the optimal transport property of the diffusion ODE deduced map. Experiments in likelihood evaluation and adjoint optimization demonstrate the superior accuracy and reduced time-cost of the proposed algorithms.

4461BA-LoRA: Bias-Alleviating Low-Rank Adaptation to Mitigate Catastrophic Inheritance in Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable proficiency across various natural language processing (NLP) tasks. However, adapting LLMs to downstream applications requires computationally intensive and memory-demanding fine-tuning procedures. To alleviate these burdens, parameter-efficient fine-tuning (PEFT) techniques have emerged as a promising approach to tailor LLMs with minimal computational overhead. While PEFT methods offer substantial advantages, they do not fully address the pervasive issue of bias propagation from pre-training data. This work introduces Bias-Alleviating Low-Rank Adaptation (BA-LoRA), a novel PEFT method designed to counteract bias inheritance. BA-LoRA incorporates three distinct regularization terms: (1) a consistency regularizer, (2) a diversity regularizer, and (3) a singular value decomposition regularizer. These regularizers aim to enhance the models’ consistency, diversity, and generalization capabilities during fine-tuning. We conduct extensive experiments on natural language understanding (NLU) and natural language generation (NLG) tasks using prominent LLMs such as LLaMA, Mistral, and Gemma. The results demonstrate that BA-LoRA outperforms LoRA and its state-of-the-art variants. Moreover, our method effectively mitigates the adverse effects of pre-training bias, leading to more reliable and robust model outputs.

4462Conjuring Semantic Similarity

[openreview] [pdf]

Abstract The semantic similarity between sample expressions measures the distance between their latent ‘meaning’. Such meanings are themselves typically represented by textual expressions, often insufficient to differentiate concepts at fine granularity. We propose a novel approach whereby the semantic similarity among textual expressions is based {\em not} on other expressions they can be rephrased as, but rather based on the imagery they evoke. While this is not possible with humans, generative models allow us to easily visualize and compare generated images, or their distribution, evoked by a textual prompt. Therefore, we characterize the semantic similarity between two textual expressions simply as the distance between image distributions they induce, or ‘conjure.’ We show that by choosing the Jensen-Shannon divergence between the reverse-time diffusion stochastic differential equations (SDEs) induced by each textual expression, this can be directly computed via Monte-Carlo sampling. Our method contributes a novel perspective on semantic similarity that not only aligns with human-annotated scores, but also opens up new avenues for the evaluation of text-conditioned generative models while offering better interpretability of their learnt representations.

4463Diffusion Models Need Visual Priors for Image Generation

[openreview] [pdf]

Abstract Conventional class-guided diffusion models generally succeed in generating images with correct semantic content, but often struggle with texture details. This limitation stems from the usage of class priors, which only provide coarse and limited conditional information. To address this issue, we propose Diffusion on Diffusion (DoD), an innovative multi-stage generation framework that first extracts visual priors from previously generated samples, then provides rich guidance for the diffusion model leveraging visual priors from the early stages of diffusion sampling. Specifically, we introduce a latent embedding module that employs a compression-reconstruction approach to discard redundant detail information from the conditional samples in each stage, retaining only the semantic information for guidance. We evaluate DoD on the popular ImageNet-256×256256 \times 256 dataset, reducing 7×\times training cost compared to SiT and DiT with even better performance in terms of the FID-50K score. Our largest model DoD-XL achieves an FID-50K score of 1.83 with only 1 million training steps, which surpasses other state-of-the-art methods without bells and whistles during inference.

4464GITAR: GENERALIZED IRREGULAR TIME SERIES REGRESSION VIA MASKING AND RECONSTRUCTION PRETRAINING

[openreview] [pdf]

Abstract Multivariate time series regression, encompassing forecasting and interpolation, is crucial for numerous real-world applications, particularly in healthcare, climate science, ecology, and others. While recent work has focused on improving modeling for time series regression, two main limitations persist. First, the prevalence of irregularly sampled time series with missing values poses significant challenges. For instance, healthcare applications often involve predicting future or missing observations from irregular data to enable continuous patient monitoring and timely intervention. As current approaches mainly rely on the assumptions of regular time series such as strong periodicity, when applied to irregular ones they exhibit performance degradation. Second, while some state-of-the-art methods (SOTA) do model irregularity and perform regression tasks on irregular data, they are often trained in a fully supervised manner. This limits their ability to generalize easily to different domains (e.g., training and testing datasets with different numbers of variables). To address these challenges, we propose GITaR, a Generalized Irregular Time Series Regression model via masking and Reconstruction pertaining mechanism, aiming to capture the inherent irregularity in time series and learn robust, generalizable representations without supervision for downstream regression tasks. Comprehensive experiments on common real-world regression tasks in healthcare, human activity recognition, and climate science underline the superior performance of GITaR compared to state-of-the-art methods. Our results highlight our model’s unique capability to generalize across different domains, demonstrating the potential for broad applicability in various fields requiring accurate temporal prediction and interpolation.

4465Learning to Adapt Frozen CLIP for Few-Shot Test-Time Domain Adaptation

[openreview] [pdf]

Abstract Few-shot Test-Time Domain Adaptation focuses on adapting a model at test time to a specific domain using only a few unlabeled examples, addressing domain shift. Prior methods leverage CLIP’s strong out-of-distribution (OOD) abilities by generating domain-specific prompts to guide its generalized, frozen features. However, since downstream datasets are not explicitly seen by CLIP, solely depending on the feature space knowledge is constrained by CLIP’s prior knowledge. Notably, when using a less robust backbone like ViT-B/16, performance significantly drops on challenging real-world benchmarks. Departing from the state-of-the-art of inheriting the intrinsic OOD capability of CLIP, this work introduces learning directly on the input space to complement the dataset-specific knowledge for frozen CLIP. Specifically, an independent side branch is attached in parallel with CLIP and enforced to learn exclusive knowledge via revert attention. To better capture the dataset-specific label semantics for downstream adaptation, we propose to enhance the inter-dispersion among text features via greedy text ensemble and refinement. The text and visual features are then progressively fused in a domain-aware manner by a generated domain prompt to adapt toward a specific domain. Extensive experiments show our method’s superiority on 5 large-scale benchmarks (WILDS and DomainNet), notably improving over smaller networks like ViT-B/16 with gains of \textbf{+5.1} in F1 for iWildCam and \textbf{+3.1%} in WC Acc for FMoW.

4466FedDFQ : Personalized Federated Learning Based On Data Feature Quantification

[openreview] [pdf]

Abstract Personalized federated learning is widely used for heterogeneous data distributions across clients. However, existing methods are difficult to measure and utilize these heterogeneities accurately. To address this issue, in this paper, we propose a novel and efficient method named FedDFQ which uses a customized Data Identity Extraction Module (DIEM) to dynamically generate metric proxies that quantify data heterogeneity across different local clients in a privacy-friendly manner. The metric proxies are used to assess the contributions of global parameter aggregation and personalized gradient backpropagation for each local client. In addition, we design a plug-and-play Automatic Gradient Accumulation Module (AGAM) that regularizes personalized classification layers with re-balanced gradients. We provide theoretical explanations and experimental results that validate the effectiveness of the proposed FedDFQ. With comprehensive comparisons to existing state-of-the-art approaches, FedDFQ outperforms them on two benchmark datasets in different heterogeneous scenarios.

4467AI2TALE: An Innovative Information Theory-based Approach for Learning to Localize Phishing Attacks

[openreview] [pdf]

Abstract Phishing attacks remain a significant challenge for detection, explanation, and defense, despite over a decade of research on both technical and non-technical solutions. AI-based phishing detection methods are among the most effective approaches for defeating phishing attacks, providing predictions on the vulnerability label (i.e., phishing or benign) of data. However, they often lack intrinsic explainability, failing to identify the specific information that triggers the classification. To this end, we propose an innovative deep learning-based approach for email (the most common phishing way) phishing attack localization. Our method aims to not only predict the vulnerability label of the email data but also provide the capability to automatically learn and figure out the most important and phishing-relevant information (i.e., sentences) in the phishing email data, offering useful and concise explanations for the identified vulnerability.The extensive experiments on seven diverse real-world email datasets demonstrate the capability and effectiveness of our method in selecting crucial information, enabling accurate detection and offering useful and concise explanations (via the most important and phishing-relevant information triggering the classification) for the vulnerability of phishing emails. Notably, our approach outperforms state-of-the-art baselines by 1.5% to 3.5% on average in Label-Accuracy and Cognitive-True-Positive metrics under a weakly supervised setting, where only vulnerability labels are used without requiring ground truth phishing information.

4468SHIKI: Self-Supervised Heuristic for Improving MLPs’ Knowledge by Integrating GNNs

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are widely recognized as leading architectures for addressing classification problems involving graphical data. In this thesis, we formally define the challenge of effectively constructing edges within a dataset and training a GNN over this graph and introduce SHIKI - a novel method to tackle this task. We provide a comprehensive theoretical analysis demonstrating how graph convolutions can improve expected performance by leveraging edges. Our study focuses on the node classification problem within a non-linearly separable Gaussian mixture model, combined with a stochastic block model, and we visually demonstrate its applicability to real-world datasets. Specifically, we show that a single graph convolution in the second layer can reduce the expected loss when applying a heuristic for edge creation. We validate our findings through extensive experiments on both synthetic and real-world datasets, including those related to the entity matching problem and textual review classification. For the synthetic data, we conduct experiments based on the dataset’s difficulty and various hyperparameters in our method, drawing connections between the two. Additionally, we perform an ablation study by systematically removing components of our method and testing the resulting degraded approach, which highlights the necessity of our full method. We employ several GNN architectures in the experiments, including GCN, GraphSAGE, and GAT.

4469Neural Spacetimes for DAG Representation Learning

[openreview] [pdf]

Abstract We propose a class of trainable deep learning-based geometries called Neural SpaceTimes (NSTs), which can universally represent nodes in weighted Directed Acyclic Graphs (DAGs) as events in a spacetime manifold. While most works in the literature focus on undirected graph representation learning or causality embedding separately, our differentiable geometry can encode both graph edge weights in its spatial dimensions and causality in the form of edge directionality in its temporal dimensions. We use a product manifold that combines a quasi-metric (for space) and a partial order (for time). NSTs are implemented as three neural networks trained in an end-to-end manner: an embedding network, which learns to optimize the location of nodes as events in the spacetime manifold, and two other networks that optimize the space and time geometries in parallel, which we call a neural (quasi-)metric and a neural partial order, respectively. The latter two networks leverage recent ideas at the intersection of fractal geometry and deep learning to shape the geometry of the representation space in a data-driven fashion, unlike other works in the literature that use fixed spacetime manifolds such as Minkowski space or De Sitter space to embed DAGs. Our main theoretical guarantee is a universal embedding theorem, showing that any kk-point DAG can be embedded into an NST with 1+O(log(k))1+\mathcal{O}(\log(k)) distortion while exactly preserving its causal structure. The total number of parameters defining the NST is sub-cubic in kk and linear in the width of the DAG. If the DAG has a planar Hasse diagram, this is improved to O(log(k)+2)\mathcal{O}(\log(k) + 2) spatial and 2 temporal dimensions. We validate our framework computationally with synthetic weighted DAGs and real-world network embeddings; in both cases, the NSTs achieve lower embedding distortions than their counterparts using fixed spacetime geometries.

4470GUNet: A Graph Convolutional Network United Diffusion Model for Stable and Diversity Pose Generation

[openreview] [pdf]

Abstract Pose skeleton images are an important reference in pose-controllable image generation. In order to enrich the source of skeleton images, recent works have investigated the generation of pose skeletons based on natural language. These methods are based on GANs. However, it remains challenging to perform diverse, structurally correct and aesthetically pleasing human pose skeleton generation with various textual inputs. To address this problem, we propose a framework with GUNet as the main model, PoseDiffusion. It is the first generative framework based on a diffusion model and also contains a series of variants fine-tuned based on a stable diffusion model. PoseDiffusion demonstrates several desired properties that outperform existing methods. 1) Correct Skeletons. GUNet, a denoising model of PoseDiffusion, is designed to incorporate graphical convolutional neural networks. It is able to learn the spatial relationships of the human skeleton by introducing skeletal information during the training process. 2) Diversity. We decouple the key points of the skeleton and characterise them separately, and use cross-attention to introduce textual conditions. Experimental results show that PoseDiffusion outperforms existing SoTA algorithms in terms of stability and diversity of text-driven pose skeleton generation. Qualitative analyses further demonstrate its superiority for controllable generation in Stable Diffusion.

4471How to Weight Multitask Finetuning? Fast Previews via Model Merging

[openreview] [pdf]

Abstract When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.

4472Few-shot Text Adversarial Attack for Black-box Multi-task Learning

[openreview] [pdf]

Abstract Current multi-task adversarial text attacks rely on white-box access to shared in- ternal features and assume a homogeneous multi-task learning framework. As a result, these attacks are less effective against practical scenarios involving black- box feedback APIs and heterogeneous multi-task learning. To bridge this gap, we introduce Cluster and Ensemble Mutil-task Text Adversarial Attack (CEMA), an effective black-box attack that exploits the transferability of adversarial texts. Specifically, we initially employ cluster-oriented substitute model training, as a plug-and-play framework, to simplify complex multi-task scenarios into more manageable text classification attacks and train the substitute model. Next, we generate multiple adversarial candidate examples by applying various adversarial text classification methods. Finally, we select the adversarial example that attacks the most substitute models as the final attack output. CEMA is evaluated on two primary multi-task objectives: text classification and translation. In the classifica- tion task, CEMA achieves attack success rates that exceed 60% while reducing the total number of queries to 100. For the text translation task, the BLEU scores of both victim texts and adversarial examples decrease to below 0.36 with 100 queries even including the commercial translation APIs, such as Baidu Translate and Ali Translate. Additionally, we derive the theoretical lower bound for CEMA’s success rate, demonstrating that a successful attack increases with the number of candidate substitute models.

4473MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

[openreview] [pdf]

Abstract We introduce MarDini, a new family of video diffusion models that integrate the advantages of masked auto-regression (MAR) into a unified diffusion model (DM) framework. Here, MAR handles temporal planning, while DM focuses on spatial generation in an asymmetric network design: i) a MAR-based planning model containing most of the parameters generates planning signals for each masked frame using low-resolution input; ii) a lightweight generation model uses these signals to produce high-resolution frames via diffusion de-noising. MarDini’s MAR enables video generation conditioned on any number of masked frames at any frame positions: a single model can handle video interpolation (e.g., masking middle frames), image-to-video generation (e.g., masking from the second frame onward), and video expansion (e.g., masking half the frames). The efficient design allocates most of the computational resources to the low-resolution planning model, making computationally expensive but important spatio-temporal attention feasible at scale. MarDini sets a new state-of-the-art for video interpolation; meanwhile, within few inference steps, it efficiently generates videos on par with those of much more expensive advanced image-to-video models.

4474BingoGuard: LLM Content Moderation Tools with Risk Levels

[openreview] [pdf]

Abstract Malicious content generated by large language models (LLMs) can pose varying degrees of harm. Although existing LLM-based moderators can detect harmful content, they struggle to assess risk levels and may miss lower-risk outputs. Accurate risk assessment allows platforms with different safety thresholds to tailor content filtering and rejection. In this paper, we introduce per-topic severity rubrics for 11 harmful topics and build BingoGuard, an LLM-based moderation system designed to predict both binary safety labels and severity levels. To address the lack of annotations on levels of severity, we propose a scalable generate-then-filter framework that first generates responses across different severity levels and then filters out low-quality responses. Using this framework, we create BingoGuardTrain, a training dataset with 54,897 examples covering a variety of topics, response severity, styles, and BingoGuardTest, a test set with 988 examples explicitly labeled based on our severity rubrics that enables fine-grained analysis on model behaviors on different severity levels. Our BingoGuard-8B, trained on BingoGuardTrain, achieves the state-of-the-art performance on several moderation benchmarks, including WildGuardTest and HarmBench, as well as BingoGuardTest, outperforming best public models, WildGuard, by 4.3%. Our analysis demonstrates that incorporating severity levels into training significantly enhances detection performance and enables the model to effectively gauge the severity of harmful responses. Warning: this paper includes red-teaming examples that may be harmful in nature.

4475Generalization Guarantees for Representation Learning via Data-Dependent Gaussian Mixture Priors

[openreview] [pdf]

Abstract We establish in-expectation and tail bounds on the generalization error of representation learning type algorithms. The bounds are in terms of the relative entropy between the distribution of the representations extracted from the training and "test’’ datasets and a data-dependent symmetric prior, i.e., the Minimum Description Length (MDL) of the latent variables for the training and test datasets. Our bounds are shown to reflect the "structure’’ and "simplicity’’ of the encoder and significantly improve upon the few existing ones for the studied model. We then use our in-expectation bound to devise a suitable data-dependent regularizer; and we investigate thoroughly the important question of the selection of the prior. We propose a systematic approach to simultaneously learning a date-dependent Gaussian mixture prior and using it as a regularizer. Interestingly, we show that a weighted attention mechanism emerges naturally in this procedure. Our experiments show that our approach outperforms the now popular Variational Information Bottleneck (VIB) method as well as the recent Category-Dependent VIB (CDVIB).

4476Differentially Private Steering for Large Language Model Alignment

[openreview] [pdf]

Abstract Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the \textit{\underline{P}rivate \underline{S}teering for LLM \underline{A}lignment (PSA)} algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa and Qwen). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our \textit{PSA} algorithm compared to several existing non-private techniques.

4477Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

[openreview] [pdf]

Abstract Agents powered by large language models have shown remarkable abilities in solving complex tasks. However, most agent systems remain reactive, limiting their effectiveness in scenarios requiring foresight and autonomous decision-making. In this paper, we tackle the challenge of developing proactive agents capable of anticipating and initiating tasks without explicit human instructions. We propose a novel data-driven approach for this problem. Firstly, we collect real-world human activities to generate proactive task predictions. These predictions are then labeled by human annotators as either accepted or rejected. The labeled data is used to train a reward model that simulates human judgment and serves as an automatic evaluator of the proactiveness of LLM agents. Building on this, we develop a comprehensive data generation pipeline to create a diverse dataset, ProactiveBench, containing 6,790 events. Finally, we demonstrate that fine-tuning models with the proposed ProactiveBench can significantly elicit the proactiveness of LLM agents. Experimental results show that our fine-tuned model achieves an F1-Score of 66.47% in proactively offering assistance, outperforming all open-source and close-source models. These results highlight the potential of our method in creating more proactive and effective agent systems, paving the way for future advancements in human-agent collaboration.

4478Mimicking Human Intuition: Cognitive Belief-Driven Q-Learning

[openreview] [pdf]

Abstract Reinforcement learning encounters challenges in various environments related to robustness and explainability. Traditional Q-learning algorithms cannot effectively make decisions and utilize the historical learning experience. To overcome these limitations, we propose Cognitive Belief-Driven Q-Learning (CBDQ), which integrates subjective belief modeling into the Q-learning framework, enhancing decision-making accuracy by endowing agents with human-like learning and reasoning capabilities. Drawing inspiration from cognitive science, our method maintains a subjective belief distribution over the expectation of actions, leveraging a cluster-based subjective belief model that enables agents to reason about the potential probability associated with each decision. CBDQ effectively mitigates overestimated phenomena and optimizes decision-making policies by integrating historical experiences with current contextual information, mimicking the dynamics of human decision-making. We evaluate the proposed method on discrete control benchmark tasks in various complicate environments. The results demonstrate that CBDQ exhibits stronger adaptability, robustness, and human-like characteristics in handling these environments, outperforming other baselines. We hope this work will give researchers a fresh perspective on understanding and explaining Q-learning.

4479Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

[openreview] [pdf]

Abstract Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS’s effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

4480Trivialized Momentum Facilitates Diffusion Generative Modeling on Lie Groups

[openreview] [pdf]

Abstract The generative modeling of data on manifold is an important task, for which diffusion models in flat spaces typically need nontrivial adaptations. This article demonstrates how a technique called `trivialization’ can transfer the effectiveness of diffusion models in Euclidean spaces to Lie groups. In particular, an auxiliary momentum variable was algorithmically introduced to help transport the position variable between data distribution and a fixed, easy-to-sample distribution. Normally, this would incur further difficulty for manifold data because momentum lives in a space that changes with the position. However, our trivialization technique creates a new momentum variable that stays in a simple fixed vector space\textbf{fixed vector space}. This design, together with a manifold preserving integrator, simplifies implementation and avoids inaccuracies created by approximations such as projections to tangent space and manifold, which were typically used in prior work, hence facilitating generation with high-fidelity and efficiency. The resulting method achieves state-of-the-art performance on protein and RNA torsion angle generation and sophisticated torus datasets. We also, arguably for the first time, tackle the generation of data on high-dimensional Special Orthogonal and Unitary groups, the latter essential for quantum problems.

4481RankMatch: A Novel Approach to Semi-Supervised Label Distribution Learning Leveraging Inter-label Correlations

[openreview] [pdf]

Abstract This paper introduces RankMatch, an innovative approach for Semi-Supervised Label Distribution Learning (SSLDL). Addressing the challenge of limited labeled data, RankMatch effectively utilizes a small number of labeled examples in conjunction with a larger quantity of unlabeled data, reducing the need for extensive manual labeling in Deep Neural Network (DNN) applications. Specifically, RankMatch introduces an ensemble learning-inspired averaging strategy that creates a pseudo-label distribution from multiple weakly augmented images. This not only stabilizes predictions but also enhances the model’s robustness. Beyond this, RankMatch integrates a pairwise relevance ranking (PRR) loss, capturing the complex inter-label correlations and ensuring that the predicted label distributions align with the ground truth. We establish a theoretical generalization bound for RankMatch, and through extensive experiments, demonstrate its superiority in performance against existing SSLDL methods.

4482MetaOOD: Automatic Selection of OOD Detection Models

[openreview] [pdf]

Abstract How can we automatically select an out-of-distribution (OOD) detection model for various underlying tasks? This is crucial for maintaining the reliability of open-world applications by identifying data distribution shifts, particularly in critical domains such as online transactions, autonomous driving, and real-time patient diagnosis. Despite the availability of numerous OOD detection methods, the challenge of selecting an optimal model for diverse tasks remains largely underexplored, especially in scenarios lacking ground truth labels. In this work, we introduce MetaOOD, the first zero-shot, unsupervised framework that utilizes meta-learning to automatically select an OOD detection model. As a meta-learning approach, MetaOOD leverages historical performance data of existing methods across various benchmark OOD datasets, enabling the effective selection of a suitable model for new datasets without the need for labeled data at the test time. To quantify task similarities more accurately, we introduce language model-based embeddings that capture the distinctive OOD characteristics of both datasets and detection models. Through extensive experimentation with 24 unique test dataset pairs to choose from among 11 OOD detection models, we demonstrate that MetaOOD significantly outperforms existing methods and only brings marginal time overhead. Our results, validated by Wilcoxon statistical tests, show that MetaOOD surpasses a diverse group of 11 baselines, including established OOD detectors and advanced unsupervised selection methods.

4483I2AM: Interpreting Image-to-Image Latent Diffusion Models via Bi-Attribution Maps

[openreview] [pdf]

Abstract Large-scale diffusion models have made significant advances in image generation, particularly through cross-attention mechanisms. While cross-attention has been well-studied in text-to-image tasks, their interpretability in image-to-image (I2I) diffusion models remains underexplored. This paper introduces Image-to-Image Attribution Maps (I2AM)(\textbf{I}^2\textbf{AM}), a method that enhances the interpretability of I2I models by visualizing bidirectional attribution maps, from the reference image to the generated image and vice versa. I2AM\text{I}^2\text{AM} aggregates cross-attention scores across time steps, attention heads, and layers, offering insights into how critical features are transferred between images. We demonstrate the effectiveness of I2AM\text{I}^2\text{AM} across object detection, inpainting, and super-resolution tasks. Our results demonstrate that I2AM\text{I}^2\text{AM} successfully identifies key regions responsible for generating the output, even in complex scenes. Additionally, we introduce the Inpainting Mask Attention Consistency Score (IMACS) as a novel evaluation metric to assess the alignment between attribution maps and inpainting masks, which correlates strongly with existing performance metrics. Through extensive experiments, we show that I2AM\text{I}^2\text{AM} enables model debugging and refinement, providing practical tools for improving I2I model’s performance and interpretability.

4484Instruction Following without Instruction Tuning

[openreview] [pdf]

Abstract Instruction tuning commonly means finetuning a language model on instruction- response pairs. We discover two forms of adaptation (tuning) that are deficient compared to instruction tuning, yet still yield instruction following; we call this implicit instruction tuning. We first find that instruction-response pairs are not necessary: training solely on responses, without any corresponding instructions, yields instruction following. This suggests pretrained models have an instruction-response mapping which is revealed by teaching the model the desired distribution of re- sponses. However, we then find it’s not necessary to teach the desired distribution of responses: instruction-response training on narrow-domain data like poetry still leads to broad instruction-following behavior like recipe generation. In particular, when instructions are very different from those in the narrow finetuning domain, models’ responses do not adhere to the style of the finetuning domain. To begin to explain implicit instruction tuning, we hypothesize that very simple changes to a language model’s distribution yield instruction following. We support this by hand-writing a rule-based language model which yields instruction following in a product-of-experts with a pretrained model. The rules are to slowly increase the probability of ending the sequence, penalize repetition, and uniformly change 15 words’ probabilities. In summary, adaptations made without being designed to yield instruction following can do so implicitly.

4485Sports-Traj: A Unified Trajectory Generation Model for Multi-Agent Movement in Sports

[openreview] [pdf]

Abstract Understanding multi-agent movement is critical across various fields. The conventional approaches typically focus on separate tasks such as trajectory prediction, imputation, or spatial-temporal recovery. Considering the unique formulation and constraint of each task, most existing methods are tailored for only one, limiting the ability to handle multiple tasks simultaneously, which is a common requirement in real-world scenarios. Another limitation is that widely used public datasets mainly focus on pedestrian movements with casual, loosely connected patterns, where interactions between individuals are not always present, especially at a long distance, making them less representative of more structured environments. To overcome these limitations, we propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs, adaptable to diverse scenarios in the domain of sports games. Specifically, we introduce a Ghost Spatial Masking (GSM) module, embedded within a Transformer encoder, for spatial feature extraction. We further extend recent State Space Models (SSMs), known as the Mamba model, into a Bidirectional Temporal Mamba (BTM) to better capture temporal dependencies. Additionally, we incorporate a Bidirectional Temporal Scaled (BTS) module to thoroughly scan trajectories while preserving temporal missing relationships. Furthermore, we curate and benchmark three practical sports datasets, \textbf{\textit{Basketball-U}}, \textbf{\textit{Football-U}}, and \textbf{\textit{Soccer-U}}, for evaluation. Extensive experiments demonstrate the superior performance of our model. We hope that our work can advance the understanding of human movement in real-world applications, particularly in sports. Our datasets, code, and model weights are available at~\href{https://anonymous.4open.science/r/UniTraj-ICLR25/README.md}{link}.

4486SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding

[openreview] [pdf]

Abstract Despite the significant advancements of Large Vision-Language Models (LVLMs) on established benchmarks, there remains a notable gap in suitable evaluation regarding their applicability in the emerging domain of long-context streaming video understanding. Current benchmarks for video understanding typically emphasize isolated single-instance text inputs and fail to evaluate the capacity to sustain temporal reasoning throughout the entire duration of video streams. To address these limitations, we introduce SVBench, a pioneering benchmark with temporal multi-turn question-answering chains specifically designed to thoroughly assess the capabilities of streaming video understanding of current LVLMs. We design a semi-automated annotation pipeline to obtain 49,979 Question-Answer (QA) pairs of 1,353 streaming videos, which includes generating QA chains that represent a series of consecutive multi-turn dialogues over video segments and constructing temporal linkages between successive QA chains. Our experimental results, obtained from 14 models in dialogue and streaming evaluations, reveal that while the closed-source GPT-4o outperforms others, most open-source LVLMs struggle with long-context streaming video understanding. We also construct a StreamingChat model, which significantly outperforms open-source LVLMs on our SVBench and achieves comparable performance on diverse vision-language benchmarks. We expect SVBench to advance the research of streaming video understanding by providing a comprehensive and in-depth analysis of current LVLMs. Our benchmark and model can be accessed athttps://anonymous.4open.science/r/SVBench-356F.

4487On Memorization of Large Language Models in Logical Reasoning

[openreview] [pdf]

Abstract Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes. This contrasting behavior is puzzling when it comes to understanding the mechanisms behind LLMs’ reasoning capabilities. One hypothesis is that the increasingly high and nearly saturated performance on common reasoning benchmarks could be due to the memorization of similar problems. In this paper, we systematically investigate this hypothesis with a quantitative measurement of memorization in reasoning tasks, using a dynamically generated logical reasoning benchmark based on Knights and Knaves (K&K) puzzles. We found that LLMs could interpolate the training puzzles (achieving near-perfect accuracy) after fine-tuning, yet fail when those puzzles are slightly perturbed, suggesting that the models heavily rely on memorization to solve those training puzzles. On the other hand, we show that while fine-tuning leads to heavy memorization, it also consistently improves generalization performance. In-depth analyses with perturbation tests, cross difficulty-level transferability, probing model internals, and fine-tuning with wrong answers suggest that the LLMs learn to reason on K&K puzzles despite training data memorization. This phenomenon indicates that LLMs exhibit a complex interplay between memorization and genuine reasoning abilities. Finally, our analysis with per-sample memorization score sheds light on how LLMs switch between reasoning and memorization in solving logical puzzles.

4488Active Evaluation Acquisition for Efficient LLM Benchmarking

[openreview] [pdf]

Abstract As large language models (LLMs) become increasingly versatile, numerous large scale benchmarks have been developed to thoroughly assess their capabilities. These benchmarks typically consist of diverse datasets and prompts to evaluate different aspects of LLM performance. However, comprehensive evaluations on hundreds or thousands of prompts incur tremendous costs in terms of computation, money, and time. In this work, we investigate strategies to improve evaluation efficiency by selecting a subset of examples from each benchmark using a learned policy. Our approach models the dependencies across test examples, allowing accurate prediction of the evaluation outcomes for the remaining examples based on the outcomes of the selected ones. Consequently, we only need to acquire the actual evaluation outcomes for the selected subset. We rigorously explore various subset selection policies and introduce a novel RL-based policy that leverages the captured dependencies. Empirical results demonstrate that our approach significantly reduces the number of evaluation prompts required while maintaining accurate performance estimates compared to previous methods.

4489RLSF: Reinforcement Learning via Symbolic Feedback

[openreview] [pdf]

Abstract Reinforcement Learning with Human Feedback (RLHF) is considered a standard approach to fine-tuning Large Language Models (LLMs). However, such methods often face limitations such as unsound black-box reward models, difficulties in collecting human preference data, and the reliance on sparse scalar rewards. These methods often fall short when applied to tasks that require complex domain-specific understanding.To address these challenges, we propose a new fine-tuning paradigm we refer to as Reinforcement Learning via Symbolic Feedback (RLSF), which aims to improve domain-specific understanding of LLMs more effectively than traditional reward signals. In the RLSF setting, the LLM being fine-tuned is considered an RL agent, while the environment is allowed access to reasoning or domain knowledge tools (e.g., solvers, provers, algebra systems, or knowledge bases). Crucially, in RLSF, these reasoning tools can provide feedback to the LLMs via poly-sized certificates (e.g., proofs), that characterize errors in the LLM-generated object with respect to some correctness specification. As a bonus, our RLSF approach does not require the reasoning systems we use to be differentiable. The ability of RLSF-based fine-tuning to leverage certificate-generating symbolic tools enables sound fine-grained (token-level) reward signals to LLMs, and thus addresses the limitations of traditional reward models mentioned above.Via extensive evaluations, we show that our RLSF-based fine-tuning of LLMs outperforms traditional approaches on five different applications (that have some associated logical or domain constraints), namely, program synthesis from natural language pseudo-code to programming language (+31.43% in functional correctness for Google’s CodeGemma-2b compared to supervised fine-tuning, +17.01% in functional correctness compared to GPT-3.5 -- 100×\boldsymbol\times larger), three chemistry tasks (+5.5% exact match for molecule generation, +19.4% exact match for forward synthesis, +33.7% exact match for retrosynthesis, using Meta’s Galactica-1.3b, compared to GPT-4 -- 1000×\boldsymbol\times larger), and solving the Game of 24 (+25% success rate using Meta’s Llama2-7b compared to traditional methods, and +7% success rate compared to GPT-3.5 -- 25×\boldsymbol\times larger). A takeaway is that fine-tuning via RLSF enables relatively smaller LLMs to significantly outperform closed-source models that are orders of magnitude larger (e.g., GPT-4).

4490LongMamba: Enhancing Mamba’s Long-Context Capabilities via Training-Free Receptive Field Enlargement

[openreview] [pdf]

Abstract Mamba models have emerged as an efficient alternative to Transformer models for language modeling tasks, offering linear complexity as context length increases. However, despite their efficiency in handling long contexts, recent studies have demonstrated that Mamba models underperform in understanding extended contexts compared to Transformer models. To address this significant shortfall, we propose ``LongMamba", a training-free technique that significantly enhances the long-context capabilities of Mamba models. Our approach builds upon the discovery that hidden state channels in Mamba models—categorized into \textit{local} and \textit{global channels} based on their receptive field lengths—exhibit distinct functionalities. Specifically, the \textit{global channels} struggle to adaptively extend their effective receptive fields when input lengths far exceed their training sequence length due to exponential decay in their hidden states. We hypothesize this exponential decay is the root cause of Mamba models’ limited performance in extended contexts. LongMamba counters this by effectively expanding the \textit{global channels}’ receptive fields to fully encompass the input sequence length, thus enabling them to capture global information more effectively. Through extensive benchmarking across synthetic and real-world long-context scenarios, LongMamba sets a new standard for state-of-the-art performance in Mamba-based long-context tasks, significantly extending the operational range of Mamba models without requiring additional fine-tuning. All code and models will be released upon acceptance.

4491Bridging the Gap between Variational Inference and Stochastic Gradient MCMC in Function Space

[openreview] [pdf]

Abstract Traditional parameter-space posterior inference for Bayesian neural networks faces several challenges, such as the difficulty in specifying meaningful prior, the potential pathologies in deep models and the intractability for multi-modal posterior. To address these issues, functional variational inference (fVI) and functional Markov Chain Monte Carlo (fMCMC) are two recently emerged Bayesian inference schemes that perform posterior inference directly in function space by incorporating more informative functional priors. Similar to their parameter-space counterparts, fVI and fMCMC have their own strengths and weaknesses. For instance, fVI is computationally efficient but imposes strong distributional assumptions, while fMCMC is asymptotically exact but suffers from slow mixing in high dimensions. To inherit the complementary benefits of both schemes, this work proposes a novel hybrid inference method for functional posterior inference. Specifically, it combines fVI and fMCMC successively by an elaborate linking mechanism to form an alternating approximation process. We also provide theoretical justification for the soundness of such a hybrid inference through the lens of Wasserstein gradient flows in the function space. We evaluate our method on several benchmark tasks and observe improvements in both predictive accuracy and uncertainty quantification compared to parameter/function-space VI and MCMC.

4492Tailoring Mixup to Data for Calibration

[openreview] [pdf]

Abstract Among all data augmentation techniques proposed so far, linear interpolation of training samples, also called Mixup, has found to be effective for a large panel of applications. Along with improved performance, Mixup is also a good technique for improving calibration and predictive confidence. However, mixing data carelessly can lead to manifold mismatch, i.e., synthetic data lying outside original class manifolds, which can deteriorate calibration of confidence. In this work, we show that the likelihood of manifold mismatch increases with the distance between data to mix. To this end, we propose to dynamically change the underlying distributions of interpolation coefficients depending on the similarity between samples to mix, and define a flexible framework to do so without losing in diversity. We provide extensive experiments for classification and regression tasks, showing that our proposed method improves performance and calibration of models, while being much more efficient.

4493LogicJitter: Let LLMs play Logic Games and they will Detect Misinformation

[openreview] [pdf]

Abstract In the face of the growing challenge of online information overload, the ability to accurately differentiate between genuine information and misinformation has become increasingly critical both from an individual and societal point of view. Current methodologies for misinformation detection predominantly rely on supervised approaches, which depend heavily on large labeled datasets. However, these datasets are not only costly and time-consuming to produce, but they are also susceptible to issues such as labeling bias, time leakage, the inherent subjectivity of the task, and domain-specific limitations. In this paper, we aim to overcome the aforementioned challenges by proposing a novel and cost-effective strategy to enhance the logical reasoning capabilities of Large Language Models (LLMs), thereby improving their ability to detect misinformation. Our approach, termed LogicJitter, employs a data augmentation technique during fine-tuning that generates both correct and incorrect statements within rule-based logic games. Moreover, these games are designed to counteract well-known human cognitive biases and logical fallacies. Hence, the primary contributions of this work include demonstrating the effectiveness of logical reasoning pre-training on LLMs and providing an open-source PyTorch package for the automatic generation of correct and incorrect logic-based training data.

4494IMPaCT GNN: Imposing invariance with Message Passing in Chronological split Temporal Graphs

[openreview] [pdf]

Abstract This paper addresses domain adaptation challenges in graph data resulting from chronological splits. In a transductive graph learning setting, where each node is associated with a timestamp, we focus on the task of Semi-Supervised Node Classification (SSNC), aiming to classify recent nodes using labels of past nodes. Temporal dependencies in node connections create domain shifts, causing significant performance degradation when applying models trained on historical data into recent data. Given the practical relevance of this scenario, addressing domain adaptation in chronological split data is crucial, yet underexplored. We propose Imposing invariance with Message Passing in Chronological split Temporal Graphs (\IMPaCT), a method that imposes invariant properties based on realistic assumptions derived from temporal graph structures. Unlike traditional domain adaptation approaches which rely on unverifiable assumptions, \IMPaCT explicitly accounts for the characteristics of chronological splits. The \IMPaCT is further supported by rigorous mathematical analysis, including a derivation of an upper bound of the generalization error. Experimentally, \IMPaCT achieves a 3.8% performance improvement over current SOTA method on the ogbn-mag graph dataset. Additionally, we introduce the Temporal Stochastic Block Model (TSBM), which replicates temporal graphs under varying conditions, demonstrating the applicability of our methods to general spatial GNNs.

4495WebCanvas: Benchmarking Web Agents in Online Environments

[openreview] [pdf]

Abstract For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and maintenance pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source a baseline agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.

4496A Score-Based Density Formula, with Applications in Diffusion Generative Models

[openreview] [pdf]

Abstract Score-based generative models (SGMs) have revolutionized the field of generative modeling, achieving unprecedented success in generating realistic and diverse content. Despite empirical advances, the theoretical basis for why optimizing the evidence lower bound (ELBO) on the log-likelihood is effective for training diffusion generative models, such as DDPMs, remains largely unexplored. In this paper, we address this question by establishing a density formula for a continuous-time diffusion process, which can be viewed as the continuous-time limit of the forward process in an SGM. This formula reveals the connection between the target density and the score function associated with each step of the forward process. Building on this, we demonstrate that the minimizer of the optimization objective for training DDPMs nearly coincides with that of the true objective, providing a theoretical foundation for optimizing DDPMs using the ELBO. Furthermore, we offer new insights into the role of score-matching regularization in training GANs, the use of ELBO in diffusion classifiers, and the recently proposed diffusion loss.

4497Adjusting Pretrained Backbones for Performativity

[openreview] [pdf]

Abstract With the widespread deployment of deep learning models, they influence their environment in various ways. The induced distribution shifts can lead to unexpected performance degradation in deployed models. Existing methods to anticipate performativity typically incorporate information about the deployed model into the feature vector when predicting future outcomes. While enjoying appealing theoretical properties, modifying the input dimension of the prediction task is often not practical. To address this, we propose a novel technique to adjust pretrained backbones for performativity in a modular way, achieving better sample efficiency and enabling the reuse of existing deep learning assets. Focusing on performative label shift, the key idea is to train a shallow adapter module to perform a \emph{Bayes-optimal} label shift correction to the backbone’s logits given a sufficient statistic of the model to be deployed. As such, our framework decouples the construction of input-specific feature embeddings from the mechanism governing performativity. Motivated by dynamic benchmarking as a use-case, we evaluate our approach under adversarial sampling, for vision and language tasks. We show how it leads to smaller loss along the retraining trajectory and enables us to effectively select among candidate models to anticipate performance degradations. More broadly, our work provides a first baseline for addressing performativity in deep learning.

4498PrivacyRestore: Privacy-Preserving Inference in Large Language Models via Privacy Removal and Restoration

[openreview] [pdf]

Abstract The widespread usage of online Large Language Models (LLMs) inference services has raised significant privacy concerns about the potential exposure of private information in user inputs to malicious eavesdroppers. Existing privacy protection methods for LLMs suffer from either insufficient privacy protection, performance degradation, or large inference time overhead. To address these limitations, we propose PrivacyRestore, a plug-and-play method to protect the privacy of user inputs during LLM inference. The server first trains restoration vectors for each privacy span and then release to clients. Privacy span is defined as a contiguous sequence of tokens within a text that contain private information. The client then aggregate restoration vectors of all privacy spans in the input into a single meta restoration vector which is later sent to the server side along with the input without privacy spans.The private information is restored via activation steering during inference. Furthermore, we prove that PrivacyRestore inherently prevents the linear growth of the privacy budget.We create three datasets, covering medical and legal domains, to evaluate the effectiveness of privacy preserving methods. The experimental results show that PrivacyRestore effectively protects private information and maintain acceptable levels of performance and inference overhead.

4499Evolving Virtual World with Delta-Engine

[openreview] [pdf]

Abstract In this paper, we focus on the \emph{virtual world}, a cyberspace that people can live in. In a sense, any video game can be regarded as a virtual world. However, the fundamental difference between them is the evolving nature, which means the real world is constantly changed by humans’ behavior, while existing virtual worlds are strictly defined by the back-end engine and cannot be changed by users’ behavior. For this, we propose a special world engine called \textbf{\emph{Delta-Engine}} to drive this virtual world. Δ\Delta associates the world’s evolution with the engine’s scaling. It consists of a base engine and a neural proxy. The base engine programs the prototype of the virtual world; given a trigger, the neural proxy generates new snippets on the base engine through \emph{incremental prediction}. This paper presents a full-stack introduction to the delta-engine. The key feature of the delta-engine is its scalability to user-generated content. Technically, this is supported by dual aspects of algorithm and data. To do this, we leverage the retrieval technique to enhance the connection of the neural proxy and the base engine, and propose the human-LLM collaborative design to align the neural proxy with novel and interesting data.

4500It’s Not a Modality Gap: Characterizing and Addressing the Contrastive Gap

[openreview] [pdf]

Abstract Learning jointly from images and texts using contrastive pre-training has emerged as an effective method to train large-scale models with a strong grasp of semantic image concepts. For instance, CLIP, pre-trained on a large corpus of web data, excels in tasks like zero-shot image classification, object detection, geolocalization, and more. These contrastive models embed input images and texts into a shared representational space. Recently, it was claimed that models like CLIP show amodality gap, where image and text embeddings occupy disjoint areas in the representational space. Previous studies attribute this gap to factors like data artifacts (mismatched pairs), model architecture artifacts (the cone effect), and the nature of the loss landscape (getting stuck in local minima). We demonstrate that, even after accounting for these factors, and even when using thesame modality, the contrastive loss actuallycreatesa gap during training. As a result, we propose renaming this phenomenon thecontrastive gap. We show that the contrastive gap is exacerbated by training with small batch sizes in high-dimensional spaces, causing embeddings of each modality to occupy small disjoint portions of the latent space. Our experiments show that minimizing the contrastive gap via the addition of uniformity and alignment terms optimizes the representational space and conveys better performance on downstream tasks such as zero-shot image classification and multi-modal arithmetic.

4501Tensor-Var: Variational Data Assimilation in Tensor Product Feature Space

[openreview] [pdf]

Abstract Variational data assimilation estimates the dynamical system states by minimizing cost function that fits the numerical models with observational data. The widely used method, four-dimensional variational assimilation (4D-Var), has two primary limitations: (1) computationally demanding for complex nonlinear systems; and (2) relying on state-observation mappings, which are often impractical. Recently, deep learning (DL) has been used as a more expressive class of efficient model approximators to address these challenges. However, integrating such models into 4D-Var remains challenging due to their inherent nonlinearities and the lack of theoretical guarantees for consistency in assimilation results. In this paper, we propose \textit{Tensor-Var} to address these challenges using kernel Conditional Mean Embedding (CME). Tensor-Var characterizes system dynamics and state-observation mappings as linear operators in a feature space, enabling a more efficient linear 4D-Var framework. Our method seamlessly integrates CME with 4D-Var, offering theoretical guarantees of consistent assimilation results between the original and feature space. To improve CME scalability, we use deep kernel features that map data into a finite-dimensional feature space, utilizing the expressiveness of deep learning. Experiments on chaotic systems and global weather forecasting demonstrate that Tensor-Var outperforms operational and DL hybrid methods 4D-Var baselines in terms of accuracy while achieving efficiency comparable to the static 3D-Var method.

4502Mitigating Memorization in Language Models

[openreview] [pdf]

Abstract Language models (LMs) can “memorize” information, i.e., encode training data in their weights in such a way that inference-time queries can lead to verbatim regurgitation of that data. This ability to extract training data can be problematic, for example, when data are private or sensitive. In this work, we investigate methods to mitigate memorization: three regularizer-based, three fine-tuning-based, and eleven machine unlearning-based methods, with five of the latter being new methods that we introduce. We also introduce TinyMem, a suite of small, computationally-efficient LMs for the rapid development and evaluation of memorization-mitigation methods. We demonstrate that the mitigation methods that we develop using TinyMem can successfully be applied to production-grade LMs, and we determine via experiment that: regularizer-based mitigation methods are slow and ineffective at curbing memorization; fine-tuning-based methods are effective at curbing memorization, but overly expensive, especially for retaining higher accuracies; and unlearning-based methods are faster and more effective, allowing for the precise localization and removal of memorized information from LM weights prior to inference. We show, in particular, that our proposed unlearning method BalancedSubnet outperforms other mitigation methods at removing memorized information while preserving performance on target tasks.

4503Improving Model Alignment Through Collective Intelligence of Open-Source Models

[openreview] [pdf]

Abstract Building helpful and harmless large language models (LLMs) requires effective model alignment approach based on human instructions and feedback; this necessitates high-quality human-labeled data. Constructing such datasets is often expensive and not scalable, and may face potential bottleneck on diversity. To address these challenges, we introduce Mixture-of-Agent Alignment (MoAA), an effective approach that leverages the collective strengths of various language models to provide high-quality data for model alignment. By employing MoAA, we enhance both supervised fine-tuning (SFT) and preference optimization, leading to improved performance compared to using a single model alone, including the state-of-ther-art commercial model. This approach leads to an intriguing direction of model alignment through an scalable and diverse instruction data recipe based on open-sourced models.

4504Forget the Data and Fine-Tuning! Just Fold the Network to Compress

[openreview] [pdf]

Abstract We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers, significantly reducing the model size without the need for fine-tuning or access to training data. Unlike existing methods, model folding preserves data statistics during compression by leveraging kk-means clustering, and using novel data-free techniques to prevent variance collapse or explosion. Our theoretical framework and experiments across standard benchmarks, including ResNet18 and LLaMA-7B, demonstrate that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods, especially at high sparsity levels. This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.

4505PROVABLY EFFICIENT FEDERATED ACTIVE MULTI-TASK REPRESENTATION LEARNING

[openreview] [pdf]

Abstract Multi-task learning is an emerging machine learning paradigm that integrates data from multiple sources, harnessing task similarities to enhance overall model performance. The application of multi-task learning to real-world settings is hindered due to data scarcity, along with challenges related to scalability and computational resources. To address this challenge, we develop a fast and sample-efficient approach for multi-task active learning when the amount of data from source tasks and target tasks is limited. By leveraging the techniques from active learning, we propose an adaptive sampling-based alternating projected gradient descent (GD) and minimization algorithm that iteratively estimates the relevance of each source task to the target task and samples from each source task based on the estimated relevance. We present the convergence guarantee of our algorithm and the sample complexity of our approach. We evaluated the effectiveness of our algorithm using numerical experiments and compared it empirically against four benchmark algorithms using synthetic and real datasets.

4506Rethinking Neural Multi-Objective Combinatorial Optimization via Neat Weight Embedding

[openreview] [pdf]

Abstract Recent decomposition-based neural multi-objective combinatorial optimization (MOCO) methods struggle to achieve desirable performance. Even equipped with complex learning techniques, they often suffer from significant optimality gaps in weight-specific subproblems. To address this challenge, we propose a neat weight embedding method to learn weight-specific representations, which captures weight-instance interaction for the subproblems and was overlooked by most current methods. We demonstrate the potentials of our method in two instantiations. First, we introduce a succinct addition model to learn weight-specific node embeddings, which surpassed most existing neural methods. Second, we design an enhanced conditional attention model to simultaneously learn the weight embedding and node embeddings, which yielded new state-of-the-art performance. Experimental results on classic MOCO problems verified the superiority of our method. Remarkably, our method also exhibits favorable generalization performance across problem sizes, even outperforming the neural method specialized for boosting size generalization.

4507A Single Swallow Does Not Make a Summer: Understanding Semantic Structures in Embedding Spaces

[openreview] [pdf]

Abstract Embedding spaces encapsulate rich information from deep learning models, with vector distances reflecting the semantic similarity between textual elements. However, their abstract nature and the computational complexity of analyzing them remain significant challenges. To address these, we introduce the concept of Semantic Field Subspace, a novel mapping that links embedding spaces with the underlying semantics. We propose \textsf{SAFARI}, a novel algorithm for \textsf{S}em\textsf{A}ntic \textsf{F}ield subsp\textsf{A}ce dete\textsf{R}m\textsf{I}nation, which leverages hierarchical clustering to discover hierarchical semantic structures, using Semantic Shifts to capture semantic changes as clusters merge, allowing for the identification of meaningful subspaces. To improve scalability, we extend Weyl’s Theorem, enabling an efficient approximation of Semantic Shifts that significantly reduces computational costs. Extensive evaluations on five real-world datasets demonstrate the effectiveness of \textsf{SAFARI} in uncovering interpretable and hierarchical semantic structures. Additionally, our approximation method achieves a 15\sim30×\times speedup while maintaining minimal errors (less than 0.01), making it practical for large-scale applications. The source code is available at \url{https://anonymous.4open.science/r/Safari-C803/}.

4508Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

[openreview] [pdf]

Abstract In recent years, SignSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that SignSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of SignSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

4509Physics-Informed Self-Guided Diffusion Model for High-Fidelity Simulations

[openreview] [pdf]

Abstract Machine learning (ML) models are increasingly explored in fluid dynamics as a promising way to generate high-fidelity computational fluid dynamics data more efficiently. A common strategy is to use low-fidelity data as computational-efficient inputs, and employ ML techniques to reconstruct high-fidelity flow fields. However, existing work typically assumes that low-fidelity data is artificially downsampled from high-fidelity sources, which limits model performance. In real-world applications, low-fidelity data is generated directly by numerical solvers with a lower initial state resolution, resulting in large deviations from high-fidelity data. To address this gap, we propose PG-Diff, a novel diffusion model for reconstructing high-fidelity flow fields, where both low- and high-fidelity data are generated from numerical solvers. Our experiments reveal that state-of-the-art models struggle to recover fine-grained high-fidelity details when using solver-generated low-fidelity inputs, due to distribution shift. To overcome this challenge, we introduce an \textit{Importance Weight} strategy during training as self-guidance and a training-free \textit{Residual Correction} method during inference as physical inductive bias, guiding the diffusion model toward higher-quality reconstructions. Experiments on four 2D turbulent flow datasets demonstrate the effectiveness of our proposed method.

4510Multi-Session Client-Centered Treatment Outcome Evaluation in Psychotherapy

[openreview] [pdf]

Abstract In psychotherapy, therapeutic outcome assessment, or treatment outcome evaluation, is essential for enhancing mental health care by systematically evaluating therapeutic processes and outcomes. Existing large language model approaches often focus on therapist-centered, single-session evaluations, neglecting the client’s subjective experience and longitudinal progress across multiple sessions. To address these limitations, we propose IPAEval, a client-Informed Psychological Assessment-based Evaluation framework that automates treatment outcome evaluations from the client’s perspective using clinical interviews. IPAEval integrates cross-session client-contextual assessment and session-focused client-dynamics assessment to provide a comprehensive understanding of therapeutic progress. Experiments on our newly developed TheraPhase dataset demonstrate that IPAEval effectively tracks symptom severity and treatment outcomes over multiple sessions, outperforming previous single-session models and validating the benefits of items-aware reasoning mechanisms.

4511Zero-Shot Generalization of GNNs over Distinct Attribute Domains

[openreview] [pdf]

Abstract Inductive GNNs are able to generalize across graphs with the same set of node attributes. However, zero-shot generalization across attributed graphs with disparate node attribute domains remains a fundamental challenge in graph machine learning. Existing methods are unable to effectively make use of node attributes when transferring to unseen attribute domains, frequently performing no better than models that ignore attributes entirely. This limitation stems from the fact that models trained on one set of attributes (e.g., biographical data in social networks) fail to capture relational dependencies that extend to new attributes in unseen test graphs (e.g., TV and movies preferences). Here, we introduce STAGE, a method that learns representations ofstatistical dependenciesbetween attributes rather than the attribute values themselves, which can then be applied to completely unseen test-time attributes, generalizing by identifying analogous dependencies between features in test. STAGE leverages the theoretical link between maximal invariants and measures of statistical dependencies, enabling it to provably generalize to unseen feature domains for a family of domain shifts. Our empirical results show that when STAGE is pretrained on multiple graph datasets with unrelated feature spaces (distinct feature types and dimensions) and evaluated zero-shot on graphs with yet new feature types and dimensions, it achieves a relative improvement in Hits@1 between 40% to 103% for link prediction, and an 11% improvement in node classification against state-of-the-art baselines.

4512Variational Best-of-N Alignment

[openreview] [pdf]

Abstract Best-of-N (BoN) is a popular and effective algorithm for aligning language models to human preferences. The algorithm works as follows: at inference time, N samples are drawn from the language model, and the sample with the highest reward, as judged by a reward model, is returned as the output. Despite its effectiveness, BoN is computationally expensive; it reduces sampling throughput by a factor of N. To make BoN more efficient at inference time, one strategy is to fine-tune the language model to mimic what BoN does during inference. To achieve this, we derive the distribution induced by the BoN algorithm. We then propose to fine-tune the language model to minimize backward KL divergence to the BoN distribution. Our approach is analogous to mean-field variational inference and, thus, we term it variational BoN (vBoN). To the extent this fine-tuning is successful and we end up with a good approximation, we have reduced the inference cost by a factor of N. Our experiments on controlled generation and summarization tasks show that BoN is the most effective alignment method, and our variational approximation to BoN achieves the closest performance to BoN and surpasses models fine-tuned using the standard KL-constrained RL objective. In the controlled generation task, vBoN appears more frequently on the Pareto frontier of reward and KL divergence compared to other alignment methods. In the summarization task, vBoN achieves high reward values across various sampling temperatures.

4513Secure FLOATING - Scalable Federated Learning Framework for Real-time Trust in Mobility Data using Secure Multi-Party Computation and Blockchain

[openreview] [pdf]

Abstract The safety of Connected and Autonomous Vehicles (CAVs), Micro-mobility devices (e-scooter, e-bikes) and smartphone users rely on trusting the trajectory data they generate for navigation around each other. There is a need for real-time verification of mobility data from these devices without compromising privacy as malicious data used for navigation could be deadly, specially for vulnerable road users. In this paper, we propose Secure-FLOATING, a scalable framework leveraging federated learning and blockchain for nearby nodes to coordinate and learn to trust mobility data from nearby devices and store this information via consensus on a tamper-proof distributed ledger. We employ lightweight Secure Multi-party computation (SMPC) with reduced messages exchanges to preserve privacy of the users and ensure data validation in real-time. Secure-FLOATING is evaluated using realistic trajectories for up to 8,000 nodes (vehicles, micro-mobility devices and pedestrians) in New York City, and it shows to achieve lower delays and overhead, thereby accurately validating each others’ mobility data in a scalable manner, with up to 75% successful endorsement for as high as 50% attacker penetration.

4514Harnessing Diversity for Important Data Selection in Pretraining Large Language Models

[openreview] [pdf]

Abstract Data selection is of great significance in pretraining large language models, given the variation in quality within the large-scale available training corpora. To achieve this, researchers are currently investigating the use of data influence to measure the importance of data instances, i.e.,i.e., a high influence score indicates that incorporating this instance to the training set is likely to enhance the model performance. Consequently, they select the top-kk instances with the highest scores. However, this approach has several limitations. (1) Calculating the accurate influence of all available data is time-consuming. (2) The selected data instances are not diverse enough, which may hinder the pretrained model’s ability to generalize effectively to various downstream tasks. In this paper, we introduce Quad\texttt{Quad}, a data selection approach that considers both quality and diversity by using data influence to achieve state-of-the-art pretraining results. To compute the influence (i.e.,i.e., the quality) more accurately and efficiently, we incorporate the attention layers to capture more semantic details, which can be accelerated through the Kronecker product. For the diversity, Quad\texttt{Quad} clusters the dataset into similar data instances within each cluster and diverse instances across different clusters. For each cluster, if we opt to select data from it, we take some samples to evaluate the influence to prevent processing all instances. Overall, we favor clusters with highly influential instances (ensuring high quality) or clusters that have been selected less frequently (ensuring diversity), thereby well balancing between quality and diversity. Experiments on Slimpajama demonstrate that Quad\texttt{Quad} significantly outperforms other data selection methods with a low FLOPs consumption. Further analysis also validates the effectiveness of our influence calculation.

4515Low-Budget Simulation-Based Inference with Bayesian Neural Networks

[openreview] [pdf]

Abstract Simulation-based inference methods have been shown to be inaccurate in the data-poor regime, when training simulations are limited or expensive. Under these circumstances, the inference network is particularly prone to overfitting, and using it without accounting for the computational uncertainty arising from the lack of identifiability of the network weights can lead to unreliable results. To address this issue, we propose using Bayesian neural networks in low-budget simulation-based inference, thereby explicitly accounting for the computational uncertainty of the posterior approximation. We design a family of Bayesian neural network priors that are tailored for inference and show that they lead to well-calibrated posteriors on tested benchmarks, even when as few as O(10)O(10) simulations are available. This opens up the possibility of performing reliable simulation-based inference using very expensive simulators, as we demonstrate on a problem from the field of cosmology where single simulations are computationally expensive. We show that Bayesian neural networks produce informative and well-calibrated posterior estimates with only a few hundred simulations.

4516Hybrid Contrastive Transformer for Visual Tracking

[openreview] [pdf]

Abstract Visual object tracking is a research hotspot in the field of computer vision, and has been widely applied in video surveillance, human-computer interaction, unmanned driving and other fields. At present, the object trackers based on Transformer have good performance, but they still face the challenge of confusing target and background in the feature extraction process. To address this issue, we propose a Hybrid Contrastive Transformer Tracker (HCTrack) in this paper, which combines contrastive learning to improve the ability of distinguishing the target and the background in video. Furthermore, a hybrid feature interaction module is presented to realize multi-level information exchange between the features of template and search regions and capture the target-related semantic information of the search frames comprehensively. Additionally, we design a redundant information pruning module to adaptively eliminate the redundant backgrounds according to the global scene information, thereby reducing the interference of the background to the target feature. HCTrack achieves superior tracking accuracy on the GOT-10k and TrackingNet datasets compared to other state-of-the-art trackers, while maintaining fast inference speed, as the contrastive learning is only implemented during training model.

4517Text as parameter: interactive prompt optimisation for large language models

[openreview] [pdf]

Abstract Large language models (LLMs) can handle a variety of tasks conditioned on natural language instructions. While fine-tuning improves task-specific performance, adjusting the model weights of LLMs requires a huge amount of computational resources, and it is impractical for real-time updates. Alternatively, prompting allows LLMs to adapt to a broad range of tasks without the need for computationally intensive gradient-based optimisation. However, crafting effective prompts remains a challenge, to the extent that it is even unclear if expert in-domain knowledge is what is needed or experience in writing prompts or something else. Approaches like meta-prompting and self-feedback seek to alleviate this burden, but they rely primarily on a numerical feedback signal, leaving the potential of textual feedback unexplored. These methods also typically require numerous interactions with the environment to gather sufficient context, leading to significant computational overhead.In this work, we propose a novel framework that takes a prompted large language model as an optimiser and treats the text-based prompt itself as a parameter. By interacting with the environment to collect feedback, our proposed method constructs the updated textual prompt. Our experimental results demonstrate that this method not only achieves superior performance but also automatically incorporates domain-specific knowledge, establishing a scientifically motivated, practical and efficient approach to prompting for future research.

4518Singular Value Adaptation for Parameter-Efficient Fine Tuning

[openreview] [pdf]

Abstract Parameter-Efficient Fine-Tuning (PEFT) has become a crucial approach in handling the growing complexity of large models and vast datasets across multiple fields such as Computer Vision or Natural Language Processing. Among the most promising of these methods are Low-Rank Adaptation (LoRA) and its derivatives, which fine-tune a pre-trained weight matrix W\mathbf{W} by introducing a low-rank update matrix ΔW\mathbf{\Delta W}. While these approaches have demonstrated strong empirical performance, they remain largely heuristic, with little theoretical grounding to explain their behavior or guide the design of ΔW\mathbf{\Delta W} for different objectives. This lack of theoretical insight limits our understanding of when these methods are most effective and how they can be systematically improved. In this paper, we propose a theoretical framework for analyzing and designing LoRA-based methods, with a focus on the formulation of ΔW\mathbf{\Delta W}. By establishing a deeper understanding of the interplay between W\mathbf{W} and ΔW\mathbf{\Delta W}, we aim to enable more efficient and targeted fine-tuning strategies, opening the door to novel variants that strike an optimal balance between performance and efficiency. Our proposed method - \textbf{Si}ngular \textbf{V}alue \textbf{A}daptation - uses insights from our theoretical framework to incorporate inductive biases on the formulation of ΔW\mathbf{\Delta W}, leading to a PEFT method that is up to 50×\times more parameter efficient that LoRA, while achieving comparable or better performance across various vision and language tasks.

4519A Tight Convergence Analysis of Inexact Stochastic Proximal Point Algorithm for Stochastic Composite Optimization Problems

[openreview] [pdf]

Abstract The \textbf{i}nexact \textbf{s}tochastic \textbf{p}roximal \textbf{p}oint \textbf{a}lgorithm (isPPA) is popular for solving stochastic composite optimization problems with many applications in machine learning. While the convergence theory of the (inexact) PPA has been well established, the known convergence guarantees of isPPA require restrictive assumptions. In this paper, we establish the stability and almost sure convergence of isPPA under mild assumptions, where smoothness and (restrictive) strong convexity of the objective function are not required. Imposing a local Lipschitz condition on component functions and a quadratic growth condition on the objective function, we establish last-iterate iteration complexity bounds of isPPA regarding the distance to the solution set and the Karush–Kuhn–Tucker (KKT) residual. Moreover, we show that the established iteration complexity bounds are tight up to a constant by explicitly analyzing the bounds for the regularized Fr’echet mean problem. We further validate the established convergence guarantees of isPPA by numerical experiments.

4520Taming Gradient Oversmoothing and Expansion in Graph Neural Networks

[openreview] [pdf]

Abstract Oversmoothing has been claimed as a primary bottleneck for multi-layered graph neural networks (GNNs). Multiple analyses have examined how and why oversmoothing occurs. However, none of the prior work addressed how optimization is performed under the oversmoothing regime. In this work, we show the presence of gradient oversmoothing\textit{gradient oversmoothing} preventing optimization during training. We further analyze that GNNs with residual connections, a well-known solution to help gradient flow in deep architecture, introduce gradient expansion\textit{gradient expansion}, a phenomenon of the gradient explosion in diverse directions. Therefore, adding residual connections cannot be a solution for making a GNN deep. Our analysis reveals that constraining the Lipschitz bound of each layer can neutralize the gradient expansion. To this end, we provide a simple yet effective normalization method to prevent the gradient expansion. An empirical study shows that the residual GNNs with hundreds of layers can be efficiently trained with the proposed normalization without compromising performance. Additional studies show that the empirical observations corroborate our theoretical analysis.

4521BLIPEE: Fast and Robust BLIP with Adversarially Trained Early Exits

[openreview] [pdf]

Abstract In recent years, Vision-Language Models (VLMs) have shown remarkable performance improvements in vision-language tasks. However, their large size poses challenges for real-world applications where inference latency is a concern. To tackle this issue, we propose employing Early Exit (EE) strategies in VLM. However, training exit classifiers in VLMs is challenging, particularly with limited labeled training data. To address this, we introduce BLIPEE, an adversarial training approach within a GAN-based framework. Here, each exit consists of a transformer layer and a classifier, and the transformer layer is adversarially trained to produce feature representations similar to the final layer, while a feature classifier serves as the discriminator. Our method focuses on performing input-adaptive inference that mitigates the overthinking issue and increases inference speed. Experimental results demonstrate the effectiveness of our approach in enhancing accuracy and model robustness by mitigating overthinking and the phenomenon of mid-crisis that we highlight. The anonymized source code is available athttps://anonymous.4open.science/status/BLIPEE-3ED3.

4522Turning Up the Heat: Min-p Sampling for Creative and Coherent LLM Outputs

[openreview] [pdf]

Abstract Large Language Models (LLMs) generate text by sampling the next token from a probability distribution over the vocabulary at each decoding step. However, popular sampling methods like top-p (nucleus sampling) often struggle to balance quality and diversity, especially at higher temperatures, leading to incoherent or repetitive outputs. To address this challenge, we propose min-p sampling, a dynamic truncation method that adjusts the sampling threshold based on the model’s confidence by scaling according to the top token’s probability. We conduct extensive experiments on benchmarks including GPQA, GSM8K, and AlpacaEval Creative Writing, demonstrating that min-p sampling improves both the quality and diversity of generated text, particularly at high temperatures. Moreover, human evaluations reveal a clear preference for min-p sampling in terms of both text quality and diversity. Min-p sampling has been adopted by multiple open-source LLM implementations, highlighting its practical utility and potential impact.

4523TRACE: Temporal Grounding Video LLM via Causal Event Modeling

[openreview] [pdf]

Abstract Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework’s formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code will be made publicly available.

4524Pursuing Feature Separation based on Neural Collapse for Out-of-Distribution Detection

[openreview] [pdf]

Abstract In the open world, detecting out-of-distribution (OOD) data, whose labels are disjoint with those of in-distribution (ID) samples, is important for reliable deep neural networks (DNNs). To achieve better detection performance, one type of approach proposes to fine-tune the model with auxiliary OOD datasets to amplify the difference between ID and OOD data through a separation loss defined on model outputs. However, none of these studies consider enlarging the feature disparity, which should be more effective compared to outputs. The main difficulty lies in the diversity of OOD samples, which makes it hard to describe their feature distribution, let alone design losses to separate them from ID features. In this paper, we neatly fence off the problem based on an aggregation property of ID features named Neural Collapse (NC). NC means that the penultimate features of ID samples within a class are nearly identical to the last layer weight of the corresponding class. Based on this property, we propose a simple but effective loss called Separation Loss, which binds the features of OOD data in a subspace orthogonal to the principal subspace of ID features formed by NC. In this way, the features of ID and OOD samples are separated by different dimensions. By optimizing the feature separation loss rather than purely enlarging output differences, our detection achieves SOTA performance on CIFAR10, CIFAR100 and ImageNet benchmarks without any additional data augmentation or sampling, demonstrating the importance of feature separation in OOD detection. The code will be published.

4525DO GENERATIVE MODELS LEARN RARE GENERATIVE FACTORS?

[openreview] [pdf]

Abstract Generative models are becoming a promising tool in AI alongside discriminative learning. Several models have been proposed to learn in an unsupervised fashion the corresponding generative factors, namely the latent variables critical for capturing the full spectrum of data variability. Diffusion Models (DMs), Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) are of particular interest due to their impressive ability to generate highly realistic data. Through a systematic empirical study, this paper delves into the intricate challenge of how DMs, GANs and VAEs internalize and replicate rare generative factors. Our findings reveal a pronounced tendency towards the memorization of these factors. We study the reasons for this memorization and demonstrate that strategies such as spectral decoupling can mitigate this issue to a certain extent.

4526Assessing Episodic Memory in LLMs with Sequence Order Recall Tasks

[openreview] [pdf]

Abstract Current LLM benchmarks focus on evaluating models’ memory of facts and semantic relations, primarily assessing semantic aspects of long-term memory. However, in humans, long-term memory also includes episodic memory, which links memories to their contexts, such as the time and place they occurred. The ability to contextualize memories is crucial for many cognitive tasks and everyday functions. This form of memory has not been evaluated in LLMs with existing benchmarks. To address the gap in evaluating memory in LLMs, we introduce Sequence Order Recall Tasks (SORT), which we adapt from tasks used to study episodic memory in cognitive psychology. SORT requires LLMs to recall the correct order of text segments, and provides a general framework that is both easily extendable and does not require any additional annotations. We present an initial evaluation dataset, Book-SORT, comprising 36k pairs of segments extracted from 9 books recently added to the public domain. Based on a human experiment with 155 participants, we show that humans can recall sequence order based on long-term memory of a book. We find that models can perform the task with high accuracy when relevant text is given in-context during the SORT evaluation. However, when presented with the book text only during training, LLMs’ performance on SORT falls short. By allowing to evaluate more aspects of memory, we believe that SORT will aid in the emerging development of memory-augmented models.

4527Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

[openreview] [pdf]

Abstract Diffusion models have exhibited exciting capabilities in generating images and are also very promising for video creation. However, the inference speed of diffusion models is limited by the slow sampling process, restricting its use cases. The sequential denoising steps required for generating a single sample could take tens or hundreds of iterations and thus have become a significant bottleneck. This limitation is more salient for applications that are interactive in nature or require small latency. To address this challenge, we propose Partially Conditioned Patch Parallelism (PCPP) to accelerate the inference of high-resolution diffusion models. Using the fact that the difference between the images in adjacent diffusion steps is nearly zero, Patch Parallelism (PP) leverages multiple GPUs communicating asynchronously to compute patches of an image in multiple computing devices based on the entire image (all patches) in the previous diffusion step. PCPP develops PP to reduce computation in inference by conditioning only on parts of the neighboring patches in each diffusion step, which also decreases communication among computing devices. As a result, PCPP decreases the communication cost by around 70% compared to DistriFusion (the state of the art implementation of PP) and achieves 2.368.02×2.36\sim 8.02\times inference speed-up using 484\sim 8 GPUs compared to 2.326.71×2.32\sim 6.71\times achieved by DistriFusion depending on the computing device configuration and resolution of generation at the cost of a possible decrease in image quality. PCPP demonstrates the potential to strike a favorable trade-off, enabling high-quality image generation with substantially reduced latency.

4528Managing Diffuse Risks in the Safe Deployment of Untrusted Large Language Models

[openreview] [pdf]

Abstract As large language models (LLMs) grow more powerful, they also become more difficult to trust. They could be either aligned with human intentions, or exhibit “subversive misalignment” -- introducing subtle errors that bypass safety checks. Although individual errors may not immediately cause harm, each increases the risk of an eventual safety failure. With this uncertainty, model deployment often grapples with the tradeoff between ensuring safety and harnessing the capabilities of untrusted models. In this work, we introduce the ``Diffuse Risk Management’’ problem, aiming to balance the average-case safety and usefulness in the deployment of untrusted models over a large sequence of tasks. We approach this problem by developing a two-level framework: the single-task level (micro-protocol) and the whole-scenario level (macro-protocol). At the single-task level, we develop various \textit{micro}-protocols that use a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. At the whole-scenario level, we find an optimal \textit{macro}-protocol that uses an adaptive estimate of the untrusted model’s risk to choose between micro-protocols. To evaluate the robustness of our method, we follow \textit{control evaluations} in a code generation testbed, which involves a red team attempting to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. Experiment results show that our approach retains 99.6% usefulness of the untrusted model while ensuring near-perfect safety, significantly outperforming existing deployment methods. Our approach also demonstrates robustness when the trusted and untrusted models have a large capability gap. Our findings demonstrate the promise of managing diffuse risks in the deployment of increasingly capable but untrusted LLMs.

4529Clipping Improves Adam and AdaGrad when the Noise Is Heavy-Tailed

[openreview] [pdf]

Abstract Methods with adaptive stepsizes, such as AdaGrad and Adam, are essential for training modern Deep Learning models, especially Large Language Models. Typically, the noise in the stochastic gradients is heavy-tailed for the later ones. Gradient clipping provably helps to achieve good high-probability convergence for such noises. However, despite the similarity between AdaGrad/Adam and Clip-SGD, the current understanding of the high-probability convergence of AdaGrad/Adam-type methods is limited in this case. In this work, we prove that AdaGrad/Adam (and their delayed version) can have provably bad high-probability convergence if the noise is heavy-tailed. We also show that gradient clipping fixes this issue, i.e., we derive new high-probability convergence bounds with polylogarithmic dependence on the confidence level for AdaGrad and Adam with clipping and with/without delay for smooth convex/non-convex stochastic optimization with heavy-tailed noise. Our empirical evaluations highlight the superiority of clipped versions of AdaGrad/Adam in handling the heavy-tailed noise.

4530Partial Channel Dependence with Channel Masks for Time Series Foundation Models

[openreview] [pdf]

Abstract Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily focused on designing model architectures to address explicit heterogeneity among datasets such as various numbers of channels, while often overlooking implicit heterogeneity such as varying dependencies between channels. In this work, we introduce the concept of partial channel dependence (PCD), which enables a more sophisticated adjustment of channel dependencies based on dataset-specific information. To achieve PCD, we propose a channel mask that captures the relationships between channels within a dataset using two key components: 1) a correlation matrix that encodes relative dependencies between channels, and 2) domain parameters that learn the absolute dependencies specific to each dataset, refining the correlation matrix. We validate the effectiveness of PCD across four tasks in TS including forecasting, classification, imputation, and anomaly detection, under diverse settings, including few-shot and zero-shot scenarios with both TS foundation models and single-task models.

4531ReHub: Linear Complexity Graph Transformers with Adaptive Hub-Spoke Reassignment

[openreview] [pdf]

Abstract We present ReHub, a novel graph transformer architecture that achieves linear complexity through an efficient reassignment technique between nodes and virtual nodes. Graph transformers have become increasingly important in graph learning for their ability to utilize long-range node communication explicitly, addressing limitations such as oversmoothing and oversquashing found in message-passing graph networks. However, their dense attention mechanism scales quadratically with the number of nodes, limiting their applicability to large-scale graphs. ReHub draws inspiration from the airline industry’s hub-and-spoke model, where flights are assigned to optimize operational efficiency. In our approach, graph nodes (spokes) are dynamically reassigned to a fixed number of virtual nodes (hubs) at each model layer. Recent work, Neural Atoms (Li et al., 2024), has demonstrated impressive and consistent improvements over GNN baselines by utilizing such virtual nodes; their findings suggest that the number of hubs strongly influences performance. However, increasing the number of hubs typically raises complexity, requiring a trade-off to maintain linear complexity. Our key insight is that each node only needs to interact with a small subset of hubs to achieve linear complexity, even when the total number of hubs is large. To leverage all hubs without incurring additional computational costs, we propose a simple yet effective adaptive reassignment technique based on hub-hub similarity scores, eliminating the need for expensive node-hub computations. Our experiments on long-range graph benchmarks indicate a consistent improvement in results over the base method, Neural Atoms, while maintaining a linear complexity instead of O(n3/2)O(n^{3/2}). Remarkably, our sparse model achieves performance on par with its non-sparse counterpart. Furthermore, ReHub outperforms competitive baselines and consistently ranks among the top performers across various benchmarks.

4532MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

[openreview] [pdf]

Abstract Effective evaluation of Multimodal Large Language Models (MLLMs) is essential for understanding their capabilities and limitations. In this paper, we introduce MIA-Bench, a benchmark designed to assess MLLMs’ ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate and contextually appropriate responses. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning and direct preference optimization to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

4533Strategy-centric Synthesis: Connecting Billions of Image-Text Pairs to High-Quality Visual Instruction Data

[openreview] [pdf]

Abstract Vision-Language Models (VLMs) have demonstrated remarkable generalization across tasks by aligning visual and linguistic representations. High-quality visual instruction data is critical for enhancing the performance of Vision-Language Models. However, current visual instruction tuning datasets, which are primarily derived from past visual tasks, have several limitations. For instance, the range of question types is often restricted and closely tied to the original visual tasks. Furthermore, image diversity is limited, as images collected for various specialized vision tasks clearly fail to adequately represent real-world user queries. Additionally, previous instruction datasets tend to lack complexity, focusing on single tasks like captioning or OCR, which makes it challenging to train models for more complex, multi-skill scenarios. To address these limitations, we propose a novel paradigms called strategy-centric synthesis: automatically synthesizing high-quality instruction data from large-scale image-text pairs. First, we employ an efficient heuristic method to select high-quality, complex images from DataComp-1B image-text pairs. Carefully crafted prompts and these images are fed to VLMs to extract high-quality query strategies and generate corresponding image descriptions. These descriptions are subsequently used to retrieve images aligned with specific questioning strategies. Finally, the retrieved images and their matching strategies are used to synthesize high-quality instructional data. Our experiments indicate that with continued instruction fine-tuning via LoRA on only 3,000 newly synthesized data samples, 0.45% of the LLAVA-1.5 instruction tuning dataset, the model significantly outperforms the original LLAVA-1.5-7B across multiple benchmarks, thereby demonstrating the effectiveness of our approach.

4534Improving classifier decision boundaries and interpretability using nearest neighbors

[openreview] [pdf]

Abstract Neural networks are not learning optimal decision boundaries. We show that decision boundaries are situated in areas of low training data density. They are impacted by few training samples which can easily lead to overfitting. We provide a simple algorithm performing a weighted average of the prediction of a sample and its nearest neighbors’ (computed in latent space) leading to minor favorable outcomes for a variety of important measures for neural networks. In our evaluation, we employ various self-trained and (state-of-the-art) pre-trained convolutional neural networks to show that our approach improves (i) resistance to label noise, (ii) robustness against adversarial attacks, (iii) classification accuracy, and yields novel means for (iv) interpretability. Our interpretability analysis is of independent interest to the XAI community, as it is applicable to any network. While improvements are not necessarily large in all four areas, our approach is conceptually simple, i.e., improvements come without any modification to network architecture, training procedure or dataset. Furthermore, our approach is in stark contrast to prior works that often require trade-offs among the four objectives combined with architectural adaptations or provide valuable, but non-actionable insights. Finally, we provide a theoretical analysis.

4535SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000× larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset (SPAM), a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across a range of model scales. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is submitted.

4536Beyond Directed Acyclic Computation Graph with Cyclic Neural Network

[openreview] [pdf]

Abstract This paper investigates a fundamental yet overlooked design principle of artificial neural networks (ANN): We do not need to build ANNs layer-by-layer sequentially to guarantee the Directed Acyclic Graph (DAG) property. Inspired by biological intelligence, where neurons form a complex, graph-structured network, we introduce the transformative Cyclic Neural Networks (Cyclic NN). It emulates biological neural systems’ flexible and dynamic graph nature, allowing neuron connections in any graph-like structure, including cycles. This offers greater flexibility compared to the DAG structure of current ANNs. We further develop the Graph Over Multi-layer Perceptron, the first detailed model based on this new design paradigm. We experimentally validate the advantages of Cyclic NN on widely tested datasets in most generalized cases, demonstrating its superiority over current layer-by-layer DAG neural networks. With the support of Cyclic NN, the Forward-Forward training algorithm also firstly outperforms the current Back-Propagation algorithm. This research illustrates a transformative ANN design paradigm, a significant departure from current ANN designs, potentially leading to more biologically similar ANNs.

4537What’s the Move? Hybrid Imitation Learning via Salient Points

[openreview] [pdf]

Abstract While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduceSPHINX:SalientPoint-basedHybridImitatioNand eXecution, a flexible IL policy that leverages multimodal observations (point clouds and wrist images), along with a hybrid action space of low-frequency, sparse waypoints and high-frequency, dense end effector movements. Given 3D point cloud observations, SPHINX learns to infer task-relevant points within a point cloud, orsalient points, which support spatial generalization by focusing on semantically meaningful features. These salient points serve as anchor points to predict waypoints for long-range movement, such as reaching target poses in free-space. Once near a salient point, SPHINX learns to switch to predicting dense end-effector movements given close-up wrist images for precise phases of a task. By exploiting the strengths of different input modalities and action representations for different manipulation phases, SPHINX tackles complex tasks in a sample-efficient, generalizable manner. Our method achieves86.7%success across 4 real-world and 2 simulated tasks, outperforming the next best state-of-the-art IL baseline by41.1%on average across440real world trials. SPHINX additionally generalizes to novel viewpoints, visual distractors, spatial arrangements, and execution speeds with a1.7xspeedup over the most competitive baseline. Our website (http://sphinx-il.github.io) provides open-sourced code for data collection, training, and evaluation, along with supplementary videos.

4538Problem-Parameter Free Federated Learning

[openreview] [pdf]

Abstract Federated learning (FL) has garnered significant attention from academia and industry in recent years due to its advantages in data privacy, scalability, and communication efficiency. However, current FL algorithms face a critical limitation: their performance heavily depends on meticulously tuned hyperparameters, particularly the learning rate or stepsize. This manual tuning process is challenging in federated settings due to data heterogeneity and limited accessibility of local datasets. Consequently, the reliance on problem-specific parameters hinders the widespread adoption of FL and potentially compromises its performance in dynamic or diverse environments. To address this issue, we introduce PAdaMFed, a novel algorithm for nonconvex FL that carefully combines adaptive stepsize and momentum techniques. PAdaMFed offers two key advantages: 1) it operates autonomously without relying on problem-specific parameters, making it, to our knowledge, the first FL algorithm to achieve such problem-parameter-agnostic adaptation; and 2) it manages data heterogeneity and partial participation without requiring heterogeneity bounds. Despite these benefits, PAdaMFed provides several strong theoretical guarantees: 1) It achieves state-of-the-art convergence rates with a sample complexity of O(ϵ4)\mathcal{O}(\epsilon^{-4}) and communication complexity of O(ϵ3)\mathcal{O}(\epsilon^{-3}), even using constant learning rates; 2) these complexities can be improved to the best-known O(ϵ3)\mathcal{O}(\epsilon^{-3}) for sampling and O(ϵ2)\mathcal{O}(\epsilon^{-2}) for communication when incorporating variance reduction; 3) it exhibits linear speedup with respect to the number of local update steps and participating clients at each global round. These attributes make PAdaMFed highly scalable and adaptable for various real-world FL applications. Extensive empirical evidence on both image classification and sentiment analysis tasks validates the efficacy of our approaches.

4539IDEA: Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction

[openreview] [pdf]

Abstract While large language models (LLMs) have been thoroughly evaluated for deductive and inductive reasoning, their proficiency in abductive reasoning and holistic rule learning in interactive environments remains less explored. We introduce RULEARN, a novel benchmark specifically designed to assess the rule-learning abilities of LLM agents in interactive settings. In RULEARN, agents strategically interact with simulated environments to gather observations, discern patterns, and solve complex problems. To enhance the rule-learning capabilities for LLM agents, we propose IDEA, a novel reasoning framework that integrates the process of Induction, Deduction, and Abduction. The IDEA agent generates initial hypotheses from limited observations through abduction, devises plans to validate these hypotheses or leverages them to solve problems via deduction, and refines previous hypotheses using patterns identified from new observations through induction, dynamically establishing and applying rules that mimic human rule-learning behaviors. Our evaluation of the IDEA framework, which involves five representative LLMs, demonstrates significant improvements over the baseline. Furthermore, within this framework, our comparison with 50 human participants reveals notable discrepancies in rule-learning behaviors. LLM agents tend to generate plausible initial hypotheses but struggle to refine them through interaction. Conversely, humans, despite sometimes overlooking initial details, excel at incorporating feedback and continuously improving their hypotheses. We believe our benchmark, RULEARN, will serve as a valuable and challenging resource, and that the IDEA framework will provide crucial insights for the development of LLM agents capable of human-like rule learning in real-world scenarios. We will release our code and data upon acceptance of the paper.

4540TrajGPT: Irregular Time-Series Representation Learning for Health Trajectory Analysis

[openreview] [pdf]

Abstract In many domains, such as healthcare, time-series data is often irregularly sampled with varying intervals between observations. This poses challenges for classical time-series models that require equally spaced data. To address this, we propose a novel time-series Transformer calledTrajectory Generative Pre-trained Transformer (TrajGPT). TrajGPT employs a novel Selective Recurrent Attention (SRA) mechanism, which utilizes a data-dependent decay to adaptively filter out irrelevant past information based on contexts. By interpreting TrajGPT as discretized ordinary differential equations (ODEs), it effectively captures the underlying continuous dynamics and enables time-specific inference for forecasting arbitrary target timesteps. Experimental results demonstrate that TrajGPT excels in trajectory forecasting, drug usage prediction, and phenotype classification without requiring task-specific fine-tuning. By evolving the learned continuous dynamics, TrajGPT can interpolate and extrapolate disease risk trajectories from partially-observed time series. The visualization of predicted health trajectories shows that TrajGPT forecasts unseen diseases based on the history of clinically relevant phenotypes (i.e., contexts).

4541ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom

[openreview] [pdf]

Abstract Large vision-language models (LVLMs) have witnessed significant progress on visual understanding tasks. However, they often prioritize language knowledge over image information on visual reasoning tasks, incurring performance degradation. To tackle this issue, we first identify the drawbacks of existing solutions (i.e., insufficient and irrelevant visual descriptions, and limited multi-modal capacities). We then decompose visual reasoning process into two stages: visual perception (i.e., eyesight) and textual reasoning (i.e., wisdom), and introduce a novel visual reasoning framework named ProReason. This framework features multi-run proactive perception and decoupled vision-reasoning capabilities. Briefly, given a multi-modal question, ProReason iterates proactive information collection and reasoning until the answer can be concluded with necessary and sufficient visual descriptions. Notably, the disassociation of capabilities allows seamless integration of existing large language models (LLMs) to compensate for the reasoning deficits of LVLMs. Our extensive experiments demonstrate that ProReason outperforms both existing multi-step reasoning frameworks and passive peer methods on a wide range of benchmarks for both open-source and closed-source models. In addition, with the assistance of LLMs, ProReason achieves a performance improvement of up to 15% on MMMU benchmark. Our insights into existing solutions and the decoupled perspective for feasible integration of LLMs illuminate future research on visual reasoning techniques, especially LLM-assisted ones.

4542Scaling Wearable Foundation Models

[openreview] [pdf]

Abstract Wearable sensors have become ubiquitous thanks to a variety of health tracking features. The resulting continuous and longitudinal measurements from everyday life generate large volumes of data; however, making sense of these observations for scientific and actionable insights is non-trivial. Inspired by the empirical success of generative modeling, where large neural networks learn powerful representations from vast amounts of text, image, video, or audio data, we investigate the scaling properties of sensor foundation models across compute, data, and model size. Using a dataset of up to 40 million hours of in-situ heart rate, heart rate variability, electrodermal activity, accelerometer, skin temperature, and altimeter per-minute data from over 165,000 people, we create LSM, a multimodal foundation model built on the largest wearable-signals dataset with the most extensive range of sensor modalities to date. Our results establish the scaling laws of LSMs for tasks such as imputation, interpolation and extrapolation, both across time and sensor modalities. Moreover, we highlight how LSMs enables sample-efficient downstream learning for tasks like exercise and activity recognition.

4543Concept Pinpoint Eraser for Text-to-image Diffusion Models via Residual Attention Gate

[openreview] [pdf]

Abstract Remarkable progress in text-to-image diffusion models has brought a major concern about potentially generating images on inappropriate or trademarked concepts. Concept erasing has been investigated with the goals of deleting target concepts in diffusion models while preserving other concepts with minimal distortion. To achieve these goals, recent concept erasing methods usually fine-tune the cross-attention layers of diffusion models. In this work, we first show that merely updating the cross-attention layers in diffusion models, which is mathematically equivalent to adding linear modules to weights, may not be able to preserve diverse remaining concepts. Then, we propose a novel framework, dubbed Concept Pinpoint Eraser (CPE), by adding nonlinear Residual Attention Gates (ResAGs) that selectively erase (or cut) target concepts while safeguarding remaining concepts from broad distributions by employing an attention anchoring loss to prevent the forgetting. Moreover, we adversarially train CPE with ResAG and learnable text embeddings in an iterative manner to maximize erasing performance and enhance robustness against adversarial attacks. Extensive experiments on the erasure of celebrities, artistic styles, and explicit contents demonstrated that the proposed CPE outperforms prior arts by keeping diverse remaining concepts while deleting the target concepts with robustness against attack prompts.

4544Methods for Convex(L0,L1)-Smooth Optimization: Clipping, Acceleration, and Adaptivity

[openreview] [pdf]

Abstract Due to the non-smoothness of optimization problems in Machine Learning, generalized smoothness assumptions have gained much attention in recent years. One of the most popular assumptions of this type is (L0,L1)(L_0, L_1)-smoothness (Zhang et al., 2020). In this paper, we focus on the class of (strongly) convex (L0,L1)(L_0, L_1)-smooth functions and derive new convergence guarantees for several existing methods. In particular, we derive improved convergence rates for Gradient Descent with (Smoothed) Gradient Clipping and for Gradient Descent with Polyak Stepsizes. In contrast to the existing results, our rates do not rely on the standard smoothness assumption and do not suffer from the exponential dependency from the initial distance to the solution. We also extend these results to the stochastic case under the over-parameterization assumption, propose a new accelerated method for convex (L0,L1)(L_0, L_1)-smooth optimization, and derive new convergence rates for Adaptive Gradient Descent (Malitsky and Mishchenko, 2020).

4545Enhancing Multi-Modal Reasoning Over Time-Series and Natural Language Data

[openreview] [pdf]

Abstract Time-series analysis is critical in many industries such as healthcare, finance and energy sectors, where understanding time-series trends alongside contextual information is essential for informed decision making. However, current time-series models are limited in their ability to perform reasoning that involves both time-series data and textual information. In this work we address this gap by introducing Chat-TS, a large language model (LLM) designed specifically for reasoning over time-series and textual data. Unlike traditional time-series models Chat-TS integrates time-series tokens into the LLM vocabulary, enhancing its reasoning ability over both text and time-series modalities without compromising its core natural language capabilities. To support the development and validation of Chat-TS we contribute three new datasets: the TS Instruct Training Dataset which pairs diverse time-series data with relevant text instructions and responses for instruction tuning, the TS Instruct question and answer (QA) benchmark, a set of nearly 4000 multiple-choice questions designed to evaluate multi-modal reasoning and the TS Instruct Qualitative Benchmark which provides a smaller subset of QA, math and decision making questions for LLM evaluation. Our training strategy preserves the inherent reasoning capabilities of the LLM while augmenting it with time-series reasoning capabilities. Evaluation results show that Chat-TS achieves state-of-the-art performance in multi-modal reasoning tasks, maintaining strong natural language proficiency while advancing time-series reasoning. All models, datasets, and code will be made publicly available [link].

4546Reliable and Diverse Evaluation of LLM Medical Knowledge Mastery

[openreview] [pdf]

Abstract Mastering medical knowledge is crucial for medical-specific LLMs. However, despite the existence of medical benchmarks like MedQA, a unified framework that fully leverages existing knowledge bases to evaluate LLMs’ mastery of medical knowledge is still lacking. In the study, we propose a novel framework PretexEval that dynamically generates reliable and diverse test samples to evaluate LLMs for any given medical knowledge base. We notice that test samples produced directly from knowledge bases by templates or LLMs may introduce factual errors and also lack diversity. To address these issues, we introduce a novel schema into our proposed evaluation framework that employs predicate equivalence transformations to produce a series of variants for any given medical knowledge point. Finally, these produced predicate variants are converted into textual language, resulting in a series of reliable and diverse test samples to evaluate whether LLMs fully master the given medical factual knowledge point. Here, we use our proposed framework to systematically investigate the mastery of medical factual knowledge of 12 well-known LLMs, based on two knowledge bases that are crucial for clinical diagnosis and treatment. The evaluation results illustrate that current LLMs still exhibit significant deficiencies in fully mastering medical knowledge, despite achieving considerable success on some famous public benchmarks. These new findings provide valuable insights for developing medical-specific LLMs, highlighting that current LLMs urgently need to strengthen their comprehensive and in-depth mastery of medical knowledge before being applied to real-world medical scenarios.

4547Variational Inference with Unnormalized Priors

[openreview] [pdf]

Abstract Variational inference typically assumes normalized priors, limiting the expressiveness of generative models like Variational Autoencoders (VAEs). In this work, we propose a novel approach by replacing the prior 𝑝(𝑧) with an unnormalized energy-based distribution exp(−𝐸(𝑧))/𝑍, where 𝐸(𝑧) is the energy function and 𝑍 is the partition function. This leads to a variational lower bound that allows for two key innovations: (1) the incorporation of more powerful, flexible priors into the VAE framework, resulting in improved likelihood estimates and enhanced generative performance, and (2) the ability to train energy-based models (EBMs) without the need for computationally expensive Markov chain sampling, requiring only a small 𝑛 > 1 importance samples from the posterior distribution. Our approach bridges VAEs and EBMs, providing a scalable and efficient framework for leveraging unnormalized priors in probabilistic models.

4548Discovering Physics Laws of Dynamical Systems via Invariant Function Learning

[openreview] [pdf]

Abstract We consider learning underlying laws of dynamical systems governed by ordinary differential equations (ODE). A key challenge is how to discover intrinsic dynamics across multiple environments while circumventing environment-specific mechanisms. Unlike prior work, we tackle more complex environments where changes extend beyond function coefficients to entirely different function forms. For example, we demonstrate the discovery of ideal pendulum’s natural motion αsinθt\alpha \sin{\theta_t} by observing pendulum dynamics in different environments, such as the damped environment αsin(θt)ρωt\alpha \sin(\theta_t) - \rho \omega_t and powered environment αsin(θt)+ρωtωt\alpha \sin(\theta_t) + \rho \frac{\omega_t}{\left|\omega_t\right|}. Here, we formulate this problem as an invariant function learninginvariant\ function\ learning task and propose a new method, known as D\mathbf{D}isentanglement of I\mathbf{I}nvariant F\mathbf{F}unctions (DIF), that is grounded in causal analysis. We propose a causal graph and design an encoder-decoder hypernetwork that explicitly disentangles invariant functions from environment-specific dynamics. The discovery of invariant functions is guaranteed by our information-based principle that enforces the independence between extracted invariant functions and environments. Quantitative comparisons with meta-learning and invariant learning baselines on three ODE systems demonstrate the effectiveness and efficiency of our method. Furthermore, symbolic regression explanation results highlight the ability of our framework to uncover intrinsic laws.

4549UniTST: Effectively Modeling Inter-Series and Intra-Series Dependencies for Multivariate Time Series Forecasting

[openreview] [pdf]

Abstract Transformer-based models have emerged as powerful tools for multivariate time series forecasting (MTSF). However, existing Transformer models often fall short of capturing both intricate dependencies across variate and temporal dimensions in MTS data. Some recent models are proposed to separately capture variate and temporal dependencies through either two sequential or parallel attention mechanisms. However, these methods cannot directly and explicitly learn the intricate inter-series and intra-series dependencies. In this work, we first demonstrate that these dependencies are very important as they usually exist in real-world data. To directly model these dependencies, we propose a transformer-based model UniTST containing a unified attention mechanism on the flattened patch tokens. Additionally, we add a dispatcher module which reduces the complexity and makes the model feasible for a potentially large number of variates. Although our proposed model employs a simple architecture, it offers compelling performance as shown in our extensive experiments on several datasets for time series forecasting.

4550Towards Scalable Semantic Representation for Recommendation

[openreview] [pdf]

Abstract With recent advances in large language models (LLMs), there has been emerging numbers of research in developing Semantic IDs based on LLMs to enhance the performance of recommendation systems. However, the dimension of these embeddings needs to match that of the ID embedding in recommendation, which is usually much smaller than the original length. Such dimension compression results in inevitable losses in discriminability and dimension robustness of the LLM embeddings, which motivates us to scale up the semantic representation. In this paper, we propose Mixture-of-Codes, which first constructs multiple independent codebooks for LLM representation in the indexing stage, and then utilizes the Semantic Representation along with a fusion module for the downstream recommendation stage. Extensive analysis and experiments demonstrate that our method achieves superior discriminability and dimension robustness scalability, leading to the best scale-up performance in recommendations.

4551GOAL: A Generalist Combinatorial Optimization Agent Learning

[openreview] [pdf]

Abstract Machine Learning-based heuristics have recently shown impressive performance in solving a variety of hard combinatorial optimization problems (COPs). However they generally rely on a separate neural model, specialized and trained for each single problem. Any variation of a problem requires adjustment of its model and re-training from scratch. In this paper, we propose GOAL (for Generalist combinatorial Optimization Agent Learning), a generalist model capable of efficiently solving multiple COPs and which can be fine-tuned to solve new COPs. GOAL consists of a single backbone plus light-weight problem-specific adapters for input and output processing. The backbone is based on a new form of mixed-attention blocks which allows to handle problems defined on graphs with arbitrary combinations of node, edge and instance-level features. Additionally, problems which involve heterogeneous types of nodes or edges are handled through a novel multi-type transformer architecture, where the attention blocks are duplicated to attend the meaningful combinations of types while relying on the same shared parameters. We train GOAL on a set of routing, scheduling and classic graph problems and show that it is only slightly inferior to the specialized baselines while being the first multi-task model that solves a wide range of COPs. Finally we showcase the strong transfer learning capacity of GOAL by fine-tuning it on several new problems. Our code is available athttps://anonymous.4open.science/r/GOAL-10/.

4552Compute Optimal Inference and Provable Amortisation Gap in Sparse Autoencoders

[openreview] [pdf]

Abstract A recent line of work has shown promise in using sparse autoencoders (SAEs) to uncover interpretable features in neural network representations. However, the simple linear-nonlinear encoding mechanism in SAEs limits their ability to perform accurate sparse inference. In this paper, we investigate sparse inference and learning in SAEs through the lens of sparse coding. Specifically, we show that SAEs perform amortised sparse inference with a computationally restricted encoder and, using compressed sensing theory, we prove that this mapping is inherently insufficient for accurate sparse inference, even in solvable cases. Building on this theory, we empirically explore conditions where more sophisticated sparse inference methods outperform traditional SAE encoders. Our key contribution is the decoupling of the encoding and decoding processes, which allows for a comparison of various sparse encoding strategies. We evaluate these strategies on two dimensions: alignment with true underlying sparse features and correct inference of sparse codes, while also accounting for computational costs during training and inference. Our results reveal that substantial performance gains can be achieved with minimal increases in compute cost. We demonstrate that this generalises to SAEs applied to large language models (LLMs), where advanced encoders achieve similar interpretability. This work opens new avenues for understanding neural network representations and offers important implications for improving the tools we use to analyse the activations of large language models.

4553Swift Hydra: Self-Reinforcing Generative Framework for Anomaly Detection with Multiple Mamba Models

[openreview] [pdf]

Abstract Despite a plethora of anomaly detection models developed over the years, their ability to generalize to unseen anomalies remains an issue, particularly in critical systems. This paper aims to address this challenge by introducing Swift Hydra, a new framework for training an anomaly detection method based on generative AI and reinforcement learning (RL). Through featuring an RL policy that operates on the latent variables of a generative model, the framework synthesizes novel and diverse anomaly samples that are capable of bypassing a detection model. These generated synthetic samples are, in turn, used to augment the detection model, further improving its ability to handle challenging anomalies. Swift Hydra also incorporates Mamba models structured as a Mixture of Experts (MoE) to enable scalable adaptation of the number of Mamba experts based on data complexity, effectively capturing diverse feature distributions without increasing the model’s inference time. Empirical evaluations on ADBench benchmark demonstrate that Swift Hydra outperforms other state-of-the-art anomaly detection models while maintaining a relatively short inference time. From these results, our research highlights a new and auspicious paradigm of integrating RL and generative AI for advancing anomaly detection.

4554Unearthing Large Scale Domain-Specific Knowledge from Public Corpora

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable potential in various tasks, however, there remains a significant lack of open-source models and data for specific domains. Previous work has primarily focused on manually specifying resources and collecting high-quality data for specific domains, which is extremely time-consuming and labor-intensive. To address this limitation, we introduce large models into the data collection pipeline to guide the generation of domain-specific information and retrieve relevant data from Common Crawl (CC), a large public corpus. We refer to this approach as Retrieve-from-CC. It not only collects data related to domain-specific knowledge but also mines the data containing potential reasoning procedures from the public corpus. By applying this method, we have collected a knowledge domain-related dataset named Retrieve-Pile, which covers four main domains, including the sciences, humanities, and other categories. Through the analysis of Retrieve-Pile, Retrieve-from-CC can effectively retrieve relevant data from the covered knowledge domains and significantly improve the performance in tests of mathematical and knowledge-related reasoning abilities.

4555Zero-shot Object-level Out-of-distribution Detection with Context-aware Inpainting

[openreview] [pdf]

Abstract Detecting when an object detector predicts wrongly, for example, misrecognizing an out-of-distribution (ODD) unseen object as a seen one, is crucial to ensure the model’s trustworthiness. Modern object detectors are known to be overly confident, making it hard to rely solely on their responses to detect error cases. We therefore investigate the use of an auxiliary model for the rescue. Specifically, we leverage an off-the-shelf text-to-image generative model (e.g., Stable Diffusion), whose training objective is different from discriminative models. We surmise such a discrepancy would allow us to use their inconsistency as an error indicator. Concretely, given a detected object box and the predicted class label, we perform class-conditioned inpainting on the box-removed image. When the predicted object label is incorrect, the inpainted image is doomed to deviate from the original one, making the reconstruction error an effective recognition error indicator, especially on misclassified OOD samples. Extensive experiments demonstrate that our approach consistently outperforms prior zero-shot and non-zero-shot OOD detection approaches.

4556Episodic Memories Generation and Evaluation Benchmark for Large Language Models

[openreview] [pdf]

Abstract Episodic memory -- the ability to recall specific events grounded in time and space -- is a cornerstone of human cognition, enabling not only coherent storytelling, but also planning and decision-making. Despite their remarkable capabilities, Large Language Models (LLMs) lack a robust mechanism for episodic memory: we argue that integrating episodic memory capabilities into LLM is essential for advancing AI towards human-like cognition, increasing their potential to reason consistently and ground their output in real-world episodic events, hence avoiding confabulations. To address this challenge, we introduce a comprehensive framework to model and evaluate LLM episodic memory capabilities. Drawing inspiration from cognitive science, we develop a structured approach to represent episodic events, encapsulating temporal and spatial contexts, involved entities, and detailed descriptions. We synthesize a unique episodic memory benchmark, free from contamination, and release open source code and datasets to assess LLM performance across various recall and episodic reasoning tasks. Our evaluation of state-of-the-art models, including GPT-4 and Claude variants, in addition to the recent o1-mini, reveals that even the most advanced LLMs struggle with episodic memory tasks, particularly when dealing with multiple related events or complex spatio-temporal relationships --- even in contexts as short as 10k-100k tokens.

4557Structured-Initialization Learning

[openreview] [pdf]

Abstract The emergence of large language models (LLMs) has revolutionized natural language processing, but their development and deployment face significant challenges in computational resources and environmental sustainability. Traditional self-supervised learning (SSL) paradigms requiring extensive computational infrastructure and exhibiting slow convergence rates, leading to increased energy consumption and longer training durations. While existing model fine-tuning techniques such as Low-Rank Adaptation (LoRA) are resource-intensive and fail to facilitate swift knowledge updates when integrating a mount of new data in model version iteration. To mitigate these challenges, we introduce Sail, a novel method for accelerating the training of neural network models by leveraging knowledge from (publicly available) pre-trained models. Our approach comprises two key components: (1) a parameter transformation technique that adjusts the dimensions of pre-trained model parameters to match the target architecture, and (2) a proximal parameter integration and retraining strategy that efficiently combines transformed parameters to initialize new models. We formalize the concept of Proximal Parameter and provide theoretical guarantees for its convergence advantages. Our approach achieves substantial reductions in training time and computational resources while maintaining or improving model performance on downstream tasks. These results indicate that Sail provides a promising direction for the more efficient and accessible development of the deep learning community. Our code will be made publicly available.

4558Contrastive Learning from Synthetic Audio Doppelgängers

[openreview] [pdf]

Abstract Learning robust audio representations currently demands extensive datasets of real-world sound recordings. By applying artificial transformations to these recordings, models can learn to recognize similarities despite subtle variations through techniques like contrastive learning. However, these transformations are only approximations of the true diversity found in real-world sounds, which are generated by complex interactions of physical processes, from vocal cord vibrations to the resonance of musical instruments. We propose a solution to both the data scale and transformation limitations, leveraging synthetic audio. By randomly perturbing the parameters of a sound synthesizer, we generate audio doppelgängers—synthetic positive pairs with causally manipulated variations in timbre, pitch, and temporal envelopes. These variations, difficult to achieve through augmentations of existing audio, provide a rich source of contrastive information. Despite the shift to randomly generated synthetic data, our method produces strong representations, outperforming real data on several standard audio classification tasks. Notably, our approach is lightweight, requires no data storage, and has only a single hyperparameter, which we extensively analyze. We offer this method as a complement to existing strategies for contrastive learning in audio, using synthesized sounds to reduce the data burden on practitioners.

4559Ada-K Routing: Boosting the Efficiency of MoE-based LLMs

[openreview] [pdf]

Abstract In the era of Large Language Models (LLMs), Mixture-of-Experts (MoE) architectures offer a promising approach to managing computational costs while scaling up model parameters. Conventional MoE-based LLMs typically employ static Top-K routing, which activates a fixed and equal number of experts for each token regardless of their significance within the context. In this paper, we propose a novel Ada-K routing strategy that dynamically adjusts the number of activated experts for each token, thereby improving the balance between computational efficiency and model performance. Specifically, our strategy incorporates learnable and lightweight allocator modules that decide customized expert resource allocation tailored to the contextual needs for each token. These allocators are designed to be fully pluggable, making it broadly applicable across all mainstream MoE-based LLMs. We leverage the Proximal Policy Optimization (PPO) algorithm to facilitate an end-to-end learning process for this non-differentiable decision-making framework. Extensive evaluations on four popular baseline models demonstrate that our Ada-K routing method significantly outperforms conventional Top-K routing. Compared to Top-K, our method achieves over 25% reduction in FLOPs and more than 20% inference speedup while still improving performance across various benchmarks. Moreover, the training of Ada-K is highly efficient. Even for Mixtral-8x22B, a MoE-based LLM with more than 140B parameters, the training time is limited to 8 hours. Detailed analysis shows that harder tasks, middle layers, and content words tend to activate more experts, providing valuable insights for future adaptive MoE system designs. Both the training code and model checkpoints will be publicly available.

4560DSI: Faster Inference of Large Language Models via Speculation Parallelism

[openreview] [pdf]

Abstract Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces Distributed Speculative Inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require fast and accurate drafters, which are often unavailable in practice. We identify a gap where SI can be slower than non-SI given slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI—given any drafters. DSI introduces a novel type of task parallelism called Speculation Parallelism (SP), which orchestrates target and drafter instances to overlap in time, creating a new foundational tradeoff between computational resources and latency. DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic single-node settings where DSI is 1.29-1.92x faster than SI.

4561Incremental Learning with Task-Specific Adapters

[openreview] [pdf]

Abstract Incremental learning aims to continuously acquire new knowledge while retaining previously learned information. The existing literature primarily focuses on enhancing model stability to prevent catastrophic forgetting of earlier tasks, often overlooking the challenges posed by inter-task differences, which we argue are the primary cause of catastrophic forgetting. In this paper, we propose a network design consisting of two blocks: one for modeling invariant features shared across all tasks, and another for capturing task-specific information. Specifically, we repurpose adapters, originally introduced for parameter-efficient fine-tuning, as feature modifiers to capture task-specific information, while the backbone network learns invariant features. Our approach can be integrated with existing methods such as elastic weight consolidation (EWC) and learning without forgetting (LwF). Extensive experiments on the CIFAR-100 and ImageNet datasets demonstrate that our adapter-based methods consistently outperform non-adapter counterparts across various learning scenarios, including different task orders and data scales. This study provides an effective solution to the stability-plasticity dilemma in incremental learning.

4562Computing Optimal Regularizers for Online Linear Optimization

[openreview] [pdf]

Abstract Follow-the-Regularized-Leader (FTRL) algorithms are a popular class of learning algorithms for online linear optimization (OLO) that guarantee sub-linear regret, but the choice of regularizer can significantly impact dimension-dependent factors in the regret bound. We present an algorithm that takes as input convex and symmetric action sets and loss sets for a specific OLO instance, and outputs a regularizer such that running FTRL with this regularizer guarantees regret within a universal constant factor of the best possible regret bound. In particular, for any choice of (convex, symmetric) action set and loss set we prove that there exists an instantiation of FTRL which achieves regret within a constant factor of the best possible learning algorithm, strengthening the universality result of Srebro et al., 2011.Our algorithm requires preprocessing time and space exponential in the dimension dd of the OLO instance, but can be run efficiently online assuming a membership and linear optimization oracle for the action and loss sets, respectively (and is fully polynomial time for the case of constant dimension dd). We complement this with a lower bound showing that even deciding whether a given regularizer is α\alpha-strongly-convex with respect to a given norm is NP-hard.

4563Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as “lost in the middle”, a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context window-extended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states.

4564DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent

[openreview] [pdf]

Abstract On-device control agents, especially on mobile devices, are responsible for operating mobile devices to fulfill users’ requests, enabling seamless and intuitive interactions. Integrating Multimodal Large Language Models (MLLMs) into these agents enhances their ability to understand and execute complex commands, thereby improving user experience. However, fine-tuning MLLMs for on-device control presents significant challenges due to limited data availability and inefficient online training processes. This paper introduces DistRL, a novel framework designed to enhance the efficiency of online RL fine-tuning for mobile device control agents. DistRL employs centralized training and decentralized data acquisition to ensure efficient fine-tuning in the context of dynamic online interactions. Additionally, the framework is backed by our tailor-made RL algorithm, which effectively balances exploration with the prioritized utilization of collected data to ensure stable and robust training. Our experiments show that, on average, DistRL delivers a 3×\times improvement in training efficiency and enables training data collection 2.4×\times faster than the leading synchronous multi-machine methods. Notably, after training, DistRL achieves a 20% relative improvement in success rate compared to state-of-the-art methods on general Android tasks from an open benchmark, significantly outperforming existing approaches while maintaining the same training time. These results validate DistRL as a scalable and efficient solution, offering substantial improvements in both training efficiency and agent performance for real-world, in-the-wild device control tasks.

4565Soft-TransFormers for Continual Learning

[openreview] [pdf]

Abstract Inspired by Well-initialized Lottery Ticket Hypothesis (WLTH), which provides suboptimal fine-tuning solutions, we propose a novel fully fine-tuned continual learning (CL) method referred to as Soft-TransFormers (Soft-TF). Soft-TF sequentially learns and selects an optimal soft-network or subnetwork for each task. During sequential training in CL, Soft-TF jointly optimizes the weights of sparse layers to obtain task-adaptive soft (real-valued) networks or subnetworks (binary masks), while keeping the well-pre-trained layer parameters frozen. In inference, the identified task-adaptive network of Soft-TF masks the parameters of the pre-trained network, mapping to an optimal solution for each task and minimizing Catastrophic Forgetting (CF) - the soft-masking preserves the knowledge of the pre-trained network. Extensive experiments on Vision Transformer (ViT) and CLIP demonstrate the effectiveness of Soft-TF, achieving state-of-the-art performance across various CL scenarios, including Class-Incremental Learning (CIL) and Task-Incremental Learning (TIL), supported by convergence theory.

4566Loss Landscape of Shallow ReLU-like Neural Networks: Stationary Points, Saddle Escaping, and Network Embedding

[openreview] [pdf]

Abstract In this paper, we investigate the loss landscape of one-hidden-layer neural networks with ReLU-like activation functions trained with the empirical squared loss. As the activation function is non-differentiable, it is so far unclear how to completely characterize the stationary points. We deduce the conditions for stationarity that apply to both non-differentiable and differentiable areas of the landscape. Additionally, we show that, if a stationary point does not contain “escape neurons”, which are defined with first-order conditions, then it must be a local minimum. Moreover, for the scalar-output case, the presence of an escape neuron guarantees that the stationary point is not a local minimum. Our results refine the description of the saddle-to-saddle training process starting from infinitesimally small (vanishing) initialization for shallow ReLU-like networks. By precluding the existence of the saddle escaping types that previous works did not rule out, we advance one step closer to a complete picture of the entire dynamics. Moreover, we are also able to fully discuss how network embedding, which is to instantiate a narrower network within a wider network, reshapes the stationary points.

4567A biologically-plausible alternative to backpropagation using pseudoinverse feedback

[openreview] [pdf]

Abstract Despite its successes in both practical machine learning and neural modelling, the backpropagation algorithm has long been considered biologically implausible (Crick, 1989). Previous solutions to this biological implausibility have proposed the existence of a separate, error feedback network, in which error at the final layer may be propagated backwards to earlier layers in a manner similar to back- propagation. However, biological evidence suggests that feedback connections in the cortex may function more similarly to an autoencoder, rather than being exclusively used as error feedback (Marino, 2020). Here, we attempt to unify these two paradigms by showing how autoencoder-like, inverse feedback connections may be used to minimize error throughout a feedforward neural network. Furthermore, we show that in the MNIST and CIFAR-10 classification tasks, an asymptotic error comparable to backpropagation can be achieved in fewer iterations than comparable biologically-plausible algorithms, such as Random Target Propagation (Lillicrap et al., 2014). Our proposed mechanism, Reciprocal Feedback, consists of two contributions: first we show how a modification of the Recirculation algorithm (Hinton E. & McClelland, 1988) is capable of learning the Moore-Penrose pseudoinverse of a pair of network weights. Then, we will show how, using the Hildebrandt-Graves Theorem (Hildebrandt & Graves, 1927), locally-learned pseudoinverse feedback connections may be used to facilitate an alternative optimization method to traditional gradient descent - while alleviating the need to compute the weight transpose.

4568Compositional Risk Minimization

[openreview] [pdf]

Abstract In this work, we tackle a challenging and extreme form of subpopulation shift, which is termed compositional shift. Under compositional shifts, some combinations of attributes are totally absent from the training distribution but present in the test distribution. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

4569Causally Motivated Sycophancy Mitigation for Large Language Models

[openreview] [pdf]

Abstract Incorporating user preferences into large language models (LLMs) can enhance the personalization and reliability of model outputs and facilitate the application of LLMs to real-world scenarios. However, leveraging user preferences can be a double-edged sword. Recent studies have found that improper utilization can incur sycophancy, where LLMs prioritize alignment with user preferences over the correctness of their outputs. To address sycophancy in LLMs, we analyze and model the problem through the lens of structured causal models (SCMs). We attribute sycophancy to LLMs’ reliance on spurious correlations between user preferences and model outputs in this paper. Based on the proposed SCMs, we develop a novel framework to mitigate sycophancy in LLMs by exploiting a significant causal signature. Specifically, we eliminate the spurious correlations embedded in the intermediate layers of LLMs through head reweighting, and then calibrate the intra-head knowledge along the causal representation direction. Extensive experiments are conducted across diverse language tasks, and the empirical results demonstrate the superiority of our method over state-of-the-art competitors in mitigating sycophancy in LLMs.

4570A Drop-In Solution for On-the-Fly Adaptation of Speculative Decoding in Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) are cutting-edge generative AI models built on transformer architecture, which tend to be highly memory-intensive when performing real-time inference. Various strategies have been developed to enhance the end-to-end inference speed for LLMs, one of which is speculative decoding. This technique involves running a smaller LLM (draft model) for inference over a defined window size, denoted as γ\gamma, while simultaneously being validated by the larger LLM (target model). Choosing the optimal γ\gamma value and the draft model is essential for unlocking the potential of speculative decoding. But it is difficult to do due to the complicated influence from various factors, including the nature of the task, the hardware in use, and the combination of the large and small models. This paper introduceson-the-fly adaption of speculative decoding, a solution that dynamically adapts the choices to maximize the efficiency of speculative decoding for LLM inferences. As a drop-in solution, it needs no offline benchmarking or training. Experiments show that the solution can lead to 3.55-16.48% speed improvement over the standard speculative decoding, and 1.2-3.4×\times over the default LLMs.

4571PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning

[openreview] [pdf]

Abstract Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-phase approach where we first perform adaptive relabeling on a few expert demonstrations to generate efficient subgoal supervision, and then jointly optimize HRL agents by employing reinforcement learning (RL) and imitation learning (IL). We perform theoretical analysis to bound the sub-optimality of our approach and derive a joint optimization framework using RL and IL. Since PEAR utilizes only a handful of expert demonstrations and considers minimal limiting assumptions on the task structure, it can be easily integrated with typical off-policy \RL algorithms to produce a practical HRL approach. We perform extensive experiments on challenging environments and show that PEAR is able to outperform various hierarchical and non-hierarchical baselines and achieve upto 80% success rates in complex sparse robotic control tasks where other baselines typically fail to show significant progress. We also perform ablations to thoroughly analyze the importance of our various design choices. Finally, we perform real world robotic experiments on complex tasks and demonstrate that PEAR consistently outperforms the baselines.

4572Compress then Serve: Serving Thousands of LoRA Adapters with Little Overhead

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) with low-rank adaptations (LoRAs) has become common practice, often yielding numerous copies of the same LLM differing only in their LoRA updates. This paradigm presents challenges for systems that serve real-time responses to queries that each involve a different LoRA. Prior works optimize the design of such systems but still require continuous loading and offloading of LoRAs, as it is infeasible to store thousands of LoRAs in GPU memory. To mitigate this issue, we investigate the efficacy of model compression when serving LoRAs. We propose a method for joint compression of LoRAs into a shared basis paired with LoRA-specific scaling matrices. We extend our algorithm to learn clusters of LoRAs that are more amenable to joint compression, allowing it to scale gracefully to large LoRA collections. Our experiments with up to 500 LoRAs demonstrate that compressed LoRAs preserve performance while offering major throughput gains in realistic serving scenarios with over a thousand LoRAs, maintaining 80% of the throughput of serving a \emph{single} LoRA.

4573Learning Imbalanced Data with Beneficial Label Noise

[openreview] [pdf]

Abstract Data imbalance and label noise are common factors hindering the classifier’s performance. Data-level approaches to addressing imbalanced learning usuallyinvolve resampling by adding or removing samples, which often results in information loss or generative errors. Building upon theoretical studies of the impact of imbalance ratio on decision boundaries across various evaluation metrics in binary classification, it is uncovered that introducing appropriate label noise can alter the biased decision boundaries and thus enhance the performance of classifiers in imbalanced learning. In this paper, we introduce the Label-Noise-based Re-balancing (LNR) approach to solve both binary and multi-class imbalanced classifications by employing a novel design of asymmetric label noise model. In contrast to other data-level methods, our approach is easy to implement and alleviates the issues of informative loss and generative errors. We validated the superiority of this method on synthetic and real-world datasets. More importantly, our LNR approach can integrate seamlessly with any classifiers and other algorithm-level methods for imbalanced learning. Overall, our work opens up a new avenue for addressing imbalanced learning, highlighting the potential advantages of balancing data through beneficial label noise.

4574Efficient Heuristics Generation for Solving Combinatorial Optimization Problems Using Large Language Models

[openreview] [pdf]

Abstract Recent studies exploited Large Language Models (LLMs) to autonomously generate heuristics for solving Combinatorial Optimization Problems (COPs), by prompting LLMs to first provide search directions and then derive heuristics accordingly. However, the absence of task-specific knowledge in prompts often leads LLMs to provide unspecific search directions, obstructing the derivation of well-performing heuristics. Moreover, evaluating the derived heuristics remains resource-intensive, especially for those semantically equivalent ones, often requiring unnecessary resource expenditure. To enable LLMs to provide specific search directions, we propose the Hercules algorithm, which leverages our designed Core Abstraction Prompting (CAP) method to abstract the core components from elite heuristics and incorporate them as prior knowledge in prompts. We theoretically prove the effectiveness of CAP in reducing unspecificity and provide empirical results in this work. To reduce the required computing resources for evaluating the derived heuristics, we propose few-shot Performance Prediction Prompting (PPP), a first-of-its-kind method for the Heuristic Generation (HG) task. PPP leverages LLMs to predict the fitness values of newly derived heuristics by analyzing their semantic similarity to previously evaluated ones. We further develop two tailored mechanisms for PPP to enhance predictive accuracy and determine unreliable predictions, respectively. The use of PPP makes Hercules more resource-efficient and we name this variant Hercules-P. Extensive experiments across various HG tasks, COPs, and LLMs demonstrate that Hercules outperforms the state-of-the-art LLM-based HG algorithms, while Hercules-P excels at minimizing computing resources. In addition, we illustrate the effectiveness of CAP, PPP, and the other proposed mechanisms by conducting relevant ablation studies.

4575Enhancing LLM’s interpretability for time series via multi-level aligned embeddings

[openreview] [pdf]

Abstract The adaptation of large language models (LLMs) to time series forecasting poses unique challenges, as time series data is continuous in nature, while LLMs operate on discrete tokens. Despite the success of LLMs in natural language processing (NLP) and other structured domains, aligning time series data with language-based representations while maintaining both predictive accuracy and interpretability remains a significant hurdle. Existing methods have attempted to reprogram time series data into text-based forms, but these often fall short in delivering meaningful, interpretable results. In this paper, we propose a multi-text alignment framework for time series forecasting using LLMs that not only improves prediction accuracy but also enhances the interpretability of time series representations. Our method decomposes time series into trend, seasonality, and residual components, which are then reprogrammed into component-specific text representations. We introduce a multi-level alignment mechanism, where component-specific embeddings are aligned with pre-trained word tokens, enabling more interpretable forecasts. Experiments on multiple datasets demonstrate that our method outperforms state-of-the-art models in accuracy while providing good interpretability.

4576STRIDE: A Tool-Assisted LLM Agent Framework for Strategic and Interactive Decision-Making

[openreview] [pdf]

Abstract Large Language Models (LLMs) have revolutionized natural language processing, showing remarkable linguistic proficiency and reasoning capabilities. However, their application in strategic multi-agent decision-making environments is hampered by significant limitations including poor mathematical reasoning, difficulty in following instructions, and a tendency to generate incorrect information. These deficiencies hinder their performance in strategic and interactive tasks that demand adherence to nuanced game rules, long-term planning, exploration in unknown environments, and anticipation of opponents’ moves. To overcome these obstacles, this paper presents a novel LLM agent framework equipped with memory and specialized tools to enhance their strategic decision-making capabilities. We deploy the tools in a number of economically important environments, in particular bilateral bargaining and multi-agent and dynamic mechanism design. We employ quantitative metrics to assess the framework’s performance in various strategic decision-making problems. Our findings establish that our enhanced framework significantly improves the strategic decision-making capability of LLMs. While we highlight the inherent limitations of current LLM models, we demonstrate the improvements through targeted enhancements, suggesting a promising direction for future developments in LLM applications for interactive environments.

4577Multi-modality Expansion and Retention for LLMs through Parameter Merging and Decoupling

[openreview] [pdf]

Abstract Extensive fine-tuning of the synthesis between multimodal encoders and Large Language Models (LLMs) on modality-specific data can expand the modalities that LLM can handle, leading to the formation of Multimodal Large Language Models (MLLMs). However, this paradigm to expanding modalities heavily relies on initiating fine-tuning from scratch with new multimodal data, which is both resource-intensive and inflexible. In this paper, we propose MMER (Multi-modality Expansion and Retention)\textit{MMER (Multi-modality Expansion and Retention)}, a novel training-free\textit{training-free} approach that reuses and composes existing MLLMs to facilitate effective multimodal expansion while retaining the original performance of each MLLM. In particular, MMER maintains the multimodal encoders of the MLLMs while merging their LLM parameters. By comparing the original LLM parameters with the merged ones, MMER can create binary masks that enable an approximate separation of the LLM parameters for each modality. This process allows the decoupled parameters to independently process modality-specific inputs, thereby reducing parameter conflicts and maintaining the fidelity of the original MLLMs. Additionally, MMER integrates strategies to prevent catastrophic forgetting by employing a similar approach to separately decouple the parameters fine-tuned on new tasks from the original parameters. Experiments on three multimodal tasks and fourteen dual-modal tasks show significant improvements over recent baselines, demonstrating that MMER can effectively expand multimodal capabilities of LLMs while retaining 99.6% of the original performance. Further experiments in both single-task and cross modalities multi-task scenarios reveal that MMER significantly mitigates catastrophic forgetting.

4578Linear Representations of Political Perspective Emerge in Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated the ability to simulate responses aligned with human subjective perspectives, such as liberal or conservative ideologies in American politics. Our study reveals that LLMs achieve this by learning a ``geometry of perspective’’ that linearly represents subjective perspectives in the activation space, where similar simulated perspectives are represented closer to each other. Specifically, we probe the hidden layers of open, transformer-based LLMs (\texttt{Llama-2-7b-chat, Mistral-7b-instruct, Vicuna-7b}) when prompted to generate texts under the ideological perspective of distinct politicians. We find a set of attention heads that represent U.S. ideological slant, which is primarily located in the middle layers known to encode high-level concepts and tasks. The activation of these attention heads, when prompted about U.S.~politicians and media outlets, linearly correlates with existing measures of their ideological slant. We use this activation to detect the ideological slant implicitly adopted by an LLM as it is generating each token. We further show that by intervening on these attention heads, we can tune LLM output to any position along the linear dimension from a liberal to conservative ideological perspective. Our research shows that political ideology serves as a fundamental dimension of LLM representations, and present an interpretability method to identify, monitor, and control the subjective perspective used to generate text. Code:https://osf.io/us9yx/?view_only=cf0fdcdb123e4d6bb7d10a64be5c1a09

4579Tournament Evaluation of Large Language Models

[openreview] [pdf]

Abstract For several decades, the standard approach to evaluating a learned model has been to compute a numerical loss that summarizes the quality of the model based on a previously unseen test set. Two models for the same task can then be compared by looking at their scores on this set. However, recent experience with large language models (LLMs) has shown that comparing summary statistics of two broadly-capable models may not provide a reliable predictor of performance on real-world tasks. This has led to a growing use of crowd-sourced human feedback directly comparing outputs from pairs of models. While helpful, this approach requires a process that involves significant time and human effort, limiting the number of models that can be thoroughly evaluated. To address the need for a scalable method of comparing modern LLMs, we present a novel approach to evaluation via tournament-style model competitions that are constructed automatically from pre-existing benchmarks. We use these automatically-constructed tournaments to compute ratings for a range of models on a diverse set of tasks that use automated scoring via both multiple-choice and free-form text generation. We compare four prominent rating systems: Elo, Glicko, TrueSkill\texttrademark, and the Bradley-Terry model, and find that automatically-constructed tournaments provide reliable information about the relative performance of LLMs while using only a fraction of the amount of data required by current benchmark-based evaluation methods. We discuss implications for model evaluations and propose future directions for large-scale LLM comparisons.

4580EDM2+: Exploring Efficient Diffusion Model Architectures for Visual Generation

[openreview] [pdf]

Abstract The training and sampling of diffusion models have been exhaustively elucidated in prior art. Instead, the underlying network architecture design remains on a shaky empirical footing. Furthermore, in accordance with the recent trend of scaling law, large-scale models make inroads into generative vision tasks. However, running such large diffusion models incurs a sizeable computational burden, rendering it desiderata to optimize calculations and efficiently allocate resources. To bridge these gaps, we navigate the design landscape of efficient U-Net based diffusion models, stemming from the prestigious EDM2. Our exploration route is organized along two key axes, layer placement and module interconnection. We systematically study fundamental design choices and uncover several intriguing insights for superior efficacy and efficiency. These findings culminate in our redesigned architecture, EDM2+, that reduces the computational complexity of the baseline EDM2 by 2×2\times without compromising the generation quality. Extensive experiments and comparative analyses highlight the effectiveness of our proposed network architecture, which achieves the state-of-the-art FID on the hallmark ImageNet benchmark. Code will be released upon acceptance.

4581DODA: Diffusion for Object-detection Domain Adaptation in Agriculture

[openreview] [pdf]

Abstract Object detection has wide applications in agriculture, but the trained models often struggle to generalize across diverse agricultural environments. To address this challenge, we propose DODA (D\underline{D}iffusion for O\underline{O}bject-detection D\underline{D}omain Adaptation in A\underline{A}griculture), a unified framework that leverages diffusion models to generate high-quality, domain-specific detection data for multiple agricultural scenarios. DODA incorporates external domain embeddings and an improved layout-to-image (L2I) approach, allowing it to generate high-quality detection data for new domains without additional training. We demonstrate DODA’s effectiveness on the Global Wheat Head Detection dataset, where fine-tuning detectors on DODA-generated data yields significant improvements across multiple domains (maximum +15.6 AP). DODA provides a simple yet powerful approach to adapt object detectors to diverse agricultural scenarios, lowering barriers for more growers to use detection in their personalized environments.

4582When Attention Sink Emerges in Language Models: An Empirical View

[openreview] [pdf]

Abstract Language Models (LMs) assign significant attention to the first token, even if it is not semantically important, which is known asattention sink. This phenomenon has been widely adopted in applications such as streaming/long context generation, KV cache optimization, inference acceleration, model quantization, and others. Despite its widespread use, a deep understanding of attention sink in LMs is still lacking. In this work, we first demonstrate that attention sinks exist universally in LMs with various inputs, even in small models. Furthermore, attention sink is observed to emerge during the LM pre-training, motivating us to investigate howoptimization,data distribution,loss function, andmodel architecturein LM pre-training influence its emergence. We highlight that attention sink emerges after effective optimization on sufficient training data. The sink position is highly correlated with the loss function and data distribution. Most importantly, we find that attention sink acts more like key biases,storing extra attention scores, which could be non-informative and not contribute to the value computation. We also observe that this phenomenon (at least partially) stems from tokens’ inner dependence on attention scores as a result of softmax normalization. After relaxing such dependence by replacing softmax attention with other attention operations, such as sigmoid attention without normalization, attention sinks do not emerge in LMs up to 1B parameters.

4583U-shaped and Inverted-U Scaling behind\Emergent Abilities of Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) have been shown to exhibit emergent abilities in some downstream tasks, where performance seems to stagnate at first and then improve sharply and unpredictably with scale beyond a threshold. By dividing questions in the datasets according to difficulty level by average performance, we observe U-shaped scaling for hard questions, and inverted-U scaling followed by steady improvement for easy questions. Moreover, the emergence threshold roughly coincides with the point at which performance on easy questions reverts from inverse scaling to standard scaling. Capitalizing on the observable though opposing scaling trend on easy and hard questions, we propose a simple yet effective pipeline, called Slice-and-Sandwich, to predict both the emergence threshold and model performance beyond the threshold.

4584Noise-conditioned Energy-based Annealed Rewards (NEAR): A Generative Framework for Imitation Learning from Observation

[openreview] [pdf]

Abstract This paper introduces a new imitation learning framework based on energy-based generative models capable of learning complex, physics-dependent, robot motion policies through state-only expert motion trajectories. Our algorithm, called Noise-conditioned Energy-based Annealed Rewards (NEAR), constructs several perturbed versions of the expert’s motion data distribution and learns smooth, and well-defined representations of the data distribution’s energy function using denoising score matching. We propose to use these learnt energy functions as reward functions to learn imitation policies via reinforcement learning. We also present a strategy to gradually switch between the learnt energy functions, ensuring that the learnt rewards are always well-defined in the manifold of policy-generated samples. We evaluate our algorithm on complex humanoid tasks such as locomotion and martial arts and compare it with state-only adversarial imitation learning algorithms like Adversarial Motion Priors (AMP). Our framework sidesteps the optimisation challenges of adversarial imitation learning techniques and produces results comparable to AMP in several quantitative metrics across multiple imitation settings.

4585Critic-CoT: Boosting the reasoning abilities of large language model via Chain-of-Thought Critic

[openreview] [pdf]

Abstract Self-critic has become a crucial mechanism for enhancing the reasoning performance of LLMs. However, current approaches mainly involve basic prompts for intuitive instance-level feedback, which resembles System-1 processes and limits the reasoning capabilities. Moreover, there is a lack of in-depth investigations into the relationship between LLM’s ability to criticize and its task-solving performance. To address these issues, we propose Critic-CoT, a novel framework that pushes LLMs toward System-2-like critic capability. Through a step-wise CoT reasoning paradigm and the automatic construction of distant-supervision data without human annotation, Critic-CoT enables LLMs to engage in slow, analytic self-critique and refinement, thereby improving their reasoning abilities. Experiments on GSM8K and MATH demonstrate that our enhanced model significantly boosts task-solving performance by filtering out invalid solutions or iterative refinement. Furthermore, we investigate the intrinsic correlation between critique and task-solving abilities within LLMs, discovering that these abilities can mutually reinforce each other rather than conflict.

4586Lower-level Duality Based Penalty Methods for Hyperparameter Optimization

[openreview] [pdf]

Abstract Hyperparameter optimization (HO) is essential in machine learning and can be structured as a bilevel optimization. However, many existing algorithms designed for addressing nonsmooth lower-level problems involve solving sequential subproblems with high complexity. To tackle this challenge, we introduce penalty methods for solving HO based on strong duality between the lower level problem and its dual. We illustrate that the penalized problem closely approximates the optimal solutions of the original HO under certain conditions. In many real applications, the penalized problem is a weakly-convex objective with proximal-friendly constraints. Furthermore, we develop two fully first-order algorithms to solve the penalized problems. Theoretically, we prove the convergence of the proposed algorithms. We demonstrate the efficiency and superiority of our method across numerical experiments.

4587When Selection meets Intervention: Additional Complexities in Causal Discovery

[openreview] [pdf]

Abstract We address the common yet often-overlooked selection bias in interventional studies, where subjects are selectively enrolled into experiments. For instance, participants in a drug trial are usually patients of the relevant disease; A/B tests on mobile applications target existing users only, and gene perturbation studies typically focus on specific cell types, such as cancer cells. Ignoring this bias leads to incorrect causal discovery results. Even when recognized, the existing paradigm for interventional causal discovery still fails to address it. This is because subtle differences inwhenandwhereinterventions happen can lead to significantly different statistical patterns. We capture this dynamic by introducing a graphical model that explicitly accounts for both the observed world (where interventions are applied) and the counterfactual world (where selection occurs while interventions have not been applied). We characterize the Markov property of the model, and propose a provably sound algorithm to identify causal relations as well as selection mechanisms up to the equivalence class, from data with soft interventions and unknown targets. Through synthetic and real-world experiments, we demonstrate that our algorithm effectively identifies true causal relations despite the presence of selection bias.

4588Explainable Graph Representation Learning via Graph Pattern Analysis

[openreview] [pdf]

Abstract Explainable artificial intelligence (XAI) is an important area in the AI community, and interpretability is crucial for building robust and trustworthy AI models. While previous work has explored model-level and instance-level explainable graph learning, there has been limited investigation into explainable graph representation learning. In this paper, we focus on representation-level explainable graph learning and ask a fundamental question: What specific information about a graph is captured in graph representations? Our approach is inspired by graph kernels, which evaluate graph similarities by counting substructures within specific graph patterns. Although the pattern counting vector can serve as an explainable representation, it has limitations such as ignoring node features and being high-dimensional. To address these limitations, we introduce a framework for learning and explaining graph representations through graph pattern analysis. We start by sampling graph substructures of various patterns. Then, we learn the representations of these patterns and combine them using a weighted sum, where the weights indicate the importance of each graph pattern’s contribution. We also provide theoretical analyses of our methods, including robustness and generalization. In our experiments, we show how to learn and explain graph representations for real-world data using pattern analysis. Additionally, we compare our method against multiple baselines in both supervised and unsupervised learning tasks to demonstrate its effectiveness.

4589Inverse Rendering for Shape, Light, and Material Decomposition using Multi-Bounce Path Tracing and Reservoir Sampling

[openreview] [pdf]

Abstract We present a novel two-stage inverse rendering framework that jointly reconstructs and optimizes explicit geometry, materials, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or oversimplified path tracing algorithms, our method first extracts an explicit triangular mesh in the initial stage. Subsequently, it employs a more realistic physically-based inverse rendering model in the second stage, utilizing multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method not only effectively estimates indirect illumination (including self-shadowing and internal reflections). but also enhances the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through both qualitative and quantitative assessments across various scenarios, especially those with complex shadows, we demonstrate that our method achieves state-of-the-art performance in decomposition results. Additionally, our optimized explicit geometry supports further applications in scene editing, relighting, and material editing, compatible with modern graphics engines and CAD software.

4590SOP-Agent: Empower General Purpose AI Agent with Domain-Specific SOPs

[openreview] [pdf]

Abstract Despite significant advancements in general-purpose AI agents, several challenges still hinder their practical application in real-world scenarios. First, the limited planning capabilities of Large Language Models (LLM) restrict AI agents from effectively solving complex tasks that require long-horizon planning (Liu et al. 2023). Second, general-purpose AI agents struggle to efficiently utilize domain-specific knowledge and human expertise. In this paper, we introduce the Standard Operational Procedure-guided Agent (SOP-agent), a novel framework for constructing domain-specific agents through pseudocode-style Standard Operational Procedures (SOPs) written in natural language. Formally, we represent a SOP as a decision graph, which is traversed to guide the agent in completing tasks specified by the SOP. We conduct extensive experiments across tasks in multiple domains, including decision-making, search and reasoning, code generation, data cleaning, and grounded customer service. The SOP-agent demonstrates excellent versatility, achieving performance superior to general-purpose agent frameworks and comparable to domain-specific agent systems. Additionally, we introduce the Grounded Customer Service Benchmark, the first benchmark designed to evaluate the grounded decision-making capabilities of AI agents in customer service scenarios based on SOPs.

4591Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

[openreview] [pdf]

Abstract Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, yet their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Previous attempts to bridge this gap through supervised fine-tuning on curated expert demonstrations often suffer from compounding errors and limited exploration data, resulting in sub-optimal policy outcomes. To overcome these challenges, we propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks. We validate our approach in the WebShop environment, a simulated e-commerce platform—where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and \textbf{beats average human performance} when equipped with the capability to do online search. In real-world booking scenarios, our methodology boosts Llama-3 70B model’s zero-shot performance from \textbf{18.6% to 81.7%} success rate (a \textbf{340% relative increase}) after a single day of data collection and further to \textbf{95.4%} with online search. We believe this represents a substantial leap forward in the capabilities of autonomous agents, paving the way for more sophisticated and reliable decision-making in real-world settings.

4592RoundTable: Investigating Group Decision-Making Mechanism in Multi-Agent Collaboration

[openreview] [pdf]

Abstract This study investigates the efficacy of Multi-Agent Systems in eliciting cross-agent communication and enhancing collective intelligence through group decision-making in a decentralized setting. Unlike centralized mechanisms, where a fixed hierarchy governs social choice, decentralized group decision-making allows agents to engage in joint deliberation. Our research focuses on the dynamics of communication and decision-making within various social choice methods. By applying different voting rules in various environments, we find that moderate decision flexibility yields better outcomes. Additionally, exploring the linguistic features of agent-to-agent conversations reveals indicators of effective collaboration, offering insights into communication patterns that facilitate or hinder collaboration. Finally, we propose various methods for determining the optimal stopping point in multi-agent collaborations based on linguistic cues. Our findings contribute to a deeper understanding of how decentralized decision-making and group conversation shape multi-agent collaboration, with implications for the design of more effective MAS environments.

4593Rethinking logic in AI: A novel benchmark inspired by polynomial analogue of Gandy’s fixed point theorem

[openreview] [pdf]

Abstract This paper introduces a novel benchmark for evaluating the logical reasoning capabilities of Large Language Models (LLMs), grounded in the polynomial analogue of Gandy’s classical fixed point theorem. Drawing on concepts from mathematical logic, we design a parameterized set of recursive problems where the objective is for LLMs to predict the outcome of a Boolean function, achievable within a polynomial number of steps. By varying the parameters, we generate problem instances of differing complexity. Our experiments reveal that current state-of-the-art LLMs fail to reliably solve even the simplest cases, despite an effective deterministic algorithm existing. Notably, even advanced models like GPT-4 exhibit significant biases in solving benchmark problems. These findings highlight the limitations of modern LLMs as code interpreters, even in basic scenarios, and underscore the necessity for hybrid LLM/interpreter systems. Furthermore, they emphasize the importance of developing quantitative tests for reasoning, given the increasing reliance on LLM-based systems in decision-making applications.

4594Generalizable Transferability Estimation of Foundation Vision Models via Implicit Learning

[openreview] [pdf]

Abstract Transferability estimation aims to identify the most suitable model from a collection of pre-trained models for specific downstream tasks, playing a crucial role in the success of the pre-training and fine-tuning paradigm. However, the recent proliferation of pre-trained models with diverse architectures and training strategies poses significant challenges for transferability estimation due to discrepancies in intrinsic model characteristics, making it difficult for existing methods to accurately simulate embedding space evolution within feasible computational limits. To address these challenges, we propose an Implicit Transferability Modeling (ITM) paradigm that incorporates an implicit modeling strategy for the intrinsic properties of pre-trained models, enabling more accurate transferability estimation. ITM employs a Divide-and-Conquer Adaptation (DCA) process to efficiently model the transfer process, reducing both learning complexity and computational cost. Additionally, we introduce a Pseudo-Clustering-based Optimization (PCO) strategy that eliminates the need for extensive fine-tuning, enabling effective estimation without intensive retraining. Our method significantly outperforms state-of-the-art approaches, achieving notable improvements across ten widely used benchmarks and demonstrating its effectiveness and generalizability in enabling accurate and efficient model selection for downstream tasks.

4595FaithEval: Can Your Language Model Stay Faithful to Context, Even If “The Moon is Made of Marshmallows”

[openreview] [pdf]

Abstract Ensuring faithfulness to context in large language models (LLMs) and retrieval-augmented generation (RAG) systems is crucial for reliable deployment in real-world applications, as incorrect or unsupported information can erode user trust. Despite advancements on standard benchmarks, faithfulness hallucination—where models generate responses misaligned with the provided context—remains a significant challenge. In this work, we introduce FaithEval, a novel and comprehensive benchmark tailored to evaluate the faithfulness of LLMs in contextual scenarios across three diverse tasks: unanswerable, inconsistent, and counterfactual contexts. These tasks simulate real-world challenges where retrieval mechanisms may surface incomplete, contradictory, or fabricated information. FaithEval comprises 4.9K high-quality problems in total, validated through a rigorous four-stage context construction and validation framework, employing both LLM-based auto-evaluation and human validation. Our extensive study across a wide range of open-source and proprietary models reveals that even state-of-the-art models often struggle to remain faithful to the given context, and that larger models do not necessarily exhibit improved faithfulness.

4596ACTIVE: Offline Reinforcement Learning via Adaptive Imitation and In-sampleV-Ensemble

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) aims to learn from static datasets and thus faces the challenge of value estimation errors for out-of-distribution actions. The in-sample learning scheme addresses this issue by performing implicit TD backups that does not query the values of unseen actions. However, pre-existing in-sample value learning and policy extraction methods suffer from over-regularization, limiting their performance on suboptimal or compositional datasets. In this paper, we analyze key factors in in-sample learning that might potentially hinder the use of a milder constraint. We propose Actor-Critic with Temperature adjustment and In-sample Value Ensemble (ACTIVE), a novel in-sample offline RL algorithm that leverages an ensemble of VV-functions for critic training and adaptively adjusts the constraint level using dual gradient descent. We theoretically show that the VV-ensemble suppresses the accumulation of initial value errors, thereby mitigating overestimation. Our experiments on the D4RL benchmarks demonstrate that ACTIVE alleviates overfitting of value functions and outperforms existing in-sample methods in terms of learning stability and policy optimality.

4597Efficient Reinforcement Learning with Large Language Model Priors

[openreview] [pdf]

Abstract In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90% in offline learning scenarios.

4598RITUAL: Random Image Transformations as a Universal Anti-hallucination Lever in LVLMs

[openreview] [pdf]

Abstract Recent advancements in Large Vision Language Models (LVLMs) have revolutionized how machines understand and generate textual responses based on visual inputs. Despite their impressive capabilities, they often produce “hallucinatory” outputs that do not accurately reflect the visual information, posing challenges in reliability and trustworthiness. Inspired by test-time augmentation, we propose a simple, training-free method termed RITUAL to enhance robustness against hallucinations in LVLMs. RITUAL introduces random image transformations as complementary inputs during the decoding phase. Importantly, these transformations are not employed during the training of the LVLMs. This straightforward strategy reduces the likelihood of hallucinations by exposing the model to varied visual scenarios, enriching its decision-making process. While transformed images alone may initially degrade performance, we empirically find that strategically combining them with the original images mitigates hallucinations. Specifically, in cases where hallucinations occur with the original image, the transformed images help correct misinterpretations by adjusting the probability distribution. By diversifying the visual input space, RITUAL provides a more robust foundation for generating accurate outputs. Notably, our method works seamlessly with existing contrastive decoding methods and does not require external models or costly self-feedback mechanisms, making it a practical addition. While extremely simple, RITUAL significantly outperforms existing contrastive decoding methods across several object hallucination benchmarks, including POPE, CHAIR, and MME.

4599Can One Modality Model Synergize Training of Other Modality Models?

[openreview] [pdf]

Abstract Learning with multiple modalities has recently demonstrated significant gains in many domains by maximizing the shared information across modalities. However, the current approaches strongly rely on high-quality paired datasets, which allow co-training from the paired labels from different modalities. In this context, we raise a pivotal question: Can a model with one modality synergize the training of other models with the different modalities, even without the paired multimodal labels? Our answer is ‘Yes’. As a figurative description, we argue that a writer, i.e., a language model, can promote the training of a painter, i.e., a visual model, even without the paired ground truth of text and image. We theoretically argue that a superior representation can be achieved by the synergy between two different modalities without paired supervision. As proofs of concept, we broadly confirm the considerable performance gains from the synergy among visual, language, and audio models. From a theoretical viewpoint, we first establish a mathematical foundation of the synergy between two different modality models, where each one is trained with its own modality. From a practical viewpoint, our work aims to broaden the scope of multimodal learning to encompass the synergistic usage of single-modality models, relieving a strong limitation of paired supervision.

4600Hawkes process revisited: balancing interpretability and flexibility with contextualized event embeddings and a neural impact kernel

[openreview] [pdf]

Abstract The Hawkes process (HP) is commonly used to model event sequences with selfreinforcing dynamics, including electronic health records, stock trades, and social media interactions. Traditional HPs capture self-reinforcement via parametric impact functions that can be inspected to understand how each event modulates the intensity of others. Neural network-based HPs offer greater flexibility, resulting in improved fit and prediction performance, but at the cost of interpretability, which can be critical in medicine and other high-stakes settings. In this work, we aim to understand and improve upon this tradeoff. We propose a novel HP formulation in which impact functions are modeled by defining a flexible impact kernel, instantiated as a neural network, in event embedding space, which allows us to model large-scale event sequences with many event types. This approach is more flexible than traditional HPs, because we do not assume a particular parametric form for the impact functions, yet more interpretable than other neural network approaches, because self-reinforcing dynamics are still entirely captured by the impact kernel, which can be inspected. If needed, our approach allows us to trade interpretability for flexibility by contextualizing the event embeddings with transformer encoder layers. Results show that our method accurately recovers impact functions in simulations and achieves competitive performance on real-world datasets even without transformer layers. This suggests that our flexible impact kernel is often sufficient to capture self-reinforcing dynamics effectively, implying that interpretability can be maintained without loss of performance.

4601Backpropagation-free training of neural PDE solvers for time-dependent problems

[openreview] [pdf]

Abstract Approximating solutions to time-dependent Partial Differential Equations (PDEs) is one of the most important problems in computational science. Neural PDE solvers have shown promise recently because they are mesh-free and easy to implement. However, backpropagation-based training often leads to poor approximation accuracy and long training time. In particular, capturing high-frequency temporal dynamics and solving over long time spans pose significant challenges. To address these, we present an approach to training neural PDE solvers without back-propagation by integrating two key ideas: separation of space and time variables and random sampling of weights and biases of the hidden layers. We reformulate the PDE as an ordinary differential equation using a neural network ansatz, construct neural basis functions only in the spatial domain, and solve the ODE leveraging classical ODE solvers from scientific computing. We demonstrate that our backpropagation-free algorithm outperforms the iterative, gradient-based optimization of physics-informed neural networks with respect to training time and accuracy, often by several orders of magnitude using different complicated PDEs characterized by high-frequency temporal dynamics, long time span, complex spatial domain, non-linearities, and shocks.

4602Knowledge Graph Finetuning Enhances Knowledge Manipulation in Large Language Models

[openreview] [pdf]

Abstract Despite the impressive performance of general large language models(LLMs), many of their applications in specific domains (e.g., low-data and knowledge-intensive) still confront significant challenges. Supervised fine-tuning (SFT)---where a general LLM is further trained on a small labeled dataset to adapt for specific tasks or domains---has shown great power for developing domain-specific LLMs. However, existing SFT data primarily consist of Question and Answer (Q&A) pairs, which poses a significant challenge for LLMs to comprehend the correlation and logic of knowledge underlying the Q&A. To address this challenge, we propose a conceptually flexible and general framework to boost SFT, namely Knowledge Graph-Driven Supervised Fine-Tuning (KG-SFT). The key idea of KG-SFT is to generate high-quality explanations for each Q&A pair via a structured knowledge graph to enhance the knowledge comprehension and manipulation of LLMs. Specifically, KG-SFT consists of three components: Extractor, Generator, and Detector. For a given Q&A pair, (i) Extractor first identifies entities within Q&A pairs and extracts relevant reasoning subgraphs from external KGs, (ii) Generator then produces corresponding fluent explanations utilizing these reasoning subgraphs, and (iii) finally, Detector performs sentence-level knowledge conflicts detection on these explanations to guarantee the reliability. KG-SFT focuses on generating high-quality explanations to improve the quality of Q&A pair, which reveals a promising direction for supplementing existing data augmentation methods. Extensive experiments on fifteen different domains and six different languages demonstrate the effectiveness of KG-SFT, leading to an accuracy improvement of up to 18% and an average of 10% in low-data scenarios.

4603What Matters for In-Context Learning: A Balancing Act of Look-up and In-Weight Learning

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated impressive performance in various tasks, including In-Context Learning (ICL), where the model performs new tasks by conditioning solely on the examples provided in the context, without updating the model’s weights. While prior research has explored the roles of pretraining data and model architecture, the key mechanism behind ICL remains unclear. In this work, we systematically uncover properties present in LLMs that support the emergence of ICL. To disambiguate these factors, we conduct a study with a controlled dataset and data sequences using a deep autoregressive model. We show that conceptual repetitions in the data sequences are crucial for ICL, more so than previously indicated training data properties like burstiness or long-tail distribution. Conceptual repetitions could refer to nn-gram repetitions in textual data or exact image copies in image sequence data. Such repetitions also offer other previously overlooked benefits such as reduced transiency in ICL performance. Furthermore, we show that the emergence of ICL depends on balancing the in-weight learning objective with the in-context solving ability during training.

4604PRE-TRAIN WITH BACKPROPAGATION AND FINE-TUNE WITH A BIO-PLAUSIBLE LEARNING RULE

[openreview] [pdf]

Abstract Backpropagation (BP) has long been the cornerstone of deep neural network training. While neural networks trained with backpropagation typically have high accuracy and precision, they suffer from limitations in their robustness to adversarial perturbation. Biologically plausible (bio-plausible) learning rules, on the other hand, are more robust. Yet, they typically underperform in terms of accuracy and precision, which has limited their widespread adoption. In this work, we aim to bridge this gap. We propose a novel approach where neural networks are pre-trained using backpropagation and fine-tuned using bio-plausible learning rules. We use several types of Sign-Symmetry learning methods to fine-tune models pre-trained using backpropagation. We explore the effectiveness of this approach in two tasks, image classification and image retrieval, then demonstrate that it improves robustness against gradient-based adversarial attacks while offering comparable accuracy and precision compared to the use of backpropagation alone. These findings show the benefit of mixing backpropagation and bio-plausible learning rules, suggesting the need for further research by the community to evaluate this approach on other tasks.

4605AtmosArena: Benchmarking Foundation Models for Atmospheric Sciences

[openreview] [pdf]

Abstract Deep learning has emerged as a powerful tool for atmospheric sciences, showing significant utility across various tasks in weather and climate modeling. In line with recent progress in language and vision foundation models, there are growing efforts to scale and finetune such models for multi-task spatiotemporal reasoning. Despite promising results, existing works often evaluate their model on a small set of non-uniform tasks, which makes it hard to quantify broad generalization across diverse tasks and domains. To address this challenge, we introduce AtmosArena, the first multi-task benchmark dedicated to foundation models in atmospheric sciences. AtmosArena comprises a suite of tasks that cover a broad spectrum of applications in atmospheric physics and atmospheric chemistry. To showcase the capabilities and key features of our benchmark, we conducted extensive experiments to evaluate two state-of-the-art deep learning models, ClimaX and Stormer on AtmosArena, and compare their performance with other deep learning and traditional baselines. By providing a standardized, open-source benchmark, we aim to facilitate further advancements in the field, much like open-source benchmarks have driven the development of foundation models for language and vision.

4606GRAM: Generalization in Deep RL with a Robust Adaptation Module

[openreview] [pdf]

Abstract The reliable deployment of deep reinforcement learning in real-world settings requires the ability to generalize across a variety of conditions, including both in-distribution scenarios seen during training as well as novel out-of-distribution scenarios. In this work, we present a framework for generalization in deep reinforcement learning that unifies these two distinct types of generalization within a single architecture. We introduce a robust adaptation module that provides a mechanism for identifying and reacting to both in-distribution and out-of-distribution environments, along with a joint training pipeline that combines the goals of in-distribution adaptation and out-of-distribution robustness. Our algorithm GRAM achieves strong generalization performance across in-distribution and out-of-distribution scenarios upon deployment, which we demonstrate on a variety of realistic simulated locomotion tasks with a quadruped robot.

4607Centrality-guided Pre-training for Graph

[openreview] [pdf]

Abstract Self-supervised learning has shown great potential in learning generalizable representations for graph-structured data. However, existing methods largely focus on improving graph representations based on augmentations, which ignores the alignment between the representation and the structure of graphs. To fill this gap, we propose a Centrality-guided Graph Pre-training (CenPre) framework to integrate the structure information into the representations of nodes based on the centrality in graph theory. The proposed CenPre contains three modules for node representation pre-training and alignment. The node-level structure learning module fuses the fine-grained node importance into node representation based on degree centrality, allowing the aggregation of node representations with equal/similar importance. The graph-level structure learning module characterizes the importance between all nodes in the graph based on eigenvector centrality, enabling the exploitation of graph-level structure similarities/differences when learning node representation. Finally, a representation alignment module aligns the pre-trained node representation using the original one, essentially allowing graph representations to learn structural information without losing their original semantic information, thereby leading to better graph representations. Extensive experiments on a series of real-world datasets demonstrate that the proposed CenPre outperforms the state-of-the-art baselines in node classification and achieves better performance in link prediction and graph classification than the baseline models.

4608ROSE: Register-Assisted General Time Series Forecasting with Decomposed Frequency Learning

[openreview] [pdf]

Abstract With the increasing collection of time series data from various domains, there arises a strong demand for general time series forecasting models pre-trained on a large number of time-series datasets to support a variety of downstream prediction tasks. Enabling general time series forecasting faces two challenges: how to obtain unified representations from multi-domian time series data, and how to capture domain-specific features from time series data across various domains for adaptive transfer in downstream tasks. To address these challenges, we propose a Register-Assisted General Time Series Forecasting Model with Decomposed Frequency Learning (ROSE), a novel pre-trained model for time series forecasting. ROSE employs Decomposed Frequency Learning for the pre-training task, which decomposes coupled semantic information in time series with frequency-based masking and reconstruction to obtain unified representations across domains. We also equip ROSE with a Time Series Register, which learns to generate a register to capture domain-specific representations during pre-training and enhances domain-adaptive transfer by selecting related register tokens on downstream tasks. After pre-training on large-scale time series data, ROSE achieves state-of-the-art forecasting performance on 7 real-world benchmarks. Remarkably, it demonstrates competitive or superior few-shot and zero-shot abilities.

4609Physics-informed Temporal Difference Metric Learning for Robot Motion Planning

[openreview] [pdf]

Abstract The robot motion planning problem involves finding a collision-free path between a robot’s initial and target configurations. Recently, self-supervised learning methods have been developed to address planning problems without requiring expensive expert demonstrations. These methods leverage the Eikonal equation for training neural networks and lead to scalable and data-efficient solutions. However, these methods face challenges when applied to complex, cluttered environments due to their inability to preserve key properties of the Eikonal equation, such as its role as an optimal value function and geodesic distance. To overcome these limitations, we propose a novel self-supervised temporal difference metric learning approach that solves the Eikonal equation more accurately and enhances performance in solving complex and unseen motion planning tasks. Our method enforces Bellman’s principle of optimality over finite regions, using temporal difference learning to avoid spurious local minima, while incorporating metric learning to preserve the Eikonal equation’s intrinsic geodesic properties, such as the triangle inequality. We demonstrate that our approach significantly outperforms existing methods in handling complex environments and generalizing to unseen environments, with robot configurations ranging from 2 to 12 degrees of freedom (DOF).

4610Quantifying Prediction Consistency Under Model Multiplicity in Tabular LLMs

[openreview] [pdf]

Abstract Fine-tuning large language models (LLMs) on tabular data for classification can lead to the phenomenon of \emph{fine-tuning multiplicity}, where equally well-performing models make conflicting predictions on the same input. Fine-tuning multiplicity can arise due to variations in the training process, e.g., seed, random weight initialization, retraining on a few additional or deleted data points. This raises critical concerns about the robustness and reliability of Tabular LLMs, particularly when deployed for high-stakes decision-making, such as finance, hiring, education, healthcare, etc. This work formalizes the unique challenge of fine-tuning multiplicity in Tabular LLMs and proposes a novel measure to quantify the robustness of individual predictions without expensive model retraining. Our measure quantifies a prediction’s robustness by analyzing (sampling) the model’s local behavior around the input in the embedding space. Interestingly, we show that sampling in the local neighborhood can be leveraged to provide probabilistic robustness guarantees against a broad class of equally-well-performing fine-tuned models. By leveraging Bernstein’s Inequality, we show that predictions with sufficiently high robustness (as defined by our measure) will remain consistent with high probability. We also provide empirical evaluation on real-world datasets to support our theoretical results. Our work highlights the importance of addressing fine-tuning instabilities to enable trustworthy deployment of Tabular LLMs in high-stakes and safety-critical applications.

4611RestoreGrad: Signal Restoration Using Conditional Denoising Diffusion Models with Jointly Learned Prior

[openreview] [pdf]

Abstract Denoising diffusion probabilistic models (DDPMs) estimate the data distribution by sequentially denoising samples drawn from a prior distribution, which is typically assumed to be the standard Gaussian for simplicity. Owing to their capabilities of generating high-fidelity samples, DDPMs can be utilized for signal restoration tasks in recovering a clean signal from its degraded observation(s), by conditioning the model on the degraded signal. The degraded signals are themselves contaminated versions of the clean signals; due to this correlation, they may encompass certain useful information about the target clean data distribution. However, naively adopting the standard Gaussian as the prior distribution in turn discards such information. In this paper, we propose to improve conditional DDPMs by leveraging a more informative prior that is jointly learned with the diffusion model. The proposed framework, called RestoreGrad, exploits the correlation between the degraded and clean signals to construct a better prior, which is especially useful for signal restoration tasks. In contrast to existing DDPMs that just settle on using pre-defined or handcrafted priors, RestoreGrad learns the prior jointly with the diffusion model. To this end, we first derive a new objective function from a modified evidence lower bound (ELBO) of the data log-likelihood, to incorporate the prior learning process into conditional DDPMs. Then, we suggest a corresponding joint learning paradigm for optimizing the new ELBO. Notably, RestoreGrad requires minimum modifications to the diffusion model itself; thus, it can be flexibly implemented on top of various conditional DDPM-based signal restoration models. On speech and image restoration tasks, we show that RestoreGrad demonstrates faster convergence (5-10 times fewer training steps) to achieve on par or better perceptual quality of restored signals over existing DDPM baselines, along with improved robustness to using fewer sampling steps in inference time (2-2.5 times fewer steps), advocating the advantages of leveraging jointly learned prior for efficiency improvements in the diffusion process.

4612FedPeWS: Personalized Warmup via Subnetworks for Enhanced Heterogeneous Federated Learning

[openreview] [pdf]

Abstract Statistical data heterogeneity is a significant barrier to convergence in federated learning (FL). While prior work has advanced heterogeneous FL through better optimization objectives, these methods fall short when there is \textit{extreme} data heterogeneity among collaborating participants. We hypothesize that convergence under extreme data heterogeneity is primarily hindered due to the aggregation of conflicting updates from the participants in the initial collaboration rounds. To overcome this problem, we propose a warmup phase where each participant learns a personalized mask and updates only a subnetwork of the full model. This \textit{personalized warmup} allows the participants to focus initially on learning specific \textit{subnetworks} tailored to the heterogeneity of their data. After the warmup phase, the participants revert to standard federated optimization, where all parameters are communicated. We empirically demonstrate that the proposed personalized warmup via subnetworks (\texttt{FedPeWS}) approach improves accuracy and convergence speed over standard federated optimization methods.

4613Adaptive Masking Enhances Visual Grounding

[openreview] [pdf]

Abstract In recent years, zero-shot and few-shot learning in visual grounding have garnered considerable attention, largely due to the success of large-scale vision-language pre-training on expansive datasets such as LAION-5B and DataComp-1B. However, the continuous expansion of these datasets presents significant challenges, particularly with respect to data availability and computational overhead, thus creating a bottleneck in the advancement of low-shot learning capabilities. In this paper, we propose a novel approach, \textbf{I}nterpretative \textbf{MA}sking with \textbf{G}aussian Radiation Mod\textbf{E}ling, aimed at enhancing vocabulary grounding in low-shot learning scenarios without necessitating an increase in dataset size. Drawing inspiration from cognitive science and the recent success of masked autoencoders (MAE), our method leverages adaptive masking on salient regions of the feature maps generated by the vision backbone. This enables the model to learn robust, generalized representations through the reconstruction of occluded information, thereby facilitating effective attention to both local and global features. We evaluate the efficacy of our approach on benchmark datasets, including COCO and ODinW, demonstrating its superior performance in zero-shot and few-shot tasks. Experimental results consistently show that IMAGE outperforms baseline models, achieving enhanced generalization and improved performance in low-shot scenarios. These findings highlight the potential of adaptive feature manipulation through attention mechanisms and Gaussian modeling as a promising alternative to approaches that rely on the continual scaling of dataset sizes for the advancement of zero-shot and few-shot learning.

4614Rethinking the Uncertainty: A Critical Review and Analysis in the Era of Large Language Models

[openreview] [pdf]

Abstract In recent years, Large Language Models (LLMs) have become fundamental to a broad spectrum of artificial intelligence applications. As the use of LLMs expands, precisely estimating the uncertainty in their predictions has become crucial. Current methods often struggle to accurately identify, measure, and address the true uncertainty, with many focusing primarily on estimating model confidence. This discrepancy is largely due to an incomplete understanding of where, when, and how uncertainties are injected into models. This paper introduces a comprehensive framework specifically designed to identify and understand the types and sources of uncertainty, aligned with the unique characteristics of LLMs. Our framework enhances the understanding of the diverse landscape of uncertainties by systematically categorizing and defining each type, establishing a solid foundation for developing targeted methods that can precisely quantify these uncertainties. We also provide a detailed introduction to key related concepts and examine the limitations of current methods in mission-critical and safety-sensitive applications. The paper concludes with a perspective on future directions aimed at enhancing the reliability and practical adoption of these methods in real-world scenarios.

4615Derivative Causal Models: Modeling Causality at Mixed Scales of Observation

[openreview] [pdf]

Abstract Causal relations can materialize in many different ways. In their most simple form --typically assumed in classical causal models and discovery approaches--, similar variations of a cause lead to similar variations of an effect. However, this `smoothness’ requires an observation of cause and effect just at the right scales. Unfortunately, this conflicts with records often encountered in the real-world, mixing continuous measurements with once-in-a-while observations of sparse events. Compactly modeling the causal effects between (discrete) events and continuous states is hard to achieve with classical causal models. To ease this situation, we leverage transformations that derive different scales of observables, respectively, to decompose relations and allow for compact causal representations, calledDerivative Causal Models(DCM). We instantiate them using integral and derivative transforms and demonstrate that the resultingDifferential Causal Models(\partialCM) can be discovered automatically from data.

4616Towards Autonomous Agents: Adaptive-planning, Reasoning, and Acting in Language Models

[openreview] [pdf]

Abstract We propose a novel in-context learning algorithm for building autonomous decision-making language agents. The language agent continuously attempts to solve the same task by reasoning, acting, observing and then self-correcting each time the task fails. Our selected language agent demonstrates the ability to solve tasks in a text-based game environment. Our results show that the gemma-2-9b-it language model, using our proposed method, can successfully complete two of six tasks that failed in the first attempt. This highlights the effectiveness of our approach in enhancing the problem-solving capabilities of a single language model through self-correction, paving the way for more advanced autonomous agents. The code is publicly available athttps://anonymous.4open.science/r/AutonomousLLMAgentwithAdaptingPlanning-D613/.

4617Zero-Shot Offline Imitation Learning via Optimal Transport

[openreview] [pdf]

Abstract Zero-shot imitation learning algorithms hold the promise of reproducing unseen behavior from as little as a single demonstration at test time. Existing practical approaches view the expert demonstration as a sequence of goals, enabling imitation with a high-level goal selector, and a low-level goal-conditioned policy. However, this framework can suffer from myopic behavior: the agent’s immediate actions towards achieving individual goals may undermine long-term objectives. We introduce a novel method that mitigates this issue by directly optimizing the occupancy matching objective that is intrinsic to imitation learning. We propose to lift a goal-conditioned value function to a distance between occupancies, which are in turn approximated via a learned world model. The resulting method can learn from offline, suboptimal data, and is capable of non-myopic, zero-shot imitation, as we demonstrate in complex, continuous benchmarks.

4618A Probabilistic Generative Method for Safe Physical System Control Problems

[openreview] [pdf]

Abstract Controlling complex physical systems is a crucial task in science and engineering, often requiring the balance of control objectives and safety constraints. Recently, diffusion models have demonstrated a strong ability to model high-dimensional state spaces, giving them an advantage over recent deep learning and reinforcement learning-based methods in complex control tasks. However, they do not inherently address safety concerns. In contrast, while safe reinforcement learning methods consider safety, they typically fail to provide guarantees for satisfying safety constraints. To address these limitations, we propose Safe Conformal Physical system control (SafeConPhy), which optimizes the diffusion model with a provable safety bound iteratively to satisfy the safety constraint. We pre-train a diffusion model on the training set. Given the calibration set and the specific control targets, we derive a provable safety bound using conformal prediction. After iteratively enhancing the safety of the diffusion model with the progressively updated bound, the model’s output can be certified as safe with a user-defined probability. We evaluate our algorithm on two control tasks: 1D Burgers’ equation and 2D incompressible fluid. Our results show that our algorithm satisfies safety constraints, and outperforms prior control methods and safe offline RL algorithms.

4619LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

[openreview] [pdf]

Abstract Optimization problems are prevalent across various scenarios. Formulating and then solving optimization problems described by natural language often requires highly specialized human expertise, which could block the widespread application of optimization-based decision making. To make problem formulating and solving automated, leveraging large language models (LLMs) has emerged as a potential way. However, this kind of way suffers from the issue of optimization generalization. Namely, the accuracy of most current LLM-based methods and the generality of optimization problem types that they can model are still limited. In this paper, we propose a unified learning-based framework called LLMOPT to boost optimization generalization. Starting from the natural language descriptions of optimization problems and a pre-trained LLM, LLMOPT constructs the introduced five-element formulation as a universal model for learning to define diverse optimization problem types. Then, LLMOPT employs the multi-instruction tuning to enhance both problem formalization and solver code generation accuracy and generality. After that, to prevent hallucinations in LLMs, such as sacrificing solving accuracy to avoid execution errors, model alignment and self-correction mechanism are adopted in LLMOPT. We evaluate the optimization generalization ability of LLMOPT and compared methods across six real-world datasets covering roughly 20 fields such as health, environment, energy and manufacturing, etc. Extensive experiment results show that LLMOPT is able to model various optimization problem types such as linear/nonlinear programming, mixed integer programming and combinatorial optimization, and achieves a notable 11.08% average solving accuracy improvement compared with the state-of-the-art methods. The code is available athttps://anonymous.4open.science/r/LLMOPT.

4620Measuring Language Model Uncertainty With Internal Concepts

[openreview] [pdf]

Abstract We study the problem of evaluating the predictive uncertainty of large language models (LLMs). We assign an uncertainty measure to the correctness of an LLM using a form of entropy that applies to semantic objects (concepts). Unlike prior works, the notion of meaning used to define concepts is derived from the LLM, rather than from an external model. Our notion of conceptual equivalence draws from ideas in Formal Concept Analysis (FCA) and lattice/order theory, and can be used to estimate correctness in closed- and open-ended scenarios. Our method has a relative improvement of up to 4.8% on average across five benchmarks and up to 10.9% on mixtures of closed- and open-ended questions.

4621Proof Search Augmented Language Models

[openreview] [pdf]

Abstract Transformer language models (TLMs) exhibit an impressively general range of capabilities. A growing body of work aims to harness these models for complex reasoning problems expressed in natural language. However, recent theoretical and empirical results have revealed limits to the algorithmic generalization of TLM reasoning. Transformers trained to solve deduction problems from one distribution fail to solve instances of the same problem class drawn from other distributions. We propose to improve the systematic reasoning capabilities of TLMs via a differentiable proof search module, yielding proof-search augmented language models (PSALMs). In a PSALM, a Transformer is responsible for predicting rule and statement representations for a neural theorem prover (NTP). The NTP performs a backward-chaining search over proofs, scoring them based on a soft unification operation. Our principal challenge is to train models to reason without also learning spurious features. Our results show that rule-level supervision allows PSALMs to successfully generalize across problem distributions in deduction tasks where vanilla transformers fail to learn systematic behavior. We also find we only need label supervision to adapt PSALMs to more natural text.

4622Enhancing Graph Self-Supervised Learning with Graph Interplay

[openreview] [pdf]

Abstract Graph self-supervised learning (GSSL) has emerged as a compelling framework for extracting informative representations from graph-structured data without extensive reliance on labeled inputs. In this study, we introduce Graph Interplay (GIP), an innovative and versatile approach that significantly enhances the performance equipped with various existing GSSL methods. To this end, GIP advocates direct graph-level communications by introducing random inter-graph edges within standard batches. Against GIP’s simplicity, we further theoretically show that GIP essentially performs a principled manifold separation via combining inter-graph message passing and GSSL, bringing about more structured embedding manifolds and thus benefits a series of downstream tasks. Our empirical study demonstrates that GIP surpasses the performance of prevailing GSSL methods across multiple benchmarks by significant margins, highlighting its potential as a breakthrough approach. Besides, GIP can be readily integrated into a series of GSSL methods and consistently offers additional performance gain. This advancement not only amplifies the capability of GSSL but also potentially sets the stage for a novel graph learning paradigm in a broader sense.

4623RMB: Comprehensively benchmarking reward models in LLM alignment

[openreview] [pdf]

Abstract Reward models (RMs) guide the alignment of large language models (LLMs), steering them toward behaviors preferred by humans. Evaluating RMs is the key to better aligning LLMs. However, the current evaluation of RMs may not directly correspond to their alignment performance due to the limited distribution of evaluation data and evaluation methods that are not closely related to alignment objectives. To address these limitations, we propose RMB, a comprehensive RM benchmark that covers over 49 real-world scenarios and includes both pairwise and Best-of-N (BoN) evaluations to better reflect the effectiveness of RMs in guiding alignment optimization. We demonstrate a positive correlation between our benchmark and the downstream alignment task performance. Based on our benchmark, we conduct extensive analysis on the state-of-the-art RMs, revealing their generalization defects that were not discovered by previous benchmarks, and highlighting the potential of generative RMs. Furthermore, we delve into open questions in reward models, specifically examining the effectiveness of majority voting for the evaluation of reward models and analyzing the impact factors of generative RMs, including the influence of evaluation criteria and instructing methods. We will release our evaluation code and datasets upon publication.

4624NEPENTHE: Entropy-Based Pruning as a Neural Network Depth’s Reducer

[openreview] [pdf]

Abstract While deep neural networks are highly effective at solving complex tasks, their computational demands can hinder their usefulness in real-time applications and with limited-resources systems. Besides, it is a known fact that, for many downstream tasks, off-the-shelf models are over-parametrized. While classical structured pruning can reduce the network’s width, the computation’s critical path, namely the maximum number of layers encountered at forward propagation, apparently can not be reduced.In this paper, we aim to reduce the depth of over-parametrized deep neural networks: we propose an eNtropy-basEdPruning as a nEuralNetwork depTH’s rEducer (NEPENTHE) to alleviate deep neural networks’ computational burden. Based on our theoretical finding, NEPENTHE leverages "unstructured’’ pruning to bias sparsity enhancement in layers with low entropy to remove them entirely. We validate our approach on popular architectures such as MobileNet, Swin-T and RoBERTa, showing that, when in the overparametrization regime, some layers are linearizable (hence reducing the model’s depth) with little to no performance loss. The code will be publicly available upon acceptance of the article.

4625Dataset Size Recovery from Fine-Tuned Weights

[openreview] [pdf]

Abstract Model inversion and membership inference attacks aim to reconstruct and verify the data on which a model was trained. However, these methods cannot guarantee to find all training samples, as they do not know the training set size. In this paper, we introduce a new task: dataset size recovery, which seeks to identify the number of samples a given model was fine-tuned on. Our core finding is that both the norm and the spectrum of the fine-tuning weight matrices are closely linked to the fine-tuning dataset size. Leveraging this insight, we propose DSiRe, an algorithm that accepts fine-tuned model weights, extracts their spectral features, and then employs a nearest neighbor classifier on top, to predict the dataset size. Although it is training-free, simple, and very easy to implement, DSiRe is broadly applicable across various fine-tuning paradigms and modalities (e.g., DSiRe can predict the number of fine-tuning images with a mean absolute error of 0.36 images). To this end, we develop and release LoRA-WiSE, a new benchmark consisting of over 25k25k weight snapshots from more than 2k2k diverse LoRA fine-tuned models.

4626Riemannian Optimization for Hyperbolic Prototypical Networks

[openreview] [pdf]

Abstract This paper addresses the utilization of hyperbolic geometry within a Prototype Learning framework. Specifically, we introduce Riemannian optimization for Hyperbolic Prototypical Networks (RHPN), a novel approach that leverages Prototype Learning on Riemannian manifolds applied to the Poincare’ ball. RHPN capitalizes on the efficiency and effectiveness of updating prototypes during training, coupled with a regularization term crucial to boost the performances. We set up an extensive experimentation that shows that RHPN is able to outperform the state-of-the-art in Prototype Learning, both in low and high dimensions, extending the impact of hyperbolic spaces to a wider range of scenarios.

4627ConBatch-BAL: Batch Bayesian Active Learning under Budget Constraints

[openreview] [pdf]

Abstract Varying annotation costs among data points and budget constraints can hinder the adoption of active learning strategies in real-world applications. This work introduces two Bayesian active learning strategies for batch acquisition under constraints (ConBatch-BAL), one based on dynamic thresholding and one following greedy acquisition. Both select samples using uncertainty metrics computed via Bayesian neural networks. The dynamic thresholding strategy redistributes the budget across the batch, while the greedy one selects the top-ranked sample at each step, limited by the remaining budget. Focusing on scenarios with costly data annotation and geospatial constraints, we also release two new real-world datasets containing geolocated aerial images of buildings, annotated with energy efficiency or typology classes. The ConBatch-BAL strategies are benchmarked against a random acquisition baseline on these datasets under various budget and cost scenarios. The results show that the developed ConBatch-BAL strategies can reduce active learning iterations and data acquisition costs in real-world settings, and even outperform the unconstrained baseline solutions.

4628Annealing Flow Generative Model Towards Sampling High-Dimensional and Multi-Modal Distributions

[openreview] [pdf]

Abstract Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a continuous normalizing flow-based approach designed to sample from high-dimensional and multi-modal distributions. The key idea is to learn a continuous normalizing flow-based transport map, guided by annealing, to transition samples from an easy-to-sample distribution to the target distribution, facilitating effective exploration of modes in high-dimensional spaces. Unlike many existing methods, AF training does not rely on samples from the target distribution. AF ensures effective and balanced mode exploration, achieves linear complexity in sample size and dimensions, and circumvents inefficient mixing times. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.

4629NoRA: Nested Low-Rank Adaptation for Efficient Fine-Tuning Large Models

[openreview] [pdf]

Abstract Low-Rank Adaptation (LoRA) has become a popular paradigm for fine-tuning large models, but it still necessitates a substantial number of training parameters. To address this issue, we first conduct comprehensive empirical studies on parameter-efficient LoRA structure. Then, we establish design guidelines that emphasize the use of serial structures, optimal placements, and nested LoRA. Based on these insights, we present NoRA, a nested parameter-efficient LoRA structure that revolutionizes the initialization and fine-tuning of projection matrices. Our NoRA’s innovative approach involves freezing outer layer LoRA weights and employing a serial inner layer design, enabling precise task-specific adaptations while maintaining compact training parameters. In addition, we propose an activation-aware Singular Value Decomposition (AwSVD) that adjusts the weight matrices based on activation distributions for initialization of outer layer LoRA weights. This schema enhances decomposition accuracy and mitigates computational errors. Extensive evaluations across multiple linguistic and visual tasks demonstrate that NoRA outperforms state-of-the-art LoRA variants, achieving significant improvements in efficiency and effectiveness on models such as Mistral-7B, Gemma-7B, and LLaMA-3 8B. Notably, NoRA reduces fine-tuning parameters|training-time|memory-usage by 85.5%|37.5%|8.9% and enhances performance by 1.9%, compared to LoRA on LLaMA-3 8B. Codes are available in the supplementary materials.

4630Semi-Supervised Vision-Centric 3D Occupancy World Model for Autonomous Driving

[openreview] [pdf]

Abstract Understanding world dynamics is crucial for planning in autonomous driving. Recent methods attempt to achieve this by learning a 3D occupancy world model that forecasts future surrounding scenes based on current observation. However, 3D occupancy labels are still required to produce promising results. Considering the high annotation cost for 3D outdoor scenes, we propose a semi-supervised vision-centric 3D occupancy world model,PreWorld, to leverage the potential of 2D labels through a novel two-stage training paradigm: the self-supervised pre-training stage and the fully-supervised fine-tuning stage. Specifically, during the pre-training stage, we utilize an attribute projection head to generate different attribute fields of a scene (e.g., RGB, density, semantic), thus enabling temporal supervision from 2D labels via volume rendering techniques. Furthermore, we introduce a simple yet effective state-conditioned forecasting module to recursively forecast future occupancy and ego trajectory in a direct manner. Extensive experiments on the nuScenes dataset validate the effectiveness and scalability of our method, and demonstrate that PreWorld achieves competitive performance across 3D occupancy prediction, 4D occupancy forecasting and motion planning tasks.

4631Decision-making with speculative opponent model-aided value function factorization

[openreview] [pdf]

Abstract In many real-world scenarios, teams of agents must coordinate their actions while competing against opponents. Traditional multi-agent reinforcement learning (MARL) approaches often treat opponents as part of the environment, causing controlled agents to overlook the impact of their adversaries. Opponent modeling can enhance an agent’s decision-making by constructing predictive models of other agents. However, existing approaches typically rely on centralized learning with access to opponent data, and the process of extracting decentralized policies becomes impractical with larger teams. To address this issue, we propose the Distributional Speculative Opponent-aided mixing framework (DSOMIX), a novel value-based speculative opponent modeling algorithm that relies solely on local information—namely the agent’s own observations, actions, and rewards. DSOMIX uses speculative beliefs to predict the behaviors of unseen opponents, enabling agents to make decisions based on local observations. Additionally, it incorporates distributional value decomposition models to capture a more granular representation of the agent’s return distribution, improving the training process for the speculative opponent models. We formally derive a value-based theorem that underpins the training process. Extensive experiments across four challenging MARL benchmarks, including MPE and Pommerman, demonstrate that DSOMIX outperforms state-of-the-art methods, achieving superior performance and faster convergence.

4632SAR2Earth: A SAR-to-EO Translation Dataset for Remote Sensing Applications

[openreview] [pdf]

Abstract Electro-optical (EO) images are essential to a wide range of remote sensing applications. With the advent of data-driven models, the efficiency of EO image analysis has significantly improved, enabling faster and more effective outcomes in these applications. However, EO images have inherent limitations—they cannot penetrate cloud cover and are unable to capture imagery at night. To overcome these challenges, synthetic aperture radar (SAR) images are employed, as they can operate effectively regardless of weather conditions or time of day. Despite this advantage, SAR images come with their own difficulties: they are affected by speckle noise, complicating analysis, and existing algorithms developed for EO imagery are not directly transferable to SAR data. To address these issues, we introduce SAR2Earth, a benchmark dataset specifically designed for SAR-to-EO translation. By translating SAR images into EO-like representations, SAR2Earth allows the extensive range of algorithms developed for EO imagery to be applied effectively to SAR data. The dataset consists of 18 spatially aligned pairs of SAR and EO images, collected from 8 distinct regions encompassing both urban and rural. We provide comprehensive evaluations, detailed model analyses, and extensive experimental results. All codes and datasets will be made publicly available athttps://sar2earth.github.io.

4633PADriver: Towards Personalized Autonomous Driving

[openreview] [pdf]

Abstract In this paper, we propose PADriver, a novel closed-loop framework for personalized autonomous driving (PAD). Built upon Multi-modal Large Language Model (MLLM), PADriver takes streaming frames and personalized textual prompts as inputs. It autoaggressively performs scene understanding, danger level estimation and action decision. The predicted danger level reflects the risk of the potential action and provides an explicit reference for the final action, which corresponds to the preset personalized prompt. Moreover, we construct a closed-loop benchmark named PAD-Highway based on Highway-Env simulator to comprehensively evaluate the decision performance under traffic rules. The dataset contains 250 hours videos with high-quality annotation to facilitate the development of PAD behavior analysis. Experimental results on the constructed benchmark show that PADriver outperforms state-of-the-art approaches on different evaluation metrics, and enables various driving modes.

4634Fugatto 1: Foundational Generative Audio Transformer Opus 1

[openreview] [pdf]

Abstract Fugatto is a versatile audio synthesis and transformation model capable of following free-form text instructions with optional audio inputs. While large language models (LLMs) trained with text on a simple next-token prediction objective can learn to infer instructions directly from the data, models trained solely on audio data lack this capacity. This is because audio data does not inherently contain the instructions that were used to generate it. To overcome this challenge, we introduce a specialized dataset generation approach optimized for producing a wide range of audio generation and transformation tasks, ensuring the data reveals meaningful relationships between audio and language. Another challenge lies in achieving compositional abilities -- such as combining, interpolating between, or negating instructions -- using data alone. To address it, we propose ComposableART, an inference-time technique that extends classifier-free guidance to compositional guidance. It enables the seamless and flexible composition of instructions, leading to highly customizable audio outputs outside the training distribution. Our evaluations across a diverse set of tasks demonstrate that Fugatto performs competitively with specialized models, while ComposableART enhances its sonic palette and control over synthesis. Most notably, we highlight our framework’s ability to synthesize emergent sounds -- sonic phenomena that transcend conventional audio generation -- unlocking new creative possibilities. \href{https://fugatto.github.io/}{DemoWebsite.}

4635Prompt-Guided Distillation from Multimodal Large Language Models to Task-specific Models for Multimodal Sentiment Analysis

[openreview] [pdf]

Abstract Multimodal Sentiment Analysis (MSA) has made some progress with the advent of Multimodal Large Language Models (MLLMs). However, the scalability and the closed-source nature of some MLLMs imposes challenges for efficient application in the real-word. In this study, we explore an innovative pathway to infuse the capabilities of general MLLMs into task-specific small models for MSA. We introduce the Prompt-Guided Multimodal Framework (PGMF), a refined teacher-student framework designed to transfer knowledge from powerful, general MLLMs to smaller, efficient models. The PGMF-Teacher utilizes MLLM-generated prompts and a tailored conditional alignment module to achieve better MSA, while the PGMF-Student distills this expertise to predict independently of MLLMs’ guidance. Extensive evaluations on two popular MSA datasets including SIMS and MOSI demonstrate that compared to previous task-specific small models, PGMF-Teacher achieves state-of-the-art performance with the help of MLLMs’ prompts, while PGMF-Student achieve competitive results with fewer parameters and without relying on MLLMs’ prompts. The proposed framework offers a novel way to equip task-specific small models with the capability of MLLMs.

4636Robust Deep Reinforcement Learning against ADVERSARIAL BEHAVIOR MANIPULATION

[openreview] [pdf]

Abstract This study investigates the robustness of deep reinforcement learning agents against targeted attacks that aim to manipulate the victim’s behavior through adversarial interventions in state observations. While several methods for such targeted manipulation attacks have been proposed, they all require white-box access to the victim’s policy, and some rely on environment-specific heuristics. Furthermore, no defense method has been proposed to counter these attacks. To address this, we propose a novel targeted attack method for manipulating the victim, which does not depend on environmental heuristics and applies in black-box and no-box settings. Additionally, we introduce a defense strategy against these attacks. Our theoretical analysis proves that the sensitivity of a policy’s action output to state changes affects the defense performance and that the earlier in the trajectory, the greater the effect. Based on this insight, we introduce a time-discounted regularization as a countermeasure for such behavior targeted attacks, which helps to improve robustness against attacks while maintaining task performance in the absence of attacks. Empirical evaluations demonstrate that our proposed attack method outperforms baseline attack methods. Furthermore, our defense strategy shows superior robustness against existing defense methods designed for untargeted attacks.

4637CWPS: Efficient Channel-Wise Parameter Sharing for Knowledge Transfer

[openreview] [pdf]

Abstract Knowledge transfer aims to apply existing knowledge to different tasks or new data, and it has extensive applications in multi-domain and multi-task learning. The key to this task is quickly identifying a fine-grained object for knowledge sharing and efficiently transferring knowledge. Current methods, such as fine-tuning, layer-wise parameter sharing, and task-specific adapters, only offer coarse-grained sharing solutions and struggle to effectively search for shared parameters, thus hindering the performance and efficiency of knowledge transfer. To address these issues, we propose Channel-Wise Parameter Sharing (CWPS), a novel fine-grained parameter-sharing method for Knowledge Transfer, which is efficient for parameter sharing, comprehensive, and plug-and-play. For the coarse-grained problem, we first achieve fine-grained parameter sharing by refining the granularity of shared parameters from the level of layers to the level of neurons. The knowledge learned from previous tasks can be utilized through the explicit composition of the model neurons. Besides, we promote an effective search strategy to minimize computational costs, simplifying the process of determining shared weights. In addition, our CWPS has strong composability and generalization ability, which theoretically can be applied to any network consisting of linear and convolution layers. We introduce several datasets in both incremental learning and multi-task learning scenarios. Our method has achieved state-of-the-art precision-to-parameter ratio performance with various backbones, demonstrating its efficiency and versatility.

4638The Impact of Element Ordering on LM Agent Performance

[openreview] [pdf]

Abstract There has been a surge of interest in language model agents that can navigate virtual environments such as the web or desktop. To navigate such environments, agents benefit from information on the various elements (e.g., buttons, text, or images) present. However, it remains unclear which element attributes have the greatest impact on agent performance, especially in environments that only provide a graphical representation (i.e., pixels). Here we find that the ordering in which elements are presented to the language model is surprisingly impactful—randomizing element ordering in webpages compromises average agent performance to a degree comparable to removing all visible text from webpages. While web agents benefit from the semantic hierarchical ordering of elements available via the browser, agents that parse elements directly from pixels do not have access to any such ordering. Here we endeavor to derive effective orderings and investigate the impact of various element ordering methods in web and desktop environments. We find that dimensionality reduction provides a viable ordering for pixel-only environments. We train a UI element detection model to derive elements from pixels and apply our findings to an agent benchmark—OmniACT—where we only have access to pixels. Our method completes more than two times as many tasks on average relative to the previous state-of-the-art.

4639ARC-RL: Self-Evolution Continual Reinforcement Learning via Action Representation Space

[openreview] [pdf]

Abstract Continual Reinforcement Learning (CRL) is a powerful tool that enables agents to learn a sequence of tasks, accumulating knowledge learned in the past and using it for problemsolving or future task learning. However, existing CRL methods all assume that the agent’s capabilities remain static within dynamic environments, which doesn’t reflect realworld scenarios where capabilities evolve. This paper introducesSelf-Evolution Continual Reinforcement Learning(SE-CRL), a new and realistic problem where the agent’s action space continually changes. It presents a significant challenge for RL agents: How can policy generalization across different action spaces be achieved? Inspired by the cortical functions that lead to consistent human behavior, we propose anActionRepresentationContinualReinforcementLearning framework (ARC-RL) to address this challenge. Our framework builds a representation space for actions by self-supervised learning on transitions, decoupling the agent’s policy from the specific action space. For a new action space, the decoder of the action representation is expanded or masked for adaptation and regularized fine-tuned to improve the stability of the policy. Furthermore, we release a benchmark based on MiniGrid to validate the effectiveness of methods for SE-CRL. Experimental results demonstrate that our framework significantly outperforms popular CRL methods by generalizing the policy across different action spaces.

4640OmniKV: Dynamic Context Selection for Efficient Long-Context LLMs

[openreview] [pdf]

Abstract During the inference phase of Large Language Models (LLMs) with long context, a substantial portion of GPU memory is allocated to the KV cache, with memory usage increasing as the sequence length grows. To mitigate the GPU memory footprint associate with KV cache, some previous studies have discarded less important tokens based on the sparsity identified in attention scores in long context scenarios. However, we argue that attention scores cannot indicate the future importance of tokens in subsequent generation iterations, because attention scores are calculated based on current hidden states. Therefore, we propose OmniKV, a token-dropping-free and training-free inference method that reduces KV cache GPU memory usage by over 75% without performance degradation. Moreover, OmniKV maintains even accelerates inference efficiency in long-text scenarios. The core innovative insight of OmniKV is: Within a single generation iteration, there is a high degree of similarity in the important tokens identified across consecutive layers. Extensive experiments demonstrate that OmniKV achieves state-of-the-art performance across multiple benchmarks, with particularly advantages in chain-of-thoughts scenarios. By using a single A100 and Llama-3-8B, OmniKV can handle a 450K context with a decoding latency of 7.52 tokens/s, which is 1.87 times faster than the original model running on three A100 GPUs in pipeline.

4641On the Power of Federated Learning for Online Sparse Linear Regression with Decentralized Data

[openreview] [pdf]

Abstract In this paper, we study the necessity of federated learning (FL) for online linear regression with decentralized data. Previous work proved that FL is unnecessary for minimizing regret in full information setting, while we prove that it can be necessary if only limited attributes of each instance are observed. We call this problem online sparse linear regression with decentralized data (OSLR-DecD). We propose a federated algorithm for OSLR-DecD, and prove a lower bound on the regret of any noncooperative algorithm. In the case of d=o(M)d=o(M), the upper bound on the regret of our algorithm is smaller than the lower bound, demonstrating the necessity of FL, in which MM is the number of clients and dd is the dimension of data. When M=1M=1, we give the first lower bound on the regret and improve previous upper bounds. We invent three new techniques including an any-time federated online mirror descent with negative entropy regularization, a paradigm for client-server collaboration with privacy protection, and a reduction from online sparse linear regression to prediction with limited advice for establishing the lower bound on the regret, some of which might be of independent interest.

4642TabKANet: Tabular Data Modeling with Kolmogorov-Arnold Network and Transformer

[openreview] [pdf]

Abstract Tabular data is the most common type of data in real-life scenarios. In this study, we propose the TabKANet model for tabular data modeling, which targets the bottlenecks in learning from numerical content. We constructed a Kolmogorov-Arnold Network (KAN) based Numerical Embedding Module and unified numerical and categorical features encoding within a Transformer architecture. TabKANet has demonstrated stable and significantly superior performance compared to Neural Networks (NNs) across multiple public datasets in binary classification, multi-class classification, and regression tasks. Its performance is comparable to or surpasses that of Gradient Boosted Decision Tree models (GBDTs). Our code is publicly available on GitHub:https://github.com/AI-thpremed/TabKANet.

4643Enhancing Learning with Label Differential Privacy by Vector Approximation

[openreview] [pdf]

Abstract Label differential privacy (DP) is a framework that protects the privacy of labels in training datasets, while the feature vectors are public. Existing approaches protect the privacy of labels by flipping them randomly, and then train a model to make the output approximate the privatized label. However, as the number of classes K increases, stronger randomization is needed, thus the performances of these methods become significantly worse. In this paper, we propose a vector approximation approach, which is easy to implement and introduces little additional computational overhead. Instead of flipping each label into a single scalar, our method converts each label into a random vector with K components, whose expectations reflect class conditional probabilities. Intuitively, vector approximation retains more information than scalar labels. A brief theoretical analysis shows that the performance of our method only decays slightly with K. Finally, we conduct experiments on both synthesized and real datasets, which validate our theoretical analysis as well as the practical performance of our method.

4644Generalization v.s. Memorization: Tracing Language Models’ Capabilities Back to Pretraining Data

[openreview] [pdf]

Abstract The impressive capabilities of large language models (LLMs) have sparked debate over whether these models genuinely generalize to unseen tasks or predominantly rely on memorizing vast amounts of pretraining data. To explore this issue, we introduce an extended concept of memorization, distributional memorization, which measures the correlation between the LLM output probabilities and the pretraining data frequency. To effectively capture task-specific pretraining data frequency, we propose a novel task-gram language model, which is built by counting the co-occurrence of semantically related nn-gram pairs from task inputs and outputs in the pretraining corpus. Using the Pythia models trained on the Pile dataset, we evaluate three distinct tasks: machine translation, factual question answering, and reasoning. Our findings reveal varying levels of memorization, with the strongest effect observed in factual question answering. Furthermore, while model performance improves across all tasks as LLM size increases, only factual question answering shows an increase in memorization, whereas machine translation and reasoning tasks exhibit greater generalization, producing more novel outputs. This study demonstrates that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks, providing a scalable method for analyzing large pretraining corpora in greater depth.

4645Improved Methods for Model Pruning

[openreview] [pdf]

Abstract Model pruning is a performance optimization technique for large language and vision models. However, existing pruning methods often lead to significant performance degradation or require extensive retraining and fine-tuning. This technique aims to identify and remove neurons, connections unlikely leading to the contribution during the machine generation phase. Our goal is to obtain a much smaller and faster foundational model that can quickly generate content almost as good as those of the unpruned models. We propose MAMA (short for Movement and Magnitude Analysis), an improved pruning method that effectively reduces model size and network computational complexity while maintaining performance comparable to the original unpruned model even at extreme pruned levels. The improved method is based on weights, bias, activations and proposed novel pruning indicators. Empirical results show that our method outperforms and be comparable to state-of-the-art methods across various pruning levels. All our code, models, dataset, and demo are publicly available.

4646Towards Understanding Why Group Robustness Methods Work

[openreview] [pdf]

Abstract Deep Learning has made remarkable strides, yet models trained under conventional Empirical Risk Minimization (ERM) approaches encounter challenges regarding their generalization capabilities. In particular, a lack of robustness to spurious correlations. In response, Group Robustness Methods (GRMs) have been developed to combat them. These methods partition training datasets into distinct groups based on spurious features and class labels and adjust their weighting in the loss function. These methods show remarkable performance in dealing with spurious correlations. The underlying mechanisms for their success, however, are not so well understood. Our work contributes by shedding light on the learning dynamics of GRMs, through an empirical and theoretical analysis of them that reveals the differences in feature learning and the type of classifiers they learn versus ERM. Surprisingly, both GRMs and ERM models retain spurious information in their representations, even when it is irrelevant to the task at hand. We find evidence that suggests that the key to GRMs’ success is two-fold: distributing prediction across multiple features in representation space to avoid relying on few but spurious attributes and incentivizing the classifier to become orthogonal to spurious features. We verify our findings by proposing an upgrade to the Subsampling baseline method called Group Distributionally Robust Feature Reweighting (GDRFR) that is easy to compute and only requires a fraction of group labels during a finetuning phase and retrieve most of GRMs performance gains over ERM.

4647Zero-shot Imputation with Foundation Inference Models for Dynamical Systems

[openreview] [pdf]

Abstract Dynamical systems governed by ordinary differential equations (ODEs) serve as models for a vast number of natural and social phenomena. In this work, we offer a fresh perspective on the classical problem of imputing missing time series data, whose underlying dynamics are assumed to be determined by ODEs. Specifically, we revisit ideas from amortized inference and neural operators, and propose a novel supervised learning framework forzero-shot time series imputation, through parametric functions satisfying some (hidden) ODEs. Our proposal consists of two components. First, a broad probability distribution over the space of ODE solutions, observation times and noise mechanisms, with which we generate a large, synthetic dataset of (hidden) ODE solutions, along with their noisy and sparse observations. Second, a neural recognition model that is trainedoffline, to map the generated time series onto the spaces of initial conditions and time derivatives of the (hidden) ODE solutions, which we then integrate to impute the missing data. We empirically demonstrate thatone and the same(pretrained) recognition model can perform zero-shot imputation across 63 distinct time series with missing values, each sampled from widely different dynamical systems. Likewise, we demonstrate that it can perform zero-shot imputation of missing high-dimensional data in 10 vastly different settings, spanning human motion, air quality, traffic and electricity studies, as well as Navier-Stokes simulations —without requiring any fine-tuning. What is more, our proposal often outperforms state-of-the-art methods, which are trained on the target datasets.Our pretrained model is available with the supplementary material

4648Characteristic Function-Based Regularization for Probability Function Informed Neural Networks

[openreview] [pdf]

Abstract Regularization is essential in neural network training to prevent overfitting and improve generalization. In this paper, we propose a novel regularization technique that leverages decomposable distribution and central limit theory assumptions by exploiting the properties of characteristic functions. We first define Probability Function Informed Neural Networks as a class of universal function approximators capable of embedding the knowledge of some probabilistic rules constructed over a given dataset into the learning process (a similar concept to Physics-informed neural networks (PINNs), if the reader is familiar with those). We then enforce a regularization framework over this network, aiming to impose structural constraints on the network’s weights to promote greater generalizability in the given probabilistic setting. Rather than replacing traditional regularization methods such as L2 or dropout, our approach is intended to supplement this and other similar classes of neural network architectures by providing instead a contextual delta of generalization. We demonstrate that integrating this method into such architectures helps improve performance on benchmark supervised classification datasets, by preserving essential distributional properties to mitigate the risk of overfitting. This characteristic function-based regularization offers a new perspective for enhancing distribution-aware learning in machine learning models.

4649Generative Verifiers: Reward Modeling as Next-Token Prediction

[openreview] [pdf]

Abstract Verifiers or reward models are often used to enhance the reasoning performance of large language models (LLMs). A common approach is the Best-of-N method, where N candidate solutions generated by the LLM are ranked by a verifier, and the best one is selected. While LLM-based verifiers are typically trained as discriminative classifiers to score solutions, they do not utilize the text generation capabilities of pretrained LLMs. To overcome this limitation, we instead propose training verifiers using the ubiquitous next-token prediction objective, jointly on verification and solution generation. Compared to standard verifiers, such generative verifiers (GenRM) can benefit from several advantages of LLMs: they integrate seamlessly with instruction tuning, enable chain-of-thought reasoning, and can utilize additional test-time compute via majority voting for better verification. We demonstrate that GenRM outperforms discriminative, DPO verifiers, and LLM-as-a-Judge, resulting in a 16-40% improvement in the number of problems solved with Best-of-N on algorithmic and math reasoning tasks. Furthermore, we find that training GenRM with synthetic verification rationales is sufficient to pick out subtle errors on math problems. Finally, we demonstrate that generative verifiers scale favorably with model size and inference-time compute.

4650CALoR: Towards Comprehensive Model Inversion Defense

[openreview] [pdf]

Abstract Model Inversion Attacks (MIAs) aim at recovering privacy-sensitive training data from the knowledge encoded in the released machine learning models. Recent advances in the MIA field have significantly enhanced the attack performance under multiple scenarios, posing serious privacy risks of Deep Neural Networks (DNNs). However, the development of defense strategies against MIAs is relatively backward to resist the latest MIAs and existing defenses fail to achieve further trade-off between model utility and model robustness. In this paper, we provide an in-depth analysis from the perspective of intrinsic vulnerabilities of MIAs, comprehensively uncovering the weaknesses inherent in the basic pipeline, which are partially investigated in the previous defenses. Building upon these new insights, we propose a robust defense mechanism, integratingConfidenceAdaptationandLow-Rank compression(CALoR). Our method includes a novel robustness-enhanced classification loss specially-designed for model inversion defenses and reveals the extraordinary effectiveness of compressing the classification header. With CALoR, we can mislead the optimization objective, reduce the leaked information and impede the backpropagation of MIAs, thus mitigating the risk of privacy leakage. Extensive experimental results demonstrate that our method achieves state-of-the-art (SOTA) defense performance against MIAs and exhibits superior generalization to existing defenses across various scenarios.

4651Sketching for Convex and Nonconvex Regularized Least Squares with Sharp Guarantees

[openreview] [pdf]

Abstract Randomized algorithms are important for solving large-scale optimization problems efficiently. In this paper, we propose a fast sketching algorithm for least square problems regularized by convex or nonconvex regularization functions, Sketching for Regularized Optimization, or SRO. SRO first generates a sketch of the original data matrix, then solves the sketched problem. We prove minimax rates for sparse signal estimation by solving the sketched sparse convex or nonconvex learning problems. A new Iterative SRO algorithm is proposed to geometrically reduce the approximation error for solving the sketched convex regularized problems. To the best of our knowledge, our results are among the first to demonstrate minimax rates for convex or nonconvex sparse learning problem by sketching under a unified theoretical framework. Experimental results demonstrate the effectiveness of the proposed SRO and Iterative SRO algorithms.

4652There and Back Again: On the relation between noises, images, and their inversions in diffusion models

[openreview] [pdf]

Abstract Denoising Diffusion Probabilistic Models (DDPMs) achieve state-of-the-art performance in synthesizing new images from random noise, but they lack meaningful latent space that encodes data into features. Recent DDPM-based editing techniques try to mitigate this issue by inverting images back to their approximated staring noise. In this work, we study the relation between the initial Gaussian noise, the samples generated from it, and their corresponding latent encodings obtained through the inversion procedure. First, we interpret their spatial distance relations to show the inaccuracy of the DDIM inversion technique by localizing latent representations manifold between the initial noise and generated samples. Then, we demonstrate the peculiar relation between initial Gaussian noise and its corresponding generations during diffusion training, showing that the high-level features of generated images stabilize rapidly, keeping the spatial distance relationship between noises and generations consistent throughout the training.

4653Scaling FP8 training to trillion-token LLMs

[openreview] [pdf]

Abstract We train, for the first time, large language models using FP8 precision on datasets up to 2 trillion tokens --- a 20-fold increase over previous limits. Through these extended training runs, we uncover critical instabilities in FP8 training that were not observable in earlier works with shorter durations. We trace these instabilities to outlier amplification by the SwiGLU activation function. Interestingly, we show, both analytically and empirically, that this amplification happens only over prolonged training periods, and link it to a SwiGLU weight alignment process. To address this newly identified issue, we introduce Smooth-SwiGLU, a novel modification that ensures stable FP8 training without altering function behavior. We also demonstrate, for the first time, FP8 quantization of both Adam optimizer moments. Combining these innovations, we successfully train a 7B parameter model using FP8 precision on 256 Intel Gaudi2 accelerators, achieving on-par results with the BF16 baseline while delivering up to a \sim 34 % throughput improvement. A reference implementation is supplied inhttps://github.com/Anonymous1252022/Megatron-DeepSpeed

4654Reflective Gaussian Splatting

[openreview] [pdf]

Abstract Novel view synthesis has experienced significant advancements owing to increasingly capable NeRF- and 3DGS-based methods. However, reflective object reconstruction remains challenging, lacking a proper solution to achieve real-time, high-quality rendering while accommodating inter-reflection. To fill this gap, we introduce a Reflective Gaussian splatting (Ref-Gaussian) framework characterized with two components: (I) Physically based deferred rendering that empowers the rendering equation with pixel-level material properties via formulating split-sum approximation; (II) Gaussian-grounded inter-reflection that realizes the desired inter-reflection function within a Gaussian splatting paradigm for the first time. To enhance geometry modeling, we further introduce material-aware normal propagation and an initial per-Gaussian shading stage, along with 2D Gaussian primitives. Extensive experiments on standard datasets demonstrate that Ref-Gaussian surpasses existing approaches in terms of quantitative metrics, visual quality, and compute efficiency. Further, we illustrate that Ref-Gaussian supports more applications such as relighting and editing.

4655Laplace Sample Information: Data Informativeness Through a Bayesian Lens

[openreview] [pdf]

Abstract Accurately estimating the informativeness of individual samples in a dataset is an important objective in deep learning, as it can guide sample selection, which can improve model efficiency and accuracy by removing redundant or potentially harmful samples. We propose Laplace Sample Information\text{\textit{Laplace Sample Information}} (LSI\mathsf{LSI}) measure of sample informativeness grounded in information theory widely applicable across model architectures and learning settings. LSI\mathsf{LSI} leverages a Bayesian approximation to the weight posterior and the KL divergence to measure the change in the parameter distribution induced by a sample of interest from the dataset. We experimentally show that LSI\mathsf{LSI} is effective in ordering the data with respect to typicality, detecting mislabeled samples, measuring class-wise informativeness, and assessing dataset difficulty. We demonstrate these capabilities of LSI\mathsf{LSI} on image and text data in supervised and unsupervised settings. Moreover, we show that LSI\mathsf{LSI} can be computed efficiently through probes and transfers well to the training of large models.

4656ELICIT: LLM Augmentation Via External In-context Capability

[openreview] [pdf]

Abstract Enhancing the adaptive capabilities of large language models is a critical pursuit in both research and application. Traditional fine-tuning methods require substantial data, computational resources, and specific capabilities, while in-context learning is limited by the need for appropriate demonstrations and efficient token usage. Inspired by the expression of in-context learned capabilities through task vectors and the concept of modular capability or knowledge, we propose ELICIT, a framework consisting of two modules designed to effectively store and reuse task vectors to enhance the diverse adaptive capabilities of models without additional training or inference tokens. Our comprehensive experiments and analysis demonstrate that our pipeline is highly transferable across different input formats, tasks, and model architectures. Externally storing and reusing vectors that represent in-context learned capabilities not only shows the potential to extract modular capabilities but also significantly enhances the performance, versatility, adaptability, and scalability of large language models, paving the way for more efficient and effective use of these models in a wide range of applications.

4657When GNNs meet symmetry in ILPs: an orbit-based feature augmentation approach

[openreview] [pdf]

Abstract A common characteristic in integer linear programs (ILPs) is symmetry, allowing variables to be permuted without altering the underlying problem structure. Recently, GNNs have emerged as a promising approach for solving ILPs. However, a significant challenge arises when applying GNNs to ILPs with symmetry: classic GNN architectures struggle to differentiate between symmetric variables, which limits their predictive accuracy. In this work, we investigate the properties of permutation equivalence and invariance in GNNs, particularly in relation to the inherent symmetry of ILP formulations. We reveal that the interaction between these two factors contributes to the difficulty of distinguishing between symmetric variables. To address this challenge, we explore the potential of feature augmentation and propose several guiding principles for constructing augmented features. Building on these principles, we develop an orbit-based augmentation scheme that first groups symmetric variables and then samples augmented features for each group from a discrete uniform distribution. Empirical results demonstrate that our proposed approach significantly enhances both training efficiency and predictive performance.

4658Parrot: Multilingual Visual Instruction Tuning

[openreview] [pdf]

Abstract The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs’ inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks.

4659LoKO: Low-Rank Kalman Optimizer for Online Fine-Tuning of Large Models

[openreview] [pdf]

Abstract Training large models with millions or even billions of parameters from scratch incurs substantial computational costs. Parameter Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), address this challenge by adapting only a reduced number of parameters to specific tasks with gradient-based optimizers. In this paper, we cast PEFT as an optimal filtering/state estimation problem and present Low-Rank Kalman Optimizer (LoKO) to estimate the optimal trainable parameters in an online manner. We leverage the low-rank decomposition in LoRA to significantly reduce matrix sizes in Kalman iterations and further capitalize on a diagonal approximation of the covariance matrix to effectively decrease computational complexity from quadratic to linear in the number of trainable parameters. Moreover, we discovered that the initialization of the covariance matrix within the Kalman algorithm and the accurate estimation of the observation noise covariance are the keys in this formulation, and we propose robust approaches that work well across a vast range of well-established computer vision and language models. Our results show that LoKO converges with fewer iterations and yields better performance models compared to commonly used optimizers with LoRA in both image classifications and language tasks. Our study opens up the possibility of leveraging the Kalman filter as an effective optimizer for the online fine-tuning of large models.

4660Geometry of the Loss Landscape in Invariant Deep Linear Neural Networks

[openreview] [pdf]

Abstract Equivariant and invariant machine learning models seek to take advantage of symmetries and other structures present in the data to reduce the sample complexity of learning. Empirical work has suggested that data-driven methods, such as regularization and data augmentation, may achieve a comparable performance as genuinely invariant models, but theoretical results are still limited. In this work, we conduct a theoretical comparison of three different approaches to achieve invariance: data augmentation, regularization, and hard-wiring. We focus on mean squared error regression with deep linear networks, which parametrize rank-bounded linear maps and can be hard-wired to be invariant to specific group actions. We show that the optimization problems resulting from hard-wiring and data augmentation have the same critical points, all of which are saddles except for the global optimum. In contrast, regularization leads to a larger number of critical points, again all of which are saddles except for the global optimum. The regularization path is continuous and converges to the hard-wired optimum.

4661Communication-Efficient Heterogeneous Federated Learning with Generalized Heavy-Ball Momentum

[openreview] [pdf]

Abstract Federated Learning (FL) has emerged as the state-of-the-art approach for learning from decentralized data in privacy-constrained scenarios. However, system and statistical challenges hinder real-world applications, which demand efficient learning from edge devices and robustness to heterogeneity. Despite significant research efforts, existing approaches (i) are not sufficiently robust, (ii) do not perform well in large-scale scenarios, and (iii) are not communication efficient. In this work, we propose a novelGeneralized Heavy-Ball Momentum(GHBM), proving that it enjoys an improved theoretical convergence rate w.r.t. existing FL methods based on classical momentum inpartial participation, without relying on bounded data heterogeneity. Then, we present FedHBM as an adaptive, communication-efficient by-design instance of GHBM. Extensive experimentation on vision and language tasks, in both controlled and realistic large-scale scenarios, confirms our theoretical findings, showing that GHBM substantially improves the state of the art, especially in large scale scenarios with high data heterogeneity and low client participation.

4662IRIS: An Iterative and Integrated Framework for Real-Time Causal Discovery

[openreview] [pdf]

Abstract Causal discovery is fundamental to scientific research, yet traditional statistical algorithms face significant challenges, including expensive data collection, redundant examination of known relations, and unrealistic assumptions. Additionally, while recent LLM-based methods excel at identifying commonly known causal relations, they fall short in uncovering novel relations. We introduce IRIS (Iterative Retrieval and Integrated System for Real-Time Causal Discovery), a novel framework that addresses these limitations. Starting with a set of initial variables, IRIS automatically retrieves relevant documents, extracts variable values, and organizes data for statistical algorithms in real-time. Our hybrid causal discovery method combines statistical algorithms and LLM-based methods to discover existing and novel causal relations. The missing variable proposal component identifies missing variables, and subsequently, IRIS expands the causal graphs by including both the initial and the newly suggested variables. Our approach offers a scalable and adaptable solution for causal discovery, enabling the exploration of causal relations from a set of initial variables without requiring pre-existing datasets.

4663LLM-Guided Self-Supervised Tabular Learning With Task-Specific Pre-text Tasks

[openreview] [pdf]

Abstract One of the most common approaches for self-supervised representation learning is defining pre-text tasks to learn data representations. Existing works determine pre-text tasks in a "task-agnostic’’ way, without considering the forthcoming downstream tasks. This offers an advantage of broad applicability across tasks, but can also lead to a mismatch between task objectives, potentially degrading performance on downstream tasks. In this paper, we introduce TST-LLM, a framework that effectively reduces this mismatch when the natural language-based description of the downstream task is given without any ground-truth labels. TST-LLM instructs the LLM to use the downstream task’s description and meta-information of data to discover features relevant to the target task. These discovered features are then treated as ground-truth labels to define "target-specific’’ pre-text tasks. TST-LLM consistently outperforms contemporary baselines, such as STUNT and LFR, with win ratios of 95% and 81%, when applied to 22 benchmark tabular datasets, including binary and multi-class classification, and regression tasks.

4664Latent Boost: Leveraging Latent Space Distance Metrics to Augment Classification Performance

[openreview] [pdf]

Abstract The pursuit of boosting classification performance in Machine Learning has primarily focused on refining model architectures and hyperparameters through probabilistic loss optimization. However, such an approach often neglects the profound, untapped potential embedded in internal structural information, which can significantly elevate the training process. In this work, we introduce Latent Boost, a novel approach that incorporates the very definition of classification via latent representation distance metrics to enhance the conventional dataset-oriented classification training. Thus during training, the model is not only optimized for classification metrics of the discrete data points but also adheres to the rule that the collective representation zones of each class should be sharply clustered. By leveraging the rich structural insights of high-dimensional latent representations, Latent Boost not only improves classification metrics like F1-Scores but also brings additional benefits of improved interpretability with higher silhouette scores and steady-fast convergence with fewer training epochs. Latent Boost brings these performance and latent structural benefits with minimum additional cost and no data-specific requirements.

4665Learning-Augmented Learning of Gaussian Mixture Models

[openreview] [pdf]

Abstract Gaussian mixture models (GMMs) is one of the most fundamental methods to identify and extract latent structure in complex datasets. Unfortunately, well-known hardness results require that any algorithm for learning a mixture of kk multivariate Gaussian distributions in dd-dimensional space requires both runtime and sample complexity exponential in dd, even if the Gaussians are reasonably separated. To overcome this barrier, we consider settings where algorithms are augmented with possibly erroneous ``advice’’ to help learn the underlying GMMs. In particular, we consider a natural predictor that can be easily trained through machine learning models. Specifically, our predictor outputs a list of β\beta possible labels for each sample from the mixture such that, with probability at least 1α1-\alpha, one of the labels in the list is the true label, for a fixed constant α\alpha. We show that to estimate the mixture up to total variation distance O~(ε)\tilde{\mathcal{O}}(\varepsilon), we can use kpoly(d,logk,1ε)k\cdot\text{poly}\left(d,\log k,\frac{1}{\varepsilon}\right) samples from the GMM, provided that β\beta is upper bounded by any fixed constant. Moreover, our algorithm uses polynomial time, thus breaking known computational limitations of algorithms that do not have access to such advice.

4666Multiplicative Logit Adjustment Approximates Neural-Collapse-Aware Decision Boundary Adjustment

[openreview] [pdf]

Abstract Real-world data distributions are often highly skewed. This has spurred a growing body of research on long-tailed recognition, aimed at addressing the imbalance in training classification models. Among the methods studied, multiplicative logit adjustment (MLA) stands out as a simple and effective method. What theoretical foundation explains the effectiveness of this heuristic method? We provide a justification for the effectiveness of MLA with the following two-step process. First, we develop a theory that adjusts optimal decision boundaries by estimating feature spread on the basis of neural collapse. Second, we demonstrate that MLA approximates this optimal method. Additionally, through experiments on long-tailed datasets, we illustrate the practical usefulness of MLA under more realistic conditions. We also offer experimental insights to guide the tuning of MLA hyperparameters.

4667Query Optimization Detection Transformer for Small Objects in Remote Sensing Images

[openreview] [pdf]

Abstract Object detection in remote sensing images is a challenging task. Remote sensing images contain substantial background noise and complex contextual information, which weakens the feature representation of small objects, making detection difficult. To solve these problems, a detection Transformer for small objects in remote sensing images is proposed, called QO-DETR. Specifically, to enhance the feature representation of small objects, a query proposal generation module is designed to select queries based on multi-class classification scores. These queries provide the initial position embeddings for object queries in the decoder, enabling the decoder’s attention mechanism to focus on object regions. To improve the model’s robustness to noise, a group denoising module is designed to add noise into decoder queries during training, enhancing the network’s ability to reconstruct object features from noise. To accurately locate small objects, a query cascade refinement strategy is designed, and each decoder layer refines anchor parameters under the guidance of preceding layers to achieve spatial alignment between the anchor and the object. Experiments have been carried out on DIOR and AI-TOD. The AP and APs on DIOR reach 51.3% and 13.4%, respectively, while on AI-TOD, they reach 23.6% and 30.1%. QO-DETR shows superior performance in detecting small objects.

4668How Low Can You Go? Searching for the Intrinsic Dimensionality of Complex Networks using Metric Node Embeddings

[openreview] [pdf]

Abstract Low-dimensional embeddings are essential for machine learning tasks involving graphs, such as node classification, link prediction, community detection, network visualization, and network compression. Although recent studies have identified exact low-dimensional embeddings, the limits of the required embedding dimensions remain unclear. We presently prove that lower dimensional embeddings are possible when using metric embeddings as opposed to vector-based inner product embeddings such as Logistic PCA (LPCA). We further provide an efficient logarithmic search procedure for identifying the exact embedding dimension and demonstrate how metric embeddings enable inference of the exact embedding dimensions of large-scale networks by exploiting that the metric properties can be used to provide linearithmic scaling. Empirically, we show that our approach extracts substantially lower dimensional representations of networks than previously reported for small-sized networks. For the first time, we demonstrate that even large-scale networks can be effectively embedded in very low-dimensional spaces, and provide examples of scalable, exact reconstruction for graphs with up to a million nodes. Our approach highlights that the intrinsic dimensionality of networks is substantially lower than previously reported and provides a computationally efficient assessment of the exact embedding dimension also of large-scale networks. The surprisingly low dimensional representations achieved demonstrate that networks in general can be losslessly represented using very low dimensional feature spaces, which can be used to guide existing network analysis tasks from community detection and node classification to structure revealing exact network visualizations.

4669Verbosity≠Veracity: Demystify Verbosity Compensation Behavior of Large Language Models

[openreview] [pdf]

Abstract When unsure about an answer, humans often respond with more words than necessary, hoping that part of the response will be correct. We observe a similar behavior in large language models (LLMs), which we term ``Verbosity Compensation" (VC). VC is harmful because it confuses the user understanding, leading to low efficiency, and influences the LLM services by increasing the latency and cost of generating useless tokens. In this paper, we present the first work that defines and analyzes Verbosity Compensation (VC), explores its causes, and proposes a simple mitigating approach. We define Verbosity Compensation (VC) as the behavior of generating responses that can be compressed without information loss when prompted to write concisely. Our experiments, conducted on five datasets of knowledge and reasoning-based QA tasks with 14 newly developed LLMs, reveal three conclusions. 1) We reveal a pervasive presence of verbosity compensation (VC) across all models and all datasets. Notably, GPT-4 exhibits a VC frequency of 50.40%. 2) We reveal the large performance gap between verbose and concise responses, with a notable difference of 27.61% on the Qasper dataset. We also show this difference cannot be naturally mitigated with the capability of LLM increases Both 1) and 2) highlight the urgent need to mitigate the frequency of VC behavior and disentangle verbosity with veracity. We propose a simple yet effective cascade algorithm that replaces the verbose responses with the other model-generated responses. The results show that our approach effectively alleviates the VC of the Mistral model from 63.81% to 16.16% on the Qasper dataset. 3) We also find that verbose responses exhibit higher uncertainty across all five datasets, suggesting a strong connection between verbosity and model uncertainty. Our dataset and code will be released.

4670Auditingf-Differential Privacy in One Run

[openreview] [pdf]

Abstract Privacy-preserving machine learning requires carefully designed and rigorously analyzed algorithms. However, such designs and analyses are often susceptible to errors or imperfections, leading to mechanisms that may not offer the expected level of privacy due to mathematical inaccuracies or implementation flaws. Conversely, some mechanisms might provide stronger privacy guarantees than can be proven through a loose analysis. Empirical privacy auditing has emerged as a means to address this gap. Existing auditing mechanisms, however, are either inefficient—requiring multiple runs of machine learning algorithms—or suboptimal in calculating the empirical privacy of these algorithms. In this work, we present a tight and efficient auditing procedure and analysis that can effectively assess the privacy of mechanisms. Our approach requires only a single run of the mechanism and achieves tight empirical privacy by leveraging the ff-DP curve, which provides a more accurate measure of privacy than the traditional ϵ,δ\epsilon,\delta parameters. Experiments demonstrate that our auditing algorithm delivers tighter empirical privacy guarantees.

4671ACES: Automatic Cohort Extraction System for Event-Stream Datasets

[openreview] [pdf]

Abstract Reproducibility remains a significant challenge in machine learning (ML) for healthcare. Datasets, model pipelines, and even task/cohort definitions are often private in this field, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. This paper addresses a significant part of this problem by introducing the Automatic Cohort Extraction System (ACES) for event-stream data. This library is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and also enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or toanydataset in which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks that learn representations, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality.

4672VICtoR: Learning Hierarchical Vision-Instruction Correlation Rewards for Long-horizon Manipulation

[openreview] [pdf]

Abstract We study reward models for long-horizon manipulation by learning from action-free videos and language instructions, which we term the visual-instruction correlation (VIC) problem. Existing VIC methods face challenges in learning rewards for long-horizon tasks due to their lack of sub-stage awareness, difficulty in modeling task complexities, and inadequate object state estimation. To address these challenges, we introduce VICtoR, a novel hierarchical VIC reward model capable of providing effective reward signals for long-horizon manipulation tasks. Trained solely on primitive motion demonstrations, VICtoR effectively provides precise reward signals for long-horizon tasks by assessing task progress at various stages using a novel stage detector and motion progress evaluator. We conducted extensive experiments in both simulated and real-world datasets. The results suggest that VICtoR outperformed the best existing methods, achieving a 43% improvement in success rates for long-horizon tasks.

4673ToM-agent: Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection

[openreview] [pdf]

Abstract Recent studies have increasingly demonstrated that large language models (LLMs) possess significant theory of mind (ToM) capabilities, showing the potential for simulating the tracking of mental states in generative agents. In this study, we propose a novel paradigm called ToM-agent, designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions. ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent’s perception of its counterpart’s mental states, such as beliefs, desires, and intentions (BDIs). Using past conversation history and verbal reflections, ToM-Agent can dynamically adjust counterparts’ inferred BDIs, along with related confidence levels. We further put forth a counterfactual intervention method that reflects on the gap between the predicted responses of counterparts and their real utterances, thereby enhancing the efficiency of reflection. Leveraging empathetic and persuasion dialogue datasets, we assess the advantages of implementing the ToM-agent with downstream tasks, as well as its performance in both the first-order and the second-order ToM. Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart’s behaviors beyond mere semantic-emotional supporting or decision-making based on common sense, providing new insights for studying large-scale LLMs-based simulation of human social behaviors.

4674In-context KV-Cache Eviction for LLMs via Attention-Gate

[openreview] [pdf]

Abstract The KV-Cache technique has become the standard for the inference of large language models (LLMs). It caches states of self-attention to avoid recomputation. Yet, it is widely criticized that KV-Cache can become a bottleneck of the LLM inference system, especially when confronted with ultra-large models and long-context queries. A natural remedy is to discard the KV-Cache for less important tokens, with StreamingLLM as an example, but the used static eviction strategies cannot flexibly adapt to varying contexts. Remedies like H2O leverage accumulative attention scores to perform dynamic eviction but suffer from the attention bias issue in capturing contextual information. This paper bridges this gap by devising a parameterized KV-Cache eviction mechanism, dubbed asAttention-Gate, which accepts the whole context as input and yields eviction flags for each token to realizein-contexteviction. The subsequent self-attention module proceeds according to the flags and only the KV states for the remaining tokens need to be cached. The Attention-Gates can vary among different heads and layers and be trivially plugged into pre-trained LLMs, tuned by cost-effective continual pre-training or supervised fine-tuning objectives to acquire what to discard. The computational and memory overhead introduced by Attention-Gates is minimal. Our method is validated across multiple tasks, demonstrating both efficiency and adaptability. After a highly efficient continual pre-training, it achieves higher average accuracy and evicts more tokens compared to traditional training-free methods. In supervised fine-tuning, it not only evicts many tokens but also outperforms LoRA-finetuned LLMs on some datasets, such as RTE, where it improves accuracy by 13.9% while evicting 62.8% of tokens, showing that effective eviction of redundant tokens can even enhance performance.

4675Markov Persuasion Processes: Learning to Persuade From Scratch

[openreview] [pdf]

Abstract In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers’ rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide a lower bound for our setting matching the guarantees of our algorithm.

4676Selective Task Group Updates for Multi-Task Optimization

[openreview] [pdf]

Abstract Multi-task learning enables the acquisition of task-generic knowledge by training multiple tasks within a unified architecture. However, training all tasks together in a single architecture can lead to performance degradation, known as negative transfer, which is a main concern in multi-task learning. Previous works have addressed this issue by optimizing the multi-task network through gradient manipulation or weighted loss adjustments. However, their optimization strategy focuses on addressing task imbalance in shared parameters, neglecting the learning of task-specific parameters. As a result, they show limitations in mitigating negative transfer, since the learning of shared space and task-specific information influences each other during optimization. To address this, we propose a different approach to enhance multi-task performance by selectively grouping tasks and updating them for each batch during optimization. We introduce an algorithm that adaptively determines how to effectively group tasks and update them during the learning process. To track inter-task relations and optimize multi-task networks simultaneously, we propose proximal inter-task affinity, which can be measured during the optimization process. We provide a theoretical analysis on how dividing tasks into multiple groups and updating them sequentially significantly affects multi-task performance by enhancing the learning of task-specific parameters. Our methods substantially outperform previous multi-task optimization approaches and are scalable to different architectures and various numbers of tasks.

4677Improving Reasoning Performance in Large Language Models via Representation Engineering

[openreview] [pdf]

Abstract Recent advancements in large language models (LLMs) have resulted in increasingly anthropomorphic language concerning the ability of LLMs to reason. Whether reasoning in LLMs should be understood to be inherently special is, however, widely debated. We propose utilizing a representation engineering approach wherein model activations are read from the residual stream of an LLM when processing a reasoning task. The activations are used to derive a control vector that is applied to the model as an inference-time intervention, modulating the representational space of the model, to improve performance on the specified task. We additionally contribute a framework for deriving control vectors and analyzing model representations. The framework allows us to induce improved reasoning performance and assess how control vectors influence the final logit distribution of a model via metrics such as KL divergence and entropy. We apply control vectors to Mistral-7B-Instruct and a range of Pythia models on a deductive and an inductive reasoning task respectively. We show that an LLM can, to a certain degree, be controlled to improve its perceived reasoning ability by modulating activations. The intervention is dependent upon the ability to reliably extract the model’s typical state when correctly solving a task. Our results suggest that there is no intrinsic difference between the process of reasoning and other information-processing tasks performed by LLMs. They furthermore demonstrate that we are capable of improving the reasoning performance of LLMs via a simple intervention on the residual stream with no additional training.

4678LION: A bidirectional framework that trains like a Transformer and infers like an RNN

[openreview] [pdf]

Abstract We introduce LION, a novel sequence-to-sequence framework that unifies the bidirectionality and parallelized training of Transformers with the fast inference of recurrent neural networks. LION is built upon a mathematical formulation where full kernelized attention with a learnable mask is efficiently computed using a bidirectional selective recurrent model, matching the effectiveness of softmax-based attention with constant-time inference. Our framework naturally accounts for spatial and temporal relationships within input sequences, reducing reliance on heuristic positional embeddings and facilitating straightforward scalability in context length and resolution. Using our framework and inspired by the recent state-space models, we propose LION-S, a transformer with selectiveselective mask and recurrent inference. Numerical evaluations on tasks such as language modeling, the Long-Range Arena, and image classification show that LION-S achieves performance on par with state-of-the-art models while delivering superior inference efficiency.

4679Mastering Syntax, Unlocking Semantics: A Mathematically Provable Two-stage Learning Process in Transformers

[openreview] [pdf]

Abstract Transformers have emerged as a cornerstone across various fields with extensive applications. However, the training dynamics of transformers remain relatively underexplored. In this work, we present a novel perspective on how transformers acquire knowledge during the training dynamics, inspired by the feature learning theory. To this end, we conceptualize each token as embodying two types of knowledge: elementary knowledge represented by syntactic information, and specialized knowledge represented by semantic information. Building on this data structure, we rigorously prove that transformers follow a syntax-then-semantics learning paradigm, i.e., first mastering syntax in the Elementary Stage and then unlocking semantics in the subsequent Specialized Stage. The results are derived from the training dynamics analysis and finite-time convergence within the in-context learning framework for supervised classification. To our best knowledge, this is the \textbf{\emph{first}} rigorous result of a two-stage optimization process in transformers from a feature learning perspective. Empirical findings on real-world language datasets support the theoretical results of the two-stage learning process. Moreover, the spectral properties of attention weights, derived from our theoretical framework, align with the experimental observations, providing further validation.

4680Gradient Inversion Transcript: A Generative Model to Reconstruct Training Data by Gradient Leakage

[openreview] [pdf]

Abstract We propose Gradient Inversion Transcript (GIT), a generic approach for reconstructing training data from gradient leakage in distributed learning using a generative model. Unlike traditional gradient matching techniques, GIT requires only the model architecture information, without access to the model’s parameters, making it more applicable to real-world distributed learning settings. Additionally, GIT operates offline, eliminating the need for intensive gradient requests and online optimization. Compared to existing generative methods, GIT adaptively constructs a generative network, with an architecture specifically tailored to the structure of the distributed learning model. Our extensive experiments demonstrate that GIT significantly improves reconstruction accuracy, especially in the case of deep models. In summary, we offer a more effective and theoretically grounded strategy for exploiting vulnerabilities of gradient leakage in distributed learning, advancing the understanding of privacy risks in collaborative learning environments.

4681Revisiting Nearest Neighbor for Tabular Data: A Deep Tabular Baseline Two Decades Later

[openreview] [pdf]

Abstract The widespread enthusiasm for deep learning has recently expanded into the domain of tabular data. Recognizing that the advancement in deep tabular methods is often inspired by classical methods, e.g., integration of nearest neighbors into neural networks, we investigate whether these classical methods can be revitalized with modern techniques. We revisit a differentiable version of KK-nearest neighbors (KNN) --- Neighbourhood Components Analysis (NCA) --- originally designed to learn a linear projection to capture semantic similarities between instances, and seek to gradually add modern deep learning techniques on top. Surprisingly, our implementation of NCA using SGD and without dimensionality reduction already achieves decent performance on tabular data, in contrast to the results of using existing toolboxes like scikit-learn. Further equipping NCA with deep representations and additional training stochasticity significantly enhances its capability, being on par with the leading tree-based method CatBoost and outperforming existing deep tabular models in both classification and regression tasks on 300 datasets. We conclude our paper by analyzing the factors behind these improvements, including loss functions, prediction strategies, and deep architectures.

4682Mastering Task Arithmetic:τJp as a Key Indicator for Weight Disentanglement

[openreview] [pdf]

Abstract Model-editing techniques using task arithmetic have rapidly gained attention. Through task arithmetic, simply through arithmetic operations on the weights of pre-trained and fine-tuned models create desired models, such as multi-task models, models with specific tasks unsolvable, or domain-transferred models. However, task arithmetic faces challenges, such as low reproducibility and the high cost associated with adjusting coefficients in the arithmetic operations on model parameters, which have limited its practical success. In this paper, we present three key contributions in the context of task addition and task negation within task arithmetic. First, we propose a new metric called τ\tauJp which is based on the product of the task vector (τ\tau) and the Jacobian of the pre-trained model with respect to its weights. We show that τ\tauJp has a causal relationship with the interference that occurs from arithmetic operations. Second, we show that introducing regularization to minimize τ\tauJp significantly mitigates interference between task inferences, which leads to eliminating coefficient tuning and better accuracy on each task. Third, in the context of incremental learning, we confirmed that our τ\tauJp regularization demonstrates more robust performance in environments where future tasks to be learned are not accessible, validating the scalability of the approach. Finally, we demonstrate that the τ\tauJp regularizer further reinforces the performance of task arithmetic by leveraging publicly available fine-tuned models, offering practical benefits for real-world applications.

4683Bridging Visual Communication and Data Exploration through Pose-Driven Query Synthesis

[openreview] [pdf]

Abstract SQL is widely used for managing relational databases and conducting interactive data analysis. Now, various natural language interfaces have emerged, designed to simplify the process of crafting SQL queries by translating natural language commands into executable SQL-Code. However, the communication preferences of the deaf and hard-of-hearing community have been largely overlooked. This paper introduces R-KinetiQuery, a groundbreaking framework for domain-adaptive sign language to SQL query translation, underpinned by a rigorous mathematical foundation synthesizing functional analysis, ergodic theory, and information geometry. At its core, R-KinetiQuery addresses the fundamental challenge of domain adaptation in the context of multimodal language translation, specifically tailored to bridge the gap between sign language communication and database query languages. A key innovation lies in our application of ergodic theory to analyze the long-term behavior of R-KinetiQuery under domain shift. We establish an ergodic theorem for the model’s time-averaged operator, demonstrating its convergence to the expected behavior across domains. This result provides a robust foundation for the model’s stability and adaptability in non-stationary environments. Our information-theoretic analysis reveals a deep connection between R-KinetiQuery and the Information Bottleneck principle. We derive a variational bound that explicitly quantifies the trade-off between compression and prediction in the model’s latent representation, providing insights into its domain-invariant feature learning. Empirically, we demonstrate R-KinetiQuery’s superior performance on a diverse set of domain adaptation tasks, consistently outperforming state-of-the-art baselines. Our experiments span a wide range of domain shifts, from subtle variations in sign language dialects to dramatic changes in database schemas and query complexities.

4684Score-based pullback Riemannian geometry

[openreview] [pdf]

Abstract Data-driven Riemannian geometry has emerged as a powerful tool for interpretable representation learning, offering improved efficiency in downstream tasks. Moving forward, it is crucial to balance cheap manifold mappings with efficient training algorithms. In this work, we integrate concepts from pullback Riemannian geometry and generative models to propose a framework for data-driven Riemannian geometry that is scalable in both geometry and learning: score-based pullback Riemannian geometry. Focusing on unimodal distributions as a first step, we propose a score-based Riemannian structure with closed-form geodesics that pass through the data probability density. With this structure, we construct a Riemannian autoencoder (RAE) with error bounds for discovering the correct data manifold dimension. This framework can naturally be used with anisotropic normalizing flows by adopting isometry regularization during training. Through numerical experiments on various datasets, we demonstrate that our framework not only produces high-quality geodesics through the data support, but also reliably estimates the intrinsic dimension of the data manifold and provides a global chart of the manifold, even in high-dimensional ambient spaces.

4685Dual-cycle Consistency Learning for Weakly Supervised Phrase Grounding

[openreview] [pdf]

Abstract Weakly supervised phrase grounding (WSPG) aims to localize objects referred by phrases without region-level annotations. The state-of-the-art methods use vision-language pre-trained (VLP) models to build pseudo labels. However, their low quality could result in the ineffectiveness of the subsequent learning. In this paper, we propose a novel WSPG framework, Dual-cycle Consistency Learning (DCL). Firstly, we propose a vision-modal cycle consistency to localize the referred objects and reconstruct the pseudo labels. To provide a conditional guidance, we propose a visual prompt engineering to generate marks for input images. To further avoid localizing randomly, we design a confidence-based regularization to filter out redundant information in image and pixel levels. Secondly, we propose a language-modal cycle consistency to correctly recognize the referred objects. To correct their positions, we provide phrase-related boxes as supervision for further learning. Extensive experiments on benchmark datasets show the effectiveness of DCL, as well as its excellent compatibility with various VLP models. The source code will be available at GitHub after double-blind phase.

4686Symmetric Kernels with Non-Symmetric Data: A Data-Agnostic Learnability Bound

[openreview] [pdf]

Abstract Kernel ridge regression (KRR) and Gaussian processes (GPs) are fundamental tools in statistics and machine learning, with recent applications to highly over-parameterized deep neural networks. The ability of these tools to learn a target function is directly related to the eigenvalues of their kernel sampled on the input data. Targets having support on higher eigenvalues are more learnable. While kernels are often highly symmetric objects, the data is often not. Thus, kernel symmetry seems to have little to no bearing on the above eigenvalues or learnability, making spectral analysis on real-world data challenging. Here, we show that contrary to this common lure, one may use eigenvalues and eigenfunctions associated with highly idealized data measures to bound learnability on realistic data. As a demonstration, we give a theoretical lower bound on the sample complexity of copying heads for kernels associated with generic transformers acting on natural language.

4687The Disparate Benefits of Deep Ensembles

[openreview] [pdf]

Abstract Ensembles of Deep Neural Networks, Deep Ensembles, are widely used as a simple way to boost predictive performance. However, their impact on algorithmic fairness is not well understood yet. Algorithmic fairness investigates how a model’s performance varies across different groups, typically defined by protected attributes such as age, gender, or race. In this work, we investigate the interplay between the performance gains from Deep Ensembles and fairness. Our analysis reveals that they unevenly favor different groups in what we refer to as a disparate benefits effect. We empirically investigate this effect with Deep Ensembles applied to popular facial analysis and medical imaging datasets, where protected group attributes are given and find that it occurs for multiple established group fairness metrics, including statistical parity and equal opportunity. Furthermore, we identify the per-group difference in predictive diversity of ensemble members as the potential cause of the disparate benefits effect. Finally, we evaluate different approaches to reduce unfairness due to the disparate benefits effect. Our findings show that post-processing is an effective method to mitigate this unfairness while preserving the improved performance of Deep Ensembles.

4688TranSpa: Towards Efficient Structured Sparse Training for Transformers

[openreview] [pdf]

Abstract Transformers have emerged as the backbone neural network architecture in today’s AI applications. Due to their high complexity, sparsifying transformers, at both pre-training and fine-tuning stages, is very attractive for lower the training and inference costs. In this paper, we propose TranSpa, an efficient structured sparse training approach for language and vision transformers. Unlike prior works focusing on individual building blocks, TranSpa fully considers the correlation between the weight matrices and their component rows/columns, and performs the coupled estimation and coupled sparsification. To achieve that, TranSpa introduces the use of new granularity when calibrating the importance of structural components in the transformer and removing the insignificant parts. Evaluations across different models, in both pre-training and fine-tuning scenarios, demonstrate the effectiveness of the proposed approach. TranSpa can bring 1.6×1.6\times size reduction with 0.6 lower perplexity when training GPT-2 model from scratch. It also enables 1.6×1.6\times training speedup over the existing sparse pre-training method. For training sparse LLaMA-1B from scratch, our approach reduces GPU memory usage by 50%, decreases training time by 21%, and achieves a 1.6×1.6\times speedup in inference throughput while maintaining model performance. Experiments of applying TranSpa for fine-tuning tasks also show significant performance improvement with respect to model accuracy and pruning cost reduction.

4689SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

[openreview] [pdf]

Abstract While prior work has explored whether large language models (LLMs) possess a “theory of mind” (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. Our approach to study such capabilities is to create a new dataset, called SimpleToM, containing concise, diverse stories (e.g., “The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier.”), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state (“Is Mary aware of the mold?”), (b) behavior (“Will Mary pay for the chips or report the mold?”), and (c) judgment (“Mary paid for the chips. Was that reasonable?”). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable, despite being correctly aware of the protagonist’s mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). However, while this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment. SimpleToM thus breaks new ground in probing real-world ToM reasoning, and reveals surprising, new insights about current model capabilities. We hope the dataset enables further exploration by the community into this critical area of model behavior.

4690ASTrA: Adversarial Self-supervised Training with Adaptive-Attacks

[openreview] [pdf]

Abstract Existing self-supervised adversarial training (self-AT) methods rely on handcrafted adversarial attack strategies for PGD attacks, which fail to adapt to the evolving learning dynamics of the model and do not account for instance specific characteristics of images. This results in sub-optimal adversarial robustness and limits the alignment between clean and adversarial data distributions. To address this, we propose ASTrA (Adversarial Self-supervised Training with Adaptive-Attacks), a novel framework introducing a learnable, self-supervised attack strategy network that autonomously discovers optimal attack parameters through exploration-exploitation in a single training episode. ASTrA leverages a reward mechanism based on contrastive loss, optimized with REINFORCE, enabling adaptive attack strategies without labeled data or additional hyperparameters. We further introduce a mixed contrastive objective to align the distribution of clean and adversarial examples in representation space. ASTrA achieves state-of-the-art results on CIFAR-10, CIFAR-100, and STL-10 while integrating seamlessly as a plug-and-play module for other self-AT methods. ASTrA shows scalability to larger datasets, demonstrates strong semi-supervised performance, and is resilient to robust overfitting, backed by explainability analysis on optimal attack strategies.

4691Extended Flow Matching : a Method of Conditional Generation with Generalized Continuity Equation

[openreview] [pdf]

Abstract Conditional generative modeling (CGM), which approximates the conditional probability distribution of data given a condition, holds significant promise for generating new data across diverse representations. While CGM is crucial for generating images, video, and text, its application to scientific computing, such as molecular generation and physical simulations, is also highly anticipated. A key challenge in applying CGM to scientific fields is the sparseness of available data conditions, which requires extrapolation beyond observed conditions. This paper proposes the Extended Flow Matching (EFM) framework to address this challenge. EFM achieves smooth transitions in distributions when departing from observed conditions, avoiding the unfavorable changes seen in existing flow matching (FM) methods. By introducing a flow with respect to the conditional axis, EFM ensures that the conditional distribution changes gradually with the condition. Specifically, we apply an extended Monge--Kantorovich theory to conditional generative models, creating a framework for learning matrix fields in a generalized continuity equation instead of vector fields. Furthermore, by combining the concept of Dirichlet energy on Wasserstein spaces with Multi-Marginal Optimal Transport (MMOT), we derive an algorithm called MMOT-EFM. This algorithm controls the rate of change of the generated conditional distribution. Our proposed method outperforms existing methods in molecular generation tasks where conditions are sparsely observed.

4692PoTable: Programming Standardly on Table-based Reasoning Like a Human Analyst

[openreview] [pdf]

Abstract Table-based reasoning has garnered substantial research interest, particularly in its integration with Large Language Model (LLM) which has revolutionized the general reasoning paradigm. Numerous LLM-based studies introduce symbolic tools (e.g., databases, Python) as assistants to extend human-like abilities in structured table understanding and complex arithmetic computations. However, these studies can be improved better in simulating human cognitive behavior when using symbolic tools, as they still suffer from limitations of non-standard logical splits and constrained operation pools. In this study, we propose PoTable as a novel table-based reasoning method that simulates a human tabular analyst, which integrates a Python interpreter as the real-time executor accompanied by an LLM-based operation planner and code generator. Specifically, PoTable follows a human-like logical stage split and extends the operation pool into an open-world space without any constraints. Through planning and executing in each distinct stage, PoTable standardly completes the entire reasoning process and produces superior reasoning results along with highly accurate, steply commented and completely executable programs. Accordingly, the effectiveness and explainability of PoTable are fully demonstrated. Extensive experiments over three evaluation datasets from two public benchmarks on two backbones show the outstanding performance of our approach. In particular, GPT-based PoTable achieves over 4% higher absolute accuracy than runner-ups on all evaluation datasets. Our code is available athttps://anonymous.4open.science/r/PoTable-6788.

4693Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language Models

[openreview] [pdf]

Abstract Visual hallucination (VH) occurs when a multimodal large language model (MLLM) generates responses with incorrect visual details for prompts. Existing methods for generating VH test cases primarily rely on human annotations, typically in the form of triples: (image, question, answer). In this paper, we introduce VHExpansion, the first automated method for expanding VH test cases for MLLMs. Given an initial VH test case, VHExpansion automatically expands it by perturbing the question and answer through negation as well as modifying the image using both common and adversarial perturbations. Additionally, we propose a new evaluation metric, symmetric accuracy, which measures the proportion of correctly answered VH test-case pairs. Each pair consists of a test case and its negated counterpart. Our theoretical analysis shows that symmetric accuracy is an unbiased evaluation metric that remains unaffected by the imbalance of VH testing cases with varying answers when an MLLM is randomly guessing the answers, whereas traditional accuracy is prone to such imbalance. We apply VHExpansion to expand three VH datasets annotated manually and use these expanded datasets to benchmark seven MLLMs. Our evaluation shows that VHExpansion effectively identifies more VH test cases. Moreover, symmetric accuracy, being unbiased, leads to different conclusions about the vulnerability of MLLMs to VH compared to traditional accuracy metric. Finally, we show that fine-tuning MLLMs on the expanded VH dataset generated by VHExpansion mitigates VH more effectively than fine-tuning on the original, manually annotated dataset. We will publish code and data upon paper acceptance.

4694Visual Prompting with Iterative Refinement for Design Critique Generation

[openreview] [pdf]

Abstract Feedback is crucial for every design process, such as user interface (UI) design, and automating design critiques can significantly improve the efficiency of the design workflow. Although existing multimodal large language models (LLMs) excel in many tasks, they often struggle with generating high-quality design critiques---a complex task that requires producing detailed design comments that are visually grounded in a given design’s image. Building on recent advancements in iterative refinement of text output and visual prompting methods, we propose an iterative visual prompting approach for UI critique that takes an input UI screenshot and design guidelines and generates a list of design comments, along with corresponding bounding boxes that map each comment to a specific region in the screenshot. The entire process is driven completely by LLMs, which iteratively refine both the text output and bounding boxes using few-shot samples tailored for each step. We evaluated our approach using Gemini-1.5-pro and GPT-4o, and found that human experts generally preferred the design critiques generated by our pipeline over those by the baseline. To assess the generalizability of our approach to other multimodal tasks, we applied our pipeline to open-vocabulary object and attribute detection, and experiments showed that our method also outperformed the baseline, with improvements of up to 82%.

4695Hallucination Detox: Sensitive Neuron Dropout (SeND) for Large Language Model Training

[openreview] [pdf]

Abstract As large language models (LLMs) become increasingly deployed across various industries, concerns regarding their reliability, particularly due to hallucinations—outputs that are factually inaccurate or irrelevant to user input—have grown. Our research investigates the relationship between the training process and the emergence of hallucinations to address a key gap in existing research that focuses primarily on post hoc detection and mitigation strategies. Using models from the Pythia suite (70M–12B parameters) and several hallucination detection metrics, we analyze hallucination trends throughout training and explore LLM internal dynamics. We introduce SEnsitive Neuron Dropout (SeND), a novel training protocol designed to mitigate hallucinations by reducing variance during training. SeND achieves this by deterministically dropping neurons with significant variability on a dataset, referred to as Sensitive Neurons. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This efficient metric is integrated into our protocol, allowing SeND to be both computationally scalable and effective at reducing hallucinations. Our empirical evaluation demonstrates that our approach improves LLM reliability at test time by up to 40% compared to normal training while also providing an efficient method to improve factual accuracy when adapting LLMs to domains such as Wikipedia and Medical datasets.

4696IEL: Intra-Model Ensemble Learning For Single Sample Test-Time Adaptation

[openreview] [pdf]

Abstract Test-Time Adaptation (TTA) problems involve adapting pre-trained models to new data distributions in testing time, with access to only model weights and a stream of unlabeled data. In this work, we present IEL, a method for adapting sets of independently pre-trained classifiers to distribution shifted data one sample at a time without labels. We minimize the cross-entropy between the classifier output that has the highest predicted probability for the majority voted class (a high confidence softmax) and all other models in a set of classifiers. The majority voted model that all others learn from may change from sample to sample, allowing the group to collectively learn from each other. Our method uniquely optimizes all trainable parameters in each model and needs only a single sample for adaptation. Using sets of independently pre-trained base classifiers with distinct architectures, we show that our approach can reduce generalization error for image classification tasks on corrupted CIFAR-10, CIFAR-100, and ImageNet while also minimizing the entropy of model outputs.

4697Just How Flexible are Neural Networks in Practice?

[openreview] [pdf]

Abstract Although overparameterization theory suggests that neural networks can fit any dataset with up to as many samples as they have parameters, practical limitations often prevent them from reaching this capacity. In this study, we empirically investigate the practical flexibility of neural networks and uncover several surprising findings. Firstly, we observe that standard optimizers, such as stochastic gradient descent (SGD), often converge to solutions that fit significantly fewer samples than the model’s parameter count, highlighting a gap between theoretical and practical capacity. Secondly, we find that convolutional neural networks (CNNs) are substantially more parameter-efficient than multi-layer perceptrons (MLPs) and Vision Transformers (ViTs), even when trained on randomly labeled data, emphasizing the role of architectural inductive biases. Thirdly, we demonstrate that the difference in a network’s ability to fit correctly labeled data versus incorrectly labeled data is a strong predictor of generalization performance, offering a novel metric for predicting generalization. Lastly, we show that stochastic training methods like SGD enable networks to fit more data than full-batch gradient descent, suggesting that stochasticity enhances flexibility beyond regularization effects. These findings highlight the importance of understanding practical capacity limits and their implications for model generalization, providing new insights into neural network training and architectural design.

4698Adversarial Attacks on Fine-tuned LLMs

[openreview] [pdf]

Abstract Large Language Models (LLMs) have greatly advanced the field of General Artificial Intelligence, yet their security vulnerabilities remain a pressing issue, particularly in fine-tuned models. Adversarial attacks in black-box settings—where model details and training data are obscured—are an emerging area of research, posing a substantial threat to private models’ integrity. In this work, we uncover a new attack vector: adversaries can exploit the similarities between open-source LLMs and fine-tuned private models to transfer adversarial examples. We introduce a novel attack strategy that generates adversarial examples on open-source models and fine-tunes them to target private, black-box models. Our experiments show that these attacks achieve success rates comparable to white-box attacks, even when private models have been trained on proprietary data. Furthermore, our approach demonstrates strong transferability to other models, including LLaMA3 and ChatGPT. These findings highlight the urgent need for more robust defenses when fine-tuning open-source LLMs.

4699FinBench: Benchmarking LLMs in Complex Financial Problem Solving and Reasoning

[openreview] [pdf]

Abstract Solving financial problems demands complex reasoning, multimodal data processing, and a broad technical understanding, presenting unique challenges for current large language models (LLMs). We introduce FinBench, a novel benchmark designed to evaluate LLM’s ability in solving complex, knowledge-intensive financial problems across diverse graduate-level topics with multi-modal context. We identify five core capabilities of LLMs using FinBench,i.e,terminology understanding,temporal reasoning,future forecasting,scenario planning, andnumerical modelling. FinBench features 4,235 examples derived from graduate-level finance textbooks, and consists of three tasks: Statement Judging, Multi-choice Question Answering and Financial Calculation. Upon FinBench, we conduct extensive experiments on 18 leading models. The result shows that o1 is the best-performing text-only model with an overall accuracy of 67.3%, but still lags significantly behind human experts with 12.5%, especially intemporal reasoningandscenario planningcapabilities. We further construct a knowledge bank with 3,032 finance terms for knowledge augmentation analysis, and find that relevant knowledge to the question only brings consistent accuracy improvements across five capabilities to small open-source model. Additionally, our error analysis reveals that rounding errors in middle of calculation and blindness to position and intersection of curves in the image are two primary issues leading to model’s poor performance in calculating and visual-context questions, respectively. These findings underscores the critical role FinBench will play in the development of general-purpose of AI agents of tackling complex, knowledge-intensive financial problems with multi-modal context.

4700MixNAM: Advancing Neural Additive Models with Mixture of Experts

[openreview] [pdf]

Abstract Additive models, such as Neural Additive Models (NAMs), are recognized for their transparency, providing clear insights into the impact of individual features on outcomes. However, they traditionally rely on point estimations and are constrained by their additive nature, limiting their ability to capture the complexity and variability inherent in real-world data. This variability often presents as different influences from the same feature value in various samples, adding complexity to prediction models. To address these limitations, we introduce MixNAM, an innovative framework that enriches NAMs by integrating a mixture of experts, where each expert encodes a different aspect of this variability in predictions from each feature. This integration allows MixNAM to capture the variability in feature contributions through comprehensive distribution estimations and to include feature interactions during expert routing, thus significantly boosting performance. Our empirical evaluation demonstrates that MixNAM surpasses traditional additive models in performance and is comparable to complex black-box approaches. Additionally, it improves the depth and comprehensiveness of feature attribution, setting a new benchmark for balancing interpretability with performance in machine learning. Moreover, the flexibility in MixNAM configuration facilitates the navigation of its trade-offs between accuracy and interpretability, enhancing adaptability to various data scenarios.

4701Dynamic Switching Teacher: How to Generalize Temporal Action Detection Models

[openreview] [pdf]

Abstract Temporal Action Detection (TAD) is a crucial task in video understanding, focusing on the precise identification of the onset and termination of specific actions within video sequences. Despite advancements on certain datasets, existing methods often struggle to maintain their efficacy when applied to datasets from disparate domain. In this study, we introduce, for the first time, the application of source-free domain adaptation (SFDA) techniques to the field of TAD, aiming to enhance the generalization capability of TAD models on unlabeled target datasets without access to source data. Most popular SFDA methods predominantly follow the Mean-Teacher (MT) framework and often falter due to the significant domain shift. The generation of pseudo labels by a pre-trained teacher model on the source domain can lead to a cascade of errors when these labels guide the training of a student model, potentially causing a harmful TAD feedback loop. To address this issue, we propose a novel dynamic switching teacher strategy that integrates both dynamic and static teacher models. The dynamic teacher model updates its parameters by learning knowledge from the student model. Concurrently, the static teacher model engages in periodic weight exchange with the student model, ensuring baseline performance and maintaining the quality of pseudo labels. This approach significantly mitigates the label noise. We establish the first benchmark for SFDA in TAD tasks and conduct extensive experiments across various datasets. Our method demonstrates state-of-the-art performance, substantiating the suitability of our method for TAD.

4702Enhancing Prototype-Based Federated Learning with Structured Sparse Prototypes

[openreview] [pdf]

Abstract Prototype-Based Federated Learning (PBFL) has gained attention for its communication efficiency, privacy preservation, and personalization capabilities in resource-constrained environments. Despite these advantages, PBFL methods face challenges, including high communication costs for high-dimensional prototypes and numerous classes, privacy concerns during aggregation, and uniform knowledge distillation in heterogeneous data settings. To address these issues, we introduce three novel methods, each targeting a specific PBFL stage: 1) Class-wise Prototype Sparsification (CPS) reduces communication costs by creating structured sparse prototypes, where each prototype utilizes only a subset of representation layer dimensions. 2) Privacy-Preserving Prototype Aggregation (PPA) enhances privacy by eliminating the transmission of client class distribution information when aggregating local prototypes. 3) Class-Proportional Knowledge Distillation (CPKD) improves personalization by modulating the distillation strength for each class based on clients’ local data distributions. We integrate these three methods into two foundational PBFL approaches and conduct experimental evaluations. The results demonstrate that this integration achieves up to 10× and 4× reductions in communication costs while outperforming the original and most communication-efficient approaches evaluated, respectively.

4703Uncertainty Modeling in Graph Neural Networks via Stochastic Differential Equations

[openreview] [pdf]

Abstract We propose a novel Stochastic Differential Equation (SDE) framework to address the problem of learning uncertainty-aware representations for graph-structured data. While Graph Neural Ordinary Differential Equations (GNODEs) have shown promise in learning node representations, they lack the ability to quantify uncertainty. To address this, we introduce Latent Graph Neural Stochastic Differential Equations (LGNSDE), which enhance GNODE by embedding randomness through a Bayesian prior-posterior mechanism for epistemic uncertainty and Brownian motion for aleatoric uncertainty. By leveraging the existence and uniqueness of solutions to graph-based SDEs, we prove that the variance of the latent space bounds the variance of model outputs, thereby providing theoretically sensible guarantees for the uncertainty estimates. Furthermore, we show mathematically that LGNSDEs are robust to small perturbations in the input, maintaining stability over time. Empirical results across several benchmarks demonstrate that our framework is competitive in out-of-distribution detection, robustness to noise perturbations, and active learning, underscoring the ability of LGNSDEs to quantify uncertainty reliably.

4704Learning with Multi-Group Guarantees for Clusterable Subpopulations

[openreview] [pdf]

Abstract A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this perspective, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration for Gaussian mixture models as a case study, we study a multi-objective algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve a O(T1/2)O(T^{1/2}) rate even when the subpopulations are not well-separated. In comparison, the more naturalcluster-then-predictapproach that first recovers the structure of the subpopulations and then makes predictions suffers from a O(T2/3)O(T^{2/3}) rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.

4705GPT Shortcuts: Learning Iterative Text Generation Patterns from a Dialogue

[openreview] [pdf]

Abstract LLM-powered conversational interfaces (e.g., ChatGPT, Claude, and Gemini) support iterative text generation, enabling users to easily generate tailored texts (e.g., texts that should address domain-specific constraints) through a series of follow-up text editing requests. However, generating such tailored texts that address the user-specified constraints across multiple different contexts requires repetitive text generation efforts, which is cumbersome, inefficient, and demanding. To address this challenge, we introduce the concept ofGPT shortcuts, which is designed to 1) learn iterative text generation patterns from a dialogue and 2) apply these learned patterns todirectlygenerate the tailored text. GPT shortcuts generate texts that address necessary constraints while maintaining similar structural appearance to the target text in the dialogue, across different contexts. To assess the capability of language models in generating GPT shortcuts, we present ShortcutBench, a benchmark consisting of 250 crowdsourced iterative text generation dialogues across five text generation tasks. Using ShortcutBench, we conducted an analysis using six LLMs and four prompting methods, varying ways to specify necessary constraints to address in the prompt. We found that 1) larger models generally outperform smaller models, 2) self-explanatory constraints within the target text are effective, and 3) precisely specifying necessary constraints to address is critical for improving the performance.

4706MoTE: Mixture of Task Experts for Embedding Models

[openreview] [pdf]

Abstract Dense embeddings are a crucial component in various natural language processing applications, serving as the foundation for downstream tasks such as Retrieval Augmented Generation (RAG), search, classification, and clustering. To improve the performance of dense embeddings, recent approaches have focused on conditioning their representation with task information (e.g., by adding a task-specific text prefix), which allows to generate embeddings that take the task into account. This paper builds on this work and develops an approach that performs task conditioning by introducing a new architecture that has a dedicated set of parameters for each of the tasks. We refer to this model a Mixture of Task Experts (MoTE). For training, we introduce a task-aware training approach that allows us to optimize the training hyper-parameter for each task. Experiments on highly competitive tasks like MTEB with 56 datasets across 7 tasks show that, on average, MoTE achieves 1.62 higher NDCG@10 on Retrieval datasets, 1.54 higher MAP on Reranking datasets and 0.65 higher overall performance across tasks when compared to contemporary approaches leveraging the same information.

4707YESNO-PRO: A HIGH-PERFORMANCE POINTWISE RERANKING ALGORITHM BRIDGING ENCODERDECODER AND DECODER-ONLY LLMS

[openreview] [pdf]

Abstract Recent research has shown significant progress in the field of zero-shot text reranking for large language models (LLMs). Traditional pointwise approaches prompt the LLM to output relevance labels such as “yes/no” or fine-grained labels, but they have several drawbacks. Firstly, these prompts struggle to capture complex correlations between queries and passages and lack robustness for outputs not covered by predefined labels. Secondly, ranking scores rely solely on the likelihood of relevance labels, leading to potential noise and bias. Lastly, existing pointwise approaches are not supported by decoder-only LLMs, as ranking requires LLMs to output prediction probabilities. In response to these challenges, a novel pointwise approach called yesno-pro has been designed, which redefines both prompt design and score computation mechanisms to better align with the intrinsic nature of text reranking. Additionally, a comprehensive reranking framework based on LLM services has been proposed to support concurrent ranking calls and quickly adapt to any open-source decoder-only large models. Experimental results have demonstrated that this method outperforms existing pointwise and some pairwise/listwise methods on TREC19/20 and BEIR datasets, achieving the state-of-the-art performance. Due to its concurrency features, this work is applicable to practical applications with high real-time requirements.

4708Near-optimal Active Regression of Single-Index Models

[openreview] [pdf]

Abstract The active regression problem of the single-index model is to solve minxf(Ax)bp\min_x \lVert f(Ax)-b\rVert_p, where AA is fully accessible and bb can only be accessed via entry queries, with the goal of minimizing the number of queries to the entries of bb. When ff is Lipschitz, previous results only obtain constant-factor approximations. This work presents the first algorithm that provides a (1+ε)(1+\varepsilon)-approximation solution by querying O~(dp21/εp2)\tilde{O}(d^{\frac{p}{2}\vee 1}/\varepsilon^{p\vee 2}) entries of bb. This query complexity is also shown to be optimal up to logarithmic factors for p[1,2]p\in [1,2] and the ε\varepsilon-dependence of 1/εp1/\varepsilon^p is shown to be optimal for p>2p>2.

4709Learning Mamba as a Continual Learner

[openreview] [pdf]

Abstract Continual learning (CL) aims to efficiently learn and accumulate knowledge from a data stream with different distributions. By formulating CL as a sequence prediction task, meta-continual learning (MCL) enables to meta-learn an efficient continual learner based on the recent advanced sequence models, e.g., Transformers. Although attention-free models (e.g., Linear Transformers) can ideally match CL’s essential objective and efficiency requirements, they usually perform not well in MCL. Considering that the attention-free Mamba achieves excellent performances matching Transformers’ on general sequence modeling tasks, in this paper, we aim to answer a question -- Can attention-free Mamba perform well on MCL? By formulating Mamba with a selective state space model (SSM) for MCL tasks, we propose to meta-learn Mamba as a continual learner, referred to as MambaCL. By incorporating a selectivity regularization, we can effectively train MambaCL. Through comprehensive experiments across various CL tasks, we also explore how Mamba and other models perform in different MCL scenarios. Our experiments and analyses highlight the promising performance and generalization capabilities of Mamba in MCL.

4710Generalization Error Minimized Deep Learning

[openreview] [pdf]

Abstract Despite the vast applications and rapid development of deep learning (DL), understanding and improving the generalization ability of deep neural networks (DNNs) remains a fundamental challenge. To tackle this challenge, in this paper, we first establish a novel bias-variance decomposition framework to analyze the generalization error of DNNs. Based on our new generalization error formula, we then present a new form of DL dubbed generalization error minimized (GEM) DL by jointly minimizing the conventional optimization target and an analytical proxy for the generalization error. Extensive experimental results show that in comparison with DNNs trained within the standard DL, GEM DNNs have smaller generalization errors and better generalization ability, thereby improving DNN prediction accuracy. Notably, GEM DL can increase prediction accuracy by as much as 13.19% on ImageNet in the presence of data distribution shift between training and testing.

4711Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning

[openreview] [pdf]

Abstract Large language models (LLMs), primarily built on decoder-only transformer architectures, excel in natural language generation tasks and have shown promise in adapting to diverse downstream tasks using zero-shot and few-shot prompting techniques. However, these prompting methods often fall short on natural language understanding (NLU) tasks, where smaller encoder-only models like BERT-base consistently outperform LLMs on benchmarks such as GLUE and SuperGLUE. In this paper, we explore two approaches—supervised fine-tuning and proximal policy optimization (PPO)—to enhance the NLU capabilities of LLMs. To reduce the computational cost of full-model fine-tuning, we integrate low-rank adaptation (LoRA) layers, restricting updates to these layers during both supervised fine-tuning and PPO stages. In the supervised fine-tuning approach, task-specific prompts are concatenated with input queries and ground-truth labels from the NLU training corpus, optimizing the model using the next-token prediction objective. Despite this, LLMs still underperform compared to encoder-only models like BERT-base on several NLU tasks. To address this gap, we employ PPO, a reinforcement learning technique that treats each token generation as an action and evaluates the sequence of generated tokens using a reward function based on their alignment with ground-truth answers. PPO then updates the model to maximize these rewards, effectively aligning its outputs with the correct labels. Our experiments with the LLAMA2-7B model demonstrate that PPO-based fine-tuning significantly improves performance, delivering an average gain of 6.3 points over supervised fine-tuning on the GLUE benchmark. PPO surpasses zero-shot prompting by 38.7 points and few-shot prompting by 26.1 points on GLUE, while also outperforming these baselines by 28.8 and 28.5 points on SuperGLUE. Additionally, PPO exceeds the performance of BERT-large, a strong baseline, with an average improvement of 2.7 points on GLUE and 9.3 points on SuperGLUE. These improvements are consistent across models such as Qwen2.5-7B and MPT-7B, highlighting PPO’s robustness and effectiveness in enhancing the NLU capabilities of LLMs.

4712Exploring and Benchmarking Planning Capabilities of Large Language Models

[openreview] [pdf]

Abstract Classical and natural language planning tasks remain a difficult domain for modern large language models (LLMs). In this work, we lay the foundations for improving planning capabilities of LLMs. First, we construct a comprehensive benchmark suite encompassing both classical planning benchmarks and natural language scenarios. This suite includes algorithms to methodically generate instances of tasks with varying levels of difficulty, allowing for rigorous and systematic evaluation of LLM performance. Next, we investigate the use of many-shot in-context learning to enhance LLM planning, exploring the relationship between increased context length and improved planning performance. In addition, we demonstrate the positive impact of fine-tuning LLMs on optimal planning paths. We also probe the efficacy of chain-of-thought reasoning methods to improve LLM planning performance. Moreover, we probe the performance of the proposed methods in out-of-distribution scenarios, assessing the ability to generalize to novel and unseen planning challenges. Finally, we investigate model’s failure modes and reveal insights that hold true across different benchmarks.

4713Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination

[openreview] [pdf]

Abstract A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hallucinations that make them unsuitable for real-world applications. In this paper, we introduce a new paradigm for constructing world models that are explicit representations of the real world and its dynamics. By integrating cutting-edge advances in real-time photorealism with Gaussian Splatting and physics simulators, we propose the first compositional manipulation world model, which we call DreMa. DreMa replicates the observed world and its dynamics, allowing it to imagine novel configurations of objects and predict the future consequences of robot actions. We leverage this capability to generate new data for imitation learning by applying equivariant transformations to a small set of demonstrations. Our evaluations across various settings demonstrate significant improvements in both accuracy and robustness by incrementing actions and object distributions, reducing the data needed to learn a policy and improving the generalization of the agents. As a highlight, we show that a real Franka Emika Panda robot, powered by DreMa ’s imagination, can successfully learn novel physical tasks from just a single example per task variation (one-shot policy learning). Our project page and source code can be found in:https://dreamtomanipulate.github.io/DreMa/.

4714Informing Reinforcement Learning Agents by Grounding Language to Markov Decision Processes

[openreview] [pdf]

Abstract While significant efforts have been made to leverage natural language to accelerate reinforcement learning, utilizing diverse forms of language efficiently remains unsolved. Existing methods focus on mapping natural language to individual elements of MDPs such as reward functions or policies, but such approaches limit the scope of language they consider to make such mappings possible. We present an approach for leveraging general language advice by translating sentences to a grounded formal language for expressing information abouteveryelement of an MDP and its solution including policies, plans, reward functions, and transition functions. We also introduce a new model-based reinforcement learning algorithm, RLang-Dyna-Q, capable of leveraging all such advice, and demonstrate in two sets of experiments that grounding language to every element of an MDP leads to significant performance gains.

4715Temporal Reasoning Transfer from Text to Video

[openreview] [pdf]

Abstract Video Large Language Models (Video LLMs) have shown promising capabilities in video comprehension, yet they struggle with tracking temporal changes and reasoning about temporal relationships. While previous research attributed this limitation to the ineffective temporal encoding of visual inputs, our diagnostic study reveals that video representations contain sufficient information for even small probing classifiers to achieve perfect accuracy. Surprisingly, we find that the key bottleneck in Video LLMs’ temporal reasoning capability stems from the underlying LLM’s inherent difficulty with temporal concepts, as evidenced by poor performance on textual temporal question-answering tasks. Building on this discovery, we introduce the \textbf{T}extual \textbf{T}emporal reasoning \textbf{T}ransfer (\textbf{T3}). T3 synthesizes diverse temporal reasoning tasks in pure text format from existing image-text datasets, addressing the scarcity of video samples with complex temporal scenarios. Remarkably, \emph{without using any video data}, T3 enhances LongVA-7B’s temporal understanding, yielding a 5.3 absolute accuracy improvement on the challenging TempCompass benchmark, which enables our model to outperform ShareGPT4Video-8B trained on 28,000 video samples. Additionally, the enhanced LongVA-7B model achieves competitive performance on comprehensive video benchmarks. For example, it achieves a 49.7 accuracy on the Temporal Reasoning task of Video-MME, surpassing powerful large-scale models such as InternVL-Chat-V1.5-20B and VILA1.5-40B. Further analysis reveals a strong correlation between textual and video temporal task performance, validating the efficacy of transferring temporal reasoning abilities from text to video domains.

4716Hough Voting-based Prompt Learning for Segment Anything Model

[openreview] [pdf]

Abstract Segment Anything Models (SAMs) like SEEM and SAM have achieved great performance on various downstream datasets at the cost of crafting spatial and semantic prompts. Previous prompt learning methods can learn prompts automatically but largely focus on learning semantic prompts, while how to learn effective spatial prompts that are important to SAMs is largely under-explored. Inspired by Hough Voting that detects a complex object by voting from its parts, we propose Hough Voting-based Spatial Prompt Learning (HoughSpaPL) that designs three types of voting mechanisms to learn three distinct spatial prompts for different subregions of the visual concept (e.g., things and stuff), which capture complementary spatial clues and vote together to guide SAMs to generate a precise segmentation mask for the visual concept. Following the same philosophy, we design Hough Voting-based Semantic Prompt Learning (HoughSemPL) that learns distinct semantic prompts for different sub-regions of the visual concept, which capture complementary semantic clues and vote together to predict a accurate semantic label for the generated mask. Extensive experiments show that our proposed techniques achieve superior prompt learning performance over popular segmentation datasets. Codes will be released.

4717Measuring And Improving Persuasiveness Of Generative Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) are increasingly being used in workflows involving generating content to be consumed by humans (e.g., marketing) and also in directly interacting with humans (e.g., through chatbots). The development of such systems that are capable of generating verifiably persuasive messages presents both opportunities and challenges for society. On the one hand, such systems could positively impact domains like advertising and social good, such as addressing drug addiction, and on the other, they could be misused for spreading misinformation and shaping political opinions. To channel LLMs’ impact on society, we need to develop systems to measure and benchmark their persuasiveness. With this motivation, we introduce PersuasionBench and PersuasionArena, the first largescale benchmark and arena containing a battery of tasks to measure the persuasion ability of generative models automatically. We introduce transsuasion (trans = carrying across, suasion = the act of persuading), a novel task of transforming non-persuasive language into persuasive content while preserving other factors determining persuasiveness (sender, receiver, time, and channel). To construct data for transsuasion, we leverage natural experiments in the form of a pair of tweets from the same user, posted in close temporal proximity, with similar semantic content but divergent wording and significantly different like counts. Given such pairs, we investigate to what extent LLMs know and leverage linguistic patterns that can help them generate more persuasive language. Our findings indicate that the persuasiveness of LLMs correlates positively with model size, but smaller models can also be made to have a higher persuasiveness than much larger models. Notably, targeted training using synthetic and natural datasets significantly enhances smaller models’ persuasive capabilities, challenging scale-dependent assumptions. Our findings carry key implications for both model developers and policymakers. For instance, while California’s SB-1047 aims to regulate AI models based on the number of floating point operations, we demonstrate that simple metrics like this alone fail to capture the full scope of AI’s societal impact. We invite the community to explore and contribute to PersuasionArena and PersuasionBench, to advance our understanding of AI-driven persuasion and its societal implications.

4718Why pre-training is beneficial for downstream classification tasks?

[openreview] [pdf]

Abstract It is widely acknowledged that pre-training brings benefits to downstream tasks by boosting accuracy and speeding up convergence, but the exact reasons for these two benefits still remain unclear. To this end, we propose to quantitatively and accurately explain effects of pre-training on the downstream task from a novel game-theoretic view, which also sheds new light into the learning behavior of deep neural networks (DNNs). Specifically, we extract and quantify the knowledge encoded by the pre-trained model, and further track the changes of such knowledge during the fine-tuning process. Interestingly, we discover that only a limited amount of pre-trained model’s knowledge is preserved for the inference of downstream tasks, and such preserved knowledge is very difficult for a model training from scratch to learn. Thus, with the help of this exclusively learned and useful knowledge, the fine-tuned model usually achieves better performance. Besides, we discover that pre-training can guide the fine-tuned model to learn target knowledge of the downstream task more directly and quickly than the model training from scratch, which accounts for the faster convergence of the fine-tuned model. The code will be released when the paper is accepted.

4719MAMBA STATE-SPACE MODELS ARE LYAPUNOV-STABLE LEARNERS

[openreview] [pdf]

Abstract Mamba state-space models (SSMs) were recently shown to outperform state-of-the-art (SOTA) Transformer large language models (LLMs) across various tasks. Despite subsequent widespread adaptation, little work has focused on Mamba LLMs’ amenability for fine-tuning frameworks ubiquitously used for Transformer-based LLMs, e.g., mixed-precision fine-tuning (MPFT) and parameter-efficient fine-tuning (PEFT). For the former, it currently remains an open question whether Mamba’s recurrent dynamics are robust to small input changes, such as those encountered during MPFT. Using dynamical systems theory (in particular, Lyapunov exponents), we answer this question in the affirmative. We empirically validate this result through several experiments, showing that Mamba SSMs are significantly more stable to changes introduced by mixed-precision than comparable Transformers, even when both MPFT and PEFT are combined. For PEFT, we show how targeting specific memory buffers in Mamba’s customized CUDA kernels for low-rank adaptation regularizes SSM parameters, thus providing both parameter efficient learning and computational savings. Finally, with both MPFT and PEFT enabled, we explore the impact of instruction tuning Mamba SSMs for in-context learning (ICL) on natural language tasks. While pretrained Mamba and Mamba-2 models only achieve 38% and 82% (respectively) of the ICL improvements of comparable Transformer-based LLMs, we show that instruction tuning allows Mamba models to narrow this gap to 81% and Mamba-2 models to skyrocket over this gap to 132%.

4720Bag-level Self-supervised instance based distance in Multiple Instance Learning

[openreview] [pdf]

Abstract Multiple Instance Learning (MIL) methods are typically supervised. However, a bag-to-bag metric is needed in many applications, including clustering, statistical tests, and dimension reduction. Such a metric should differentiate between bags, regardless of the sparsity or overlap between the instances of the bags. We propose SUMIT (Self sUpervised MIL dIsTance) as an instance-embedding-based distance that maximizes the distinction between bags. SUMIT is optimized using five criteria: self-similarity within a bag, quality of instance reconstruction, robustness to sampling depth, conservation of triangle inequality, and separation of instances to clusters. We show using current standard MIL datasets and a novel wiki-based set of wiki topics that the within bag-similarity loss is the most important for a bag-to-bag metric that best separates bags of similar classes. SUMIT bridges the gap between instance-level and bag-level approaches, by keeping the embedding of all instances but ensuring their proximity within a bag.

4721Intrinsic Explanation of Random Subspace Method for Enhanced Security Applications

[openreview] [pdf]

Abstract Random subspace method has wide security applications such as providing certified defenses against adversarial and backdoor attacks, and building robustly aligned LLM against jailbreaking attacks. However, the explanation of random subspace method lacks sufficient exploration. Existing state-of-the-art feature attribution methods such as Shapley value and LIME are computationally impractical and lacks security guarantee when applied to random subspace method. In this work, we propose EnsembleSHAP, an intrinsically faithful and secure feature attribution for random subspace method that reuses its computational byproducts. Specifically, our feature attribution method is 1) computationally efficient, 2) maintains essential properties of effective feature attribution (such as local accuracy), and 3) offers guaranteed protection against attacks on feature attribution methods. We perform comprehensive evaluations for our explanation’s effectiveness when faced with different empirical attacks. Our experimental results demonstrates that our explanation not only faithfully reports the most important features, but also certifiably detects the harmful features embedded in the input sample.

4722RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

[openreview] [pdf]

Abstract Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Transformer (RDT), a pioneering diffusion foundation model for bimanual manipulation. It is built on scalable Diffusion Transformers (DiTs), which can effectively represent multi-modality, with innovative designs to deal with the heterogeneity of multi-modal inputs and to capture the nonlinearity and high frequency of robotic data. To address data scarcity, we first introduce a Physically Interpretable Unified Action Space, which can unify the action representations of various robots while preserving the physical meanings of original actions, facilitating learning transferrable physical knowledge. With the above designs, we managed to pre-train RDT on the largest collection of multi-robot datasets to date and scaled it up to 1.2B parameters, which is the largest diffusion-based foundation model for robotic manipulation. We further fine-tuned RDT on a self-created multi-task bimanual dataset with over 6K+ episodes to refine its manipulation capabilities. Experiments on real robots demonstrate that RDT significantly outperforms existing methods. It exhibits zero-shot generalization to unseen objects and scenes, understands and follows language instructions, learns new skills with just 1\sim5 demonstrations, and effectively handles complex, dexterous tasks. Code and a Demo video are provided in the supplementary materials.

4723FAVEN: Fast Audio-Visual Embodied Navigation in 3D Environments

[openreview] [pdf]

Abstract Achieving fast audio-visual embodied navigation in 3D environments is still a challenging problem. Existing methods typically rely on separate audio and visual data processing merged in late stages, leading to suboptimal path planning and increased time to locate targets. In this paper, we introduce FavEN, a novel transformer and mamba architecture that combines audio and visual data into early fusion\textit{early fusion} tokens. These tokens are passed through the entire network from the initial layer on and cross-attend to both data modalities. The effect of our early fusion approach is that the network can correlate information from the two data modalities from the get-go, which vastly improves its downstream navigation performance. We demonstrate this empirically through experimental results on the Replica and Matterport3D benchmarks. Furthermore, for the first time, we demonstrate the effectiveness of early fusion in improving the path search speed of audio-visual embodied navigation systems in real-world settings. Across various benchmarks, in comparison to previous approaches, FavEN reduces the search time by 93.6% and improves the SPL metrics by 10.4 and 6.5 on heard and unheard sounds.

4724UNAST: Unified framework for Neural Architecture Search for Transformers

[openreview] [pdf]

Abstract We introduce the UNAST, a new approach to optimize Large Language Models (LLMs) post-training. UNAST combines Neural Architecture Search (NAS) with sparsity and quantization for LLM compression. Starting with a trained model, UNAST replaces layers (e.g., attention and MLP) with more efficient alternatives by adjusting attention heads, KV projection dimensions, and MLP expansion factors. Local distillation pretrains layer candidates to mimic original layers. Scores and costs (latency, number of parameters, etc.) of each operator are fed into an Integer Linear Optimizer to find the optimal architecture under predefined constraints (latency, number of parameters, etc.). Our experiments show that UNAST scales to large models, reducing training costs by up to 10 times compared to training smaller models from scratch. Validation on GPT-3 and LLaMa models demonstrate that UNAST improves latency and memory footprint by up to 60% with minimal accuracy loss. UNAST also provides insights into the effects of different compression types on Transformer layers, aiding in the development of non-uniform models.

4725Same Accuracy, Twice As Fast: Continual Learning Surpasses Retraining From Scratch

[openreview] [pdf]

Abstract Continual learning aims to enable models to adapt to new datasets without losing performance on previously learned data, often assuming prior data is no longer available. However, in many practical scenarios, both old and new data are accessible. In such cases, good performance on both datasets is typically achieved by abandoning the model trained on the previous data and re-training a new model from scratch on both datasets. This training from scratch is computationally expensive. In contrast, methods that leverage the previously trained model are worthy of investigation as they could significantly reduce computational costs. Our evaluation framework quantifies the computational savings of such methods while maintaining or exceeding the performance of training from scratch. We identify key optimization aspects - initialization, regularization, data selection, and hyper-parameters - that can each contribute to reducing computational costs. For each aspect, we propose effective first-step methods that already yield substantial computational savings. By combining these strategies, we achieve up to 2.7x reductions in computation time across various computer vision tasks, highlighting the potential for further advancements in this area.

4726One Stone Three Birds:Three-Dimensional Implicit Neural Network for Compression and Continuous Representation of Multi-Altitude Climate Data

[openreview] [pdf]

Abstract Wind energy stands out as a promising clean and renewable energy alternative, not only for its potential to combat global warming but also for its capacity to meet the ever-growing demand for energy. However, analysis of wind data to fully harness the benefits of wind energy demands tackling several related challenges: (1) Current data resolution is inadequate for capturing the detailed information needed across diverse climatic conditions; (2) Efficient management and storage of real-time measurements are currently lacking; (3) Extrapolating wind data across spatial specifications enables analysis at costly-to-measure, unobserved points is necessary. In response to these challenges, we introduce a modality-agnostic learning framework utilizing implicit neural networks. Our model effectively compresses a large volume of climate data into a manageable latent codec. It also learns underlying continuous climate patterns, enabling reconstruction at any scale and supporting modality transfer and fusion. Extensive experimental results show consistent performance improvements over existing baselines.

4727HeurAgenix: A Multi-Agent LLM-Based Paradigm for Adaptive Heuristic Evolution and Selection in Combinatorial Optimization

[openreview] [pdf]

Abstract Combinatorial Optimization (CO) is a class of problems where the goal is to identify an optimal solution from a finite set of feasible solutions under specific constraints. Despite its ubiquity across industries, existing heuristic algorithms struggle with limited adaptability, complex parameter tuning, and limited generalization to novel problems. Recent approaches leveraging machine learning have made incremental improvements but remain constrained by extensive data requirements and reliance on historical problem-specific adjustments. Large Language Models (LLMs) offer a new paradigm to overcome these limitations due to their ability to generalize across domains, autonomously generate novel insights, and adapt dynamically to different problem contexts. To harness these capabilities, we introduce HeurAgenix\textbf{HeurAgenix}, a novel multi-agent hyper-heuristic framework that leverages LLMs to generate, evolve, evaluate, and select heuristics for solving CO problems. Our framework comprises four key agents: heuristic generation, heuristic evolution, benchmark evaluation, and heuristic selection. Each agent is designed to exploit specific strengths of LLMs, such as their capacity for synthesizing knowledge from diverse sources, autonomous decision-making, and adaptability to new problem instances. Experiments on both classic and novel CO tasks show that HeurAgenix significantly outperforms state-of-the-art approaches by enabling scalable, adaptable, and data-efficient solutions to complex optimization challenges.

4728KDA: A Knowledge-Distilled Attacker for Scalable LLM Red Teaming

[openreview] [pdf]

Abstract Jailbreak attacks exploit specific prompts to bypass LLM safeguards and generate harmful or inappropriate content. Recently, numerous approaches have emerged for generating jailbreak attacks across diverse malicious scenarios. However, these methods often suffer from critical limitations such as the reliance on handcrafted prompts, the necessity for white-box access to target LLMs, the generation of monotonous prompts, or the dependence on expensive queries to commercial LLMs. Moreover, these methods typically require considerable time to generate jailbreak attacks. In this paper, we propose a Knowledge-Distilled Attacker (KDA) that leverages existing realistic and semantically meaningful prompts to learn a model that efficiently produces successful attacks. Specifically, we finetune an open-source LLM on a diverse set of attack prompts, enabling our framework to automatically generate black-box, coherent, and diverse attack prompts independent of commercial LLMs. Our KDA achieves a 100% success rate on multiple state-of-the-art LLMs while only requiring less than 10 seconds per attack generation. Further, using KDA, we introduce the RedTeam-10k dataset, a large-scale dataset of 10,000 harmful attack prompts inducing malicious LLM behavior spanning 12 categories such as bias, hate, and illegal activities. This dataset is 20x larger than any existing attack prompt dataset, positioning KDA as a powerful tool for large-scale adversarial testing.

4729Performance Heterogeneity in Message-Passing and Transformer-based Graph Neural Networks

[openreview] [pdf]

Abstract Graph Neural Networks have emerged as the most popular architecture for graph-level learning, including graph classification and regression tasks, which frequently arise in areas such as biochemistry and drug discovery. Achieving good performance in practice requires careful model design. Due to gaps in our understanding of the relationship between model and data characteristics, this often requires manual architecture and hyperparameter tuning. This is particularly pronounced in graph-level tasks, due to much higher variation in the input data than in node-level tasks. To work towards closing these gaps, we begin with a systematic analysis of individual performance in graph-level tasks. Our results establish significant performance heterogeneity in both message-passing and transformer-based architectures. We then investigate the interplay of model and data characteristics as drivers of the observed heterogeneity. Our results suggest that graph topology alone cannot explain heterogeneity. Using the Tree Mover’s Distance, which jointly evaluates topological and feature information, we establish a link between class-distance ratios and performance heterogeneity in graph classification. These insights motivate model and data preprocessing choices that account for heterogeneity between graphs. We propose a selective rewiring approach, which only targets graphs whose individual performance benefits from rewiring. We further show that the optimal network depth depends on the graph’s spectrum, which motivates a heuristic for choosing the number of GNN layers. Our experiments demonstrate the utility of both design choices in practice.

4730Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems

[openreview] [pdf]

Abstract Although large language models (LLMs) demonstrate impressive proficiency in various tasks, they present potential safety risks, such as ‘jailbreaks’, where malicious inputs can coerce LLMs into generating harmful content. To address these issues, many LLM developers have implemented various safety measures to align these models. This alignment involves several techniques, including data filtering during pre-training, supervised fine-tuning, reinforcement learning from human feedback, and red-teaming exercises. These methods often introduce deliberate and intentional biases similar to Political Correctness (PC) to ensure the ethical behavior of LLMs. In this paper, we delve into the intentional biases injected into LLMs for safety purposes and examine methods to circumvent these safety alignment techniques. Notably, these intentional biases result in a jailbreaking success rate in GPT-4o models that differs by 20% between non-binary and cisgender keywords and by 16% between white and black keywords, even when the other parts of the prompts are identical. We introduce the concept ofPCJailbreak, highlighting the inherent risks posed by these safety-induced biases. Additionally, we propose an efficient defense methodPCDefense, which prevents jailbreak attempts by injecting defense prompts prior to generation.PCDefensestands as an appealing alternative to Guard Models, such as Llama-Guard, that require additional inference cost after text generation. Our findings emphasize the urgent need for LLM developers to adopt a more responsible approach when designing and implementing safety measures. To enable further research and improvements, we open-source ourcode and artifactsof PCJailbreak, providing the community with tools to better understand and mitigate safety-induced biases in LLMs.

4731Identifying Feedforward and Feedback Controllable Subspaces of Neural Population Dynamics

[openreview] [pdf]

Abstract There is overwhelming evidence that cognition, perception, and action rely on feedback control. However, if and how neural population dynamics are amenable to different control strategies is poorly understood, in large part because machine learning methods to directly assess controllability in neural population dynamics are lacking. To address this gap, we developed a novel dimensionality reduction method, Feedback Controllability Components Analysis (FCCA), that identifies subspaces of linear dynamical systems that are most feedback controllable based on a new measure of feedback controllability. We further show that PCA identifies subspaces of linear dynamical systems that maximize a measure of feedforward controllability. As such, FCCA and PCA are data-driven methods to identify subspaces of neural population data (approximated as linear dynamical systems) that are most feedback and feedforward controllable respectively, and are thus natural contrasts for hypothesis testing. We developed new theory that proves that non-normality of underlying dynamics determines the divergence between FCCA and PCA solutions, and confirmed this in numerical simulations. Applying FCCA to diverse neural population recordings, we find that feedback controllable dynamics are geometrically distinct from PCA subspaces and are better predictors of animal behavior. Our methods provide a novel approach towards analyzing neural population dynamics from a control theoretic perspective, and indicate that feedback controllable subspaces are important for behavior.

4732Words in Motion: Extracting Interpretable Control Vectors for Motion Transformers

[openreview] [pdf]

Abstract Transformer-based models generate hidden states that are difficult to interpret. In this work, we aim to interpret these hidden states and control them at inference, with a focus on motion forecasting. We leverage the phenomenon of neural collapse and use linear probes to measure interpretable features in hidden states. Our experiments reveal meaningful directions and distances between hidden states of opposing features, which we use to fit control vectors for activation steering. We further refine our approach using sparse autoencoders to optimize our control vectors. Notably, we show that enforcing sparsity leads to a more linear relationship between control vector temperatures and forecasts. Our approach not only enables mechanistic interpretability but also zero-shot generalization to unseen dataset characteristics.

4733Learning Representations for Independence Testing

[openreview] [pdf]

Abstract Many tools exist that attempt to detect dependence between random variables, a core question across a wide range of machine learning, statistical, and scientific endeavors. Although several statistical tests guarantee eventual detection of any dependence with enough samples, standard tests may require an exorbitant amount of samples for detecting subtle dependencies between high-dimensional random variables with complex distributions. In this work, we study two related ways to learn powerful independence tests. First, we show how to construct powerful statistical tests with finite-sample validity by using variational estimators of mutual information, such as the InfoNCE or NWJ estimators. Second, we establish a close relationship between these variational mutual information-based tests and tests based on the Hilbert-Schmidt Independence Criterion (HSIC), showing that learning a variational bound in the former case is closely related to learning kernels, typically parameterized by deep networks, in the latter. Finally, we show how to find a representation that maximizes the asymptotic power of an HSIC test, prove that this procedure works, and demonstrate empirically the practical improvement of our tests (with HSIC tests generally outperforming the variational ones) on difficult problems of detecting structured dependence.

4734Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach

[openreview] [pdf]

Abstract As online video content rapidly grows, the task of text-video retrieval (TVR) becomes increasingly important. A key challenge in TVR is the information asymmetry between video and text: videos are inherently richer in information, while their textual descriptions often capture only fragments of this complexity. This paper introduces a novel, data-centric framework to bridge this gap by enriching textual representations to better match the richness of video content. During training, videos are segmented into event-level clips and captioned to ensure comprehensive coverage. During retrieval, a large language model (LLM) generates semantically diverse queries to capture a broader range of possible matches. To enhance retrieval efficiency, we propose a query selection mechanism that identifies the most relevant and diverse queries, reducing computational cost while improving accuracy. Our method achieves state-of-the-art results across multiple benchmarks, demonstrating the power of data-centric approaches in addressing information asymmetry in TVR. This work paves the way for new research focused on leveraging data to improve cross-modal retrieval.

4735Transformers Learn Low Sensitivity Functions: Investigations and Implications

[openreview] [pdf]

Abstract Transformers achieve state-of-the-art accuracy and robustness across many tasks, but an understanding of their inductive biases and how those biases differ from other neural network architectures remains elusive. In this work, we identify the sensitivity of the model to token-wise random perturbations in the input as a unified metric which explains the inductive bias of transformers across different data modalities and distinguishes them from other architectures. We show that transformers have lower sensitivity than MLPs, CNNs, ConvMixers and LSTMs, across both vision and language tasks. We also show that this low-sensitivity bias has important implications: i) lower sensitivity correlates with improved robustness; it can also be used as an efficient intervention to further improve the robustness of transformers; ii) it corresponds to flatter minima in the loss landscape; and iii) it can serve as a progress measure for grokking. We support these findings with theoretical results showing (weak) spectral bias of transformers in the NTK regime, and improved robustness due to the lower sensitivity.

4736CViT: Continuous Vision Transformer for Operator Learning

[openreview] [pdf]

Abstract Operator learning, which aims to approximate maps between infinite-dimensional function spaces, is an important area in scientific machine learning with applications across various physical domains. Here we introduce the Continuous Vision Transformer (CViT), a novel neural operator architecture that leverages advances in computer vision to address challenges in learning complex physical systems. CViT combines a vision transformer encoder, a novel grid-based coordinate embedding, and a query-wise cross-attention mechanism to effectively capture multi-scale dependencies. This design allows for flexible output representations and consistent evaluation at arbitrary resolutions. We demonstrate CViT’s effectiveness across a diverse range of partial differential equation (PDE) systems, including fluid dynamics, climate modeling, and reaction-diffusion processes. Our comprehensive experiments show that CViT achieves state-of-the-art performance on multiple benchmarks, often surpassing larger foundation models, even without extensive pretraining and roll-out fine-tuning. Taken together, CViT exhibits robust handling of discontinuous solutions, multi-scale features, and intricate spatio-temporal dynamics. Our contributions can be viewed as a significant step towards adapting advanced computer vision architectures for building more flexible and accurate machine learning models in the physical sciences.

4737Controllable Context Sensitivity and the Knob Behind It

[openreview] [pdf]

Abstract When making predictions, a language model must trade off how much it relies on its context vs. its prior knowledge. Choosing how sensitive the model is to its context is a fundamental functionality, as it enables them to excel at tasks like retrieval-augmented generation and question-answering. In this paper, we search for a knob which controls this sensitivity, determining whether these models answer from the context or its prior knowledge. To guide this search, we design a task for controllable context sensitivity. In this task, we first feed the model a context (“Paris is in England.”) and a question (“Where is Paris?”); we then instruct the model to either use its prior or contextual knowledge and evaluate whether it generates the correct answer for both intents (either England or France). When fine-tuned on this task, instruction-tuned versions of Llama-3.1, Mistral-v0.3, and Gemma-2 can solve it with high accuracy (85-95%). Analyzing these high-performing models, we narrow down which layers may be important to context sensitivity using a novel linear time algorithm. Then, in each model, we identify a 1-D subspace in a single layer that encodes whether the model follows context or prior knowledge. Interestingly, while we identify this subspace in a fine-tuned model, we find that setting its value serves as an effective knob in not only that model but also non-fine-tuned instruct and base models of that model family. Finally, we show a strong correlation between a model’s performance and how distinctly it separates context-agreeing from context-ignoring answers in this subspace. These results suggest a single subspace facilitates how the model chooses between context and prior knowledge, hinting at a simple fundamental mechanism that controls this behavior.

4738Smooth Probabilistic Interpolation Benefits Generative Modeling for Discrete Graphs

[openreview] [pdf]

Abstract Though typically represented by the discrete node and edge attributes, the graph topological information can be sufficiently captured by the graph spectrum in a continuous space. It is believed that incorporating the continuity of graph topological information into the generative process design could establish a superior paradigm for graph generative modeling. Motivated by such prior and recent advancements in the generative paradigm, we propose Graph Bayesian Flow Networks (GraphBFN) in this paper, a principled generative framework that designs an alternative generative process emphasizing the dynamics of topological information. Unlike recent discrete-diffusion-based methods, GraphBFNemploys the continuous counts derived from sampling infinite times from a categorical distribution as latent to facilitate a smooth decomposition of topological information, demonstrating enhanced effectiveness. To effectively realize the concept, we further develop an advanced sampling strategy and new time-scheduling techniques to overcome practical barriers and boost performance. Through extensive experimental validation on both generic graph and molecular graph generation tasks, GraphBFN could consistently achieve superior or competitive performance with significantly higher training and sampling efficiency.

4739Differentiable and Learnable Wireless Simulation with Geometric Transformers

[openreview] [pdf]

Abstract Modelling the propagation of electromagnetic wireless signals is critical for designing modern communication systems. Wireless ray tracing simulators model signal propagation based on the 3D geometry and other scene parameters, but their accuracy is fundamentally limited by underlying modelling assumptions and correctness of parameters. In this work, we introduce Wi-GATr, a fully-learnable neural simulation surrogate designed to predict the channel observations based on scene primitives (e. g., surface mesh, antenna position and orientation). Recognizing the inherently geometric nature of these primitives, Wi-GATr leverages an equivariant Geometric Algebra Transformer that operates on a tokenizer specifically tailored for wireless simulation. We evaluate our approach on a range of tasks (i. e., signal strength and delay spread prediction, receiver localization, and geometry reconstruction) and find that Wi-GATr is accurate, fast, sample-efficient, and robust to symmetry-induced transformations. Remarkably, we find our results also translate well to the real world: Wi-GATr demonstrates more than 35% lower error than hybrid techniques, and 70% lower error than a calibrated wireless tracer.

4740Efficient Controlled Language Generation with Low-Rank Autoregressive Reward Models

[openreview] [pdf]

Abstract Language models trained on large amounts of data are known to produce inappropriate content in some cases and require careful tuning to be used in the real world. We revisit the reward augmented decoding (RAD) approach to control the generation from a language model using the scores from a task-specific reward model. We investigate the training objective of RAD, and reformulate it as a task of learning a reward matrix. We show that RAD is designed to support high flexibility when representing the reward matrices, which leads to a higher computational costs during decoding. However, we demonstrate that RAD does not use its full flexibility. Motivated by this, we propose a simpler but more efficient low-rank parametrization of the reward model enabling fast and effective guided decoding. For the detoxification and sentiment control tasks, we show that our low-rank reward model performs on par with the more flexible RAD parametrization, while requiring only a single reward model call per generated token.

4741Efficient optimization with orthogonality constraint: a randomized Riemannian submanifold method

[openreview] [pdf]

Abstract Optimization with orthogonality constraints frequently arise in various fields such as machine learning, signal processing and computer vision. Riemannian optimization offers a powerful framework for solving these problems by equipping the constraint set with a Riemannian manifold structure and performing optimization intrinsically on the manifold. This approach typically involves computing a search direction in the tangent space and updating variables via a retraction operation. However, as the size of the variables increases, the computational cost of the retraction can become prohibitively high, limiting the applicability of Riemannian optimization to large-scale problems. To address this challenge and enhance scalability, we propose a novel approach that restricts each update on a random submanifold, thereby significantly reducing the per-iteration complexity. We introduce two sampling strategies for selecting the random submanifold and theoretically analyze the convergence of the proposed method. We provide convergence results for general nonconvex functions and functions that satisfy Riemannian Polyak–Łojasiewicz condition as well as for stochastic optimization settings. Extensive experiments verify the benefits of the proposed method, showcasing its effectiveness across a wide variety of problem instances.

4742Model-Driven Labeled Data Free Fine-tuning

[openreview] [pdf]

Abstract Supervised fine-tuning is a prevalent technique for boosting model performance. However, it heavily depends on extensive training over labeled data. This paper introduces a novel model-driven fine-tuning method that operates independently of supervised training and labeled data. By harnessing the collective intelligence of a diverse model pool, our method enhances individual model performance through a two-phase process. Initially, we consolidate the expertise of the models within the pool to create a general meta-model. This meta-model then serves as a guide for iteratively fine-tuning the original models in a few shots, promoting a synergistic improvement in performance. Our experimental results show that this model-driven approach not only surpasses the performance of full-parameter fine-tuning models but also does so without the need for supervised training. This breakthrough offers a cost-effective and scalable alternative to traditional supervised fine-tuning, addressing the challenge of data scarcity and paving the way for future research in unsupervised model enhancement. Our work represents a significant step towards making fine-tuning techniques more accessible and practical in environments where labeled data is limited or even unavailable.

4743Coreset Selection via Reducible Loss in Continual Learning

[openreview] [pdf]

Abstract A natural solution for rehearsal-based continual learning is to select a coreset as memory. A coreset serves as an informative summary of a large dataset, enabling a model trained solely on the coreset to achieve performance comparable to training on the full dataset. Previous bi-level coreset selection methods adjust sample weights or probabilities to minimize the outer loss, which is computed over the entire dataset. For non-representative samples like ambiguous or noisy samples, since these samples are not well learned even training model on the full dataset, loss of these samples in the outer loss are not worthy to be reduced. However, their high loss values may cause them to be selected in an attempt to minimize the outer loss, which may lead to suboptimal performance for models trained on the coreset. To address this issue, we first investigate how the performance of a trained model changes when a sample is added to the training dataset and approximate this performance gain using reducible loss. We then select samples with the highest performance gain in the coreset so that performance of model trained on coreset could be maximized. We show that samples with high performance gain are informative and representative. Furthermore, reducible loss requires only forward computation, making it significantly more efficient than previous methods. To better apply coreset selection in continual learning, we extend our method to address key challenges such as task interference, streaming data, and knowledge distillation. Experiments on data summarization and continual learning demonstrate the effectiveness and efficiency of our approach.

4744Enhancing Multilingual Reasoning in LLMs: Insights from Cross-Linguistic Correlations and Optimal Data Proportions

[openreview] [pdf]

Abstract Large language models (LLMs) typically rely on fine-tuning to enhance their reasoning capabilities across various languages. However, limited research has been conducted on the optimal balance of language proportions within multilingual reasoning datasets. To fill this gap, we performed a systematic study to examine how different proportions of language data in multilingual reasoning datasets influence fine-tuning performance. Our study revealed a clear relationship between language proportions in datasets and the fine-tuning performance of LLMs. By fine-tuning multiple LLMs using the appropriate language distributions and data volumes identified in our study, we achieved state-of-the-art performance in both multilingual mathematical reasoning and solving mathematical problems using Python code. Furthermore, our approach significantly reduced data volume requirements and translation costs compared to existing methods, providing a valuable reference for future research.

4745Self-Boosting Large Language Models with Synthetic Preference Data

[openreview] [pdf]

Abstract Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

4746Measuring Effects of Steered Representation in Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) show advanced performance and adaptability across various tasks. As the model size becomes more extensive, precise control by editing the forward process of LLMs is a challenging problem. Recent research has focused on steering hidden representations during forward propagation to guide model outputs in desired directions, yielding precise control over specific responses. Although steering shows a broader impact on diverse tasks, the influence of steered representations remains unclear. For instance, steering towards a refusal direction might lead the model to refuse even benign requests in subsequent generations. This work tackles the problem of evaluating activation steering. We introduce a counterfactual-based steering evaluation framework that compares the output of base and steered generations. Within the framework, we propose a steering effect matrix that eases the selection of generations base and steered output types. We experimentally evaluate the effects of steered representation for consequence generation with Llama3-8B, Llama2-7B, and Exaone-8B across diverse datasets. We conclude that steered representation changes the original output severely in longer contexts.

4747Q* Agent: Optimizing Language Agents with Q-Guided Exploration

[openreview] [pdf]

Abstract Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QAgent, leveraging an estimated Q value to generate intermediate annotations for open language agents. By introducing a reasoning tree and performing process reward modeling, QAgent provides effective intermediate guidance for each step. This guidance aims to automatically annotate data in a step-wise manner. Besides, we propose a Q-guided exploration strategy that can significantly boost model performance by providing process guidance during inference. Notably, even with almost half the annotated data, QAgent retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QAgent can lead to more accurate decision making through qualitative analysis.

4748Strong denoising of financial time-series

[openreview] [pdf]

Abstract In this paper we introduce a method for improving the signal to noise ratio of financial data. The approach relies on combining a target variable with different context variables and using auto-encoders (AEs) to learn reconstructions of the combined inputs. The idea is to seek agreement among multiple AEs which are trained on related but different inputs for which they are forced to find common ground. The training process is set up as a conversation where models take turns at producing a prediction (speaking) or reconciling own predictions with the output of the other AE (listening), until an agreement is reached. This leads to “mutual regularization” among the AEs. Unlike standard regularization which relies on including a complexity penalty into the loss function, the proposed method uses the partner network to detect and amend the lack of generality in the data representation. As only true regularities can be agreed upon by the AEs, the replication of noise is costly and will therefore be avoided.

4749Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-like and Personalised Early Adolescences

[openreview] [pdf]

Abstract The capabilities of large language models (LLMs) have been applied in expert systems across various domains, providing new opportunities for AI in Education (AI4Education). Educational interactions involve a cyclical exchange between teachers and students. Current research predominantly focuses on using LLMs to simulate teachers, leveraging their expertise to enhance student learning outcomes. However, the simulation of students, which could improve teachers’ instructional skills, has received insufficient attention due to the challenges of modeling and evaluating virtual students. This research poses the question: “\textit{Can LLMs be utilized to develop virtual student agents that mimic human-like behavior and individual variability?}” Unlike expert systems focusing on knowledge delivery, virtual students must replicate learning difficulties, emotional responses, and linguistic uncertainties. These traits present significant challenges in both modeling and evaluation. To address these issues, this study focuses on language learning as a context for modeling virtual student agents. We propose a novel AI4Education framework, termed \textbf{SOE} (\textbf{S}cene - \textbf{O}bject - \textbf{E}valuation), to systematically construct \textbf{LVSA} (\textbf{L}LM-based \textbf{V}irtual \textbf{S}tudent \textbf{A}gents). By curating a dataset of personalized teacher-student interactions with various personality traits, question types, and learning stages, and fine-tuning LLMs using LoRA, we conduct multi-dimensional evaluation experiments. Specifically, we: (1) develop a theoretical framework for generating LVSA; (2) integrate human subjective evaluation metrics into GPT-4 assessments, demonstrating a strong correlation between human evaluators and GPT-4 in judging LVSA authenticity; and (3) validate that LLMs can generate human-like, personalized virtual student agents in educational contexts, laying a foundation for future applications in pre-service teacher training and multi-agent simulation environments.

4750Efficient Long-range Language Modeling with Self-supervised Causal Retrieval

[openreview] [pdf]

Abstract Recently, retrieval-based language models (RLMs) have received much attention. However, most of them leverage a pre-trained retriever with fixed parameters, which may not adapt well to causal language models. In this work, we propose Grouped Cross-Attention, a novel module enabling joint pre-training of the retriever and causal LM, and apply it to long-context modeling. For a given input sequence, we split it into chunks and use the current chunk to retrieve past chunks for subsequent text generation. Our innovation allows the retriever to learn how to retrieve past chunks that better minimize the auto-regressive loss of subsequent tokens in an end-to-end manner. By integrating top-kk retrieval, our model can be pre-trained efficiently from scratch with context lengths up to 64K tokens. Our experiments show our model, compared with long-range LM baselines, can achieve lower perplexity with comparable or lower pre-training and inference costs.

4751Open-Set Graph Anomaly Detection via Normal Structure Regularisation

[openreview] [pdf]

Abstract This paper considers an important Graph Anomaly Detection (GAD) task, namely open-set GAD, which aims to train a detection model using a small number of normal and anomaly nodes (referred to asseen anomalies) to detect both seen anomalies andunseen anomalies(i.e., anomalies that cannot be illustrated the training anomalies). Those labelled training data provide crucial prior knowledge about abnormalities for GAD models, enabling substantially reduced detection errors. However, current supervised GAD methods tend to over-emphasise fitting the seen anomalies, leading to many errors of detecting the unseen anomalies as normal nodes. Further, existing open-set AD models were introduced to handle Euclidean data, failing to effectively capture discriminative features from graph structure and node attributes for GAD. In this work, we propose a novel open-set GAD approach, namely normal\underline{n}ormal structure\underline{s}tructure regularisation\underline{reg}ularisation (NSReg), to achieve generalised detection ability to unseen anomalies, while maintaining its effectiveness on detecting seen anomalies. The key idea in NSReg is to introduce a regularisation term that enforces the learning of compact, semantically-rich representations of normal nodes based on their structural relations to other nodes. When being optimised with supervised anomaly detection losses, the regularisation term helps incorporate strong normality into the modelling, and thus, it effectively avoids over-fitting the seen anomalies and learns a better normality decision boundary, largely reducing the false negatives of detecting unseen anomalies as normal. Extensive empirical results on seven real-world datasets show that NSReg significantly outperforms state-of-the-art competing methods by at least 14% AUC-ROC on the unseen anomaly classes and by 10% AUC-ROC on all anomaly classes.

4752No more hard-prompts: SoftSRV prompting for synthetic data generation

[openreview] [pdf]

Abstract We present a novel soft-prompt based framework, SoftSRV, that leverages a frozen pre-trained large language model (LLM) to generate targeted synthetic text sequences. Given a sample from the target distribution, our proposed framework uses data-driven loss minimization to train a parameterized ``variable’’ soft-prompt. This soft-prompt is then used to steer the frozen LLM to generate synthetic sequences that are similar to the target distribution. We argue that SoftSRV provides a practical improvement over common hard-prompting approaches that rely on human-curated prompt-templates, which can be idiosyncratic, labor intensive to craft, and may need to be specialized per domain. We empirically evaluate SoftSRV and other baselines, using a frozen large decoder-only model to generate synthetic fine-tuning data for a small Gemma model. To test generality, we evaluate across three different domains (coding, math, reasoning) without any particular specialization to each domain. In this challenging setting, SoftSRV significantly improves upon hard-prompt baselines, generating data with superior fine-tuning performance and that better matches the target distribution according to the {\sc mauve} similarity metric.

4753MCCE: Missingness-aware Causal Concept Explainer

[openreview] [pdf]

Abstract Causal concept effect estimation is gaining increasing interest in the field of interpretable machine learning. This general approach explains the behaviors of machine learning models by estimating the causal effect of human-understandable concepts, which represent high-level knowledge more comprehensibly than raw inputs like tokens. However, existing causal concept effect explanation methods assume complete observation of all concepts involved within the dataset, which can fail in practice due to incomplete annotations or missing concept data. We theoretically demonstrate that unobserved concepts can bias the estimation of the causal effects of observed concepts. To address this limitation, we introduce the Missingness-aware Causal Concept Explainer (MCCE), a novel framework specifically designed to estimate causal concept effects when not all concepts are observable. Our framework learns to account for residual bias resulting from missing concepts and utilizes a linear predictor to model the relationships between these concepts and the outputs of black-box machine learning models. It can offer explanations on both local and global levels. We conduct validations using a real-world dataset, demonstrating that MCCE outperforms existing state-of-the-art explanation methods in causal concept effect estimation.

4754Continual Learning via Learning a Continual Memory in Vision Transformer

[openreview] [pdf]

Abstract This paper studies task-incremental continual learning (TCL) using Vision Transformers (ViTs). Our goal is to improve the overall streaming-task performance without catastrophic forgetting by learning task synergies (e.g., a new task learns to automatically reuse/adapt modules from previous similar tasks, or to introduce new modules when needed, or to skip some modules when it appears to be an easier task). One grand challenge is how to tame ViTs at streaming diverse tasks in terms of balancing their plasticity and stability in a task-aware way while overcoming the catastrophic forgetting. To address the challenge, we propose a simple yet effective approach that identifies a lightweight yet expressive ``sweet spot’’ in the ViT block as the task-synergy memory in TCL. We present a Hierarchical task-synergy Exploration-Exploitation (HEE) sampling based neural architecture search (NAS) method for effectively learning task synergies by structurally updating the identified memory component with respect to four basic operations (reuse, adapt, new and skip) at streaming tasks. The proposed method is thus dubbed as CHEEM (Continual Hierarchical-Exploration-Exploitation Memory). In experiments, we test the proposed CHEEM on the challenging Visual Domain Decathlon (VDD) benchmark and the 5-Dataset benchmark. It obtains consistently better performance than the prior art with sensible CHEEM learned continually.

4755Talking Vehicles: Cooperative Driving via Natural Language

[openreview] [pdf]

Abstract Using natural language as a vehicle-to-vehicle (V2V) communication protocol offers the potential for autonomous vehicles to drive cooperatively not only with each other but also with human drivers. Simple and effective messages for sharing critical observations or negotiating plans to achieve coordination could improve traffic safety and efficiency compared to methods without communication. In this work, we propose a suite of traffic tasks in vehicle-to-vehicle autonomous driving where vehicles in a traffic scenario need to communicate in natural language to facilitate coordination in order to avoid an imminent collision and/or support efficient traffic flow, which we model as a general-sum partially observable stochastic game. To this end, this paper introduces a novel method, LLM+Debrief, to learn a message generation and control policy for autonomous vehicles through multi-agent discussion. To evaluate our method, we developed a gym-like simulation environment that contains a range of accident-prone driving scenarios that could be alleviated by communication. Our experimental results demonstrate that our method is more effective at generating meaningful and human-understandable natural language messages to facilitate cooperation and coordination than untrained LLMs. Our anonymous code is available in supplementary materials.

4756MLOT: Extending the Bipartite Structure towards Multi-Layered Structure for Optimal Transport

[openreview] [pdf]

Abstract Despite its remarkable success and widespread adoption in various domains, optimal transport (OT) has a rather simple structure, relying on bipartite graphs with only two layers of nodes for transportation. In this paper, we propose a multi-layered OT approach that extends the original two-layer structure to handle transportation problems across multiple hierarchical levels. Within this framework, the source distribution flows through intermediate layers, before reaching the target distribution. Unlike previous variants of OT that involve multiple distributions, our multi-layered OT typically involves uncertain intermediate distributions, which need to be computed based on the relationships between the preceding and succeeding distributions. Under entropic regularization, MLOT-Sinkhorn algorithm is further proposed for multi-layered OT, which can be accelerated using GPUs and significantly outperforms general solvers such as Gurobi. The theoretical results of our entropic MLOT are also given in this paper. In the experiments, we validate its speed advantage and convergence performance. We further validate its feasibility through Text-Image retrieval and intermediate image computing task, which demonstrates reformulating the problems as MLOT can achieve better results. Source code will be made available.

4757Discrete Codebook World Models for Continuous Control

[openreview] [pdf]

Abstract In reinforcement learning (RL), world models serve as internal simulators, enabling agents to predict environment dynamics and future outcomes in order to make informed decisions. While previous approaches leveraging discrete latent spaces, such as DreamerV3, have achieved strong performance in discrete action environments, they are typically outperformed in continuous control tasks by models with continuous latent spaces, like TD-MPC2. This paper explores the use of discrete latent spaces for continuous control with world models. Specifically, we demonstrate that quantized discrete codebook encodings are more effective representations for continuous control, compared to alternative encodings, such as one-hot and label-based encodings. Based on these insights, we introduce DCWM: Discrete Codebook World Model, a model-based RL method which surpasses recent state-of-the-art algorithms, including TD-MPC2 and DreamerV3, on continuous control benchmarks.

4758VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for MLLMs

[openreview] [pdf]

Abstract With the recent introduction of vision understanding capabilities in large language models, multimodal LLMs (MLLMs) have inherited and advanced a series of intriguing capabilities from classical LLMs. Among these capabilities, visual spatial planning - the ability to comprehend the spatial arrangements of objects and devise action plans to achieve specific desired outcomes - remains under-explored in MLLMs. In our study, we introduce VSP, a benchmark specifically designed to 1) evaluate the spatial planning capability in these models in general, and 2) break down the visual planning task into finer-grained sub-tasks, including perception and reasoning, and measure their capabilities in these sub-tasks. Contrary to expectations that MLLMs should naturally process scene images and reason effectively, evaluation on the benchmark shows that both open-source and private MLLMs fail to generate effective plans for even simple spatial planning tasks. The fine-grained analysis further reveals that while MLLMs have flaws in both perception and reasoning, the deficiency in the former capabilities is significantly worse. Evaluations on these tasks reveal fundamental deficiencies in the models’ visual perception and reasoning abilities, explaining their worse performance in the general spatial planning tasks. Our work illuminates future directions for improving multimodal LLMs’ abilities in spatial planning.

4759What Makes Large Language Models Reason in (Multi-Turn) Code Generation?

[openreview] [pdf]

Abstract Prompting techniques such as chain-of-thought have established themselves as a popular vehicle for improving the outputs of large language models (LLMs). For code generation, however, their exact mechanics and efficacy are under-explored using unified metrics and benchmarks. We thus investigate the effects of a wide range of prompting strategies with a focus on automatic re-prompting over multiple turns and computational requirements. After systematically decomposing reasoning, instruction, and execution feedback prompts, we conduct an extensive grid search on the competitive programming benchmarks CodeContests and TACO for multiple LLM families and sizes (Llama 3.0 and 3.1, 8B, 70B, 405B, and GPT-4o). Our study reveals strategies that consistently improve performance across all models with small and large sampling budgets. We then show how finetuning with such an optimal configuration allows models to internalize the induced reasoning process and obtain improvements in performance and scalability for multi-turn code generation.

4760Verbalized Graph Representation Learning: A Fully Interpretable Graph Model Based on Large Language Models Throughout the Entire Process

[openreview] [pdf]

Abstract Representation learning on text-attributed graphs (TAGs) has attracted significant interest due to its wide-ranging real-world applications, particularly through Graph Neural Networks (GNNs). Traditional GNN methods focus on encoding the structural information of graphs, often using shallow text embeddings for node or edge attributes. This limits the model to understand the rich semantic information in the data and its reasoning ability for complex downstream tasks, while also lacking interpretability. With the rise of large language models (LLMs), an increasing number of studies are combining them with GNNs for graph representation learning and downstream tasks. While these approaches effectively leverage the rich semantic information in TAGs datasets, their main drawback is that they are only partially interpretable, which limits their application in critical fields. In this paper, we propose a verbalized graph representation learning (VGRL) method which is fully interpretable. In contrast to traditional graph machine learning models, which are usually optimized within a continuous parameter space, VGRL constrains this parameter space to be text description which ensures complete interpretability throughout the entire process, making it easier for users to understand and trust the decisions of the model. We conduct several studies to empirically evaluate the effectiveness of VGRL and we believe these method can serve as a stepping stone in graph representation learning.

4761Source Attribution for Large Language Model-Generated Data

[openreview] [pdf]

Abstract The impressive performances of Large Language Models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the Intellectual Property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text by an LLM. In this paper, we show that this problem can be tackled by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a source attribution framework that satisfies these key properties due to our algorithmic designs. Our framework enables an LLM to learn an accurate mapping from the generated texts to data providers, which sets the foundation for effective source attribution. Extensive empirical evaluations show that our framework achieves effective source attribution.

4762AFlow: Automating Agentic Workflow Generation

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable potential in solving complex tasks across diverse domains, typically by employing agentic workflows that follow detailed instructions and operational sequences. However, constructing these workflows requires significant human effort, limiting scalability and generalizability. Recent research has sought to automate the generation and optimization of these workflows, but existing methods still rely on initial manual setup and fall short of achieving fully automated and effective workflow generation. To address this challenge, we reformulate workflow optimization as a search problem over code-represented workflows, where LLM-invoking nodes are connected by edges. We introduce \textbf{AFlow}, an automated framework that efficiently explores this space using Monte Carlo Tree Search, iteratively refining workflows through code modification, tree-structured experience, and execution feedback. Empirical evaluations across six benchmark datasets demonstrate AFlow’s efficacy, yielding a 5.7% average improvement over state-of-the-art baselines. Furthermore, AFlow enables smaller models to outperform GPT-4o on specific tasks at 4.55% of its inference cost in dollars. The code will be made available as open-source upon publication.

4763Robust Decentralized VFL Over Dynamic Device Environment

[openreview] [pdf]

Abstract Robust collaborative learning on a network of edge devices, for vertically split datasets, is challenging because edge devices may fail due to environment conditions or events such as extreme weather. The current Vertical Federated learning (VFL) approaches assume a centralized learning setup or assume the active party or server cannot fail. To address these limitations, we first formalize the problem of VFL under dynamic network conditions such as faults (named DN-VFL). Then, we develop a novel DN-VFL method calledMultipleAggregation withGossip Rounds andSimulated Faults (MAGS) that synthesizes faults via dropout, replication, and gossiping to improve robustness significantly over baselines. We also theoretically analyze our proposed approaches to explain why they enhance robustness. Extensive empirical results validate that MAGS is robust across a range of fault rates—including extreme fault rates—compared to prior VFL approaches.

4764GraphProp: Training the Graph Foundation Models using Graph Properties

[openreview] [pdf]

Abstract In this work, we focus on training Graph Foundation Models (GFMs) for graph-level tasks like protein classification. Effective GFM training requires capturing information consistent across different domains. We have discovered that graph structures provide more consistent cross-domain information compared to node features and graph labels. However, traditional in-context learning methods primarily focus on transferring node features from various domains into a unified representation space but often lack structural cross-domain generalization. To address this, we introduce a method called GraphProp, which emphasizes structural generalization. The GraphProp training process consists of two main phases: initially, it trains a structural GFM through the supervised prediction of graph structural properties. It then uses the structural representation from this GFM as positional encoding to train a comprehensive GFM. This phase of training utilizes in-context learning with domain-specific node features and graph labels to improve cross-domain node feature generalization. Additionally, employing data augmentation in training the structural GFM helps address the scarcity of labeled graph data and facilitates explicit cross-domain structural generalization. Our experimental results demonstrate that GraphProp significantly outperforms traditional in-context learning methods, especially in handling graphs without node features.

4765Generalization, Expressivity, and Universality of Graph Neural Networks on Attributed Graphs

[openreview] [pdf]

Abstract We analyze the universality and generalization of graph neural networks (GNNs) on attributed graphs, i.e., with node attributes. To this end, we propose pseudometrics over the space of all attributed graphs that describe the fine-grained expressivity of GNNs. Namely, GNNs are both Lipschitz continuous with respect to our pseudometrics and can separate attributed graphs that are distant in the metric. Moreover, we prove that the space of all attributed graphs is relatively compact with respect to our metrics. Based on these properties, we prove a universal approximation theorem for GNNs and generalization bounds for GNNs on any data distribution of attributed graphs. The proposed metrics compute the similarity between the structures of attributed graphs via a hierarchical optimal transport between computation trees. Our work extends and unites previous approaches which either derived theory only for graphs with no attributes, derived compact metrics under which GNNs are continuous but without separation power, or derived metrics under which GNNs are continuous and separate points but the space of graphs is not relatively compact, which prevents universal approximation and generalization analysis.

4766LOGO --- Long cOntext aliGnment via efficient preference Optimization

[openreview] [pdf]

Abstract Long-context models (LCMs) have shown great potential in processing long input sequences (even more than 100M tokens) conveniently and effectively. With significant progress, recent research has pointed out that LCMs can accurately locate token-level salient information within the context. Yet, the generation performance of these LCMs is far from satisfactory and might result in misaligned responses, such as hallucinations. To enhance the generation capability of LCMs, existing works have investigated the effects of data size and quality for both pre-training and instruction tuning. Though achieving meaningful improvement, previous methods fall short in either effectiveness or efficiency. In this paper, we introduce LOGO (Long cOntext aliGnment via efficient preference Optimization), a training strategy that first introduces preference optimization for long-context alignment. To overcome the GPU memory-bound issue caused by the long sequence, LOGO employs a reference-free preference optimization strategy and adopts a position synthesis method to construct the training data. By training with only 0.3B data on a single 8×\timesA800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct-80K model to achieve comparable performance with GPT-4 in real-world long-context tasks while preserving the model’s original capabilities on other tasks, e.g., language modeling and MMLU. Moreover, LOGO can extend the model’s context window size while enhancing its generation performance.

4767Tuning Frequency Bias of State Space Models

[openreview] [pdf]

Abstract State space models (SSMs) leverage linear, time-invariant (LTI) systems to effectively learn sequences with long-range dependencies. By analyzing the transfer functions of LTI systems, we find that SSMs exhibit an implicit bias toward capturing low-frequency components more effectively than high-frequency ones. This behavior aligns with the broader notion of frequency bias in deep learning model training. We show that the initialization of an SSM assigns it an innate frequency bias and that training the model in a conventional way does not alter this bias. Based on our theory, we propose two mechanisms to tune frequency bias: either by scaling the initialization to tune the inborn frequency bias; or by applying a Sobolev-norm-based filter to adjust the sensitivity of the gradients to high-frequency inputs, which allows us to change the frequency bias via training. Using an image-denoising task, we empirically show that we can strengthen, weaken, or even reverse the frequency bias using both mechanisms. By tuning the frequency bias, we can also improve SSMs’ performance on learning long-range sequences, averaging an 88.26%88.26\% accuracy on the Long-Range Arena (LRA) benchmark tasks.

4768Towards Specialized Web Agents Using Production-Scale Workflow Data

[openreview] [pdf]

Abstract Large Language Model (LLM) agents are rapidly improving to handle increasingly complex web-based tasks. Most of these agents rely on general-purpose, proprietary models like GPT-4 and focus on designing better prompts to improve their planning abilities. However, general-purpose LLMs are not specifically trained to understand specialized web contexts such as HTML, and they often struggle with long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This simple yet effective approach shows substantial gains over prompting-based agents on existing benchmarks---our WorkflowAgent achieves state-of-the-art performance on Mind2Web and substantially improves the baseline task success rate from 37.2% to 51.3% on WebArena. We further perform detailed ablation studies on various fine-tuning design choices and provide insights into LLM selection, training recipes, context window optimization, and effect of dataset sizes.

4769Affine Steerable Equivariant Layer for Canonicalization of Neural Networks

[openreview] [pdf]

Abstract In the field of equivariant networks, achieving affine equivariance, particularly for general group representations, has long been a challenge. In this paper, we propose the steerable EquivarLayer, a generalization of InvarLayer, by extending the concept of invariants to equivariants. The steerable EquivarLayer supports affine equivariance with arbitrary input and output representations, marking the first model to incorporate steerability into networks for the affine group. To integrate it with canonicalization, a promising approach for making pre-trained models equivariant, we introduce a novel Det-Pooling module, expanding both the applicability of EquivarLayer and the range of groups suitable for canonicalization. We conduct experiments on image classification tasks involving group transformations to validate the steerable EquivarLayer in the role of a canonicalization function, demonstrating its effectiveness over data augmentation.

4770Unbiased Attribution with Intrinsic Information

[openreview] [pdf]

Abstract The importance of attribution algorithms in the AI field lies in enhancing model transparency, diagnosing and improving models, ensuring fairness, and increasing user understanding. Gradient-based attribution methods have become the most critical because of their high computational efficiency, continuity, wide applicability, and flexibility. However, current gradient-based attribution algorithms require the introduction of additional class information to interpret model decisions, which can lead to issues of information ignorance and extra information. Information ignorance can obscure important features relevant to the current model decision, while extra information introduces irrelevant data that can cause feature leakage in the attribution process. To address these issues, we propose the Attribution with Intrinsic Information (AII) algorithm, which analyzes model decisions without the need for specified class information. Additionally, to better evaluate the potential of current attribution algorithms, we introduce the metrics of insertion confusion and deletion confusion alongside existing mainstream metrics. To continuously advance research in the field of explainable AI (XAI), our algorithm is open-sourced athttps://anonymous.4open.science/r/AII-787D/.

4771Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

[openreview] [pdf]

Abstract Chain-of-thought (CoT) prompting has become a widely used strategy for working with large language and multimodal models. While CoT has been shown to improve performance across many tasks, determining the settings in which it is effective remains an ongoing effort. In particular, it is still an open question in what settings CoT systematically reduces model performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, looking at cases where (i) verbal thinking or deliberation hurts performance in humans, and (ii) the constraints governing human performance generalize to language models. Three such cases are implicit statistical learning, visual recognition, and classifying with patterns containing exceptions. In extensive experiments across all three settings, we find that a diverse collection of state-of-the-art models exhibit significant drop-offs in performance (e.g., up to 36.3% absolute accuracy for GPT-o1 compared to GPT-4o) when using CoT compared to zero-shot counterparts. We also identify three tasks that satisfy condition (i) but not (ii), and find that while verbal thinking reduces human performance in these tasks, CoT retains or increases model performance. Overall, our results show that even though there is not an exact parallel between the cognitive processes of models and those of humans, considering cases where thinking has negative consequences for human performance can help us identify settings where it has negative consequences for models. By connecting the literature on human deliberation with evaluation of CoT, we offer a new tool that can be used in understanding the impact of prompt choices and inference-time reasoning.

4772Generalizing Stochastic Smoothing for Differentiation and Gradient Estimation

[openreview] [pdf]

Abstract We deal with the problem of gradient estimation for stochastic differentiable relaxations of algorithms, operators, simulators, and other non-differentiable functions. Stochastic smoothing conventionally perturbs the input of a non-differentiable function with a differentiable density distribution with full support, smoothing it and enabling gradient estimation. Our theory starts at first principles to derive stochastic smoothing with reduced assumptions, without requiring a differentiable density nor full support, and presenting a general framework for relaxation and gradient estimation of non-differentiable black-box functions f:RnRmf:\mathbb{R}^n\to\mathbb{R}^m. We develop variance reduction for gradient estimation from 3 orthogonal perspectives. Empirically, we benchmark 6 distributions and up to 24 variance reduction strategies for differentiable sorting and ranking, differentiable shortest-paths on graphs, differentiable rendering for pose estimation, as well as differentiable cryo-electron tomography simulations.

4773Vocabulary-Defined Semantics: Latent Space Clustering for Improving In-Context Learning

[openreview] [pdf]

Abstract In-context learning enables language models (LM) to adapt to downstream data or tasks by incorporating few samples as demonstrations within the prompts. It offers strong performance without the expense of fine-tuning. However, the performance of in-context learning can be unstable depending on the quality, format, or order of demonstrations, which in turn exacerbates the difficulty of optimization. Prior work, such as Knn Prompting, index samples based on the similarities of logits at the output-side, in addition to the regular retrieval operation at the input-side. They improve in-context learning by leveraging the core ability of next-token prediction, rather than relying solely on the emergent capacity to make analogies. Despite this, the hard-to-optimize issue of in-context learning still exists. In our view, it stems from the process of selecting demonstrations. To address this, we propose complementing in-context learning with an additional clustering operation. We propose a novel approach ``vocabulary-defined semantics’'. Grounded in LM vocabulary, which is the label space of model outputs, the proposed approach computes semantically equivalent latent representations for output labels. Then, taking the representations as centroids, a clustering operation is performed to align the semantic properties between the language model and the downstream data/tasks. Based on extensive experiments across diverse textual understanding datasets and multiple models, our approach outperforms the state-of-the-art in terms of effectiveness and efficiency. On average, it achieves 3%-49% improvements while requiring only half of the computation time.

4774HiRA: Parameter-Efficient Hadamard High-Rank Adaptation for Large Language Models

[openreview] [pdf]

Abstract We propose Hadamard High-Rank Adaptation (HiRA), a parameter-efficient fine-tuning (PEFT) method that enhances the adaptability of Large Language Models (LLMs). While Low-rank Adaptation (LoRA) is widely used to reduce resource demands, its low-rank updates may limit its expressiveness for new tasks. HiRA addresses this by using a Hadamard product to retain high-rank update parameters, improving the model capacity. Empirically, HiRA outperforms LoRA and its variants on several tasks, with extensive ablation studies validating its effectiveness. Our code will be released.

4775The Price of Freedom: Exploring Tradeoffs in Equivariant Tensor Products with Spherical Signals

[openreview] [pdf]

Abstract E(3)E(3)-equivariant neural networks have demonstrated success across a wide range of 3D modelling tasks. A fundamental operation in these networks is the tensor product, which interacts two geometric features in an equivariant manner to create new features. Due to the high computational complexity of the tensor product, significant effort has been invested to optimize the runtime of this operation. \citet{gaunt} recently proposed the Gaunt tensor product (GTP) which promises a significant speedup over the naive implementation of the tensor product. However, this method is unable to perform antisymmetric operations which are crucial for tasks involving chirality. In this work, we introduce vector signal tensor product (VSTP) to solve this issue and show how it generalizes to a class of irrep signal tensor products (ISTPs). Finally, we investigate why these tensor products are faster. We find most of the speedup comes at the price of expressivity. Further, we microbenchmarked the various tensor products and find that the theoretical runtime guarantees may differ wildly from empirical performance, demonstrating the need for careful application-specific benchmarking. Our code is linked \href{https://anonymous.4open.science/r/vector-spherical-harmonics-1231/}{here}.

4776Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solver

[openreview] [pdf]

Abstract This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar decouples reasoning into a self-play mutual generation-discrimination process. First, a target SLM augments the Monte Carlo Tree Search (MCTS) with a rich set of human-like reasoning actions to construct higher quality reasoning trajectories. Next, another SLM, with capabilities similar to the target SLM, acts as a discriminator to verify each trajectory generated by the target SLM. The mutually agreed reasoning trajectories are considered mutual consistent, thus are more likely to be correct. Extensive experiments across five SLMs demonstrate rStar can effectively solve diverse reasoning problems, including GSM8K, GSM-Hard, MATH, SVAMP, and StrategyQA. Remarkably, rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B, from 36.46% to 81.88% for Mistral-7B, from 74.53% to 91.13% for LLaMA3-8B-Instruct.

4777Which Tasks Should Be Compressed Together? A Causal Discovery Approach for Efficient Multi-Task Representation Compression

[openreview] [pdf]

Abstract Traditional image compression methods often overlook the intricate interdependencies in multi-task learning, resulting in inefficiency and redundancy. In this paper, we propose a novel compression framework that leverages causal graph models to uncover conditional relationships between mutually beneficial task clusters. By constructing directed acyclic graphs (DAGs) based on conditional entropy, we capture the causal links among tasks, enabling progressive, context-aware compression. Parent representations act as hyperpriors for their dependents, reducing redundancy, enhancing scalability, and boosting compression efficiency. Extensive experiments across key computer vision tasks, including segmentation, depth zbuffer, and autoencoding demonstrate superior bitrate reduction and task performance. Our findings underscore the importance of disentangling task representations and modelling causal relationships for efficient multi-task compression, offering a new perspective on compact representation learning for advanced intelligent systems. Code will be available at:https://github.com.

4778CoLoRA: A Competitive Learning Approach for Enhancing LoRA

[openreview] [pdf]

Abstract We propose a Competitive Low-Rank Adaptation (CoLoRA) framework to address the limitations of the LoRA method, which either lacks capacity with a single rank-rr LoRA or risks inefficiency and overfitting with a larger rank-KrKr LoRA, where KK is an integer larger than 1. The proposed CoLoRA method initializes KK distinct LoRA components, each with rank rr, and allows them to compete during training. This competition drives each LoRA component to outperform the others, improving overall model performance. The best-performing LoRA is selected based on validation metrics, ensuring that the final model outperforms a single rank-rr LoRA and matches the effectiveness of a larger rank-KrKr LoRA, all while avoiding extra computational overhead during inference. To the best of our knowledge, this is the first work to introduce and explore competitive learning in the context of LoRA optimization. The CoLoRA’s code will be released later.

4779SingNet: Towards a Large-Scale, Diverse, and In-the-Wild Singing Voice Dataset

[openreview] [pdf]

Abstract The lack of a publicly-available large-scale and diverse dataset has long been a significant bottleneck for singing voice applications like Singing Voice Synthesis (SVS) and Singing Voice Conversion (SVC). To tackle this problem, we present SingNet, an extensive, diverse, and in-the-wild singing voice dataset. Specifically, we propose a data processing pipeline to extract ready-to-use training data from sample packs and songs on the internet, forming 3000 hours of singing voices in various languages and styles. Furthermore, to facilitate the use and demonstrate the effectiveness of SingNet, we pre-train and open-source various state-of-the-art (SOTA) models on Wav2vec2, BigVGAN, and NSF-HiFiGAN based on our collected singing voice data. We also conduct benchmark experiments on Automatic Lyric Transcription (ALT), Neural Vocoder, and Singing Voice Conversion (SVC). Audio demos are available at:https://singnet-dataset.github.io/.

4780TypedThinker: Typed Thinking Improves Large Language Model Reasoning

[openreview] [pdf]

Abstract Despite significant advancements in the reasoning capabilities of Large Language Models (LLMs), the exploration of diverse reasoning solutions remains understudied. In this paper, we propose TypedThinker, a novel framework that enhances LLMs’ problem-solving abilities by incorporating multiple reasoning types (deductive, inductive, abductive, and analogical). Our analysis across four benchmarks reveals that different reasoning types uniquely solve distinct sets of problems, highlighting the importance of diverse thinking approaches. TypedThinker addresses two key challenges: selecting appropriate reasoning types for given problems and effectively implementing specific reasoning types. The framework employs a meta-thinker for reasoning type selection and a reasoner for execution, supported by an explicit memory for experience retrieval. Through self-training on successful experiences, TypedThinker learns an implicit policy for reasoning type selection and application. Experimental results demonstrate significant improvements over baseline models, with accuracy increases of 3.4% for Mistral 7B and 16.7% for LLaMA3 8B across logical and mathematical benchmarks. Notably, TypedThinker shows effective generalization to new benchmarks and can enhance even powerful models like GPT-4o.

4781Enhancing Cross-Lingual and Cross-Domain Adaptability in Large Language Models for Software Engineering

[openreview] [pdf]

Abstract This paper presents a groundbreaking mathematical framework for unsupervised domain adaptation (UDA) in the context of cross-lingual and cross-domain code modeling. We introduce the Enhanced Dynamic Code Modeling (UDA-EDCM) system, which leverages advanced concepts from measure theory, differential geometry, and information geometry to address the challenges posed by the diversity of natural and programming languages. At the core of UDA-EDCM is a novel measure-theoretic formulation of domain adaptation, utilizing optimal transport theory to minimize the discrepancy between source and target domains. We develop a Riemannian manifold approach to feature space alignment, introducing a Geodesic Flow Kernel that captures the intrinsic geometry of the code representation space. The UDA-EDCM operator is analyzed through the lens of functional analysis, revealing its spectral properties and their implications for generalization. Our information-theoretic bound on domain adaptation provides insights into the fundamental limits of knowledge transfer in code modeling. We present a unified theorem that synthesizes these diverse mathematical perspectives, offering a comprehensive characterization of UDA-EDCM’s performance in terms of Wasserstein distance, empirical Rademacher complexity, and Fisher information. This theoretical foundation is complemented by an innovative optimization framework based on the Fisher Information Metric, ensuring efficient convergence in the probabilistic manifold of model parameters. Extensive experiments demonstrate that UDA-EDCM significantly outperforms existing approaches in zero-shot and few-shot learning scenarios across a wide range of programming languages and coding tasks. Our work not only advances the baselines in domain adaptation for code intelligence but also establishes a rigorous mathematical basis for future research in adaptive AI systems for software engineering.

4782Disentangling the Roles of Representation and Selection in Data Pruning (for Fine-Tuning)

[openreview] [pdf]

Abstract Data pruning, the process of carefully selecting a small subset of training data, has been shown to improve both training efficiency and performance. It typically involves two steps: (1) obtaining a representation for each instance, and (2) applying a selection algorithm using these representations. However, the distinct roles of these two steps, as well as their interactions, remain unclear. To address this, we conduct a systematic study of data pruning, focusing on NLP fine-tuning. Our theoretical and empirical findings reveal that data representation often plays a more fundamental role than the selection algorithm: gradients, despite being computationally expensive, provide stronger pruning signals than other representations, making gradient-based methods consistently outperform cheaper alternatives. We also demonstrate that different selection algorithms excel in specific scenarios but are heavily influenced by the chosen representation. These insights provide clear guidelines for future research and practical applications.

4783Batched Bayesian optimization with correlated candidate uncertainties

[openreview] [pdf]

Abstract Batched Bayesian optimization (BO) can accelerate molecular design by efficiently identifying top-performing compounds from a large chemical library. Existing acquisition strategies for batch design in BO aim to balance exploration and exploitation. This often involves optimizing non-additive batch acquisition functions, necessitating approximation via myopic construction and/or diversity heuristics. In this work, we propose an acquisition strategy for discrete optimization that is motivated by pure exploitation, qPO (multipoint Probability of Optimality). qPO maximizes the probability that the batch includes the true optimum, which is expressible as the sum over individual acquisition scores and thereby circumvents the combinatorial challenge of optimizing a batch acquisition function. We differentiate the proposed strategy from parallel Thompson sampling and discuss how it implicitly captures diversity. Finally, we apply our method to the model-guided exploration of large chemical libraries and provide empirical evidence that it performs better than or on par with state-of-the-art methods in batched Bayesian optimization.

4784EMOE: Expansive Matching of Experts for Robust Uncertainty Based Rejection

[openreview] [pdf]

Abstract Expansive Matching of Experts (EMOE) is a novel method that utilizes support-expanding, extrapolatory pseudo-labeling to improve prediction and uncertainty based rejection on out-of-distribution (OOD) points. We propose an expansive data augmentation technique that generates OOD instances in a latent space, and an empirical trial based approach to filter out augmented expansive points for pseudo-labeling. EMOE utilizes a diverse set of multiple base experts as pseudo-labelers on the augmented data to improve OOD performance through a shared MLP with multiple heads (one per expert). We demonstrate that EMOE achieves superior performance compared to state-of-the-art methods on both image and tabular data.

4785Story-Adapter: A Training-free Iterative Framework for Long Story Visualization

[openreview] [pdf]

Abstract Story visualization, the task of generating coherent images based on a narrative, has seen significant advancements with the emergence of text-to-image models, particularly diffusion models. However, maintaining semantic consistency, generating high-quality fine-grained interactions, and ensuring computational feasibility remain challenging, especially in long story visualization (i.e., up to 100 frames). In this work, we propose a training-free and computationally efficient framework, termed Story-Adapter, to enhance the generative capability of long stories. Specifically, we propose an iterative paradigm to refine each generated image, leveraging both the text prompt and all generated images from the previous iteration. Central to our framework is a training-free global reference crossattention module, which aggregates all generated images from the previous iteration to preserve semantic consistency across the entire story, while minimizing computational costs with global embeddings. This iterative process progressively optimizes image generation by repeatedly incorporating text constraints, resulting in more precise and fine-grained interactions. Extensive experiments validate the superiority of Story-Adapter in improving both semantic consistency and generative capability for fine-grained interactions, particularly in long story scenarios.

4786OVTR: End-to-End Open-Vocabulary Multiple Object Tracking with Transformer

[openreview] [pdf]

Abstract Open-vocabulary multiple object tracking aims to generalize trackers to unseen categories during training, enabling their application across a variety of real-world scenarios. However, the existing open-vocabulary tracker, which relies on off-the-shelf open-vocabulary detector, perceives categories and locations independently in each frame, causing instability and making it vulnerable to similar appearances and irregular motion in diverse scenes. In this paper, we propose OVTR (End-to-End Open-Vocabulary Multiple Object Tracking with TRansformer), the first end-to-end open-vocabulary tracker that models motion, appearance, and category simultaneously. To achieve stable classification and continuous tracking, we designed the CIP (Category Information Propagation) strategy, which establishes multiple high-level category information priors for subsequent frames. Additionally, we introduce a dual-branch structure for generalization capability and deep multimodal interaction, and incorporate protective strategies in the decoder to enhance performance. Notably, our method does not require proposals that contain novel categories, yet still achieves strong results on the open-vocabulary MOT benchmark. Moreover, experiment transferring the model to other dataset demonstrates its effective adaptability.

4787Dynamic Contrastive Learning for Time Series Representation

[openreview] [pdf]

Abstract Understanding events in time series is an important task in a variety of contexts. However, human analysis and labeling are expensive and time-consuming. Therefore, it is advantageous to learn embeddings for moments in time series in an unsupervised way, which allows for good performance in classification or detection tasks after later minimal human labeling. In this paper, we propose dynamic contrastive learning (DynaCL), an unsupervised representation learning framework for time series that uses temporal adjacent steps to define positive pairs. DynaCL adopts N-pair loss to dynamically treat all samples in a batch as positive or negative pairs, enabling efficient training and addressing the challenges of complicated sampling of positives. We demonstrate that DynaCL embeds instances from time series into well-defined, semantically meaningful clusters, which allows superior performance on downstream tasks on a variety of public time series datasets. Our findings also reveal that high scores on unsupervised clustering metrics do not guarantee that the representations are useful in downstream tasks.

4788EEEC: Emotion-Experiencer-Event-Cause multi-step chain reasoning for Emotion-Cause Pair Extraction

[openreview] [pdf]

Abstract Emotion-cause pair extraction (ECPE) aims to identify all emotion and cause clauses in documents, forming the ECPs. Although existing methods have achieved some success, they face issues such as overlooking the impact of emotion experiencers, failing to leverage specific domain knowledge, and tending to spurious correlations. To address these issues, we transform the ECPE task into a multi-step reasoning problem and propose the Emotion-Experience-Event-Cause (EEEC) framework. We introduce an experiencer identification task to understand the source of emotions and enhance the association between emotion and cause clauses. In addition, by combining both prior knowledge and induced reasoning, EEEC guides a large-scale language model (LLM) to perform the emotion-reason pair extraction task efficiently. Experimental results demonstrate that EEEC achieves performance close to current state-of-the-art supervised fine-tuning methods. The data and code are released athttps://anonymous.4open.science/r/EEEC-EB80/.

4789Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

[openreview] [pdf]

Abstract Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability---which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the \textit{lookup-table mechanism} for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models. We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.

4790Long-Term 3D Point Tracking By Cost Volume Fusion

[openreview] [pdf]

Abstract Long-term point tracking is essential to understand non-rigid motion in the physical world better. Deep learning approaches have recently been incorporated into long-term point tracking, but most prior work predominantly functions in 2D. Although these methods benefit from the well-established backbones and matching frameworks, the motions they produce do not always make sense in the 3D physical world. In this paper, we propose the first deep learning framework for long-term point tracking in 3D that generalizes to new points and videos without requiring test-time fine-tuning. Our model contains a cost volume fusion module that effectively integrates multiple past appearances and motion information via a transformer architecture, significantly enhancing overall tracking performance. In terms of 3D tracking performance, our model significantly outperforms simple scene flow chaining and previous 2D point tracking methods, even if one uses ground truth depth and camera pose to backproject 2D point tracks in a synthetic scenario.

4791Efficient transformer with reinforced position embedding for language models

[openreview] [pdf]

Abstract In this paper, we propose an efficient transformer architecture that uses reinforced positional embedding to obtain superior performance with half the number of encoder decoder layers. We demonstrate that concatenating positional encoding with trainable token embeddings, normalizing across tokens in the token embedding matrix, and using the normalized token embedding matrix as the value of the attention layer improve the training and validation loss and the training time in an encoder-decoder Transformer model for a Portuguese-English translation task with 10 epochs or 12 hours of training across 10 trials. Our method, with roughly a threefold parameter reduction compared to the baseline model, yields a mean training loss of 1.21, a mean validation loss of 1.51, and an average training time of 1352.27 seconds per epoch, surpassing the baseline model with the same embedding dimension that employs addition of positional encoding and token embeddings, which achieves a mean training loss of 1.96, a validation loss of 2.18, and an average training time of 4297.79 seconds per epoch. Additionally, we evaluated our proposed architecture and the baseline across 14 diverse translation datasets from TensorFlow. The results indicate that our method consistently achieves lower or comparable training and validation losses, suggesting enhanced learning efficiency.

4792Primphormer: Leveraging Primal Representation for Graph Transformers

[openreview] [pdf]

Abstract Graph Transformers (GTs) have emerged as a promising approach for graph representation learning. Despite their successes, the quadratic complexity of GTs limits scalability on large graphs due to their pair-wise computations. To fundamentally reduce the computational burden of GTs, we introduce Primphormer, a primal-dual framework that interprets the self-attention mechanism on graphs as a dual representation and then models the corresponding primal representation with linear complexity. Theoretical evaluations demonstrate that Primphormer serves as a universal approximator for functions on both sequences and graphs, showcasing its strong expressive power. Extensive experiments on various graph benchmarks demonstrate that Primphormer achieves competitive empirical results while maintaining a more user-friendly memory and computational costs.

4793AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment

[openreview] [pdf]

Abstract With the growing adoption of reinforcement learning with human feedback (RLHF) for aligning large language models (LLMs), the risk of backdoor installation during alignment has increased, leading to unintended and harmful behaviors. Existing backdoor triggers are typically limited to fixed word patterns, making them detectable during data cleaning and easily removable post-poisoning. In this work, we explore the use of prompt-specific paraphrases as backdoor triggers, enhancing their stealth and resistance to removal during LLM alignment. We propose AdvBDGen, an adversarially fortified generative fine-tuning framework that automatically generates prompt-specific backdoors that are effective, stealthy, and transferable across models. AdvBDGen employs a generator-detector pair, fortified by an adversary, to ensure the installability and stealthiness of backdoors. It enables the crafting of complex triggers using as little as 3% of the fine-tuning data. Once installed, these backdoors can jailbreak LLMs during inference, demonstrate improved stability against perturbations compared to traditional constant triggers, and are harder to remove. These properties highlight the greater risks posed by such an adversarially crafted backdoors to LLM alignment.

4794DynaPrompt: Dynamic Test-Time Prompt Tuning

[openreview] [pdf]

Abstract Test-time prompt tuning enhances zero-shot generalization of vision-language models but tends to ignore the relatedness among test samples during inference. Online test-time prompt tuning provides a simple way to leverage the information in previous test samples, albeit with the risk of prompt collapse due to error accumulation. To enhance test-time prompt tuning, we propose DynaPrompt, short for dynamic test-time prompt tuning, exploiting relevant data distribution information while reducing error accumulation. Built on an online prompt buffer, DynaPrompt adaptively selects and optimizes the relevant prompts for each test sample during tuning. Specifically, we introduce a dynamic prompt selection strategy based on two metrics: prediction entropy and probability difference. For unseen test data information, we develop dynamic prompt appending, which allows the buffer to append new prompts and delete the inactive ones. By doing so, the prompts are optimized to exploit beneficial information on specific test data, while alleviating error accumulation. Experiments on fourteen datasets demonstrate the effectiveness of dynamic test-time prompt tuning.

4795Instruct-SkillMix: A Powerful Pipeline for LLM Instruction Tuning

[openreview] [pdf]

Abstract We introduce INSTRUCT-SKILLMIX, an automated approach for creating diverse, high quality SFT data for instruction-following. The pipeline involves two stages, each leveraging an existing powerful LLM: (1) Skill extraction: uses the LLM to extract core “skills” for instruction-following by directly prompting the model. This is inspired by “LLM metacognition” of (Didolkar et al., 2024); (2) Data generation: uses the powerful LLM to generate (instruction, response) data that exhibit a randomly chosen pair of these skills. Here, the use of random skill combinations promotes diversity and difficulty. The estimated cost of creating the dataset is under $600.Vanilla SFT (i.e., no PPO, DPO, or RL methods) on data generated from INSTRUCT-SKILLMIX leads to strong gains on instruction following benchmarks such as AlpacaEval 2.0, MT-Bench, and WildBench. With just 4K examples, LLaMA-3-8B-Base achieves 42.76% length-controlled win rate on AlpacaEval 2.0, a level similar to frontier models like Claude 3 Opus and LLaMA-3.1-405B-Instruct. Ablation studies also suggest plausible reasons for why creating open instruction-tuning datasets via naive crowd-sourcing has proved difficult. In our dataset,adding 20% low quality answers (“shirkers”) causes a noticeable degradation in performance.The INSTRUCT-SKILLMIX pipeline seems flexible and adaptable to other settings.

4796SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation

[openreview] [pdf]

Abstract LLM inference for popular enterprise use cases, such as summarization, RAG, and code-generation, typically observes orders of magnitude longer prompt lengths than generation lengths. This characteristic leads to high cost of prefill and increased response latency. In this paper, we present SwiftKV, a novel model transformation and distillation procedure specifically designed to reduce the time and cost of processing prompt tokens while preserving high quality of generated tokens. SwiftKV combines three key mechanisms: i) SingleInputKV, which prefills later layers’ KV cache using a much earlier layer’s output, allowing prompt tokens to skip much of the model computation, ii) AcrossKV, which merges the KV caches of neighboring layers to reduce the memory footprint and support larger batch size for higher throughput, and iii) a knowledge-preserving distillation procedure that can adapt existing LLMs for SwiftKV with minimal accuracy impact and low compute and data requirement. For Llama-3.1-8B and 70B, SwiftKV reduces the compute requirement of prefill by 50% and the memory requirement of the KV cache by 62.5% while incurring minimum quality degradation across a wide range of tasks. In the end-to-end inference serving using an optimized vLLM implementation, SwiftKV realizes up to 2x higher aggregate throughput and 60% lower time per output token. It can achieve a staggering 560 TFlops/GPU of normalized inference throughput, which translates to 16K tokens/s for Llama-3.1-70B in 16-bit precision on 4x H100 GPUs. Our training, inference, and model implementations are open-sourced athttps://anonymized.link.

4797Hypercone Assisted Contour Generation for Out-of-Distribution Detection

[openreview] [pdf]

Abstract Recent advances in the field of out-of-distribution (OOD) detection have placed great emphasis on learning better representations suited to this task. While there have been distance-based approaches, distributional awareness has seldom been exploited for better performance. We present HACk-OOD, a novel OOD detection method that makes no distributional assumption about the data, but automatically adapts to its distribution. Specifically, HACk-OOD constructs a set of hypercones by maximizing the angular distance to neighbors in a given data-point’s vicinity, to approximate the contour within which in-distribution (ID) data-points lie. Experimental results show state-of-the-art FPR@95 and AUROC performance on Near-OOD detection and on Far-OOD detection on the challenging CIFAR-100 benchmark without explicitly training for OOD performance.

4798The Role of Deductive and Inductive Reasoning in Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved substantial progress in artificial intelligence, particularly in reasoning tasks. However, their reliance on static prompt structures, coupled with limited dynamic reasoning capabilities, often constrains their adaptability to complex and evolving problem spaces. In this paper, we propose the Deductive and InDuctive (DID) method, which enhances LLM reasoning by dynamically integrating both deductive and inductive reasoning within the prompt construction process. Drawing inspiration from cognitive science, the DID approach mirrors human adaptive reasoning mechanisms, offering a flexible framework that allows the model to adjust its reasoning pathways based on task context and performance. We empirically validate the efficacy of DID on established datasets such as AIW and MR-GSM8K, as well as on our custom dataset, Holiday Puzzle, which presents tasks about different holiday date calculating challenges. By leveraging DID’s hybrid prompt strategy, we demonstrate significant improvements in both solution accuracy and reasoning quality, achieved without imposing substantial computational overhead. Our findings suggest that DID provides a more robust and cognitively aligned framework for reasoning in LLMs, contributing to the development of advanced LLM-driven problem-solving strategies informed by cognitive science models.

4799KooNPro: A Variance-Aware Koopman Probabilistic Model Enhanced by Neural Processes for Time Series Forecasting

[openreview] [pdf]

Abstract The probabilistic forecasting of time series is a well-recognized challenge, particularly in disentangling correlations among interacting time series and addressing the complexities of distribution modeling. By treating time series as temporal dynamics, we introduce \textbf{KooNPro}, a novel probabilistic time series forecasting model that combines variance-aware deep \textbf{Koo}pman model with \textbf{N}eural \textbf{Pro}cesses. KooNPro introduces a variance-aware continuous spectrum using Gaussian distributions to capture complex temporal dynamics with improved stability. It further integrates the Neural Processes to capture fine dynamics, enabling enhanced dynamics capture and prediction. Extensive experiments on nine real-world datasets demonstrate that KooNPro consistently outperforms state-of-the-art baselines. Ablation studies highlight the importance of the Neural Process component and explore the impact of key hyperparameters. Overall, KooNPro presents a promising novel approach for probabilistic time series forecasting.

4800SQuBa: Speech Mamba Language Model with Querying-Attention for Efficient Summarization

[openreview] [pdf]

Abstract Abstractive Speech Summarization (SSum) becomes increasingly difficult as the input speech length grows. To address this, we present SQuBa (Speech Querying Mamba Architecture), an end-to-end model designed explicitly for efficient speech summarization. SQuBa leverages a querying-attention Mamba projector to condense extended acoustic features into compact semantic tokens, which are subsequently summarized by the Mamba Large Language Model (LLM). The architecture’s computational complexity scales linearly with input length, enabling efficient handling of longer inputs. A two-stage training framework, complemented by bootstrapped Direct Preference Optimization (DPO) fine-tuning, empowers SQuBa to generate concise and coherent summaries. Experimental results demonstrate that SQuBa delivers competitive performance while significantly improving inference speed, making it ideal for real-world applications such as podcast and meeting transcriptions.

4801MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark

[openreview] [pdf]

Abstract Evaluating instruction following capabilities for multimodal, multi-turn dialogue is challenging. With potentially multiple instructions in the input model context, the task is time-consuming for human raters and we show LLM based judges are biased towards answers from the same model. We propose MMMT-IF, an image based multi-turn Q&A evaluation set with added global instructions between questions, constraining the answer format. This challenges models to retrieve instructions dispersed across long dialogues and reason under instruction constraints. All instructions are objectively verifiable through code execution. We introduce the Programmatic Instruction Following (PIF\operatorname{PIF}) metric to measure the fraction of the instructions that are correctly followed while performing a reasoning task. The PIF-N-K\operatorname{PIF-N-K} set of metrics further evaluates robustness by measuring the fraction of samples in a corpus where, for each sample, at least K out of N generated model responses achieve a PIF\operatorname{PIF} score of one. The PIF\operatorname{PIF} metric aligns with human instruction following ratings, showing 60 percent correlation. Experiments show Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet, have a PIF\operatorname{PIF} metric that drops from 0.81 on average at turn 1 across the models, to 0.64 at turn 20. Across all turns, when each response is repeated 4 times (PIF-4-4\operatorname{PIF-4-4}), GPT-4o and Gemini successfully follow all instructions only 11% of the time. When all the instructions are also appended to the end of the model input context, the PIF\operatorname{PIF} metric improves by 22.3 points on average, showing that the challenge with the task lies not only in following the instructions, but also in retrieving the instructions spread out in the model context. We plan to open source the MMMT-IF dataset and metric computation code.

4802Autoencoder-Based Hybrid Replay for Class-Incremental Learning

[openreview] [pdf]

Abstract In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks tt increases. Current exemplar replay strategies impose O(t)\mathcal{O}(t) memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving O(0.1t)\mathcal{O}(0.1 t) at the worst case with the computing complexity of O(t)\mathcal{O}(t) while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets.

4803Stealthy Shield Defense: A Conditional Mutual Information-Based Approach against Black-Box Model Inversion Attacks

[openreview] [pdf]

Abstract Model inversion attacks (MIA) aim to uncover private training data by accessing public models, raising increasing concerns about privacy breaches. Black-box MIA, where attackers can generate inputs and obtain the model’s outputs arbitrarily, has gained more attention due to its closer alignment with real-world scenarios and greater potential threats. Existing defenses primarily focus on white-box attacks, with a lack of specialized defenses to address the latest black-box attacks. To fill this technological gap, we propose a post-processing defense algorithm based on conditional mutual information (CMI). We have theoretically proven that our CMI framework serves as a special information bottleneck, making outputs less dependent on inputs and more dependent on true labels. To further reduce the modifications to outputs, we introduce an adaptive rate-distortion framework and optimize it by water-filling method. Experimental results show that our approach outperforms existing defenses, in terms of both MIA robustness and model utility, across various attack algorithms, training datasets, and model architectures. In particular, on CelebA dataset, our defense lowers the attack accuracy of LOKT to 0% while other defenses remain 50-75%.

4804Leveraging VLMs for MUDA:A Category-Specific Prompting with Multi-Modal Low-Rank Adapter

[openreview] [pdf]

Abstract Multi-Source Domain Adaptation (MSDA) aims to adaptively apply knowledge from multiple source pre-trained models to an unlabeled target domain. Current MSDA methods typically require extensive parameter tuning for each source model, which becomes computationally expensive, especially when dealing with numerous source domains or larger source models. With the recent advancements of Vision-Language Models (VLMs) as natural source models, the challenges of cross-domain tasks based on multi-source domains have evolved: 1) VLMs rapidly adapt to downstream tasks through prompt tuning, yet learnable prompt tokens are prone to overfftting due to limited training samples; 2) Rapidly leveraging knowledge from multiple source domains and encouraging the learning of invariant representations across these domains is a central issue; 3) The presence of visual and textual domain gaps, as well as cross-modal misalignment, can signiffcantly impact model performance. In this paper, we propose a ffnetuning framework that integrates prompts with multimodal Low-Rank Adaptation (LoRA). This framework employs learnable prompt features as shared characteristics across different domains and utilizes multimodal LoRA matrices to represent domain-speciffc features for individual ffne-tuning of VLMs across multiple source domains. Furthermore, it encourages interaction between ffne-tuning parameters from different domains and modalities to enhance consistency. We combine all source domain-speciffc LoRA modules into an integrated module using a set of coefffcients and adapt this integrated module to learn on the target domain. Extensive experiments demonstrate that our approach achieves signiffcant improvements on standard image classiffcation benchmark datasets, highlighting its effectiveness in multi-source domain adaptation tasks.

4805OmniRe: Omni Urban Scene Reconstruction

[openreview] [pdf]

Abstract We introduce OmniRe, a comprehensive system for efficiently creating high-fidelity digital twins of dynamic real-world scenes from on-device logs. Recent methods using neural fields or Gaussian Splatting primarily focus on vehicles, hindering a holistic framework for all dynamic foregrounds demanded by downstream applications, e.g., the simulation of human behavior. OmniRe extends beyond vehicle modeling to enable accurate, full-length reconstruction of diverse dynamic objects in urban scenes. Our approach builds scene graphs on 3DGS and constructs multiple Gaussian representations in canonical spaces that model various dynamic actors, including vehicles, pedestrians, cyclists, and others. OmniRe allows holistically reconstructing any dynamic object in the scene, enabling advanced simulations (~60 Hz) that include human-participated scenarios, such as pedestrian behavior simulation and human-vehicle interaction. This comprehensive simulation capability is unmatched by existing methods. Extensive evaluations on the Waymo dataset show that our approach outperforms prior state-of-the-art methods quantitatively and qualitatively by a large margin. We further extend our results to 5 additional popular driving datasets to demonstrate its generalizability on common urban scenes. We will make the code and data publicly available.

[openreview] [pdf]

Abstract End-to-End Autonomous Driving (E2EAD) methods typically rely on supervised perception tasks to extract explicit scene information (e.g., objects, maps). This reliance necessitates expensive annotations and constrains deployment and data scalability in real-time applications. In this paper, we introduce SSR, a novel framework that utilizes only 16 navigation-guided tokens as Sparse Scene Representation, efficiently extracting crucial scene information for E2EAD. Our method eliminates the need for supervised sub-tasks, allowing computational resources to concentrate on essential elements directly related to navigation intent. We further introduce a temporal enhancement module that employs a Bird’s-Eye View (BEV) world model, aligning predicted future scenes with actual future scenes through self-supervision. SSR achieves state-of-the-art planning performance on the nuScenes dataset, demonstrating a 27.2% relative reduction in L2 error and a 51.6% decrease in collision rate to the leading E2EAD method, UniAD. Moreover, SSR offers a 10.9× faster inference speed and 13× faster training time. This framework represents a significant leap in real-time autonomous driving systems and paves the way for future scalable deployment. Code will be released.

4807Knowledge-localized Unlearning for Faithful Forgetting in Language Models

[openreview] [pdf]

Abstract Large language models are exposed to privacy risks since they are trained on large text corpus, which may include sensitive or private information. Therefore, existing studies have attempted to unlearn undesirable knowledge exposed without permission from a language model. However, they are limited in that they have overlooked the complex and interconnected nature of knowledge, where related knowledge must be carefully examined. Specifically, they have failed to evaluate whether an unlearning method faithfully erases interconnected knowledge that should be removed, retaining knowledge that appears relevant but exists in a completely different context. To resolve this problem, we first define a new concept called superficial unlearning, which refers to the phenomenon where an unlearning method either fails to erase the interconnected knowledge it should remove or unintentionally erases irrelevant knowledge. Based on the definition, we introduce a new benchmark, FaithUnBench, to analyze and evaluate the faithfulness of unlearning in real-world knowledge QA settings. Furthermore, we propose a novel unlearning method, KLUE, which identifies and updates only knowledge-related neurons to achieve faithful unlearning. KLUE categorizes knowledge neurons using an explainability method and updates only those neurons using selected unforgotten samples. Experimental results demonstrate that widely-used unlearning methods fail to ensure faithful unlearning, while our method shows significant effectiveness in real-world QA settings.

4808RotRNN: Modelling Long Sequences with Rotations

[openreview] [pdf]

Abstract Linear recurrent neural networks, such as State Space Models (SSMs) and Linear Recurrent Units (LRUs), have recently shown state-of-the-art performance on long sequence modelling benchmarks. Despite their success, their empirical performance is not well understood and they come with a number of drawbacks, most notably their complex initialisation and normalisation schemes. In this work, we address some of these issues by proposing RotRNN – a linear recurrent model which utilises the convenient properties of rotation matrices. We show that RotRNN provides a simple and efficient model with a robust normalisation procedure, and a practical implementation that remains faithful to its theoretical derivation. RotRNN also achieves competitive performance to state-of-the-art linear recurrent models on several long sequence modelling datasets.

4809T2V2: A Unified Non-Autoregressive Model for Speech Recognition and Synthesis via Multitask Learning

[openreview] [pdf]

Abstract We introduce T2V2 (Text toVoice andVoice toText), a unified non-autoregressive model capable of performing both automatic speech recognition (ASR) and text-to-speech (TTS) synthesis within the same framework. T2V2 uses a shared Conformer backbone with rotary positional embeddings to efficiently handle these core tasks, with ASR trained using Connectionist Temporal Classification (CTC) loss and TTS using masked language modeling (MLM) loss. The model operates on discrete tokens, where speech tokens are generated by clustering features from a self-supervised learning model. To further enhance performance, we introduce auxiliary tasks: CTC error correction to refine raw ASR outputs using contextual information from speech embeddings, and unconditional speech MLM, enabling classifier free guidance to improve TTS. Our method is self-contained, leveraging intermediate CTC outputs to align text and speech using Monotonic Alignment Search, without relying on external aligners. We perform extensive experimental evaluation to verify the efficacy of the T2V2 framework, achieving state-of-the-art performance on TTS task and competitive performance in discrete ASR.

4810U3D: Unlocking the Video Prior for High Fidelity Sparse Novel View Synthesis and 3D Generation

[openreview] [pdf]

Abstract Trained on massive datasets, video diffusion models have shown strong generative priors for novel view synthesis tasks. Existing methods finetune these models to synthesize 360-degree orbit videos from input images. While these methods demonstrate the pretrained models’ generalization ability, they are limited by the assumption of temporal attention and struggle to generate highly consistent results. Additionally, generating novel views as a sequence of twenty or more frames incurs high computational costs compared to sparse view synthesis methods. Sparse novel view synthesis methods finetuned from traditional 2D diffusion models, on the other hand, can generate highly consistent images from arbitrary camera positions but suffer from poor generalization, leading to unsatisfactory results on out-of-domain inputs. In this paper, we explore leveraging video diffusion models’ rich generative priors to enhance sparse novel view generation models. Specifically, we investigate the generation process of video diffusion models and unearth key observations to extract geometrical priors from them. Based on this, we propose a novel framework, U3D, for sparse novel view synthesis. U3D includes a geometrical reference network to integrate these priors into the sparse novel view synthesis network and a temporal enhanced sparse view generation network to preserve pretrained temporal knowledge. By leveraging the significant generative priors from video diffusion models, our framework can synthesize highly consistent sparse novel views with strong generalization ability, which can be reconstructed into high-quality 3D assets using feed-forward sparse view reconstruction methods.

4811Proactive Privacy Amnesia for Large Language Models: Safeguarding PII with Negligible Impact on Model Utility

[openreview] [pdf]

Abstract With the rise of large language models (LLMs), increasing research has recognized their risk of leaking personally identifiable information (PII) under malicious attacks. Although efforts have been made to protect PII in LLMs, existing methods struggle to balance privacy protection with maintaining model utility. In this paper, inspired by studies of amnesia in cognitive science, we propose a novel approach, Proactive Privacy Amnesia (PPA), to safeguard PII in LLMs while preserving their utility. This mechanism works by actively identifying and forgetting key memories most closely associated with PII in sequences, followed by a memory implanting using suitable substitute memories to maintain the LLM’s functionality. We conduct evaluations across multiple models to protect common PII, such as phone numbers and physical addresses, against prevalent PII-targeted attacks, demonstrating the superiority of our method compared with other existing defensive techniques. The results show that our PPA method completely eliminates the risk of phone number exposure by 100% and significantly reduces the risk of physical address exposure by 9.8% – 87.6%, all while maintaining comparable model utility performance.

4812FeedSign: Full-parameter Federated Fine-tuning of Large Models with Extremely Low Communication Overhead of One Bit

[openreview] [pdf]

Abstract Federated fine-tuning (FFT) aims to fine-tune a pre-trained model with private data from distributed clients by exchanging models rather than data under the orchestration of a parameter server (PS). However, as large models are acing in almost every machine learning task, the communication overhead and memory demand are surging accordingly, hindering the practical deployment on consumer devices. To overcome the bottleneck forged by the growing communication overhead of federated learning and lower the high memory demand of large model fine-tuning, we propose FeedSign, an FFT algorithm where a client uploads its update model and downloads the global model of any size using exactly 1 bit per step, while the memory demand is squeezed to the amount needed for inference. This is realized by utilizing zeroth-order (ZO) optimizers on large models and shared pseudo-random number generators (PRNG) across devices to split the gradient estimate from the clients to 1) a direction corresponding to a designated random seed and 2) a binary vote from the client indicating whether the seed-corresponding direction grants a local loss descent, which is the only information the clients should convey to the PS. We conduct theoretical analysis on FeedSign and show that it converges at an exponential rate O(et)\mathcal{O}(e^{-t}), where tt is the number of elapsed steps, the same rate as in first-order (FO) methods can attain in big O\mathcal{O} notation. Moreover, it is also found that FeedSign enjoys good robustness against data heterogeneity and Byzantine attacks. We conduct extensive experiments on models across different structures and sizes (11M to 13B) and found that the proposed method performs better or closely, depending on scenarios, compared to its ZO and FO counterparts albeit an orders-of-magnitude lower communication overhead. We also discuss some interesting advantages as byproducts guaranteed by the minimalistic design of FeedSign.

4813PoincareNorm: Rethinking Over-smoothing beyond Dirichlet energy

[openreview] [pdf]

Abstract Dirichlet energy is intuitive and commonly used to measure over-smoothing. However, Dirichlet energy can only capture information about the first-order derivative of features. In light of this, we propose a series of node similarity measures which are the energy of higher-order derivatives of features and generalize Dirichlet energy. After we rigorously analyze the property of proposed measures and its application to establish the sharp decay rate of Dirichlet energy under continuous diffusion or discrete random walk which is closely related to the first nonzero eigenvalue of graph Laplacian. Lastly, to address over-smoothing with respect to these measures, we propose a normalization termed PoincareNorm which generalizes PairNorm to control our proposed measures. We consider the semi-supervised node classification task in the scenario without missing features, PoincareNorm outperforms existing normalization methods.

4814Enhancing Multimodal LLM for Detailed and Accurate Video Captioning using Multi-Round Preference Optimization

[openreview] [pdf]

Abstract Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimization (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimized using DPO. To further improve training, we introduce a novel multi-round DPO (mrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initializing the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilize the process. To address potential catastrophic forgetting of non-captioning abilities due to mrDPO, we propose rebirth tuning, which finetunes the pre-DPO LLM by using the captions generated by the mrDPO-trained model as supervised labels. Experiments show that mrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing global and local error rates by 40% and 20%, respectively, while decreasing the repetition rate by 35%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining competitive performance to the state-of-the-art on widely used video question-answering benchmark among models of similar size. Upon acceptance, we will release the code, model checkpoints, and training and test data. Demos are available athttps://video-salmonn-2.github.io.

4815T-JEPA: Augmentation-Free Self-Supervised Learning for Tabular Data

[openreview] [pdf]

Abstract Self-supervision is often used for pre-training to foster performance on a downstream task by constructing meaningful representations of samples. Self-supervised learning (SSL) generally involves generating different views of the same sample and thus requires data augmentations that are challenging to construct for tabular data. This constitutes one of the main challenges of self-supervision for structured data. In the present work, we propose a novel augmentation-free SSL method for tabular data. Our approach, T-JEPA, relies on a Joint Embedding Predictive Architecture (JEPA) and is akin to mask reconstruction in the latent space. It involves predicting the latent representation of one subset of features from the latent representation of a different subset within the same sample, thereby learning rich representations without augmentations. We use our method as a pre-training technique and train several deep classifiers on the obtained representation. Our experimental results demonstrate a substantial improvement in both classification and regression tasks, outperforming models trained directly on samples in their original data space. Moreover, T-JEPA enables some methods to consistently outperform or match the performance of traditional methods likes Gradient Boosted Decision Trees. To understand why, we extensively characterize the obtained representations and show that T-JEPA effectively identifies relevant features for downstream tasks without access to the labels. Additionally, we introduce regularization tokens, a novel regularization method critical for training of JEPA-based models on structured data.

4816Which Network is Trojaned? Increasing Trojan Evasiveness for Model-Level Detectors

[openreview] [pdf]

Abstract Trojan attacks can pose serious risks by injecting deep neural networks with hidden, adversarial functionality. Recent methods for detecting whether a model is trojaned appear highly successful. However, a concerning and relatively unexplored possibility is that trojaned networks could be made harder to detect. To better understand the scope of this risk, we develop a general method for making trojans more evasive based on several novel techniques and observations. In experiments, we find that our evasive trojans reduce the efficacy of a wide range of detectors across numerous evaluation settings while maintaining high attack success rates. Surprisingly, we also find that our evasive trojans are substantially harder to reverse-engineer despite not being explicitly designed with this attribute in mind. These findings underscore the importance of developing more robust monitoring mechanisms for hidden functionality and clarifying the offense-defense balance of trojan detection.

4817Let the Rule Speak: Enhancing In-context Learning Debiasing with Interpretability

[openreview] [pdf]

Abstract In-context learning, which allows large language models to perform diverse tasks with a few demonstrations, is found to have imbalanced per-class prediction accuracy on multi-class text classification. Although notable output correction methods have been developed to tackle the issue and simultaneously improve downstream prediction accuracy, they may fail to answer the core interpretability challenges: why and which certain classes need corrections, and more importantly, to have an easy-to-understand transformation for correcting each class. To address such interpretability gaps, we first find that the imbalance arises from certain classes consistently receiving high ICL output probabilities, whereas others receiving lower or mixed ranges, so the former is more frequently chosen, resulting in higher accuracy; more crucially, we find that these ranges have significantly varying degrees of influence on the accuracy bias, highlighting the need for precise, interpretable probability corrections by range. Motivated by this, we propose FuRud, a fuzzy rule optimization based debiasing method, that (1) detects which classes need corrections, and (2) for each correction-needed class, detects its probability ranges and applies asymmetric amplifications or reductions to correct them interpretably. Notably, across seven benchmark datasets, FuRud reduces the pairwise class accuracy bias (COBias) by more than half (56%), while achieving a relative increase of 21% in accuracy, outperforming state-of-the-art debiasing methods. Moreover, FuRud can optimize a downstream task in a few-shot manner, with as few as 10 optimization examples. Furthermore, FuRud can work for prompt formats that lead to highly skewed predictions. For example, FuRud greatly improves ICL outputs which use letter options, with 44% relative accuracy increase and 54% relative COBias reduction.

4818DriveE2E: Benchmarking Closed-Loop End-to-End Autonomous Driving Based-on Real-World Traffic Scenarios

[openreview] [pdf]

Abstract End-to-end learning has demonstrated considerable promise in advancing autonomous driving by fully leveraging sensor data. Recently, many end-to-end models have been developed, with a substantial number evaluated using the nuScenes dataset in an open-loop manner. However, open-loop evaluations, which lack interaction with the environment, fail to fully capture the driving capabilities of these models. While closed-loop evaluations, such as those using the CARLA simulator, allow for interaction with the environment, they often rely on rule-based, manually configured traffic scenarios. This approach leads to evaluations that diverge significantly from real-world driving conditions, thus limiting their ability to reflect actual driving performance. To address these limitations, we introduce a novel closed-loop evaluation framework that closely integrates real-world driving scenarios with the CARLA simulator, effectively bridging the gap between simulated environments and real-world driving conditions. Our approach involves the creation of digital twins for 15 real-world intersections and the incorporation of 800 real-world traffic scenarios selected from a comprehensive 100-hour video dataset captured with highly installed infrastructure sensors. These digital twins accurately replicate the physical and environmental characteristics of their real-world counterparts, while the traffic scenarios capture a diverse range of driving behaviors, locations, weather conditions, and times of day. Within this twinned environment, CARLA enables realistic simulations where autonomous agents can dynamically interact with their surroundings. Furthermore, we have established a comprehensive closed-loop benchmark that evaluates end-to-end autonomous driving models across these diverse scenarios. Notably, this is the first closed-loop end-to-end autonomous driving benchmark based on real-world traffic scenarios. Video demos are provided in the supplementary materials.

4819Curvature Diversity-Driven Deformation and Domain Alignment for Point Cloud

[openreview] [pdf]

Abstract Unsupervised Domain Adaptation (UDA) is crucial for reducing the need for extensive manual data annotation when training deep networks on point cloud data. A significant challenge of UDA lies in effectively bridging the domain gap. To tackle this challenge, we propose Curvature Diversity-Driven Nuclear-Norm Wasserstein Domain Alignment (CDND). Our approach first introduces a Curvature Diversity-driven Deformation Reconstruction (CurvRec) task, which effectively mitigates the gap between the source and target domains by enabling the model to extract salient features from semantically rich regions of a given point cloud. We then propose Deformation-based Nuclear-norm Wasserstein Discrepancy (D-NWD), which applies the Nuclear-norm Wasserstein Discrepancy to both deformed and original data samples to align the source and target domains. Furthermore, we contribute a theoretical justification for the effectiveness of D-NWD in distribution alignment and demonstrate that it is generic enough to be applied to any deformations. To validate our method, we conduct extensive experiments on two public domain adaptation datasets for point cloud classification and segmentation tasks. Empirical experiment results show that our CDND achieves state-of-the-art performance by a noticeable margin over existing approaches.

4820Parameter-Efficient Fine-Tuning of State Space Models

[openreview] [pdf]

Abstract Deep State Space Models (SSMs), such as Mamba(Gu & Dao, 2023), have emerged as powerful tools for language modeling, offering high performance with efficient inference and linear scaling in sequence length. However, the application of parameter-efficient fine-tuning (PEFT) methods to SSM-based models remains largely unexplored. This paper aims to systematically study two key questions: (i) How do existing PEFT methods perform on SSM-based models? (ii) Which modules are most effective for fine-tuning? We conduct an empirical benchmark of four basic PEFT methods on SSM-based models. Our findings reveal that prompt-based methods (e.g., prefix-tuning) are no longer effective, an empirical result further supported by theoretical analysis. In contrast, LoRA remains effective for SSM-based models. We further investigate the optimal application of LoRA within these models, demonstrating both theoretically and experimentally that applying LoRA to linear projection matrices without modifying SSM modules yields the best results, as LoRA is not effective at tuning SSM modules. To further improve performance, we introduce LoRA with Selective Dimension tuning (SDLoRA), which selectively updates certain channels and states on SSM modules while applying LoRA to linear projection matrices. Extensive experimental results show that this approach outperforms standard LoRA.

4821INS: Interaction-aware Synthesis to Enhance Offline Multi-agent Reinforcement Learning

[openreview] [pdf]

Abstract Data scarcity in offline multi-agent reinforcement learning (MARL) is a key challenge for real-world applications. Recent advances in offline single-agent reinforcement learning (RL) demonstrate the potential of data synthesis to mitigate this issue. However, in multi-agent systems, interactions between agents introduce additional challenges. These interactions complicate the synthesis of multi-agent datasets, leading to data distortion when inter-agent interactions are neglected. Furthermore, the quality of the synthetic dataset is often constrained by the original dataset. To address these challenges, we proposeINteraction-aware Synthesis (INS), which synthesizes high-quality multi-agent datasets using diffusion models. Recognizing the sparsity of inter-agent interactions, INS employs a sparse attention mechanism to capture these interactions, ensuring that the synthetic dataset reflects the underlying agent dynamics. To overcome the limitation of diffusion models requiring continuous variables, INS implements a bit action module, enabling compatibility with both discrete and continuous action spaces. Additionally, we incorporate a select mechanism to prioritize transitions with higher estimated values, further enhancing the dataset quality. Experimental results across multiple datasets in MPE and SMAC environments demonstrate that INS consistently outperforms existing methods, resulting in improved downstream policy performance and superior dataset metrics. Notably, INS can synthesize high-quality data using only 10% of the original dataset, highlighting its efficiency in data-limited scenarios.

4822Context-Aware Kernel Search for Bayesian Optimization with Large Language Models

[openreview] [pdf]

Abstract The efficiency of Bayesian optimization (BO) relies on careful selection of the surrogate model to balance exploration and exploitation under limited budget. Traditional BO methods often struggle with sub-optimal kernel choices when using Gaussian processes (GPs) as the surrogate model. When the kernel is inadequately chosen, BO may converge slowly or even get stuck at an undesired local minimum. To address such drawback, we propose the novel Context-Aware Kernel Search (CAKES) to automate optimal kernel design in BO with large language models (LLMs). Concretely, CAKES exploits LLMs as crossover and mutation operators to adaptively generate and refine GP kernels based on the observed data. CAKES works entirely in-context and can be easily integrated into existing systems without requiring any fine-tuning. We further present a theoretical analysis demonstrating that our method achieves sub-linear regret relative to the budget for any input dimension. Experimental results demonstrate that CAKES outperforms various salient baseline methods in numerous synthetic and real-world optimization tasks. Notably, CAKES improves the overall performance on benchmark functions by roughly 9%. In hyperparameter tuning tasks, CAKES can effectively leverage fewer data samples to quickly identify high-performing configurations and consistently ranks first across various datasets. As an encouraging real application, we successfully applied CAKES to design photonic chips, achieving significant improvements in key performance indicators while speeding up the design cycle by a factor of ten compared to the baselines. Our code is accessible athttps://github.com/cakes4bo/cakes.

4823Learning High-dimensional Gaussian Mixture Models via a Fourier Approach

[openreview] [pdf]

Abstract In this paper, we address the challenge of learning high-dimensional Gaussian mixture models (GMMs), with a specific focus on estimating both the model order and the mixing distribution from i.i.d. samples. We propose a novel algorithm that achieves linear complexity relative to the sample size nn, significantly improving computational efficiency. Unlike traditional methods, such as the method of moments or maximum likelihood estimation, our algorithm leverages Fourier measurements from the samples, facilitating simultaneous estimation of both the model order and the mixing distribution. The difficulty of the learning problem can be quantified by the minimum separation distance Δ\Delta and minimal mixing weight wminw_{\min}. For stable estimation, a sample size of Ω(1wmin2Δ4K4)\Omega\left(\frac{1}{w_{\min}^2 \Delta^{4K-4}}\right) is required for the model order, while Ω(1wmin2Δ4K2)\Omega\left(\frac{1}{w_{\min}^2 \Delta^{4K-2}}\right) is necessary for the mixing distribution. This highlights the distinct sample complexities for the two tasks. For DD-dimensional mixture models, we propose a PCA-based approach to reduce the dimension, reducing the algorithm’s complexity to O(nD2)O(nD^2), with potential further reductions through random projections. Numerical experiments demonstrate the efficiency and accuracy compared with the EM algorithm. In particular, we observe a clear phase transition in determining the model order, as our method outperforms traditional information criteria. Additionally, our framework is flexible and can be extended to learning mixtures of other distributions, such as Cauchy or exponential distributions.

4824Conflict-Aware Adversarial Training

[openreview] [pdf]

Abstract Adversarial training is the most effective method to obtain adversarial robustness for deep neural networks by directly involving adversarial samples in the training procedure. To obtain an accurate and robust model, the weighted-average method is applied to optimize standard loss and adversarial loss simultaneously. In this paper, we argue that the weighted-average method does not provide the best tradeoff for the standard performance and adversarial robustness. We argue that the failure of the weighted-average method is due to the conflict between the gradients derived from standard and adversarial loss, and further demonstrate such a conflict increases with attack budget theoretically and practically. To alleviate this problem, we propose a new trade-off paradigm for adversarial training with a conflict-aware factor for the convex combination of standard and adversarial loss, named \textbf{Conflict-Aware Adversarial Training~(CA-AT)}. Comprehensive experimental results show that CA-AT consistently offers a superior trade-off between standard performance and adversarial robustness under the settings of adversarial training from scratch and parameter-efficient finetuning.

4825Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts

[openreview] [pdf]

Abstract Time series forecasting plays a vital role for decision-making across a wide range of real-world domains, which has been extensively studied. Most existing single-modal models rely solely on numerical series, which suffer from the limitations imposed by insufficient information. Recent studies have revealed that multimodal models can address the core issue by integrating textual information. However, these models focus on either historical or future textual information, overlooking the unique contributions each plays in time series forecasting. Besides, these models fail to grasp the intricate relationships between textual and time series data, constrained by their moderate capacity for multimodal comprehension. To tackle these challenges, we propose Dual-Forecaster, a pioneering multimodal time series model that combines both descriptively historical textual information and predictive textual insights, leveraging advanced multimodal comprehension capability. We begin by developing the historical text-time series contrastive loss to align the descriptively historical textual data and corresponding time series data, followed by encoding multimodal text-time series representations between them through the history-oriented modality interaction module, and then combining predictive textual data through the future-oriented modality interaction module to ensure textual insights-following forecasting. Our comprehensive evaluations on synthetic dataset and captioned-public datasets demonstrate that Dual-Forecaster is a distinctly effective multimodal time series model that outperforms or is comparable to other state-of-the-art models, highlighting the superiority of integrating textual information for time series forecasting. This work opens new avenues in the integration of textual information with numerical time series data for multimodal time series analysis.

4826Improving Soft Unification with Knowledge Graph Embedding Methods

[openreview] [pdf]

Abstract Neural Theorem Provers (NTPs) present a promising framework for neuro-symbolic reasoning, combining end-to-end differentiability with the interpretability of symbolic logic programming. However, optimizing NTPs remains a significant challenge due to their complex objective landscape and gradient sparcity. On the other hand, Knowledge Graph Embedding (KGE) methods offer smooth optimization with well-defined learning objectives but often lack interpretability. In this work, we propose several strategies to integrate the strengths of NTPs and KGEs. By incorporating KGE objectives into the NTP framework, we demonstrate substantial improvements in both accuracy and computational efficiency.

4827Multi-Label Test-Time Adaptation with Bound Entropy Minimization

[openreview] [pdf]

Abstract Mainstream test-time adaptation (TTA) techniques endeavor to mitigate distribution shifts via entropy minimization for multi-class classification, inherently increasing the probability of the most confident class. However, when encountering multi-label instances, the primary challenge stems from the varying number of labels per image, and prioritizing only the highest probability class inevitably undermines the adaptation of other positive labels. To address this issue, we investigate TTA within multi-label scenario (ML--TTA), developing Bound Entropy Minimization (BEM) objective to simultaneously increase the confidence of multiple top predicted labels. Specifically, to determine the number of labels for each augmented view, we retrieve a paired caption with yielded textual labels for that view. These labels are allocated to both the view and caption, called weak label set and strong label set with the same size k. Following this, the proposed BEM considers the highest top-k predicted labels from view and caption as a single entity, respectively, learning both view and caption prompts concurrently. By binding top-k predicted labels, BEM overcomes the limitation of vanilla entropy minimization, which exclusively optimizes the most confident class. Across the MSCOCO, VOC, and NUSWIDE multi-label datasets, our ML--TTA framework equipped with BEM exhibits superior performance compared to the latest SOTA methods, across various model architectures, prompt initialization, and varying label scenarios. The code is available athttps://anonymous.4open.science/r/ML-TTA-10BE.

4828Exposing the Achilles’ Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning

[openreview] [pdf]

Abstract Large Language Models (LLMs) have significantly impacted the field of Math Word Problems (MWPs), transforming how these problems are approached and solved, particularly in educational contexts. However, existing evaluations often focus on final accuracy, neglecting the critical aspect of reasoning capabilities. This work addresses that gap by evaluating LLMs’ abilities to detect and correct reasoning mistakes. We present a novel dataset, MWP-MISTAKE, containing MWPs with both correct and incorrect reasoning steps generated through rule-based methods and smaller language models. Our comprehensive benchmarking of state-of-the-art models such as GPT-4o and GPT4 uncovers important insights into their strengths and limitations. While GPT-4o excels in mistake detection and rectification, gaps remain, particularly in handling complex datasets and novel problems. Additionally, we identify concerns with data contamination and memorization, which affect LLM reliability in real-world applications. While OpenAI’s O1 model demonstrates 90% accuracy in reasoning and final answers on complex tasks, it still remains weak in mistake detection. Our findings highlight the need for improved reasoning evaluations and suggest ways to enhance LLM generalization and robustness in math problem-solving.

4829Correlation and Navigation in the Vocabulary Key Representation Space of Language Models

[openreview] [pdf]

Abstract Language model (LM) decoding is based on the next-token prediction (NTP) probability distribution. For neural LMs (e.g., Transformer-based), NTP distribution is essentially a softmax-regularized dot product between an encoded input context (query) and fixed vocabulary representations (keys). In this paper, we study the effect of the key distribution on the NTP distribution, with a focus on whether the similarity between keys will trigger spurious correlations in NTP. Through knowledge-probing tasks, we show that in the NTP distribution, the few top-ranked tokens are typically accurate. However, the middle-ranked prediction is highly biased towards the tokens that are distributionally (not necessarily semantically) similar to these top ones. For instance, if “P” is predicted as the top-1 token, “A”-“Z” will all be ranked high in NTP, no matter whether they can lead to correct decoding results. This hurts the sampling diversity and makes the sampling of correct, long-tail results hopeless and noisy. We attempt to alleviate this issue via a novel in-context method that iteratively pushes the query representation away from explored regions. Specifically, we include the explored decoding results in the context and prompt the LM to generate something else, which encourages the LM to produce a query representation that has small dot products with explored keys. Experiments on knowledge-probing tasks show that our method leads to efficient navigation away from explored keys to correct new keys. We further extend our method to open-ended and chain-of-thought (for reasoning) generation. Experiment results show that ICN contributes to better generation diversity and improved self-consistency voting performance. Finally, we discuss potential training issues caused by the fixed key space together with the challenges and possible ways to address them in future research.

4830Energy-based Backdoor Defense Against Federated Graph Learning

[openreview] [pdf]

Abstract Federated Graph Learning is rapidly evolving as a privacy-preserving collaborative approach. However, backdoor attacks are increasingly undermining federated systems by injecting carefully designed triggers that lead to the model making incorrect predictions. Trigger structures and injection locations in Federated Graph Learning are more diverse, making traditional federated defense methods less effective. In our work, we propose an effective Federated Graph Backdoor Defense using Topological Graph Energy (FedTGE). At the local client level, it injects distribution knowledge into the local model, assigning low energy to benign samples and high energy to the constructed malicious substitutes, and selects benign clients through clustering. At the global server level, the energy elements uploaded by each client are treated as new nodes to construct a global energy graph for energy propagation, making the selected clients’ energy elements more similar and further adjusting the aggregation weights. Our method can handle high data heterogeneity, does not require a validation dataset, and is effective under both small and large malicious proportions. Extensive results on various settings of federated graph scenarios under backdoor attacks validate the effectiveness of this approach.

4831Learning Interpretable Hierarchical Dynamical Systems Models from Time Series Data

[openreview] [pdf]

Abstract In science, we are often interested in obtaining a generative model of the underlying system dynamics from observed time series. While powerful methods for dynamical systems reconstruction (DSR) exist when data come from a single domain, how to best integrate data from multiple dynamical regimes and leverage it for generalization is still an open question. This becomes particularly important when individual time series are short, and group-level information may help to fill in for gaps in single-domain data. At the same time, averaging is not an option in DSR, as it will wipe out crucial dynamical properties (e.g., limit cycles in one domain vs. chaos in another). Hence, a framework is needed that enables to efficiently harvest group-level (multi-domain) information while retaining all single-domain dynamical characteristics. Here we provide such a hierarchical approach and showcase it on popular DSR benchmarks, as well as on neuroscientific and medical time series. In addition to faithful reconstruction of all individual dynamical regimes, our unsupervised methodology discovers common low-dimensional feature spaces in which datasets with similar dynamics cluster. The features spanning these spaces were further dynamically highly interpretable, surprisingly in often linear relation to control parameters that govern the dynamics of the underlying system. Finally, we illustrate transfer learning and generalization to new parameter regimes.

4832Tight Time Complexities in Parallel Stochastic Optimization with Arbitrary Computation Dynamics

[openreview] [pdf]

Abstract In distributed stochastic optimization, where parallel and asynchronous methods are employed, we establish optimal time complexities under virtually any computation behavior of workers/devices/CPUs/GPUs, capturing potential disconnections due to hardware and network delays, time-varying computation powers, and any possible fluctuations and trends of computation speeds. These real-world scenarios are formalized by our new universal computation model. Leveraging this model and new proof techniques, we discover tight lower bounds that apply to virtually all synchronous and asynchronous methods, including Minibatch SGD, Asynchronous SGD (Recht et al., 2011), and Picky SGD (Cohen et al., 2021). We show that these lower bounds, up to constant factors, are matched by the optimal Rennala SGD and Malenia SGD methods (Tyurin & Richtárik, 2023).

4833Model Growth Schedule learning via Optimal Path (SLOP) for Efficient LLM Pre-Training

[openreview] [pdf]

Abstract Existing training methods for Transformer-based large language models (LLMs) rely on massive amounts of data training from scratch, which requires a high cost in terms of compute and time. Recent studies have demonstrated the great potential of improving the LLM’s training efficiency by growing from small pre-trained models to large ones—a technique known as model growth. There are two main research problems associated with model growth: growth schedule and growth operators. Existing research focuses on growth operators, detailing specific manipulations of potential dimensions to expand Transformer parameters. Few studies have investigated the optimal growth schedule, which involves integrating all possible growth operators to create an optimal multi-staged growth path. This work introduces SLOP, a growth Schedule Learning methodology via Optimal Path, for multi-stage growth of models with minimal experimental training. SLOP utilizes marginal utility as an appropriate measure for an optimal schedule that balances training costs and model performance after multi-stage growth. With this measurement, the objective of determining the optimal model growth path is converted into a dynamic programming problem, which is then addressed mathematically in polynomial time. Empirical results demonstrate SLOP’s theoretical validity and show that it is an efficient approach that outperforms alternative schedules in a variety of settings.

4834Discrimination-free Insurance Pricing with Privatized Sensitive Attributes

[openreview] [pdf]

Abstract Fairness has emerged as a critical consideration in the landscape of machine learning algorithms, particularly as AI continues to transform decision-making across societal domains. To ensure that these algorithms are free from bias and do not discriminate against individuals based on sensitive attributes such as gender and race, the field of algorithmic bias has introduced various fairness concepts, including demographic parity and equalized odds, along with methodologies to achieve these notions in different contexts. Despite the rapid advancement in this field, not all sectors have embraced these fairness principles to the same extent. One specific sector that merits attention in this regard is insurance. Within the realm of insurance pricing, fairness is defined through a distinct and specialized framework. Consequently, achieving fairness according to established notions does not automatically ensure fair pricing in insurance. In particular, regulators are increasingly emphasizing transparency in pricing algorithms and imposing constraints on insurance companies on the collection and utilization of sensitive consumer attributes. These factors present additional challenges in the implementation of fairness in pricing algorithms. To address these complexities and comply with regulatory demands, we propose an efficient method for constructing fair models that align with the specific fairness criteria unique to the insurance pricing domain. Notably, our approach only relies on privatized sensitive attributes and offers statistical guarantees. Further, it does not require insurers to have direct access to sensitive attributes, and it can be tailored to accommodate varying levels of transparency as required. This methodology seeks to meet the growing demands for privacy and transparency from regulators while ensuring fairness in insurance pricing practices.

4835ME-LORA: MEMORY-EFFICIENT BAYESIAN LOW- RANK ADAPTATION FOR LARGE LANGUAGE MODELS

[openreview] [pdf]

Abstract Bayesian Low-Rank Adaptation (LoRA) has shown excellent performance in reducing the overconfidence of inference by large language models as it can accurately quantify the inference uncertainty. However, the general Bayesian LoRA technique requires huge memory as it fine-tunes three low-rank matrices with large size: two matrices have size of n×rn\times r and the other has size of r×mr\times m, where rr denotes rank, and n,mn, m denote the size of input and output, respectively. The large amount of memory required by this technique precludes its practical applications especially for the cases with long input or output. Here, we propose a memory efficient Bayesian LoRA technique (called Me-LoRA) that needs only two low-rank matrices plus two small matrices with size of only r×rr\times r. The key idea of our approach is that we introduce a small matrix (with size r×rr\times r) to describe the variance estimates required by Bayesian LoRA, which is calculated through sampling two other samll matrices. Compared with the general Bayesian LoRA technique, our approach reduces the memory requirement by nearly 13\frac{1}{3} as the rank rr is generally very small. Experimental results using both LlaMA-7B and LlaMA-13B models on representative data sets suggest that our approach achieves the same performance as the original Bayesian LoRA techniques and outperforms the existing approaches. In summary, the memory-efficient Bayesian LoRA presented in this study circumvents the challenge of high memory requirement and thus paves a new way to the practical applications of Bayesian LoRA in the cases with larger input and output size.

4836ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

[openreview] [pdf]

Abstract Large language models (LLMs) have brought significant changes to many aspects of our lives. However, assessing and ensuring their chronological knowledge remains challenging. Existing approaches fall short in addressing the accumulative nature of knowledge, often relying on a single time stamp. To overcome this, we introduce ChroKnowBench, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our benchmark distinguishes between knowledge that evolves (e.g., scientific discoveries, amended laws) and knowledge that remain constant (e.g., mathematical truths, commonsense facts). Building on this benchmark, we present ChroKnowledge (Chronological Categorization of Knowledge), a novel sampling-based framework for evaluating and updating LLMs’ non-parametric chronological knowledge. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on. (2) LLMs partially recall knowledge or show a cut-off at temporal boundaries rather than recalling all aspects of knowledge correctly. Thus, we apply our ChroKnowPrompt, an in-depth prompting to elicit chronological knowledge by traversing step-by-step through the surrounding time spans. We observe that our framework successfully updates the overall knowledge across the entire timeline in both the biomedical domain (+11.9%) and the general domain (+2.8%), highlighting its positive effect in refining temporal knowledge. This non-parametric approach also enables knowledge updates not only in open-source models but also in proprietary LLMs, ensuring comprehensive applicability across model types. We perform a comprehensive analysis based on temporal characteristics of ChroKnowPrompt and validate the potential of various models to elicit intrinsic temporal knowledge through our method.

4837Integrating Distributed Acoustic Sensing and PINN Frameworks for Enhanced Indoor Sound Source Localization

[openreview] [pdf]

Abstract Distributed Acoustic Sensing (DAS) is an emerging technology that transforms standard optical fibers into dense arrays of acoustic sensors, offering unprecedented opportunities for smart city applications, indoor monitoring of human activity, and surveillance without compromising privacy. In this paper, we integrate DAS with Physics-Informed Neural Networks (PINNs) for indoor sound source localization. By embedding the acoustic wave equation and impedance boundary conditions into the neural network architecture, we exploit physical laws to guide the learning process, improving accuracy and generalization. We propose two strategies for real-time sound source localization using DAS data. The first strategy involves training the PINN on all available data simultaneously, while the second strategy incrementally feeds data over time, simulating real-time data acquisition. Using real indoor DAS measurements, we demonstrate the effectiveness of our approach in deciphering complex room acoustics and accurately inferring sound source locations under both strategies. Our framework provides a novel solution for real-time indoor positioning and human activity surveillance, offering significant advantages over traditional camera-based systems by preserving individual privacy.

4838Modeling dynamic social vision highlights gaps between deep learning and humans

[openreview] [pdf]

Abstract Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. The majority of work that supports this idea evaluates how accurately these models predict behavior and brain responses to static images of objects and scenes. Real-world vision, however, is highly dynamic, and far less work has evaluated deep learning models on human responses to moving stimuli, especially those that involve more complicated, higher-order phenomena like social interactions. Here, we extend a dataset of natural videos depicting complex multi-agent interactions by collecting human-annotated sentence captions for each video, and we benchmark 350+ image, video, and language models on behavior and neural responses to the videos. As in prior work, we find that many vision models reach the noise ceiling in predicting visual scene features and responses along the ventral visual stream (often considered the primary neural substrate of object and scene recognition). In contrast, vision models poorly predict human action and social interaction ratings and neural responses in the lateral stream (a neural pathway theorized to specialize in dynamic, social vision), though video models show a striking advantage in predicting mid-level lateral stream regions. Language models (given human sentence captions of the videos) predict action and social ratings better than image and video models, but perform poorly at predicting neural responses in the lateral stream. Together, these results identify a major gap in AI’s ability to match human social vision and provide insights to guide future model development for dynamic, natural contexts.

4839Safe Bayesian Optimization for Complex Control Systems via Additive Gaussian Processes

[openreview] [pdf]

Abstract Controller tuning and optimization have been among the most fundamental problems in robotics and mechatronic systems. The traditional methodology is usually model-based, but its performance heavily relies on an accurate mathematical system model. In control applications with complex dynamics, obtaining a precise model is often challenging, leading us towards a data-driven approach. While various researchers have explored the optimization of a single controller, it remains a challenge to obtain the optimal controller parameters safely and efficiently when multiple controllers are involved. In this paper, we propose SafeCtrlBO to optimize multiple controllers simultaneously and safely. We simplify the exploration process in safe Bayesian optimization, reducing computational effort without sacrificing expansion capability. Additionally, we use additive kernels to enhance the efficiency of Gaussian process updates for unknown functions. Hardware experimental results on a permanent magnet synchronous motor (PMSM) demonstrate that compared to existing safe Bayesian optimization algorithms, SafeCtrlBO can obtain optimal parameters more efficiently while ensuring safety.

4840Diffusion2: Dynamic 3D Content Generation via Score Composition of Video and Multi-view Diffusion Models

[openreview] [pdf]

Abstract Recent advancements in 3D generation are predominantly propelled by improvements in 3D-aware image diffusion models. These models are pretrained on Internet-scale image data and fine-tuned on massive 3D data, offering the capability of producing highly consistent multi-view images. However, due to the scarcity of synchronized multi-view video data, it remains challenging to adapt this paradigm to 4D generation directly. Despite that, the available video and 3D data are adequate for training video and multi-view diffusion models separately that can provide satisfactory dynamic and geometric priors respectively. To take advantage of both, this paper presents Diffusion2^2, a novel framework for dynamic 3D content creation that reconciles the knowledge about geometric consistency and temporal smoothness from these models to directly sample dense multi-view multi-frame images which can be employed to optimize continuous 4D representation. Specifically, we design a simple yet effective denoising strategy via score composition of pretrained video and multi-view diffusion models based on the probability structure of the target image array. To alleviate the potential conflicts between two heterogeneous scores, we further introduce variance-reducing sampling via interpolated steps, facilitating smooth and stable generation. Owing to the high parallelism of the proposed image generation process and the efficiency of the modern 4D reconstruction pipeline, our framework can generate 4D content within few minutes. Notably, our method circumvents the reliance on expensive and hard-to-scale 4D data, thereby having the potential to benefit from the scaling of the foundation video and multi-view diffusion models. Extensive experiments demonstrate the efficacy of our proposed framework in generating highly seamless and consistent 4D assets under various types of conditions.

4841ML4MILP: A Benchmark Dataset for Machine Learning-based Mixed-Integer Linear Programming

[openreview] [pdf]

Abstract Machine learning (ML)-based approaches for solving mixed integer linear programming (MILP) problems have shown significant potential and are growing in sophistication. Despite this advancement, progress in this field is often hindered by the mixed and unsorted nature of current benchmark datasets, which typically lack carefully categorized collections of homogeneous instances. To bridge this gap, we propose ML4MILP, the premier open-source benchmark dataset specifically designed for evaluating ML-based optimization algorithms in the MILP domain. Based on the proposed structure and embedding similarity metrics, we used a novel classification algorithm to carefully categorize the collected and generated instances, resulting in a benchmark dataset encompassing 100,000 instances across more than 70 heterogeneous classes. We demonstrate the utility of ML4MILP through extensive benchmarking against a comprehensive suite of algorithms in the baseline library, consisting of traditional exact solvers and heuristic algorithms, as well as ML-based approaches. Our ML4MILP is open-source and accessible at:https://anonymous.4open.science/r/ML4MILP-6BE0.

4842Fitting Networks with a Cancellation Trick

[openreview] [pdf]

Abstract The degree-corrected block model (DCBM), latent space model (LSM), and β\beta-model are all popular network models. We combine their modeling ideas and propose the logit-DCBM as a new model. Similar as the β\beta-model and LSM, the logit-DCBM contains nonlinear factors, where fitting the parameters is a challenging open problem. We resolve this problem by introducing a cancellation trick. We also propose R-SCORE as a recursive community detection algorithm, where in each iteration, we first use the idea above to update our parameter estimation, and then use the results to remove the nonlinear factors in the logit-DCBM so the renormalized model approximately satisfies a low-rank model, just like the DCBM. Our numerical study suggests that R-SCORE significantly improves over existing spectral approaches in many cases. Also, theoretically, we show that the Hamming error rate of R-SCORE is faster than that of SCORE in a specific sparse region, and is at least as fast outside this region.

4843MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking

[openreview] [pdf]

Abstract Although end-to-end multi-object trackers like MOTR enjoy the merits of simplicity, they suffer from the conflict between detection and association, resulting in unsatisfactory convergence dynamics. While MOTRv2 partly addresses this problem, it demands an additional detector. In this work, we serve as the first to reveal this conflict arises from unfair label assignment between detect and track queries, where detect queries are responsible for recognizing newly appearing targets and track queries are to associate them in following frames. Based on this observation, we propose MOTRv3, which balances the label assignment using the proposed release-fetch supervision strategy. In this strategy, labels are first released for detection and gradually fetched back for association. Besides, another two strategies named pseudo label distillation and track group denoising are designed to further strengthen the supervision for detection and association. Without extra detector during inference, MOTRv3 achieves impressive performance across diverse benchmarks, showing scaling up capability.

4844Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models

[openreview] [pdf]

Abstract Multimodal Large Language Models (LLMs) are pivotal in revolutionizing customer support and operations by integrating multiple modalities such as text, images, and audio. Federated Prompt Learning (FPL) is a recently proposed approach that combines pre-trained multimodal LLMs such as vision-language models with federated learning to create personalized, privacy-preserving AI systems. However, balancing the competing goals of personalization, generalization, and privacy remains a significant challenge. Over-personalization can lead to overfitting, reducing generalizability, while stringent privacy measures, such as differential privacy, can hinder both personalization and generalization. In this paper, we propose a Differentially Private Federated Prompt Learning (DP-FPL) approach to tackle this challenge by leveraging a low-rank adaptation scheme to capture generalization while maintaining a residual term that preserves expressiveness for personalization. To ensure privacy, we introduce a novel method where we apply local differential privacy to the two low-rank components of the local prompt, and global differential privacy to the global prompt. Our approach mitigates the impact of privacy noise on the model performance while balancing the tradeoff between personalization and generalization. Extensive experiments demonstrate the effectiveness of our approach over other benchmarks.

4845Identifiability Guarantees For Time Series Representation via Contrastive Sparsity-inducing

[openreview] [pdf]

Abstract Time series representations learned from high-dimensional data are generally expected to be more robust and better at generalizing to new and potentially out-of-distribution scenarios. However, this is not always the case, as variations in unseen data or prior assumptions may insufficiently constrain the posterior probability distribution, resulting in ill-defined representations and consequently weaker predictions. While disentangled representations for time series are often said to be beneficial for generalizing downstream tasks, the current empirical and theoretical understanding remains limited. In this work, we provide new results on identifiability that guarantee disentangled representations via Contrastive Sparsity-inducing Learning, which improves generalization and interpretability. Motivated by this result, we propose the TimeCSL framework to learn a disentangled representation that generalizes and maintains compositionality. We conduct a large-scale study on time series source separation, investigating whether sufficiently disentangled representations enhance the ability to generalize to source combinations in downstream tasks for both training data and unseen combinations in the testing distribution. Our results show that sufficient identifiability in time series representations leads to improved performance under shifted distributions.

4846A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform

[openreview] [pdf]

Abstract Short-video platforms show an increasing impact on people’s daily life nowadays, with billions of active users spending plenty of time each day. The interactions between users and online platforms give rise to many scientific problems across computational social science and artificial intelligence. However, despite the rapid development of short-video platforms, currently there are serious shortcomings in existing relevant datasets on three aspects: inadequate user-video feedback, limited user attributes and lack of video content. To address these problems, we provide a large-scale dataset with rich user behavior, attributes and video content from a real mobile short-video platform. This dataset covers 10,000 voluntary users and 153,561 videos, and we conduct three-fold technical validations of the dataset. First, we verify the richness of the behavior data including interaction frequency and feedback distribution. Second, we validate the wide coverage of user-side and video-side attribute data. Third, we confirm the representing ability of the content features. We believe the dataset could support the broad research community, including user modeling, social science, human behavior understanding, etc. Our dataset is available at this anonymous link:http://101.6.70.16:8080/.

4847Demystifying the Token Dynamics of Deep Selective State Space Models

[openreview] [pdf]

Abstract Selective state space models (SSM), such as Mamba, have gained prominence for their effectiveness in modeling sequential data. Despite their outstanding empirical performance, a comprehensive theoretical understanding of deep selective SSM remains elusive, hindering their further development and adoption for applications that need high fidelity. In this paper, we investigate the dynamical properties of tokens in a pre-trained Mamba model. In particular, we derive the dynamical system governing the continuous-time limit of the Mamba model and characterize the asymptotic behavior of its solutions. In the one-dimensional case, we prove that only one of the following two scenarios happens: either all tokens converge to zero, or all tokens diverge to infinity. We provide criteria based on model parameters to determine when each scenario occurs. For the convergent scenario, we empirically verify that this scenario negatively impacts the model’s performance. For the divergent scenario, we prove that different tokens will diverge to infinity at different rates, thereby contributing unequally to the updates during model training. Based on these investigations, we propose two refinements for the model: excluding the convergent scenario and reordering tokens based on their importance scores, both aimed at improving practical performance. Our experimental results validate these refinements, offering insights into enhancing Mamba’s effectiveness in real-world applications.

4848OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?

[openreview] [pdf]

Abstract Large language models (LLMs) are driving substantial advancements in software engineering, with successful applications like Copilot and Cursor transforming real-world development practices. However, current research predominantly focuses on the early stages of development, such as code generation, while overlooking the post-development phases that are crucial to user experience. To explore the potential of LLMs in this direction, we propose OpenRCA, a benchmark dataset and evaluation framework for assessing LLMs’ ability to identify the root cause of software failures. OpenRCA includes 335 failures from three enterprise software systems, along with over 68 GB of telemetry data (logs, metrics, and traces). Given a failure case and its associated telemetry, the LLM is tasked to identify the root cause that triggered the failure, requiring comprehension of software dependencies and reasoning over heterogeneous, long-context telemetry data. Our results show substantial room for improvement, as current models can only handle the simplest cases. Even with the specially designed RCA-agent, the best-performing model, Claude 3.5, solved only 11.34% failure cases. Our work paves the way for future research in this direction.

4849LLMs Can Achieve High-quality Simultaneous Machine Translation as Efficiently as Offline

[openreview] [pdf]

Abstract When the complete source sentence is provided, Large Language Models (LLMs) perform excellently in offline machine translation even with a simple prompt"Translate the following sentence from [src lang] into [tgt lang]:". However, in many real scenarios, the source tokens arrive in a streaming manner and simultaneous machine translation (SiMT) is required, then theefficiencyandperformanceof decoder-only LLMs are significantly limited by their auto-regressive nature. To enable LLMs to achieve high-quality SiMT as efficiently as offline translation, we propose a novel paradigm that includes constructing supervised fine-tuning (SFT) data for SiMT, along with new training and inference strategies. To replicate the token input/output (I/O) stream in SiMT, the source and target tokens are rearranged into an interleaved sequence, separated by special tokens according to varying latency requirements. This enables powerful LLMs to learn read and write operations adaptively, based on varying latency prompts, while still maintaining efficient auto-regressive decoding. Experimental results demonstrate that, even with limited SFT data, our approach achieves state-of-the-art performance across various simultaneous translation benchmarks and different evaluation metrics, and preserves the original capabilities of offline translation. Moreover, EAST generalizes well to document-level SiMT without requiring specific fine-tuning, even beyond the offline translation model.

4850GBO: A Multi-Granularity Optimization Algorithm via Granular-ball for Continuous Problems

[openreview] [pdf]

Abstract Optimization problems aim to find the optimal solution, which is becoming increasingly complex and difficult to solve. Traditional evolutionary optimization methods always overlook the granular characteristics of solution space. In the real scenario of numerous optimization, the solution space is typically partitioned into sub-regions characterized by varying degree distributions. These sub-regions present different granularity characteristics at search potential and difficulty. Considering the granular characteristics of the solution space, the number of coarse-grained regions is smaller than the number of points, so the calculation is more efficient. On the other hand, coarse-grained characteristics are not easily affected by fine-grained sample points, so the calculation is more robust. To this end, this paper proposes a new multi-granularity evolutionary optimization method, namely Granular-ball Optimization (GBO) algorithm, which characterizes and searches the solution space from coarse to fine. Specifically, using granular-balls instead of traditional points for optimization increases the diversity and robustness of the random search process. At the same time, the search range in different iteration processes is limited by the radius of granular-balls, covering the solution space from large to small. And the mechanism of granular-ball splitting is applied to continuously split and evolve the large granular-balls into smaller for refining the solution space. Extensive experiments on commonly used benchmarks have shown that GBO outperforms popular and advanced evolutionary algorithms. The code is available in the Supplementary Materials.

4851Solving Continual Offline RL through Selective Weights Activation on Aligned Spaces

[openreview] [pdf]

Abstract Continual offline reinforcement learning (CORL) has shown impressive ability in diffusion-based lifelong learning systems by modeling the joint distributions of trajectories. However, most research only focuses on limited continual task settings where the tasks have the same observation and action space, which deviates from the realistic demands of training agents in various environments. In view of this, we propose Vector-Quantized Continual Diffuser, named VQ-CD, to break the barrier of different spaces between various tasks. Specifically, our method contains two complementary sections, where the quantization spaces alignment provides a unified basis for the selective weights activation. In the quantized spaces alignment, we leverage vector quantization to align the different state and action spaces of various tasks, facilitating continual training in the same space. Then, we propose to leverage a unified diffusion model attached by the inverse dynamic model to master all tasks by selectively activating different weights according to the task-related sparse masks. Finally, we conduct extensive experiments on 15 continual learning (CL) tasks, including conventional CL task settings (identical state and action spaces) and general CL task settings (various state and action spaces). Compared with 16 baselines, our method reaches the SOTA performance.

4852A New Look at Low-Rank Recurrent Neural Networks

[openreview] [pdf]

Abstract Low-rank recurrent neural networks (RNNs) have recently gained prominence as a framework for understanding how neural systems solve complex cognitive tasks. However, training and interpreting these networks remains an important open problem. Here we address these challenges by adopting a view of low-rank RNNs as parametrizing a low-dimensional ordinary differential equation (ODE) using a set of nonlinear basis functions. This perspective, which arises from an approach known as the ``neural engineering framework’', reveals that low-rank RNNs are equivalent to neural ODEs with a single hidden layer. We show that training a low-rank RNN to implement a particular dynamical system can thus be formalized as least-squares regression in a random basis. This allows us to propose a new method for finding the smallest RNN capable of implementing a dynamical system using a variant of orthogonal matching pursuit. More generally, our perspective clarifies limits on the expressivity of low-rank RNNs, such as the fact that without inputs, a low-rank RNN with sigmoidal nonlinearity can only implement odd-symmetric functions. We delve further into the role of inputs in shaping network dynamics and show that RNNs can produce identical trajectories using a wide variety of static or time-varying dynamics; this highlights the importance of perturbations for inferring dynamics from observed neural trajectories. Finally, we highlight the usefulness of our framework by comparing to RNNs trained using backprop-through-time on neuroscience-inspired tasks, showcasing that our method achieves faster and more accurate learning with smaller networks than gradient-based training.

4853Gnothi Seauton: Empowering Faithful Self-Interpretability in Black-Box Models

[openreview] [pdf]

Abstract The debate between self-interpretable models and post-hoc explanations for black-box models is central to Explainable AI (XAI). Self-interpretable models, such as concept-based networks, offer insights by connecting decisions to human-understandable concepts but often struggle with performance and scalability. Conversely, post-hoc methods like Shapley values, while theoretically robust, are computationally expensive and resource-intensive. To bridge the gap between these two lines of research, we propose a novel method that combines their strengths, providing theoretically guaranteed self-interpretability for black-box models without compromising prediction accuracy. Specifically, we introduce a parameter-efficient pipeline, AutoGnothi, which integrates a small side network into the black-box model, allowing it to generate Shapley value explanations without changing the original network parameters. This side-tuning approach significantly reduces memory, training, and inference costs, outperforming traditional parameter-efficient methods, where full fine-tuning serves as the optimal baseline. AutoGnothi enables the black-box model to predict and explain its predictions with minimal overhead. Extensive experiments show that AutoGnothi offers accurate explanations for both vision and language tasks, delivering superior computational efficiency with comparable interpretability.

4854An Empirical Study of Deep Reinforcement Learning in Continuing Tasks

[openreview] [pdf]

Abstract In reinforcement learning (RL), continuing tasks refer to tasks where the agent-environment interaction is ongoing and can not be broken down into episodes. These tasks are suitable when environment resets are unavailable, agent-controlled, or predefined but where all rewards—including those beyond resets—are critical. These scenarios frequently occur in real-world applications and can not be modeled by episodic tasks. While modern deep RL algorithms have been extensively studied and well understood in episodic tasks, their behavior in continuing tasks remains underexplored. To address this gap, we provide an empirical study of several well-known deep RL algorithms using a suite of continuing task testbeds based on Mujoco and Atari environments, highlighting several key insights concerning continuing tasks. Using these testbeds, we also investigate the effectiveness of a method for improving temporal-difference-based reinforcement learning (RL) algorithms in continuing tasks by centering rewards, as introduced by \citet{naik2024reward}. While their work primarily focused on this method in conjunction with Q-learning, our results extend their findings by demonstrating that this method is effective across a broader range of algorithms, scales to larger tasks, and outperforms two other reward-centering approaches.

4855HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation

[openreview] [pdf]

Abstract Influence function provides a principled method to assess the contribution of individual training samples to a specific target, yet their high computation costs limits its applications on large-scale models or datasets. Existing methods proposed for influence function approximation have significantly reduce the computation overheads. However, they mostly suffer from a unsatisfied accuracy due to the lack of strong convergence guarantees. The family of hyperpower methods are well-known for their rigorous convergence guarantees on matrix inverse approximation, while the matrix multiplication operation can involve intractable memory and computation costs on large-scale models. We propose HyperINF, an efficient and accurate influence function approximation method which leverages the hyperpower method, specifically the Schulz’s iterative algorithm. To deal with the computation-intensive matrix multiplication, we incorporate the generalized fisher information (GFIM) as a low-rank approximation of the hessian matrix, which reduces the memory and computation overheads to a constant costs independent of ranks on LoRA-tuned models. We first demonstrate the superior accuracy and stability of HyperINF compared to other baselines through a synthetic convergence simulation of matrix inversion. We further validate the efficacy of HyperINFthrough extensive real-world data attribution tasks, including mislabeled data detection and data selection for LLM and VLM fine-tuning. On LoRA-tuned models, HyperINF achieves superior downstream performance with minimal memory and computational overhead, while other baselines suffer from significant degradation. The codebase is available at \url{https://anonymous.4open.science/r/HyperINF-B702}.

4856YOLO-MARL: You Only LLM Once for Multi-agent Reinforcement Learning

[openreview] [pdf]

Abstract Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, the trained decentralized normal-sized neural network-based policies operate independently of the LLM. We evaluate our method across three different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms.

4857Towards zero shot multivariate time series anomaly detection - A Realistic Evaluation

[openreview] [pdf]

Abstract A long line of multivariate timeseries anomaly detection (MTAD) approaches use performance enhancement techniques that are not feasible in practical scenarios. In specific, a) point adjustment technique is employed which uses ground truth to forcefully convert false negatives to true positives and inflates precision to unrealistic proportions, and b) significant data leakage is introduced where anomaly score threshold is determined using the test data and test labels. In this paper, we show the real world performance of existing MTAD techniques when point adjustment and threshold learning on test data is disabled. Moreover, we show that anomalies introduced in real world benchmark datasets result in significant distribution shift between normal and anomalous data, and when point adjustment and threshold learning are used even untrained deterministic methods can perform on par or even beat baseline techniques. We then introduce six synthetic benchmark examples derived from real world systems, where anomalous data and normal data have statistically insignificant distribution shift. We propose, sparse model identification enhanced anomaly detection (SPIE-AD), a model recovery and conformance based zero shot MTAD approach that outperforms state of art MTAD techniques on three real world benchmark datasets without using point adjustment and threshold learning on test data. We evaluate state-of-art MTAD and SPIE-AD on the novel synthetic benchmarks. SPIE-AD outperforms state-of-art MTAD techniques on both standard and novel benchmarks.

4858Diffusion Curriculum: Synthetic-to-Real Data Curriculum via Image-Guided Diffusion

[openreview] [pdf]

Abstract Low-quality or scarce data has posed significant challenges for training deep neural networks in practice. While classical data augmentation cannot contribute very different new data, diffusion models opens up a new door to build self-evolving AI by generating high-quality and diverse synthetic data through text-guided prompts. However, text-only guidance cannot control synthetic images’ proximity to the original images, resulting in out-of-distribution data detrimental to the model performance. To overcome the limitation, we study image guidance to achieve a spectrum of interpolations between synthetic and real images. With stronger image guidance, the generated images are similar to the training data but hard to learn. While with weaker image guidance, the synthetic images will be easier for model but contribute to a larger distribution gap with the original data. The generated full spectrum of data enables us to build a novel “Diffusion CurricuLum (DisCL)”. DisCL adjusts the image guidance level of image synthesis for each training stage: It identifies and focuses on hard samples for the model and assesses the most effective guidance level of synthetic images to improve hard data learning. We apply DisCL to two challenging tasks: long-tail (LT) classification and learning from low-quality data. It focuses on lower-guidance images of high-quality to learn prototypical features as a warm-up of learning higher-guidance images that might be weak on diversity or quality. Extensive experiments showcase a gain of 2.7% and 2.1% in OOD and ID macro-accuracy when applying DisCL to iWildCam dataset. On ImageNet-LT, DisCL improves the base model’s tail-class accuracy from 4.4% to 23.64% and leads to a 4.02% improvement in all-class accuracy.

4859iAgent: LLM Agent as a Shield between User and Recommender Systems

[openreview] [pdf]

Abstract Traditional recommender systems usually take the user-platform paradigm, where users are directly exposed under the control of the platform’s recommendation algorithms. However, the defect of recommendation algorithms may put users in very vulnerable positions under this paradigm. First, many sophisticated models are often designed with commercial objectives in mind, focusing on the platform’s benefits, which may hinder their ability to protect and capture users’ true interests. Second, these models are typically optimized using data from all users, which may overlook individual user’s preferences. Due to these shortcomings, users may experience several disadvantages under the traditional user-platform direct exposure paradigm, such as lack of control over the recommender system, potential manipulation by the platform, echo chamber effects, or lack of personalization for less active users due to the dominance of active users during collaborative learning. Therefore, there is an urgent need to develop a new paradigm to protect user interests and alleviate these issues. Recently, some researchers have introduced LLM agents to simulate user behaviors, these approaches primarily aim to optimize platform-side performance, leaving core issues in recommender systems unresolved. To address these limitations, we propose a new user-agent-platform paradigm, where agent serves as the protective shield between user and recommender system that enables indirect exposure. To this end, we first construct four recommendation datasets, denoted as InstructRec, along with user instructions for each record. To understand user’s intention, we design an Instruction-aware Agent (iAgent) capable of using tools to acquire knowledge from external environments. Moreover, we introduce an Individual Instruction-aware Agent ( i2^2Agent), which incorporates a dynamic memory mechanism to optimize from individual feedback. Results on four InstructRec datasets demonstrate that i2Agent consistently achieves an average improvement of 16.6% over SOTA baselines across ranking metrics. Moreover, i2^2Agent mitigates echo chamber effects and effectively alleviates the model bias in disadvantaged users (less-active), serving as a shield between user and recommender systems.

4860GeoILP: A Synthetic Dataset to Guide Large-Scale Rule Induction

[openreview] [pdf]

Abstract Inductive logic programming (ILP) is a machine learning approach aiming to learn explanatory rules from data. While existing ILP systems can successfully solve small-scale tasks, large-scale applications with various language biases are rarely explored. Besides, it is crucial for a large majority of current ILP systems to require expert-defined language bias, which hampers the development of ILP towards broader utilizations. In this paper, we introduce GeoILP, a large-scale synthetic dataset of diverse ILP tasks involving numerous aspects of language bias. % including complex rule forms, high deduction complexity, and more realistic assumptions. The ILP tasks are built from geometry problems, at the level from textbook exercise to regional International Mathematical Olympiad (IMO), with the help of a deduction engine. These problems are elaborately selected to cover all challenging language biases, such as recursion, predicate invention, and high arity. Experimental results show that no existing method can solve GeoILP tasks. In addition, along with classic symbolic-form data, we provide image-form data to boost the development of the joint learning of neural perception and symbolic rule induction.

4861The Renaissance of Classic Feature Aggregations for Visual Place Recognition in the Era of Foundation Models

[openreview] [pdf]

Abstract Visual Place Recognition (VPR) addresses the retrieval problem in large-scale geographic image databases through feature representations. Recent approaches have leveraged visual foundation models and have proposed novel feature aggregations. However, these methods have failed to grasp the core concepts of foundational models, such as leveraging extensive training sets, and have also neglected the potential of classical feature aggregations, such as GeM and NetVLAD, for low-dimensional representations. Building on these insights, we revive classic aggregation methods and create more fundamental VPR models, abbreviated SuperPlace. First, we introduce a supervised label alignment method that combines grid partitioning and local feature matching. This allows models to be trained on diverse VPR datasets within a unified framework, similar to the design principles of foundation models. Second, we introduce G2^2M, a compact feature aggregation with two GeMs, in which one GeM learns the principal components of feature maps along the channel direction and calibrates the other GeM’s output. Third, we propose the secondary fine-tuning (FT2^2) strategy for NetVLAD-Linear (NVL). NetVLAD first learns feature vectors in a high-dimensional space and then compresses them into a low-dimensional space using a single linear layer. G2^2M excels in large-scale applications requiring rapid response and low latency, while NVL-FT2^2 is optimized for scenarios demanding high precision across a broad range of conditions. Extensive experiments (12 test sets, 14 previous methods, and 11 tables) highlight our contributions and demonstrate the superiority of SuperPlace. Specifically, SuperPlace-G2^2M achieves state-of-the-art results with only one-tenth of the feature dimensions compared to recent methods. Moreover, SuperPlace-NVL-FT2^2 holds the top rank on the MSLS challenge leaderboard. We have submitted a ranking screenshot, the source code, and the original experimental records in the supplementary materials.

4862TSTTC: A Large-Scale Dataset for Time-to-Contact Estimation in Driving Scenarios

[openreview] [pdf]

Abstract Time-to-Contact (TTC) estimation is a critical task for assessing collision risk and is widely used in various driver assistance and autonomous driving systems. The past few decades have witnessed development of related theories and algorithms. The prevalent learning-based methods call for a large-scale TTC dataset in real-world scenarios. In this work, we present a large-scale object oriented TTC dataset in the driving scene for promoting the TTC estimation by a monocular camera. To collect valuable samples and make data with different TTC values relatively balanced, we go through thousands of hours of driving data and select over 200K sequences with a preset data distribution. To augment the quantity of small TTC cases, we also generate clips using the latest Neural rendering methods. Additionally, we provide several simple yet effective TTC estimation baselines and evaluate them extensively on the proposed dataset to demonstrate their effectiveness.

4863Stable Consistency Tuning: Understanding and Improving Consistency Models

[openreview] [pdf]

Abstract Diffusion models achieve superior generation quality but suffer from slow generation speed due to the iterative nature of denoising. In contrast, consistency models, a new generative family, achieve competitive performance with significantly faster sampling. These models are trained either through consistency distillation, which leverages pretrained diffusion models, or consistency training/tuning directly from raw data. In this work, we propose a novel framework for understanding consistency models by modeling the denoising process of the diffusion model as a Markov Decision Process (MDP) and framing consistency model training as the value estimation through Temporal Difference (TD) Learning. More importantly, this framework allows us to analyze the limitations of current consistency training/tuning strategies. Built upon Easy Consistency Tuning (ECT), we propose Stable Consistency Tuning (SCT), which incorporates variance-reduced learning using the score identity. SCT leads to significant performance improvements on benchmarks such as CIFAR-10 and ImageNet-64. On ImageNet-64, SCT achieves 1-step FID 2.42 and 2-step FID 1.55, a new SoTA for consistency models.

4864Detecting Problematic Questions to Support Math Word Problem Design

[openreview] [pdf]

Abstract When designing math word problems, teachers must ensure the clarity and precision of the question to avoid multiple interpretations and unanswerable situations, thereby maintaining consistent grading standards and effectiveness. We address these issues to provide comprehensive support to teachers in creating clear, solvable, and formal math word problems. In this paper, we present MathError, a dataset of real-world math word problems annotated with error types to investigate the need for question correction. Our work explores how large language models (LLMs) can assist teachers in detecting problematic questions to support math word problem design in scenarios with limited data, simulating real-world conditions with minimal training samples. Preliminary results demonstrate the models’ capabilities in detecting problematic questions and identify areas for further research and development in educational applications.

4865RILe: Reinforced Imitation Learning

[openreview] [pdf]

Abstract Reinforcement Learning has achieved significant success in generating complex behavior but often requires extensive reward function engineering. Adversarial variants of Imitation Learning and Inverse Reinforcement Learning offer an alternative by learning policies from expert demonstrations via a discriminator. However, these methods struggle in complex tasks where randomly sampling expert-like behaviors is challenging. This limitation stems from their reliance on policy-agnostic discriminators, which provide insufficient guidance for agent improvement, especially as task complexity increases and expert behavior becomes more distinct. We introduce RILe (Reinforced Imitation Learning environment), a novel trainer-student system that learns a dynamic reward function based on the student’s performance and alignment with expert demonstrations. In RILe, the student learns an action policy while the trainer, using reinforcement learning, continuously updates itself via the discriminator’s feedback to optimize the alignment between the student and the expert. The trainer optimizes for long-term cumulative rewards from the discriminator, enabling it to provide nuanced feedback that accounts for the complexity of the task and the student’s current capabilities. This approach allows for greater exploration of agent actions by providing graduated feedback rather than binary expert/non-expert classifications. By reducing dependence on policy-agnostic discriminators, RILe enables better performance in complex settings where traditional methods falter, outperforming existing methods by 2x in complex simulated robot-locomotion tasks.

4866Accelerating Block Coordinate Descent for LLM Finetuning via Landscape Correction

[openreview] [pdf]

Abstract Training and finetuning large language models (LLMs) are resource-intensive tasks, with memory limitations being a key bottleneck. A classic optimization method, block coordinate descent (BCD), offers solutions by segmenting the trainable parameters into multiple blocks and optimizing one active block at a time while freezing the others, thereby significantly reducing memory cost. However, we identify that blindly applying BCD to train LLMs can be inefficient for two reasons. First, optimizing only the active block requires backpropagating through multiple deeper yet inactive blocks, resulting in wasteful computations. Second, the frozen blocks, when they are not quite close to optimality, can narrow the optimization landscape, potentially misguiding the training of the active block. To address these issues simultaneously, we propose integrating BCD withlandscape correction, which unfreezes the inactive blocks and updates them in a cost-efficient manner during the same backpropagation as the update to the active block. We show that our method empirically improves vanilla BCD with minimal additional computation and memory. Experiments on 8B and 70B models demonstrate that our proposed method surpasses memory efficient baselines and matches Adam’s downstream performance while reducing memory cost by 80% compared to Adam.

4867An Exploration of Speech Conditioned Large Language Models (SLMs)

[openreview] [pdf]

Abstract Efforts to enable Large Language Models (LLMs) to understand human speech have spurred the development of an increasing number of Speech-Conditioned Large Language Models (SLMs). While these models have demonstrated success on various speech-related tasks, such as automatic speech recognition (ASR), the design space of SLMs has not been thoroughly explored. In this work, we revisit key design choices for SLMs, aiming to gain insights into how these choices impact the performance of SLMs and how we could optimize them for better results. Surprisingly, our experiments reveal that current SLMs struggle to follow speech instructions or respond to speech inputs, even for simple queries like ”who has been to the moon?”. Our experimental findings indicate that speech instruction following data is crucial for improving these capabilities. Leveraging this insight, we propose to use synthetic speech instruction following data to enhance speech instruction following capability. Combining the findings from our other experiments, we provide an effective recipe for developing SLMs. Our model, called SiM, not only achieves strong ASR performance, but also significantly outperforms existing SLMs in speech instruction following.

4868Task-oriented Sequential Grounding in 3D Scenes

[openreview] [pdf]

Abstract Grounding natural language in physical 3D environments is essential for the advancement of embodied artificial intelligence. Current datasets and models for 3D visual grounding predominantly focus on identifying and localizing objects from static, object-centric descriptions. These approaches do not adequately address the dynamic and sequential nature of task-oriented grounding necessary for practical applications. In this work, we propose a new task: Task-oriented Sequential Grounding in 3D scenes, wherein an agent must follow detailed step-by-step instructions to complete daily activities by locating a sequence of target objects in indoor scenes. To facilitate this task, we introduce SG3D, a large-scale dataset containing 22,346 tasks with 112,236 steps across 4,895 real-world 3D scenes. The dataset is constructed using a combination of RGB-D scans from various 3D scene datasets and an automated task generation pipeline, followed by human verification for quality assurance. We adapted three state-of-the-art 3D visual grounding models to the sequential grounding task and evaluated their performance on SG3D. Our results reveal that while these models perform well on traditional benchmarks, they face significant challenges with task-oriented sequential grounding, underscoring the need for further research in this area.

4869Context-Alignment: Activating and Enhancing LLMs Capabilities in Time Series

[openreview] [pdf]

Abstract Recently, leveraging pre-trained Large Language Models (LLMs) for time series (TS) tasks has gained increasing attention, which involves activating and enhancing LLMs’ capabilities. Many methods aim to activate LLMs’ capabilities based on token-level alignment, but overlook LLMs’ inherent strength on natural language processing — their deep understanding of linguistic logic and structure rather than superficial embedding processing. We propose Context-Alignment, a new paradigm that aligns TS with a linguistic component in the language environments familiar to LLMs to enable LLMs to contextualize and comprehend TS data, thereby activating their capabilities. Specifically, such context-level alignment comprises structural alignment and logical alignment, which is achieved by a Dual-Scale Context-Alignment GNNs (DSCA-GNNs) applied to TS-language multimodal inputs. Structural alignment utilizes dual-scale nodes to describe hierarchical structure in TS-language, enabling LLMs treat long TS data as a whole linguistic component while preserving intrinsic token features. Logical alignment uses directed edges to guide logical relationships, ensuring coherence in the contextual semantics. Demonstration examples prompt are employed to construct Demonstration Examples based Context-Alignment (DECA) following DSCA-GNNs framework. DECA can be flexibly and repeatedly integrated into various layers of pre-trained LLMs to improve awareness of logic and structure, thereby enhancing performance. Extensive experiments show the effectiveness of DECA and the importance of Context-Alignment across tasks, particularly in few-shot and zero-shot forecasting, confirming that Context-Alignment provide powerful prior knowledge on context. We will release the source code upon publication.

4870Learning stochastic dynamics from snapshots through regularized unbalanced optimal transport

[openreview] [pdf]

Abstract Reconstructing dynamics using samples from sparsely time-resolved snapshots is an important problem in both natural sciences and machine learning. Here, we introduce a new deep learning approach for solving regularized unbalanced optimal transport (RUOT) and inferring continuous unbalanced stochastic dynamics from observed snapshots. Based on the RUOT form, our method models these dynamics without requiring prior knowledge of growth and death processes or additional information, allowing them to be learnt directly from data. Theoretically, we explore the connections between the RUOT and Schrödinger bridge problem and discuss the key challenges and potential solutions. The effectiveness of our method is demonstrated with a synthetic gene regulatory network, high-dimensional Gaussian Mixture Model, and single-cell RNA-seq data from blood development. Compared with other methods, our approach accurately identifies growth and transition patterns, eliminates false transitions, and constructs the Waddington developmental landscape.

4871Chemistry-Inspired Diffusion with Non-Differentiable Guidance

[openreview] [pdf]

Abstract Recent advances in diffusion models have shown remarkable potential in the conditional generation of novel molecules. These models can be guided in two ways: (i) explicitly, through additional features representing the condition, or (ii) implicitly, using a property predictor. However, training property predictors in conditional diffusion models requires an abundance of labeled data and is inherently challenging in real-world applications. We propose a novel approach that attenuates the limitations of acquiring large labeled datasets by leveraging domain knowledge from quantum chemistry as a non-differentiable oracle to guide an unconditional diffusion model. Instead of relying on neural networks, the oracle provides accurate guidance in the form of estimated gradients, allowing the diffusion process to sample from a conditional distribution specified by quantum chemistry. We show that this results in more precise conditional generation of novel and stable molecular structures. Our experiments demonstrate that our method: (1) significantly reduces atomic forces, enhancing the validity of generated molecules when used for stability optimization; (2) is compatible with both explicit and implicit guidance in diffusion models, enabling joint optimization of molecular properties and stability; and (3) generalizes effectively to molecular optimization tasks beyond stability optimization.

4872Logic Agent: Enhancing Validity with Logic Rule Invocation

[openreview] [pdf]

Abstract Chain-of-Thought (CoT) prompting has become a key strategy for enhancing the inferential abilities of large language models (LLMs) in reasoning tasks. However, it often struggles with ensuring reasoning validity and maintaining informativeness. This paper presents the Logic Agent (LA), a novel framework designed to boost the validity of reasoning in LLMs through strategic logic function calls. Distinct from traditional methods, LA converts LLMs into dynamic agents that apply propositional logic rules, transforming natural language inputs into structured logical forms. The agent utilizes a robust suite of predefined functions to guide the reasoning process effectively. This approach can enhance the structured and coherent generation of reasoning outputs, improving their interpretability and logical consistency. Through detailed experiments, we showcase LA’s ability to adapt across different LLM sizes, significantly enhancing the accuracy of complex reasoning tasks across various domains.

4873LAIA-SQL: Enhancing Natural Language to SQL Generation in Multi-Table QA via Task Decomposition and Keyword Extraction

[openreview] [pdf]

Abstract Natural Language to SQL (NL2SQL) provides an effective solution for multi-table question answering (Table QA) to automate data retrieval by transforming simple user queries into SQL commands. It enhances data accessibility and decision-making processes across various industries. Large Language Model (LLM) based NL2SQL methods have been shown to outperform rule-based or neural network-based NL2SQL methods. However, existing LLM-based NL2SQL approaches face challenges like inaccurate interpretation of user questions, slow retrieval speeds, erroneous SQL generation, and high operational costs. As there is a lack of datasets specifically designed to evaluate natural language understanding (NLU) in NL2SQL tasks and no models optimized for user question understanding in Table QA, we introduce LAIA-NLU, a novel dataset that dissects NLU into task decomposition and keyword extraction. LAIA-NLU contains 1,500 high-quality QA pairs, created through manual review. Using this dataset, we developed LAIA-NLUer, which is capable of effectively interpreting user intent in table-based queries. To further enhance NL2SQL performance in terms of speed, cost, and accuracy, we also present LAIA-SQL, a retrieval-augmented based NL2SQL framework. Experimental results show that LAIA-SQL outperforms state-of-the-art models, achieving an accuracy improvement to 67.28% in BIRD dataset, a 52.4% reduction in runtime, and a 97% decrease in operational costs. These improvements demonstrate the potential of our approach to advance multi-table data retrieval and analysis. Our code, dataset, and model will be publicly available to encourage further research in this field.

4874No Free Lunch: Fundamental Limits of Learning Non-Hallucinating Generative Models

[openreview] [pdf]

Abstract Generative models have shown impressive capabilities in synthesizing high-quality outputs across various domains. However, a persistent challenge is the occurrence of “hallucinations,” where the model produces outputs that are plausible but invalid. While empirical strategies have been explored to mitigate this issue, a rigorous theoretical understanding remains elusive. In this paper, we develop a theoretical framework to analyze thelearnabilityof non-hallucinating generative models from a learning-theoretic perspective. Our results reveal that non-hallucinating learning is statisticallyimpossiblewhen relying solely on the training dataset, even for a hypothesis class of size two and when the entire training set is truthful. To overcome these limitations, we show that incorporatinginductive biasesaligned with the actual facts into the learning process is essential. We provide a systematic approach to achieve this by restricting the fact set to a concept class of finite VC-dimension and demonstrate its effectiveness under various learning paradigms. Although our findings are primarily conceptual, they represent a first step towards a principled approach to addressing hallucinations in learning generative models.

4875Gaussian Ensemble Belief Propagation for Efficient Inference in High-Dimensional, Black-box Systems

[openreview] [pdf]

Abstract Efficient inference in high-dimensional models is a central challenge in machine learning. We introduce the Gaussian Ensemble Belief Propagation (GEnBP) algorithm, which combines the strengths of the Ensemble Kalman Filter (EnKF) and Gaussian Belief Propagation (GaBP) to address this challenge. GEnBP updates ensembles of prior samples into posterior samples by passing low-rank local messages over the edges of a graphical model, enabling efficient handling of high-dimensional states, parameters, and complex, noisy, black-box generation processes. By utilizing local message passing within a graphical model structure, GEnBP effectively manages complex dependency structures and remains computationally efficient even when the ensemble size is much smaller than the inference dimension--a common scenario in spatiotemporal modeling, image processing, and physical model inversion. We demonstrate that GEnBP can be applied to various problem structures, including data assimilation, system identification, and hierarchical models, and show through experiments that it outperforms existing methods in terms of accuracy and computational efficiency.

4876Knowledge Benchmark Graph: Assisting Large Language Models in Designing Models by Retrieving Benchmark Knowledge

[openreview] [pdf]

Abstract In recent years, the design and transfer of neural network models have been widely studied due to their exceptional performance and capabilities. However, the complex nature of datasets and the vast architecture space pose significant challenges for both manual and automated algorithms in creating high-performance models. Inspired by researchers who design, train, and document the performance of various models across different datasets, this paper introduces a novel schema that transforms the benchmark data into a Knowledge Benchmark Graph (KBG), which primarily stores the facts in the form of performance(data, model). Constructing the KBG facilitates the structured storage of design knowledge, aiding subsequent model design and transfer. However, it is a non-trivial task to retrieve or design suitable neural networks based on the KBG, as real-world data are often off the records. To tackle this challenge, we propose transferring existing models stored in KBG by establishing correlations between unseen and previously seen datasets. Given that measuring dataset similarity is a complex and open-ended issue, we explore the potential for evaluating the correctness of the similarity function. Then, we further integrate the KBG with Large Language Models (LLMs), assisting LLMs to think and retrieve existing model knowledge in a manner akin to humans when designing or transferring models. We demonstrate our method specifically in the context of Graph Neural Network (GNN) architecture design, constructing a KBG (with 26,206 models, 211,669 performance records, and 2,540,064 facts) and validating the effectiveness of leveraging the KBG to promote GNN architecture design.

4877Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing

[openreview] [pdf]

Abstract Role-Playing Agents (RPAs) have shown remarkable performance in various applications, yet they often struggle to recognize and appropriately respond to hard queries that conflict with their role-play knowledge. To investigate RPAs’ performance when faced with different types of conflicting requests, we develop an evaluation benchmark that includes contextual knowledge conflicting requests, parametric knowledge conflicting requests, and non-conflicting requests to assess RPAs’ ability to identify conflicts and refuse to answer appropriately without over-refusing. Through extensive evaluation, we find that most RPAs behave significant performance gaps toward different conflict requests. To elucidate the reasons, we conduct an in-depth representation-level analysis of RPAs under various conflict scenarios. Our findings reveal the existence of rejection regions and direct response regions within the model’s forwarding representation, and thus influence the RPA’s final response behavior. Therefore, we introduce a lightweight representation editing approach that conveniently shifts conflicting requests to the rejection region, thereby enhancing the model’s refusal accuracy. The experimental results validate the effectiveness of our editing method, improving RPAs’ refusal ability of conflicting requests while maintaining their general role-playing capabilities.

4878Efficient Alternating Minimization with Applications to Weighted Low Rank Approximation

[openreview] [pdf]

Abstract Weighted low rank approximation is a fundamental problem in numerical linear algebra, and it has many applications in machine learning. Given a matrix MRn×nM \in \mathbb{R}^{n \times n}, a non-negative weight matrix WR0n×nW \in \mathbb{R}_{\geq 0}^{n \times n}, a parameter kk, the goal is to output two matrices X,YRn×kX,Y\in \mathbb{R}^{n \times k} such that W(MXY)F\| W \circ (M - X Y^\top) \|_F is minimized, where denotes the Hadamard product. It naturally generalizes the well-studied low rank matrix completion problem. Such a problem is known to be NP-hard and even hard to approximate assuming the Exponential Time Hypothesis. Meanwhile, alternating minimization is a good heuristic solution for weighted low rank approximation. In particular, [Li, Liang and Risteski, ICML’16] shows that, under mild assumptions, alternating minimization does provide provable guarantees. In this work, we develop an efficient and robust framework for alternating minimization that allows the alternating updates to be computed approximately. For weighted low rank approximation, this improves the runtime of [Li, Liang and Risteski, ICML’16] from W0k2\|W\|_0k^2 to W0k\|W\|_0 k where W0\|W\|_0 denotes the number of nonzero entries of the weight matrix. At the heart of our framework is a high-accuracy multiple response regression solver together with a robust analysis of alternating minimization.

4879What Makes Your Model a Low-empathy or Warmth Person: Exploring the Origins of Personality in LLMs

[openreview] [pdf]

Abstract Large language models (LLMs) have demonstrated remarkable capabilities in generating human-like text and exhibiting personality traits similar to those in humans. However, the mechanisms by which LLMs encode and express traits such as agreeableness and impulsiveness remain poorly understood. Drawing on the theory of social determinism, we investigate how long-term background factors, such as family environment and cultural norms, interact with short-term pressures like external instructions, shaping and influencing LLMs’ personality traits. By steering the output of LLMs through the utilization of interpretable features within the model, we explore how these background and pressure factors lead to changes in the model’s traits without the need for further fine-tuning. Additionally, we suggest the potential impact of these factors on model safety from the perspective of personality.

4880Modeling Real-Time Interactive Conversations as Timed Diarized Transcripts

[openreview] [pdf]

Abstract Chatbots built upon language models have exploded in popularity, but they have largely been limited to synchronous, turn-by-turn dialogues. In this paper we present a simple yet general method to simulate real-time interactive conversations using pretrained text-only language models, by modeling timed diarized transcripts and decoding them with causal rejection sampling. We demonstrate the promise of this method with two case studies: instant messenger dialogues and spoken conversations, which require generation at about 30 tok/s and 20 tok/s respectively to maintain real-time interactivity. These capabilities can be added into language models using relatively little data and run on commodity hardware.

4881Unsupervised Multi-Agent Diversity With Wasserstein Distance

[openreview] [pdf]

Abstract In cooperative Multi-Agent Reinforcement Learning (MARL), agents sharing policy network parameters are observed to learn similar behaviors, which impedes efficient exploration and easily results in the local optimum of cooperative policies. In order to encourage multi-agent diversity, many recent efforts have contributed to distinguishing different trajectories by maximizing the mutual information objective, given agent identities. Despite their successes, these mutual information-based methods do not necessarily promote exploration. To encourage multi-agent diversity and sufficient exploration, we propose a novel Wasserstein Multi-Agent Diversity (WMAD) exploration method that maximizes the Wasserstein distance between the trajectory distributions of different agents in a latent representation space. Since the Wasserstein distance is defined over two distributions, we further extend it to learn diverse policies for multiple agents. We empirically evaluate our method in various challenging multi-agent tasks and demonstrate its superior performance and sufficient exploration compared to existing state-of-the-art methods.

4882Privacy Auditing of Large Language Models

[openreview] [pdf]

Abstract Current techniques for privacy auditing of large language models (LLMs) have limited efficacy---they rely on basic approaches to generate canaries which leads to weak membership inference attacks that in turn give loose lower bounds on the empirical privacy leakage. We develop canaries that are far more effective than those used in prior work under threat models that cover a range of realistic settings. We demonstrate through extensive experiments on multiple families of fine-tuned LLMs that our approach sets a new standard for detection of privacy leakage. For measuring the memorization rate of non-privately trained LLMs, our designed canaries largely surpassing the prior SOTA. For example, on the Qwen2.5-0.5B model, our designed canaries achieves 26.026.0% TPR at 11% FPR, largely surpassing the prior SOTA of 1.31.3% TPR at 11% FPR. Our method can be used to provide a privacy audit of ε1\varepsilon \approx 1 for a model trained with theoretical ε\varepsilon of 4. To the best of our knowledge, this is the first time that a privacy audit of LLM training has achieved nontrivial auditing success in the setting where the attacker cannot train shadow models, insert gradient canaries, or access the model at every iteration.

4883A Proxy Matrix-based Framework for Contextual Stochastic Optimization under Confounding Effect

[openreview] [pdf]

Abstract Data-driven decision-making in real-world scenarios often faces the challenge of endogeneity between decisions and outcomes, introducing confounding effects. While existing literature typically assumes unconfoundedness, this is often unrealistic. In practice, decision-making relies on high-dimensional, heterogeneous-type proxy features of confounders, leading to suboptimal decisions due to limited predictive power for uncertainty. We propose a novel semi-parametric decision framework to mitigate confounding effects. Our approach combines exponential family matrix completion to infer the confounders matrix from proxy features, with non-parametric prescriptive methods for decision-making based on the estimated confounders. We derive a non-convergent regret bound for data-driven decisions under confounding effects and demonstrate how our framework improves this bound. Experiments on both synthetic and real datasets validate our method’s efficacy in reducing confounding effects across various proxy dimensions. We also show that our approach consistently outperforms benchmarks in practical applications.

4884Rational Metareasoning for Large Language Models

[openreview] [pdf]

Abstract Being prompted to engage in reasoning has emerged as a core technique for using large language models (LLMs), deploying additional inference-time compute to improve task performance. However, as LLMs increase in both size and adoption, inference costs are correspondingly becoming increasingly burdensome. How, then, might we optimize reasoning’s cost-performance tradeoff? This work introduces a novel approach based on computational models of metareasoning used in cognitive science, training LLMs to selectively use intermediate reasoning steps only when necessary. We first develop a reward function that incorporates the Value of Computation by penalizing unnecessary reasoning, then use this reward function with Expert Iteration to train the LLM. Compared to few-shot chain-of-thought prompting and STaR, our method significantly reduces inference costs (20-37% fewer tokens generated across three models) while maintaining task performance across diverse datasets.

4885Monte Carlo Planning with Large Language Model for Text-Based Games

[openreview] [pdf]

Abstract Text-based games provide valuable environments for language-based autonomous agents. However, planning-then-learning paradigms, such as those combining Monte Carlo Tree Search (MCTS) and reinforcement learning (RL), are notably time-consuming due to extensive iterations. Additionally, these algorithms perform uncertainty-driven exploration but lack language understanding and reasoning abilities. In this paper, we introduce the Monte Carlo planning with Dynamic Memory-guided Large language model (MC-DML) algorithm. MC-DML leverages the language understanding and reasoning capabilities of Large Language Models (LLMs) alongside the exploratory advantages of tree search algorithms. Specifically, we enhance LLMs with in-trial and cross-trial memory mechanisms, enabling them to learn from past experiences and dynamically adjust action evaluations during planning. We conduct experiments on a series of text-based games from the Jericho benchmark. Our results demonstrate that the MC-DML algorithm significantly enhances performance across various games at the initial planning phase, outperforming strong contemporary methods that require multiple iterations. This demonstrates the effectiveness of our algorithm, paving the way for more efficient language-grounded planning in complex environments.

4886Enhancing Multi-Agent Learning in Real-World Interactive Environments through Process Reward Decomposition

[openreview] [pdf]

Abstract LLM-based agents have made significant advancements in interactive environments, such as mobile operations and web browsing, with multi-agent systems further boosting performance. However, current agent learning techniques heavily rely on in-domain data and struggle to generalize across tasks and environments. Moreover, existing multi-agent learning methods are limited by fixed role assignments, which restrict their flexibility and generalization. Furthermore, the multi-step nature of interactive tasks, combined with sparse end-to-end reward signals, hinder effective learning to a great extent. To address these issues, we propose CollabUIAgents\textit{CollabUIAgents}, a two-stage multi-agent learning framework for interactive environments. In the first stage, the base model is adapted to the environment using curriculum learning on multi-level instruction data. In the second stage, a novel process reward decomposition strategy is introduced during reinforcement learning, allowing rewards to be distributed at both the agent and conversation round levels. This granular feedback fosters collaborative awareness among agents without predefined roles and improves learning efficacy. Experimental results show that our method significantly enhances the performance of multi-agent systems based on open-source models, achieving notable improvements both within and across domains, while also exhibiting strong cross-environment generalization capabilities. Moreover, our best-performing systems achieve results on par with or exceed those of the strong closed-source models, while maintaining the flexibility to be integrated with prompt-based multi-agent systems for future research.

4887Seeker: Enhancing Exception Handling in Code with a LLM-based Multi-Agent Approach

[openreview] [pdf]

Abstract In real-world software development, improper or missing exception handling can severely impact the robustness and reliability of code. Exception handling mechanisms require developers to detect, capture, and manage exceptions according to high standards, but many developers struggle with these tasks, leading to fragile code. This problem is particularly evident in open-source projects and impacts the overall quality of the software ecosystem. To address this challenge, we explore the use of large language models (LLMs) to improve exception handling in code. Through extensive analysis, we identify three key issues: Insensitive Detection of Fragile Code, Inaccurate Capture of Exception Types, and Distorted Handling Solutions. These problems are widespread across real-world repositories, suggesting that robust exception handling practices are often overlooked or mishandled. In response, we propose \emph{Seeker}, a multi-agent framework inspired by expert developer strategies for exception handling. Seeker uses agents—Scanner, Detector, Predator, Ranker, and Handler—to assist LLMs in detecting, capturing, and resolving exceptions more effectively. Our work is the first systematic study on leveraging LLMs to enhance exception handling practices, providing valuable insights for future improvements in code reliability.

4888Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

[openreview] [pdf]

Abstract Relational reasoning is a central component of generally intelligent systems, enabling robust and data-efficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information:sensoryinformation about the properties of individual objects, andrelationalinformation about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call theDual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluateDATon a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.

4889DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

[openreview] [pdf]

Abstract The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

4890Exploring Large Action Sets with Hyperspherical Embeddings using von Mises-Fisher Sampling

[openreview] [pdf]

Abstract This paper introduces von Mises-Fisher exploration (vMF-exp), a scalable method for exploring large action sets in reinforcement learning problems where hyperspherical embedding vectors represent actions. vMF-exp involves initially sampling a state embedding representation using a von Mises-Fisher distribution, then exploring this representation’s nearest neighbors, which scales to virtually unlimited numbers of candidate actions. We show that, under theoretical assumptions, vMF-exp asymptotically maintains the same probability of exploring each action as Boltzmann Exploration (B-exp), a popular alternative that, nonetheless, suffers from scalability issues as it requires computing softmax values for each action. Consequently, vMF-exp serves as a scalable alternative to B-exp for exploring large action sets with hyperspherical embeddings. In the final part of this paper, we further validate the empirical relevance of vMF-exp by discussing its successful deployment at scale on a music streaming service. On this service, vMF-exp has been employed for months to recommend playlists inspired by initial songs to millions of users, from millions of possible actions for each playlist.

4891LD-SDM: Language-Driven Hierarchical Species Distribution Modeling

[openreview] [pdf]

Abstract We focus on species distribution modeling using global-scale presence-only data, leveraging geographical and environmental features to map species ranges, as in previous studies. However, we innovate by integrating taxonomic classification into our approach. Specifically, we propose using a large language model to extract a latent representation of the taxonomic classification from a textual prompt. This allows us to map the range of any taxonomic rank, including unseen species, without additional supervision. We also present a new proximity-aware evaluation metric, suitable for evaluating species distribution models, which addresses critical shortcomings of traditional metrics. We evaluated our model for species range prediction, zero-shot prediction, and geo-feature regression and found that it outperforms several state-of-the-art models. We will share code, data, and model checkpoints after acceptance.

4892Understanding Factual Recall in Transformers via Associative Memories

[openreview] [pdf]

Abstract Large language models have demonstrated an impressive ability to perform factual recall. Prior work has found that transformers trained on factual recall tasks can store information at a rate proportional to their parameter count. In our work, we show that shallow transformers can use a combination of associative memories to obtain such near optimal storage capacity. We begin by proving that the storage capacities of both linear and MLP associative memories scale linearly with parameter count. We next introduce a synthetic factual recall task, and prove that a transformer with a single layer of self-attention followed by an MLP can obtain 100% accuracy on the task whenever either the total number of self-attention parameters or MLP parameters scales (up to log factors) linearly with the number of facts. In particular, the transformer can trade off between using the value matrices or the MLP as an associative memory to store the dataset of facts. We complement these expressivity results with an analysis of the gradient flow trajectory of a simplified linear attention model trained on our factual recall task, where we show that the model exhibits sequential learning behavior.

4893Radar: Fast Long-Context Decoding for Any Transformer

[openreview] [pdf]

Abstract Transformer models have demonstrated exceptional performance across a wide range of applications. Though forming the foundation of Transformer models, the dot-product attention does not scale well to long-context data since its time requirement grows quadratically with context length. In this work, we propose Radar, a training-free approach that accelerates inference by dynamically searching for the most important context tokens. For any pre-trained Transformer, Radar can reduce the decoding time complexity without training or heuristically evicting tokens. Moreover, we provide theoretical justification for our approach, demonstrating that Radar can reliably identify the most important tokens with high probability. We conduct extensive comparisons with the previous methods on a wide range of tasks. The results demonstrate that Radar achieves the state-of-the-art performance across different architectures with reduced time complexity, offering a practical solution for efficient long-context processing of Transformers.

4894Factorized Implicit Global Convolution for Automotive Computational Fluid Dynamics Prediction

[openreview] [pdf]

Abstract Computational Fluid Dynamics (CFD) is crucial for automotive design, requiring the analysis of large 3D point clouds to study how vehicle geometry affects pressure fields and drag forces. However, existing deep learning approaches for CFD struggle with the computational complexity of processing high-resolution 3D data. We propose Factorized Implicit Global Convolution (FIGConv), a novel architecture that efficiently solves CFD problems for very large 3D meshes with arbitrary input and output geometries. FIGConv achieves quadratic complexity O(N2)O(N^2), a significant improvement over existing 3D neural CFD models that require cubic complexity O(N3)O(N^3). Our approach combines Factorized Implicit Grids to approximate high-resolution domains, efficient global convolutions through 2D reparameterization, and a U-shaped architecture for effective information gathering and integration. We validate our approach on the industry-standard Ahmed body dataset and the large-scale DrivAerNet dataset. On DrivAerNet, our model achieves an R2R^2 value of 0.95 for drag prediction, outperforming the previous state-of-the-art by a significant margin. This represents a 40% improvement in relative mean squared error and a 70% improvement in absolute mean squared error over prior methods.

4895AcademicEval: Live Long-Context LLM Benchmark

[openreview] [pdf]

Abstract Large Language Models (LLMs) have achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length and labor-intensive annotation, and the label leakage issue in LLM training also poses a pressing challenge. Therefore, we propose \textsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. \textsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, \textit{i.e.}, \textsc{Title}, \textsc{Abstract}, \textsc{Introduction}, and \textsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, \textsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, \textsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct holistic experiments on \textsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, illustrating the challenge of our benchmark. We also provide insightful analysis for enhancing LLMs’ long-context modeling capabilities.

4896DAG-SHAP: Feature Attribution in DAG based on Edge Intervention

[openreview] [pdf]

Abstract Shapley value-based feature attribution methods face challenges in scenarios with complex feature interactions and causal relationships, even when a causal structure is provided. The assumption on the attribution objects of existing methods often deviates from practical scenarios as they cannot capture the exogenous influence of features through each edge in the causal graph, leading to unreasonable interpretations. To overcome these limitations, we propose a novel feature attribution method called DAG-SHAP, which is based on edge intervention. DAG-SHAP treats the exogenous contributions in each ongoing feature edge as an individual attribution object ensuring that both externality and exogenous contributions of features are appropriately captured. Additionally, we introduce an approximation method for efficiently computing DAG-SHAP. Extensive experiments on both synthetic and real datasets validate the effectiveness of DAG-SHAP. Our code can be found in the anonymous repository at \url{https://anonymous.4open.science/r/dag-30F2}.

4897PolyPythias: Stability and Outliers across Fifty Language Model Pre-Training Runs

[openreview] [pdf]

Abstract Understanding the stability of language model pre-training and its effects on downstream performance is still understudied. Prior work shows that the training process can yield significantly different results in response to slight variations in initial conditions, e.g., the random seed. Crucially, resources to study pre-training stability in language models are still lacking, especially for decoder-only models. We introduce the PolyPythias, a set of 45 new training runs for the Pythia model suite: 9 new seeds across 5 model sizes, from 14M to 410M parameters, resulting in about 7k new checkpoints that we release. Using these new 45 training runs, in addition to the 5 already available, we study the effects of different initial conditions determined by the seed---i.e., parameters’ initialisation and data order---on (i) downstream performance, (ii) learned linguistic representations, and (iii) emergence of training phases. In addition to common scaling behaviours, our analyses generally reveal highly consistent training dynamics across both model sizes and initial conditions. Additionally, the new seeds for each model allow us to identify outlier training runs and delineate their characteristics. Our findings show the potential of using these methods to predict training stability.

4898Comparisons Are All You Need for Optimizing Smooth Functions

[openreview] [pdf]

Abstract When optimizing machine learning models, there are various scenarios where gradient computations are challenging or even infeasible. Furthermore, in reinforcement learning (RL), preference-based RL that only compares between options has wide applications, including reinforcement learning with human feedback in large language models. In this paper, we systematically study optimization of a smooth function f ⁣:RnRf\colon\mathbb{R}^n\to\mathbb{R} only assuming an oracle that compares function values at two points and tells which is larger. When ff is convex, we give two algorithms using O~(n/ϵ)\tilde{O}(n/\epsilon) and O~(n2)\tilde{O}(n^{2}) comparison queries to find an ϵ\epsilon-optimal solution, respectively. When ff is nonconvex, our algorithm uses O~(n/ϵ2)\tilde{O}(n/\epsilon^2) comparison queries to find an ϵ\epsilon-approximate stationary point. All these results match the best-known zeroth-order algorithms with function evaluation queries in nn dependence, thus suggesting that \emph{comparisons are all you need for optimizing smooth functions using derivative-free methods}. In addition, we also give an algorithm for escaping saddle points and reaching an ϵ\epsilon-second order stationary point of a nonconvex ff, using O~(n1.5/ϵ2.5)\tilde{O}(n^{1.5}/\epsilon^{2.5}) comparison queries.

4899DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models

[openreview] [pdf]

Abstract Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

4900Node-wise Filtering in Graph Neural Networks: A Mixture of Experts Approach

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) have proven to be highly effective for node classification tasks across diverse graph structural patterns. Traditionally, GNNs employ a uniform global filter—typically a low-pass filter for homophilic graphs and a high-pass filter for heterophilic graphs. However, real-world graphs often exhibit a complex mix of homophilic and heterophilic patterns, rendering a single filter approach suboptimal. In this work, we theoretically demonstrate that a global filter optimized for one pattern can adversely affect performance on nodes with differing patterns. To address this, we introduce a novel GNN framework Node-MoE that utilizes a mixture of experts to adaptively select the appropriate filters for different nodes. Extensive experiments demonstrate the effectiveness of the proposed Node-MoE on both homophilic and heterophilic graphs.

4901Diffusing to the Top: Boost Graph Neural Networks with Minimal Hyperparameter Tuning

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) are proficient in graph representation learning and achieve promising performance on versatile tasks such as node classification and link prediction. Usually, a comprehensive hyperparameter tuning is essential for fully unlocking GNN’s top performance, especially for complicated tasks such as node classification on large graphs and long-range graphs. This is usually associated with high computational and time costs and careful design of appropriate search spaces. This work introduces a graph-conditioned latent diffusion framework (GNN-Diff) to generate high-performing GNNs based on the model checkpoints of sub-optimal hyperparameters selected by a light-tuning coarse search. We validate our method through 166 experiments across four graph tasks: node classification on small, large, and long-range graphs, as well as link prediction. Our experiments involve 10 classic and state-of-the-art target models and 20 publicly available datasets. The results consistently demonstrate that GNN-Diff: (1) boosts the performance of GNNs with efficient hyperparameter tuning; and (2) presents high stability and generalizability on unseen data across multiple generation runs. The code is available athttps://anonymous.4open.science/r/GNN-Diff-1AD3.

4902SynHING: Synthetic Heterogeneous Information Network Generation for Graph Learning and Explanation

[openreview] [pdf]

Abstract Graph Neural Networks (GNNs) excel in modeling graph structures across diverse domains, such as community analysis and recommendation systems. As the need for GNN interpretability grows, there is an increasing demand for robust baselines and comprehensive graph datasets, especially within the realm of Heterogeneous Information Networks (HIN). To address this, we introduce SynHING, a framework for Synthetic Heterogeneous Information Network Generation designed to advance graph learning and explanation. After identifying key motifs in a target HIN, SynHING systematically employs a bottom-up generation process with intra-cluster and inter-cluster merge modules. This process, along with post-pruning techniques, ensures that the synthetic HIN accurately mirrors the structural and statistical properties of the original graph. The effectiveness of SynHING is validated using four datasets - IMDB, Recipe, ACM, and DBLP - spanning three distinct application categories, demonstrating both its generality and practicality. Furthermore, SynHING provides ground-truth motifs for evaluating GNN explainer models, establishing a new benchmark for explainable, synthetic HIN generation. This contributes significantly to advancing interpretable machine learning in complex network environments.

4903Optimizing Dynamic Treatment Strategies with Reinforcement Learning and Dual-Hawkes Process in Clinical Environments

[openreview] [pdf]

Abstract Modeling the timing of critical events and controlling associated risks through treatment options are crucial aspects of healthcare. However, current methods fall short in optimizing dynamic treatment plans to improve clinical outcomes. A key challenge lies in modeling the intensity functions of critical events throughout disease progression and capturing the dynamic interactions between patient conditions and treatments. To address this, we propose integrating reinforcement learning with a Generative Adversarial Network (GAN) and a dual-Hawkes process model to develop intelligent agents capable of delivering personalized and adaptive treatment strategies. The dual-Hawkes process allows us to model the intensity of both disease progression and recovery, while accounting for long-term dependencies. The GAN simulates real-world clinical environments using raw time-to-event data, without requiring detailed treatment annotations. By interacting with GAN, our model-based reinforcement learning agent learns an optimal dynamic policy that leverages long-term historical dependencies. When applied to the MIMIC-III dataset, our approach significantly increased the duration that patients remained in a healthy state, outperforming established machine learning policies.

4904Minimax-optimal trust-aware multi-armed bandits

[openreview] [pdf]

Abstract Multi-armed bandit (MAB) algorithms have achieved significant success in sequential decision-making applications, under the premise that humans perfectly implement the recommended policy. However, existing methods often overlook the crucial factor of human trust in learning algorithms. When trust is lacking, humans may deviate from the recommended policy, leading to undesired learning performance. Motivated by this gap, we study the trust-aware MAB problem by integrating a dynamic trust model into the standard MAB framework. Specifically, it assumes that the recommended and actually implemented policy differs depending on human trust, which in turn evolves with the quality of the recommended policy. We establish the minimax regret in the presence of the trust issue and demonstrate the suboptimality of vanilla MAB algorithms such as the upper confidence bound (UCB) algorithm. To overcome this limitation, we introduce a novel two-stage trust-aware procedure that provably attains near-optimal statistical guarantees. A simulation study is conducted to illustrate the benefits of our proposed algorithm when dealing with the trust issue.

4905Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models

[openreview] [pdf]

Abstract Graph-structured data plays a vital role in numerous domains, such as social networks, citation networks, commonsense reasoning graphs and knowledge graphs. While graph neural networks have been employed for graph processing, recent advancements have explored integrating large language models for graph-based tasks. In this paper, we propose a novel approach named Learnable Graph Pooling Token (LGPT), which addresses the limitations of the scalability issues in node-level projection and information loss in graph-level projection. LGPT enables flexible and efficient graph representation by introducing learnable parameters that act as tokens in large language models, balancing fine-grained and global graph information. Additionally, we investigate an Early Query Fusion technique, which fuses query context before constructing the graph representation, leading to more effective graph embeddings. Our method achieves a 4.13% performance improvement on the GraphQA benchmark without training the large language model, demonstrating significant gains in handling complex textual-attributed graph data.

4906Regularizing Energy among Training Samples for Out-of-Distribution Generalization

[openreview] [pdf]

Abstract The energy-based model provides a unified framework for various learning models where an energy value is assigned to each configuration of random variables based on probability. Recently, different methods have been proposed to derive an energy value out of the logits of a classifier for out-of-distribution (OOD) detection or OOD generalization. However, these methods mainly focus on the energy difference between in-distribution and OOD data samples, neglecting the energy difference among in-distribution data samples. In this paper, we show that the energy among in-distribution data also requires attention. We propose to investigate the energy difference between in-distribution data samples. Both empirically and theoretically, we show that previous methods for subpopulation shift (\emph{e.g.}, long-tail classification) such as data re-weighting and margin control apply implicit energy regularization and we provide a unified framework from the energy perspective. With the influence function, we further extend the energy regularization framework to OOD generalization scenarios where the distribution shift is more implicit compared to the long-tail recognition scenario. We conduct experiments on long-tail datasets, subpopulation shift benchmarks, and OOD generalization benchmarks to show the effectiveness of the proposed energy regularization method. The source code will be made publically available.

4907MambaExtend: A Training-Free Approach to Improve Long Context Extension of Mamba

[openreview] [pdf]

Abstract The inherent quadratic complexity of the attention mechanism in transformer models has driven the research community to explore alternative architectures with sub-quadratic complexity, such as state-space models. Within this emerging paradigm, Mamba has established itself as a leading model, achieving state-of-the-art results in various language modeling benchmarks. However, despite its impressive performance, Mamba’s effectiveness is significantly limited by its pre-training context length, resulting in a pronounced degradation when the model is tasked with handling longer contexts. Our investigation reveals that Mamba’s inability to generalize effectively to long contexts is primarily due to the out-of-distribution (OOD) discretization steps. To address this critical limitation, we introduceMambaExtend, a novel framework designed to enhance context extension capabilities of Mamba. Specifically, MambaExtend leverages atraining-freeapproach to calibrateonlythe scaling factors of discretization modules for different layers. We demonstrate both gradient-based and gradient-free zeroth-order optimization to learn the optimal scaling factors for each Mamba layer, requiring orders of magnitude fewer updates as opposed to the parameter fine-tuning based alternatives. With this, for the first time, we can enable a training-free context extension of up to 32×\mathbf{32}\times from 2k to 64k, that too without any significant increase in perplexity. Compared to the existing alternative approach of fine-tuning, due to only selective calibration of the scaling factors, MambaExtend requires up to \mathord{\sim}5.42106×\mathbf{{5.42*10^{6}}}\times fewer parameter update costing up to 3.87×\mathbf{3.87}\times lower peak-memory while maintaining similar or better long-context performance evaluated across multiple tasks. Code will be released soon.

4908Language models scale reliably with over-training and on downstream tasks

[openreview] [pdf]

Abstract Scaling laws are useful guides for derisking expensive training runs, as they predict performance of large models using cheaper, small-scale experiments. However, there remain gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., “Chinchilla optimal” regime). In contrast, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but models are usually compared on downstream task performance. To address both shortcomings, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we fit scaling laws that extrapolate in both the amount of over-training and the number of model parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32×\times over-trained) and a 6.9B parameter, 138B token run (i.e., a compute-optimal run)––each from experiments that take 300×\times less compute. Second, we relate the perplexity of a language model to its downstream task performance by proposing a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models, using experiments that take 20×\times less compute.

4909RL2Grid: Benchmarking Reinforcement Learning in Power Grid Operations

[openreview] [pdf]

Abstract Reinforcement learning (RL) has the potential to transform power grid operations by providing adaptive, scalable controllers essential for decarbonization and grid resilience. However, despite their promise, today’s RL methods struggle to deal with complex dynamics, aleatoric uncertainty, long-horizon goals, and hard physical constraints, hindering their application in power grids and other real-world settings. In this work, we present RL2Grid, a benchmark representing realistic power grid operations that aims to foster the maturity of RL methods. This work builds upon Grid2Op, a power grid simulation framework developed by RTE France, to provide standardized tasks, state and action spaces, and rewards within a common interface, and thereby provide a common basis for monitoring and promoting progress. We evaluate and compare widely adopted RL algorithms across the increasingly complex grid settings represented within RL2Grid, establishing reference performance metrics and offering insights into the effectiveness of different approaches (including pure RL approaches and hybrid approaches incorporating heuristics). Our findings indicate that power grids present substantial challenges for modern RL, underscoring the need for novel methods capable of dealing with complex real-world physical systems.

4910Towards Foundation Models for Mixed Integer Linear Programming

[openreview] [pdf]

Abstract Mixed Integer Linear Programming (MILP) is essential for modeling complex decision-making problems but faces challenges in computational tractability and requires expert formulation. Current deep learning approaches for MILP focus on specific problem classes and do not generalize to unseen classes. To address this shortcoming, we take a foundation model training approach, where we train a single deep learning model on a diverse set of MILP problems to generalize across problem classes. As existing datasets for MILP lack diversity and volume, we introduce MILP-Evolve, a novel LLM-based evolutionary framework that is capable of generating a large set of diverse MILP classes with an unlimited amount of instances. We study our methodology on three key learning tasks that capture diverse aspects of MILP: (1) integrality gap prediction, (2) learning to branch, and (3) a new task of aligning MILP instances with natural language descriptions. Our empirical results show that models trained on the data generated by MILP-Evolve achieve significant improvements on unseen problems, including MIPLIB benchmarks. Our work highlights the potential of moving towards a foundation model approach for MILP that can generalize to a broad range of MILP applications. We are committed to fully open-sourcing our work to advance further research.

4911ADAM: An Embodied Causal Agent in Open-World Environments

[openreview] [pdf]

Abstract In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their interpretability and generalization capability. To this end, we introduce ADAM, An emboDied causal Agent in Minecraft, that can autonomously navigate the open world, perceive multimodal contexts, learn causal world knowledge, and tackle complex tasks through lifelong learning. ADAM is empowered by four key components: 1) an interaction module, enabling the agent to execute actions while documenting the interaction processes; 2) a causal model module, tasked with constructing an ever-growing causal graph from scratch, which enhances interpretability and diminishes reliance on prior knowledge; 3) a controller module, comprising a planner, an actor, and a memory pool, which uses the learned causal graph to accomplish tasks; 4) a perception module, powered by multimodal large language models, which enables ADAM to perceive like a human player. Extensive experiments show that ADAM constructs an almost perfect causal graph from scratch, enabling efficient task decomposition and execution with strong interpretability. Notably, in our modified Minecraft games where no prior knowledge is available, ADAM maintains its performance and shows remarkable robustness and generalization capability. ADAM pioneers a novel paradigm that integrates causal methods and embodied agents in a synergistic manner.

4912Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks

[openreview] [pdf]

Abstract Multimodal foundation models, such as Gemini and GPT-4, have revolutionized human-machine interactions by seamlessly integrating various forms of data. Developing a universal spoken language model that comprehends a wide range of natural language instructions is critical for bridging communication gaps and facilitating more intuitive interactions. However, the absence of a comprehensive evaluation benchmark poses a significant challenge. We present Dynamic-SUPERB Phase-2, an open and evolving benchmark for the comprehensive evaluation of instruction-based universal speech models. Building upon the first generation, this second version incorporates 125 new tasks contributed collaboratively by the global research community, expanding the benchmark to a total of 180 tasks, making it the largest benchmark for speech and audio evaluation. While the first generation of Dynamic-SUPERB was limited to classification tasks, Dynamic-SUPERB Phase-2 broadens its evaluation capabilities by introducing a wide array of novel and diverse tasks, including regression and sequence generation, across speech, music, and environmental audio. Evaluation results indicate that none of the models performed well universally. SALMONN-13B excelled in English ASR, while WavLLM demonstrated high accuracy in emotion recognition, but current models still require further innovations to handle a broader range of tasks. We open-source all task data and the evaluation pipeline, which will be available after the paper is published.

4913Reinforcement Learning from Wild Animal Videos

[openreview] [pdf]

Abstract We propose to learn legged robot locomotion skills by watching thousands of wild animal videos from the internet, such as those featured in nature documentaries. Indeed, such videos offer a rich and diverse collection of plausible motion examples, which could inform how robots should move. To achieve this, we introduce Reinforcement Learning from Wild Animal Videos (RLWAV), a method to ground these motions into physical robots. We first train a video classifier on a large-scale animal video dataset to recognize actions from RGB clips of animals in their natural habitats. We then train a multi-skill policy to control a robot in a physics simulator, using the classification score of a third-person camera capturing videos of the robot’s movements as a reward for reinforcement learning. Finally, we directly transfer the learned policy to a real quadruped Solo. Remarkably, despite the extreme gap in both domain and embodiment between animals in the wild and robots, our approach enables the policy to learn diverse skills such as walking, jumping, and keeping still, without relying on reference trajectories nor hand-designed rewards.

4914Beatrix: Out-of-Distribution Generalization of Large EEG Model via Invariant Contrastive Fine-Tuning

[openreview] [pdf]

Abstract The advent of large-scale foundation models has revolutionized EEG analysis; however, their ability to generalize to Out-of-Distribution (OoD) brain signals remains limited due to the inherent variability in physiological states, individual differences, and experimental setups. To address these challenges, we introduce Beatrix, a novel spectral EEG foundation model that achieves state-of-the-art OoD generalization across diverse brain activity tasks. Beatrix leverages a unique analytic wavelet-based spectral tokenization that captures the intricate non-stationary dynamics of EEG signals, and employs a semi-causal generative modeling approach during pre-training, enabling it to learn expressive latent representations capable of both interpolation and extrapolation across temporal and frequency domains. For fine-tuning, we propose an innovative Contrastive Invariant Fine-Tuning (CIFT) method that enhances domain-invariant learning without the need for explicit environment labels, thus significantly improving OoD generalizability in a parameter-efficient manner. Our multi-view Transformer architecture further integrates both spectral and temporal information, allowing Beatrix to comprehensively model EEG signals across channels. Extensive experiments demonstrate that Beatrix consistently outperforms existing EEG models in tasks such as seizure detection and forecasting, auditory neural decoding, motor imagery, and sleep staging, showcasing its robustness and broad applicability. By achieving superior performance with reduced fine-tuning costs, Beatrix represents a significant advancement in the field of EEG foundation models.

4915Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis

[openreview] [pdf]

Abstract While recent zero-shot text-to-speech (TTS) models have significantly improved speech quality and expressiveness, mainstream systems still suffer from issues related to speech-text alignment modeling: 1) autoregressive large language models are inefficient and not robust in long-sentence inference; 2) non-autoregressive diffusion models without explicit speech-text alignment require substantial model capacity for alignment learning; 3) predefined alignment-based diffusion models suffer from limited expressiveness and a complicated inference pipeline. This paper introduces \textit{S-DiT}, a TTS system featuring an innovative sparse alignment algorithm that guides the latent diffusion transformer (DiT). Specifically, 1) we provide sparse alignment boundaries to S-DiT to reduce the difficulty of alignment learning without limiting the model’s expressiveness; 2) to simplify the overall pipeline, we propose a unified frontend language model (F-LM) training framework to cover various speech processing tasks required by TTS models. Additionally, we adopt the piecewise rectified flow technique to accelerate the generation process and employ a multi-condition classifier-free guidance strategy for accent intensity adjustment. Experiments demonstrate that S-DiT matches state-of-the-art zero-shot TTS speech quality while maintaining a more efficient pipeline. Moreover, our system can generate high-quality one-minute speech with only 8 sampling steps. Audio samples are available athttps://sditdemo.github.io/sditdemo/.

4916Hierarchical Graph Learners for Cardinality Estimation

[openreview] [pdf]

Abstract Cardinality estimation -- the task of estimating the number of records that a database query will return -- is core to performance optimization in modern database systems. Traditional optimizers used in commercial systems use heuristics that can lead to large errors. Recently, neural network based models have been proposed that outperform the traditional optimizers. These neural network based estimators perform well if they are trained with large amounts of query samples. In this work, we observe that data warehouse workloads contain highly repetitive queries, and propose a hierarchy of localized on-line models to target these repetitive queries. At the core, these models use an extension of Merkle-Trees to hash query plans which are directed acyclic graphs. The hash values can divisively partition a large set of graphs into many sets, each containing few (whole) graphs. We learn an online model for each partition of the hierarchy. No upfront training is needed; on-line models learn as the queries are executed. When a new query comes, we check the partitions it is hashed to and if no such local model was sufficiently confident along the hierarchy, we fall-back onto a default model at the root. Our experimental results show that not only our hierarchical on-line models perform better than the traditional optimizers, they also outperform neural models, with robust errors rates at the tail.

4917MEMREASONER: A MEMORY-AUGMENTED LANGUAGE MODEL ARCHITECTURE FOR MULTI-HOP REASONING

[openreview] [pdf]

Abstract Recent benchmarks suggest that there remains significant room to improve large language models’ ability to robustly reason across facts distributed in extremely long documents. In this work, we propose MemReasoner, a new memory-augmented LLM architecture that is trained to perform temporal reasoning, along with multiple computational steps, over the context stored in the memory. Experiments show that MemReasoner trained on the core reasoning facts generalizes better, when compared to off-the-shelf large language models and existing recurrent models, on a test distribution where the required facts are scattered across long natural text up to 128k tokens. Further, MemReasoner demonstrates robust reasoning performance relative to the baselines, when the answer distribution in test samples differs from that in the training set.

4918First-Step Advantage: Importance of Starting Right in Multi-Step Math Reasoning

[openreview] [pdf]

Abstract Language models can solve complex reasoning tasks better by learning to generate rationales for their predictions. Often these models know how to solve a task but their auto-regressive decoding nature leads to incorrect results if they start incorrectly. We observe that smaller models in particular, when corrected, can solve a task that they would have otherwise struggled with. We demonstrate this phenomenon by using a larger model to guide smaller models, which leads to significantly improved performance (up to+24points on the GSM8K dataset by 7B models). To assist smaller models in initiating the starting step, we propose QuestCoT, where a smaller model firstasks itself how to start, before proceeding with a chain of reasoning. On various multistep mathematical reasoning datasets over multiple smaller models, we show that getting the right start can lead to significant performance gains across all models (gains of up to+6points on GSM8K,+9on SVAMP,+5on ASDiv, and+7on MultiArith).

4919Subtask-Aware Visual Reward Learning from Segmented Demonstrations

[openreview] [pdf]

Abstract Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This paper introduces REDS: REward learning from Demonstration with Segmentations, a novel reward learning framework that leverages action-free videos with minimal supervision. Specifically, REDS employs video demonstrations segmented into subtasks from diverse sources and treats these segments as ground-truth rewards. We train a dense reward function conditioned on video segments and their corresponding subtasks to ensure alignment with ground-truth reward signals by minimizing the Equivalent-Policy Invariant Comparison distance. Additionally, we employ contrastive learning objectives to align video representations with subtasks, ensuring precise subtask inference during online interactions. Our experiments show that REDS significantly outperforms baseline methods on complex robotic manipulation tasks in Meta-World and more challenging real-world tasks, such as furniture assembly in FurnitureBench, with minimal human intervention. Moreover, REDS facilitates generalization to unseen tasks and robot embodiments, highlighting its potential for scalable deployment in diverse environments.

4920Pretrained Transformers are Deep Optimizers: Provable In-Context Learning for Deep Model Training

[openreview] [pdf]

Abstract We investigate the transformer’s capability for in-context learning (ICL) to simulate the training process of deep models. Our key contribution is providing a positive example of using a pretrained transformer to train a deep neural network by gradient descent in an implicit fashion via ICL. Specifically, we provide an explicit construction of a (2N+4)L(2N+4)L-layer transformer capable of simulating LL gradient descent steps of an NN-layer ReLU network through ICL. We also give the theoretical guarantees for the approximation within any given error and the convergence of the ICL gradient descent. Additionally, we extend our analysis to the more practical setting using Softmax-based transformers. We validate our findings on synthetic datasets for 3-layer, 4-layer, and 6-layer neural networks. The results show that ICL performance matches that of direct training.

4921Q-Supervised Contrastive Representation: A State Decoupling Framework for Safe Offline Reinforcement Learning

[openreview] [pdf]

Abstract Safe offline reinforcement learning (RL), which aims to learn the safety-guaranteed policy without risky online interaction with environments, has attracted growing recent attention for safety-critical scenarios. However, existing approaches encounter out-of-distribution problems during the testing phase, which can result in potentially unsafe outcomes. This issue arises due to the infinite possible combinations of reward-related and cost-related states. In this work, we proposeState Decoupling with Q-supervised Contrastive representation(SDQC), a novel framework that decouples the global observations into reward- and cost-related representations for decision-making, thereby improving the generalization capability for unfamiliar global observations. Compared with the classical representation learning methods, which typically require model-based estimation (e.g., bisimulation), we theoretically prove that our Q-supervised method generates a coarser representation while preserving the optimal policy, resulting in improved generalization performance. Experiments on DSRL benchmark problems provide compelling evidence that SDQC surpasses other baseline algorithms, especially for its exceptional ability to achieve almost zero violations in more than half of the tasks, while the state-of-the-art algorithm can only achieve the same level of success in a quarter of the tasks. Further, we demonstrate that SDQC possesses superior generalization ability when confronted with unseen environments.

4922Tropical Expressivity of Neural Networks

[openreview] [pdf]

Abstract We propose an algebraic geometric framework to study the expressivity of linear activation neural networks. A particular quantity of neural networks that has been actively studied is the number of linear regions, which gives a quantification of the information capacity of the architecture. To study and evaluate information capacity and expressivity, we work in the setting of tropical geometry---a combinatorial and polyhedral variant of algebraic geometry---where there are known connections between tropical rational maps and feedforward neural networks. Our work builds on and expands this connection to capitalize on the rich theory of tropical geometry to characterize and study various architectural aspects of neural networks. Our contributions are threefold: we provide a novel tropical geometric approach to selecting sampling domains among linear regions; an algebraic result allowing for a guided restriction of the sampling domain for network architectures with symmetries; and a new open source OSCAR library to analyze neural networks symbolically using their tropical representations, where we present a new algorithm that computes the exact number of their linear regions. We provide a comprehensive set of proof-of-concept numerical experiments demonstrating the breadth of neural network architectures to which tropical geometric theory can be applied to reveal insights on expressivity characteristics of a network. Our work provides the foundations for the adaptation of both theory and existing software from computational tropical geometry and symbolic computation to neural networks and deep learning.

4923Scattered Forest Search: Smarter Code Space Exploration with LLMs

[openreview] [pdf]

Abstract We propose a novel approach to scaling LLM inference for code generation. We frame code generation as a black box optimization problem within the code space, and employ optimization-inspired techniques to enhance exploration. Specifically, we introduce SCATTERED FOREST SEARCH to enhance solution diversity while searching for solutions. Our theoretical analysis illustrates how these methods avoid local optima during optimization. Extensive experiments on HumanEval, MBPP, APPS, CodeContests, and Leetcode reveal significant performance improvements. For instance, our method achieves a pass@1 rate of 67.1% on HumanEval+ and 87.2% on HumanEval with GPT-3.5, marking improvements of 8.6% and 4.3% over the state-of-the-art, while also halving the iterations needed to find the correct solution. Furthermore, our method scales more efficiently than existing search techniques, including tree search, line search, and repeated sampling.

4924Shared Memory for Multi-agent Lifelong Pathfinding

[openreview] [pdf]

Abstract Multi-agent reinforcement learning (MARL) demonstrates significant progress in solving cooperative and competitive multi-agent problems in various environments. One of the main challenges in MARL is the need to explicitly predict other agents’ behavior to achieve cooperation. As a solution to this problem, we propose the Shared Recurrent Memory Transformer (SRMT), which extends memory transformers to multi-agent settings by pooling and globally broadcasting individual working memories, enabling agents to implicitly exchange information and coordinate actions. We evaluate SRMT on the Partially Observable Multi-Agent Path Finding problem, both in a toy bottleneck navigation task requiring agents to pass through a narrow corridor and on a set of mazes from the POGEMA benchmark. In the bottleneck task, SRMT consistently outperforms a range of reinforcement learning baselines, especially under sparse rewards, and generalizes effectively to longer corridors than those seen during training. On POGEMA maps, including Mazes, Random, and Warehouses, SRMT is competitive with a variety of recent MARL, hybrid, and planning-based algorithms. These results suggest that incorporating shared memory into transformer-based architectures can enhance coordination in decentralized multi-agent systems.

4925Cascaded Learned Bloom filter for Optimal Model-Filter Size Balance and Fast Rejection

[openreview] [pdf]

Abstract Recent studies have demonstrated that learned Bloom filters, which combine machine learning with the classical Bloom filter, can achieve superior memory efficiency. However, existing learned Bloom filters face two critical unresolved challenges: the balance between the machine learning model size and the Bloom filter size is not optimal, and the reject time cannot be minimized effectively. We propose the Cascaded Learned Bloom Filter (CLBF) to address these issues. Our optimization approach based on dynamic programming automatically selects configurations that achieve an optimal balance between the model and filter sizes while minimizing reject time. Experiments with real-world datasets show that CLBF reduces memory usage by up to 24% and decreases reject time by up to 14 times compared to the state-of-the-art learned Bloom filter.

4926Large Language Models Are Stronger Entropy Models for Transform Coding

[openreview] [pdf]

Abstract Large language models (LLMs) have shown promising advancements in lossless compression due to their excellent next-token prediction capabilities. However, there is a gap between LLM-based compressors and classical transform-based codecs. Existing LLM-based compressors function solely as entropy coders, focusing on compressing redundant data in the raw domain. In contrast, classical codecs typically transform raw data into more compact features in the latent domain before applying entropy coding. But LLM-based compressors have not discussed this case. To our knowledge, this is the first work to introduce an LLM-based entropy model for transform coding. Specifically, we propose a simple yet effective fine-tuning strategy, tested across various codecs for both images and speeches. With less than 2% parameters are fine-tuned, the LLMs can serve as highly effective entropy models for well-established transform-based compression codecs. For instance, LLaMA3-8B paired with arithmetic coding compresses latent image codes on Kodak to 4.62% and speech codes on LibriTTS to 42.53% of their transformed sizes after fine-tuning. Our proposed methods achieve notable BD-rate improvements of 54.07% over JPEG, 17.61% over VQGAN, and 34.61% over SpeechTokenizer. These findings highlight the great potential of integrating LLMs into codecs to significantly improve coding efficiency. Source codes will be released upon acceptance.

4927Synergistic Weak-Strong Collaboration by Aligning Preferences

[openreview] [pdf]

Abstract Current Large Language Models (LLMs) demonstrate exceptional general reasoning and problem-solving abilities but often struggle with specialized tasks or domains requiring proprietary information due to their generalized training and size constraints. Fine-tuning large models for every specific domain is impractical because of inaccessibility to black-box model parameters and high computational costs. We explore a solution to this challenge: can a collaborative framework between a specialized weak model and a general strong model effectively extend LLMs’ capabilities to niche but critical tasks? We propose a dynamic interaction where the weak model, tailored to specific domains, generates detailed initial drafts and background information, while the strong model refines and enhances these drafts using its advanced reasoning skills. To optimize this collaboration, we introduce a feedback loop by fine-tuning the weak model based on the strong model’s preferences, fostering an adaptive and synergistic relationship. We validate our framework through experiments on three datasets. We find that the collaboration significantly outperforms each model alone by leveraging complementary strengths. Moreover, fine-tuning the weak model with strong model’s preference further enhances overall performance. Our collaborative approach achieves an average F1 score improvement of 3.24% over the weak model alone and 12.17% over the strong model alone across all benchmarks.

4928PPT: Patch Order Do Matters In Time Series Pretext Task

[openreview] [pdf]

Abstract Recently, patch-based models have been widely discussed in time series analysis. However, existing pretext tasks for patch-based learning, such as masking, may not capture essential time and channel-wise patch interdependencies in time series data, presumed to result in subpar model performance. In this work, we introducePatch order-aware Pretext Task (PPT), a new self-supervised patch order learning pretext task for time series classification. PPT exploits the intrinsic sequential order information among patches across time and channel dimensions of time series data, where model training is aided by channel-wise patch permutations. The permutation disrupts patch order consistency across time and channel dimensions with controlled intensity to provide supervisory signals for learning time series order characteristics. To this end, we propose two patch order-aware learning methods: patch order consistency learning, which quantifies patch order correctness, and contrastive learning, which distinguishes weakly permuted patch sequences from strongly permuted ones. With patch order learning, we observe enhanced model performance, e.g., improving up to 7% accuracy for the supervised cardiogram task and outperforming mask-based learning by 5% in the self-supervised human activity recognition task. We also propose ACF-CoS, an evaluation metric that measures theimportance of ordernessfor time series datasets, which enables pre-examination of the efficacy of PPT in model training.

4929Learning to Filter Outlier Edges in Global SfM

[openreview] [pdf]

Abstract This paper introduces a novel approach to improve camera position estimation in global Structure-from-Motion (SfM) frameworks by filtering inaccurate pose graph edges, representing relative translation estimates, before applying translation averaging. In SfM, pose graph vertices represent cameras and edges relative poses (rotation and translation) between cameras. We formulate the edge filtering problem as a vertex filtering in the dual graph -- a line graph where the vertices stem from edges in the original graph, and the edges from cameras. Exploiting such a representation, we frame the problem as a binary classification over nodes in the dual graph. To learn such a classification and find outlier edges, we employ a Transformer architecture-based technique. To address the challenge of memory overflow often caused by converting to a line graph, we introduce a clustering-based graph processing approach, enabling the application of our method to arbitrarily large pose graphs. The proposed method outperforms existing relative translation filtering techniques in terms of final camera position accuracy and can be seamlessly integrated with any other filter. The code will be made public.

4930Riemann-Lebesgue Forest for Regression

[openreview] [pdf]

Abstract We propose a novel ensemble method called Riemann-Lebesgue Forest (RLF) for regression. The core idea in RLF is to mimic the way how a measurable function can be approximated by partitioning its range into a few intervals. With this idea in mind, we develop a new tree learner named Riemann-Lebesgue Tree (RLT) which has a chance to perform Lebesgue type cutting,i.e splitting the node from response Y at certain non-terminal nodes. In other words, we introduce the “splitting type randomness” in our ensemble method. We show that the optimal Lebesgue type cutting results in larger variance reduction in response Y than ordinary CART cutting (an analogue of Riemann partition). Such property is beneficial to the ensemble part of RLF. We also generalize the asymptotic normality of RLF under different parameter settings. Two one-dimensional examples are provided to illustrate the flexibility of RLF. The competitive performance of RLF against original random forest is demonstrated by experiments in simulation data and real world datasets.

4931Learning Parameter Sharing with Tensor Decompositions and Sparsity

[openreview] [pdf]

Abstract Large neural networks achieve remarkable performance, but their size hinders deployment on resource-constrained devices. While various compression techniques exist, parameter sharing remains relatively unexplored. This paper introduces Sparsity-enabled Parameter Sharing (SParS), a novel algorithm that leverages the relationship between parameter sharing, tensor decomposition, and sparsity to efficiently compress large vision transformer models. SParS employs a shared base and sparse factors to represent shared neurons across multilayer perceptrons (MLP). Shared parameterization is initialized via Singular Value Decomposition (SVD) and optimized by minimizing block-wise reconstruction error. Experiments demonstrate that SParS compresses DeiT-B and Swin-L MLPs to 25–40% of their original parameter count while maintaining accuracy within 1 percentage point of the original models.

4932Derivatives Are All You Need For Learning Physical Models

[openreview] [pdf]

Abstract Physics-Informed Neural Networks (PINNs) explicitly incorporate Partial Differential Equations (PDEs) into the loss function, thus learning representations that are inherently consistent with the physical system. We claim that it is possible to learn physically consistent models without explicit knowledge about the underlying equations. We propose Derivative Learning (DERL) to model a physical system by learning its partial derivatives, as they contain all the necessary information to determine the system’s dynamics. Like in PINNs, we also train the learning model on the initial and boundary conditions of the system. We provide theoretical guarantees that our approach learns the true solution and is consistent with the underlying physical laws, even when using empirical derivatives. DERL outperforms PINNs and other state-of-the-art approaches in tasks ranging from simple dynamical systems to PDEs. Finally, we show that distilling the derivatives enables the transfer of physical information from one model to another. Distillation of higher-order derivatives improves physical consistency. Ultimately, learning and distilling the derivatives of physical systems turns out to be a powerful tool to learn physical models.

4933Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues

[openreview] [pdf]

Abstract Linear Recurrent Neural Networks (LRNNs), such as Mamba, RWKV, GLA, mLSTM, and DeltaNet have emerged as efficient alternatives to transformers in large language modeling, offering linear scaling with sequence length and improved training efficiency. However, LRNNs struggle with state-tracking which is important for, e.g., code comprehension or tracking chess pieces across a board. Even parity, the simplest state-tracking task, which non-linear RNNs like LSTMs handle effectively, cannot be solved by current LRNNs. Recently, Sarrof et al. (2024) demonstrated that the failure of LRNNs like Mamba to solve parity stems from restricting the eigenvalue range of their diagonal state-transition matrices to [0,1][0, 1], and that incorporating negative eigenvalues can resolve this issue. We generalize this result to full matrix LRNNs, which have recently shown promise in models such as DeltaNet. We prove that no finite-precision LRNN with state-transition matrices having only positive eigenvalues can solve parity, while complex eigenvalues are needed to count modulo 3. Notably, we also prove that LRNNs can learn any regular language when their state-transition matrices are products of identity plus vector outer product matrices with eigenvalues in the range [1,1][-1, 1]. Our empirical results confirm that extending the eigenvalue range of models like Mamba and DeltaNet to include negative values not only enables them to solve parity but consistently improves their performance on state-tracking tasks. Furthermore, pre-training LRNNs with an extended eigenvalue range for language modeling achieves comparable performance and stability while showing promise for coding tasks. Our work enhances the expressivity of modern LRNNs, broadening their applicability without changing the cost of training or inference.

4934BiLO: Bilevel Local Operator Learning for PDE inverse problems

[openreview] [pdf]

Abstract We propose a new neural network based method for solving inverse problems for partial differential equations (PDEs) by formulating the PDE inverse problem as a bilevel optimization problem. At the upper level, we minimize the data loss with respect to the PDE parameters. At the lower level, we train a neural network to locally approximate the PDE solution operator in the neighborhood of a given set of PDE parameters, which enables an accurate approximation of the descent direction for the upper level optimization problem. The lower level loss function includes the L2 norms of both the residual and its derivative with respect to the PDE parameters. We apply gradient descent simultaneously on both the upper and lower level optimization problems, leading to an effective and fast algorithm. The method, which we refer to as BiLO (Bilevel Local Operator learning), is also able to efficiently infer unknown functions in the PDEs through the introduction of an auxiliary variable. We demonstrate that our method enforces strong PDE constraints, is robust to sparse and noisy data, and eliminates the need to balance the residual and the data loss, which is inherent to soft PDE constraints.

4935Relaxing Representation Alignment with Knowledge Preservation for Multi-Modal Continual Learning

[openreview] [pdf]

Abstract In continual learning, developing robust representations that adapt to new distributions or classes while retaining prior knowledge is crucial. While most traditional approaches focus on single-modality data, multi-modal learning offers significant advantages by leveraging diverse sensory inputs, akin to human perception. However, transitioning to multi-modal continual learning introduces additional challenges as the model needs to effectively combine new information from different modalities while avoiding catastrophic forgetting. In this work, we propose a relaxed cross-modality representation alignment loss and utilize a dual-learner framework to preserve the relation between previously learned representations. We validate our framework using several multi-modal datasets that encompass various types of input modalities. Results show that we consistently outperform baseline continual learning methods in both class incremental and domain incremental learning scenarios. Further analysis highlights the effectiveness of our solution in preserving prior knowledge while incorporating new information.

4936Exploring Edge Probability Graph Models Beyond Edge Independency: Concepts, Analyses, and Algorithms

[openreview] [pdf]

Abstract Desirable random graph models (RGMs) should(i)generaterealisticstructures such as high clustering (i.e., high subgraph densities),(ii)generatevariable(i.e., not overly similar) graphs, and(iii)remaintractableto compute and control graph statistics. A common class of RGMs (e.g., Erd\H{o}s-R’{e}nyi and stochastic Kronecker) outputs edge probabilities, and we need to realize (i.e., sample from) the edge probabilities to generate graphs. Typically, each edge’s existence is assumed to be determined independently for simplicity and tractability. However, with edge independency, RGMs theoretically cannot produce high subgraph densities and high output variability simultaneously. In this work, we explore realization beyond edge independence that can produce more realistic structures while maintaining high traceability and variability. Theoretically, we propose an edge-dependent realization framework calledbindingthat provably preserves output variability, and deriveclosed-formtractability results on subgraph (e.g., triangle) densities in generated graphs. Practically, we propose algorithms for graph generation with binding and parameter fitting of binding. Our empirical results demonstrate that binding exhibits high tractability and generates realistic graphs with high clustering, significantly improving upon existing RGMs assuming edge independency.

4937Action Typicality and Uniqueness Learning for Zero-Shot Video Anomaly Detection

[openreview] [pdf]

Abstract Zero-Shot Video Anomaly Detection (ZS-VAD) is an urgent task in scenarios where the target video domain lacks training data due to various concerns, \emph{e.g.}, data privacy. The skeleton-based approach is a promising way to achieve ZS-VAD as it eliminates domain disparities in both background and human appearance. However, existing methods only learn low-level skeleton representation and rely on the domain-specific normality boundary, which cannot generalize well to new scenes with different normal and abnormal behavior patterns. In this paper, we propose a novel skeleton-based zero-shot video anomaly detection framework, which captures both scene-generic typical anomalies and scene-adaptive unique anomalies. Firstly, we introduce a language-guided typicality modeling module that projects skeleton snippets into action semantic space and learns generalizable typical distributions of normal and abnormal behavior. Secondly, we propose a test-time context uniqueness analysis module to finely analyze the spatio-temporal differences between skeleton snippets and then derive scene-adaptive boundaries. Without using any training samples from the target domain, our method achieves state-of-the-art results on four large-scale VAD datasets: ShanghaiTech, UBnormal, NWPU, and UCF-Crime. The Code will be publicly available.

4938On Large Language Model Continual Unlearning

[openreview] [pdf]

Abstract While large language models have demonstrated impressive performance across various domains and tasks, their security issues have become increasingly severe. Machine unlearning has emerged as a representative approach for model safety and security, removing the influence of undesired data on the target model. However, these methods do not sufficiently consider that unlearning requests in real-world scenarios are continuously emerging, especially in the context of LLMs, which may lead to accumulated model utility loss that eventually becomes unacceptable. Moreover, existing LLM unlearning methods often ignore previous data access limitations due to privacy concerns and copyright protection. Without previous data, the utility preservation during unlearning is much harder. To overcome these challenges, we propose the O3 framework that includes an \underline{\textit{O}}rthogonal low-rank adapter (LoRA) for continually unlearning requested data and an \underline{\textit{O}}ut-\underline{\textit{O}}f-Distribution (OOD) detector to measure the similarity between input and unlearning data. The orthogonal LoRA achieves parameter disentanglement among continual unlearning requests. The OOD detector is trained with a novel contrastive entropy loss and utilizes a glocal-aware scoring mechanism. During inference, our O3 framework can decide whether and to what extent to load the unlearning LoRA based on the OOD detector’s predicted similarity between the input and the unlearned knowledge. Notably, O3’s effectiveness does not rely on any retained data. We conducted extensive experiments on O3 and state-of-the-art LLM unlearning methods across three tasks and seven datasets. The results indicate that O3 consistently achieves the best unlearning effectiveness and utility preservation, especially when facing continuous unlearning requests. The source codes can be found at \url{https://anonymous.4open.science/r/O3-A02B}.

4939Instance-Level Smoothing for Enhanced Privacy in Deep Learning: Theoretical Insights and Empirical Validation

[openreview] [pdf]

Abstract In this paper, we address the dual challenge of maintaining high accuracy and ensuring fairness in differentially private (DP) deep learning models. The optimization process is inherently complicated by the necessity of injecting random noise and limiting training iterations, particularly for over-parameterized models. Moreover, DP mechanisms frequently exacerbate accuracy disparities across subpopulations, complicating the balance between privacy and fairness. To tackle these challenges, we introduce a novel framework that systematically addresses the trade-off between privacy and utility in DP deep learning. At the core of our approach is the concept of instance-level smoothing, which enhances privacy protections without compromising performance. Our theoretical contributions include deep insights into sample complexity, instance-level smoothing factors, and error bounds required to achieve a given privacy budget. These insights provide a robust foundation for optimizing the delicate balance between privacy and utility. Our method demonstrates remarkable robustness, independent of iteration counts, model parameters, batch normalization processes, and subpopulation disparities. This flexibility enables an optimal balance between privacy preservation and utility, adaptable to a wide range of scenarios. Through extensive empirical studies on the large-scale medical imaging dataset CheXpert, we validate the effectiveness of our approach. Our findings align with theoretical predictions, showing that our method can effectively meet stringent privacy requirements while maintaining high performance. By bridging the gap between formal privacy guarantees and practical deep learning applications, our work lays the groundwork for future advancements in the field. This research empowers practitioners to protect sensitive data during model training and ensures both data privacy and model generality, paving the way for more secure and equitable AI systems.

4940Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation

[openreview] [pdf]

Abstract Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by fast rigid-body dynamics. Simulation techniques for soft bodies are comparatively several orders of magnitude slower, thereby limiting the use of RL due to sample complexity requirements. To address this challenge, this paper presents both a novel RL algorithm and a simulation platform to enable scaling RL on tasks involving rigid bodies and deformables. We introduce Soft Analytic Policy Optimization (SAPO), a maximum entropy first-order model-based actor-critic RL algorithm which uses first-order analytic gradients from differentiable simulation to train a stochastic actor to maximize expected return and entropy. Alongside our approach, we develop Rewarped, a parallel differentiable multiphysics simulation platform that supports simulating various materials beyond rigid bodies. We re-implement challenging manipulation & locomotion tasks in Rewarped, and show that SAPO outperforms baselines over a range of tasks that involve interaction between rigid bodies, articulations, and deformables.

4941Video Generation with Learned Action Prior

[openreview] [pdf]

Abstract Long-term stochastic video generation remains challenging, especially with moving cameras. This scenario introduces complex interactions between camera movement and observed pixels, resulting in intricate spatio-temporal dynamics and partial observability issues. Current approaches often focus on pixel-level image reconstruction, neglecting explicit modeling of camera motion dynamics. Our proposed solution incorporates camera motion or action as an extended part of the observed image state, employing a multi-modal learning framework to simultaneously model both image and action. We introduce three models: (i) Video Generation with Learning Action Prior (VG-LeAP) that treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; (ii) Causal-LeAP, which establishes a causal relationship between action and the observed image frame, and learns a seperate action prior, conditioned on the observed image states along with the image prior; and (iii) RAFI, which integrates the augmented image-action state concept with a conditional flow matching framework, demonstrating that this action-conditioned image generation concept can be extended to other transformer-based architectures. Through comprehensive empirical studies on robotic video dataset, RoAM, we highlight the importance of multi-modal training in addressing partially observable video generation problems.

4942Fine-Tuning Attention Modules Only: Enhancing Weight Disentanglement in Task Arithmetic

[openreview] [pdf]

Abstract For the past several years,task arithmetichas gained increasing attention. This approach edits pre-trained models directly in weight space by combining the fine-tuned weights of various tasks into aunified model. Its efficiency and cost-effectiveness stem from its training-free combination, contrasting with traditional methods that require model training on large datasets for multiple tasks. However, applying such a unified model to individual tasks can lead to interference from other tasks (lack ofweight disentanglement). To address this issue, Neural Tangent Kernel (NTK) linearization has been employed to leverage a ‘‘kernel behavior’’, facilitating weight disentanglement and mitigating adverse effects from unrelated tasks. Despite its benefits, NTK linearization presents drawbacks, including doubled training costs, as well as reduced performance of individual models. To tackle this problem, we propose a simple yet effective and efficient method that is to finetune the attention modules only in the Transformer. Our study reveals that the attention modules exhibit kernel behavior, and fine-tuning the attention modules only significantly improves weight disentanglement. To further understand how our method improves the weight disentanglement of task arithmetic, we present a comprehensive study of task arithmetic by differentiating the role of the representation module and task-specific module. In particular, we find that the representation module plays an important role in improving weight disentanglement whereas the task-specific modules such as the classification heads can degenerate the weight disentanglement performance.

4943Ferret: Federated Full-Parameter Tuning at Scale for Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have become indispensable in numerous real-world applications. Unfortunately, fine-tuning these models at scale, especially in federated settings where data privacy and communication efficiency are critical, presents significant challenges. Existing methods often resort to parameter-efficient fine-tuning (PEFT) to mitigate communication overhead, but this typically comes at the cost of model accuracy. To address these limitations, we propose federated full-parameter tuning at scale for LLMs (Ferret), the first first-order method with shared randomness to enable scalable full-parameter tuning of LLMs across decentralized data sources while maintaining competitive model accuracy. Ferret accomplishes this through three aspects: (1) it employs widely applied first-order methods for efficient local updates; (2) it projects these updates into a low-dimensional space to considerably reduce communication overhead; and (3) it reconstructs local updates from this low-dimensional space with shared randomness to facilitate effective full-parameter global aggregation, ensuring fast convergence and competitive final performance. Our rigorous theoretical analyses and insights along with extensive experiments, show that Ferret significantly enhances the scalability of existing federated full-parameter tuning approaches by achieving high computational efficiency, reduced communication overhead, and fast convergence, all while maintaining competitive model accuracy.

4944A Closer Look at Machine Unlearning for Large Language Models

[openreview] [pdf]

Abstract Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches.

4945Time After Time: Scalable Effect Estimation for Interventions on When and What to do

[openreview] [pdf]

Abstract Decision support in fields such as healthcare and finance requires reasoning about treatment timing. Artificial Intelligence holds great potential for supporting such decisions by estimating the causal effect of policies such as medication regimens, or resource allocation schedules. However, existing methods for effect estimation are limited in their ability to handle \emph{irregular time}. While treatments and observations in data are often irregularly spaced across the timeline, existing techniques either discretize time, do not scale gracefully to large models, or disregard the effect of treatment time.We present a solution for effect estimation of sequential treatment times called Earliest Disagreement Q-Evaluation (EDQ). The method is based on Dynamic Programming and is compatible with flexible sequence models, such as transformers. It provides accurate estimates under the assumptions of ignorability, overlap, and no-instantaneous effects. We validate the approach through experiments on a survival time prediction task.

4946The Uncanny Valley: Exploring Adversarial Robustness from a Flatness Perspective

[openreview] [pdf]

Abstract Flatness of the loss surface not only correlates positively with generalization, but is also related to adversarial robustness, since perturbations of inputs relate non-linearly to perturbations of weights. In this paper, we empirically analyze the relation between adversarial examples and relative flatness with respect to the parameters of one layer. We observe a peculiar property of adversarial examples in the context of relative flatness: during an iterative first-order white-box attack, the flatness of the loss surface measured around the adversarial examplefirstbecomes sharper until the label is flipped, but if we keep the attack running, it runs into a flatuncanny valleywhere the label remains flipped. In extensive experiments, we observe this phenomenon across various model architectures and datasets, even for adversarially trained models. Our results also extend to large language models (LLMs), but due to the discrete nature of the input space and comparatively weak attacks, adversarial examples rarely reach truly flat regions. Most importantly, this phenomenon shows that flatness alone cannot explain adversarial robustness unless we can also guarantee the behavior of the function around the examples. We therefore theoretically connect relative flatness to adversarial robustness by bounding the third derivative of the loss surface, underlining the need for flatness in combination with a low global Lipschitz constant for a robust model.

4947Network-based Active Inference for Adaptive and Cost-efficient Real-World Applications: A Benchmark Study of a Valve-turning Task Against Deep Reinforcement Learning

[openreview] [pdf]

Abstract This paper introduces Network-based Active Inference (NetAIF), a novel approach that integrates Active Inference (AIF) principles with network dynamics to enable adaptive, cost-efficient real-world applications. In benchmark tests against Deep Reinforcement Learning (DRL), NetAIF outperforms DRL in both computational efficiency and task performance. Leveraging random attractor dynamics, NetAIF generates real-time trajectories, allowing robots to adapt to complex, dynamic environments without the need for extensive pre-training. We demonstrate NetAIF’s superiority in industrial valve manipulation, achieving over 99% accuracy in goal position and orientation in untrained dynamic environments, with a 45,000-fold reduction in computational costs. NetAIF is approximately 100,000 times more efficient in iteration count than DRL, making it a highly robust and efficient solution for industrial applications.

4948Synergistic Approach for Simultaneous Optimization of Monolingual, Cross-lingual, and Multilingual Information Retrieval

[openreview] [pdf]

Abstract Information retrieval across different languages is an increasingly important challenge in natural language processing. Recent approaches based on multilingual pre-trained language models have achieved remarkable success, yet they often optimize for either monolingual, cross-lingual, or multilingual retrieval performance at the expense of others. This paper proposes a novel hybrid batch training strategy to simultaneously improve zero-shot retrieval performance across monolingual, cross-lingual, and multilingual settings while mitigating language bias. The approach fine-tunes multilingual language models using a mix of monolingual and cross-lingual question-answer pair batches sampled based on dataset size. Experiments on XQuAD-R, MLQA-R, and MIRACL benchmark datasets show that the proposed method consistently achieves comparable or superior results in zero-shot retrieval across various languages and retrieval tasks compared to monolingual-only or cross-lingual-only training. Hybrid batch training also substantially reduces language bias in multilingual retrieval compared to monolingual training. These results demonstrate the effectiveness of the proposed approach for learning language-agnostic representations that enable strong zero-shot retrieval performance across diverse languages.

4949RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval

[openreview] [pdf]

Abstract Transformer-based Large Language Models (LLMs) have become increasingly important. However, due to the quadratic time complexity of attention computation, scaling LLMs to longer contexts incurs extremely slow inference speed and high GPU memory consumption for caching key-value (KV) vectors. This paper proposes RetrievalAttention, a training-free approach to both accelerate attention computation and reduce GPU memory consumption. By leveraging the dynamic sparsity of attention mechanism, RetrievalAttention proposes to build approximate nearest neighbour search (ANNS) indexes for KV vectors in CPU memory and retrieve the most relevant ones through vector search during generation. Unfortunately, we observe that the off-the-shelf ANNS indexes are often ineffective for such retrieval tasks due to the out-of-distribution (OOD) between query vectors and key vectors in the attention mechanism. RetrievalAttention addresses the OOD challenge by designing an attention-aware vector search algorithm that can adapt to the distribution of query vectors. Our evaluation demonstrates that RetrievalAttention achieves near full attention accuracy while only requiring access to 1–3% of the data. This leads to a significant reduction in the inference cost of long-context LLMs, with a much lower GPU memory footprint. In particular, RetrievalAttention only needs a single NVIDIA RTX4090 (24GB) to serve 128K tokens for LLMs with 8B parameters, which is capable of generating one token in 0.188 seconds.

4950Pretraining A Shared Q-Network for Data Efficient Offline Reinforcement Learning

[openreview] [pdf]

Abstract Recent breakthroughs in supervised learning domains such as computer vision and natural language processing follow the consistent paradigm: pretrain a neural network with a large dataset and fine-tune it onto downstream tasks with a relatively small dataset. Offline reinforcement learning (RL) can be an alternative approach for learning the best policy with the static dataset in sequential decision-making problems, akin to supervised learning. Following the paradigm, previous works have focused on constructing a large dataset or pretraining networks with the static dataset and fine-tuning them with online interactions. However, it is still vague that offline RL can exhibit data efficiency, e.g. robustness to static dataset size. In this paper, we propose a simple yet effective plug-and-play method that pretrains a Q-network under an offline RL scheme, improving task performance and data efficiency. Our method consists of two core functionalities: Transforming the Q-network structure to a shared network architecture and pretraining weights of the shared network by a supervised regression task that predicts the forward dynamics of a task. We provide an analysis of how our method enables improved performance even in a small dataset in terms of the projected Bellman equation. We also empirically demonstrate that the proposed method improves the performance of existing popular offline RL methods on the D4RL and Robomimic benchmarks with an average improvement of 135.94% on the D4RL benchmark. Moreover, we demonstrate the proposed method boosts data efficiency in offline RL with varying data collection strategies.

4951Learnability of Discrete Dynamical Systems under High Classification Noise

[openreview] [pdf]

Abstract Due to the important role of discrete dynamical systems in modeling real-world cascading phenomena on networks, problems for learning such systems have garnered considerable attention in ML. However, existing studies on this topic typically assume that the training data is noise-free, an assumption that is often impractical. In this work, we address this gap by investigating a more realistic and challenging setting: learning discrete dynamical systems from data contaminated with noise. Towards this end, we present efficient noise-tolerant learning algorithms that provide provable performance guarantees under the PAC model, and establish tight bounds on sample complexity. We show that, even in the presence of noise, the proposed learner only needs a small training set to infer a system. Notably, the number of training samples required by the algorithm in the noisy setting is the same (to within a constant factor) as the information-theoretic upper bound in the noise-free scenario. Further, the number of noisy training samples used by the algorithm is only a logarithmic factor higher than the best-known lower bound. Through experimental studies, we evaluate the empirical performance of the algorithms on both synthetic and real-world networks.

4952CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring (the Lack of) Cultural Knowledge of LLMs

[openreview] [pdf]

Abstract To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs’ cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.

4953Scaling Instruction-tuned LLMs to Million-token Contexts via Hierarchical Synthetic Data Generation

[openreview] [pdf]

Abstract Large Language Models (LLMs) struggle with long-context reasoning, not only due to the quadratic scaling of computational complexity with sequence length but also because of the scarcity and expense of annotating long-context data. There has been barely any open-source work that systematically ablates long-context data, nor is there any openly available instruction tuning dataset with contexts surpassing 100K tokens. To bridge this gap, we introduce a novel post-training synthetic data generation strategy designed to efficiently extend the context window of LLMs while preserving their general task performance. Our approach scalably extends to arbitrarily long context lengths, unconstrained by the length of available real-world data, which effectively addresses the scarcity of raw long-context data. Through a step-by-step rotary position embedding (RoPE) scaling training strategy, we demonstrate that our model, with a context length of up to 1M tokens, performs well on the RULER benchmark and InfiniteBench and maintains robust performance on general language tasks.

4954Aligning to Constraints for Data-Efficient Language Model Customization

[openreview] [pdf]

Abstract General-purpose language models (LMs) are aligned to diverse user intents, but fall short when it comes to specific applications. While finetuning is the default method for customized alignment, human annotations are often unavailable in various customization scenarios. Based on the observation that one of the main issues of LM customization is constraint adherence, we investigate the feasibility of using constraints as a bridge from general LMs to customized ones. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs’ capability to adhere to different classes of constraints, thereby improving task performance comparable to or approaching that of finetuning with labeled data.

4955Aioli: A Unified Optimization Framework for Language Model Data Mixing

[openreview] [pdf]

Abstract Language model performance depends on identifying the optimal mixture of data groups to train on (e.g., law, code, math). Prior work has proposed a diverse set of methods to efficiently learn mixture proportions, ranging from fitting regression models over training runs to dynamically updating proportions throughout training. Surprisingly, we find that no existing method consistently outperforms a simple stratified sampling baseline in terms of average test perplexity per group. In this paper, we study the cause of this inconsistency by unifying existing methods into a standard optimization framework. We show that all methods set proportions to minimize total loss, subject to a method-specific mixing law---an assumption on how loss is a function of mixture proportions. We find that existing parameterizations of mixing laws can express the true loss-proportion relationship empirically, but the methods themselves often set the mixing law parameters inaccurately, resulting in poor and inconsistent performance. Finally, we leverage the insights from our framework to derive a new online method named Aioli, which directly estimates the mixing law parameters throughout training and uses them to dynamically adjust proportions. Empirically, Aioli outperforms stratified sampling on 6 out of 6 datasets by an average of 0.28 test perplexity points, whereas existing methods fail to consistently beat stratified sampling, doing up to 6.9 points worse. Moreover, in a practical setting where proportions are learned on shorter runs due to computational constraints, Aioli can dynamically adjust these proportions over the full training run, consistently improving performance over existing methods by up to 12.012 test perplexity points.

4956Real-World Data and Calibrated Simulation Suite for Offline Training of Reinforcement Learning Agents to Optimize Energy and Emission in Buildings for Environmental Sustainability

[openreview] [pdf]

Abstract Commercial office buildings contribute 17 percent of Carbon Emissions in the US, according to the US Energy Information Administration (EIA), and improving their efficiency will reduce their environmental burden and operating cost. A major contributor of energy consumption in these buildings are the Heating, Ventilation, and Air Conditioning (HVAC) devices. HVAC devices form a complex and interconnected thermodynamic system with the building and outside weather conditions, and current setpoint control policies are not fully optimized for minimizing energy use and carbon emission. Given a suitable training environment, a Reinforcement Learning (RL) agent is able to improve upon these policies, but training such a model, especially in a way that scales to thousands of buildings, presents many practical challenges. Most existing work on applying RL to this important task either makes use of proprietary data, or focuses on expensive and proprietary simulations that may not be grounded in the real world. We present the Smart Buildings Control Suite, the first open source interactive HVAC control dataset extracted from live sensor measurements of devices in real office buildings. The dataset consists of two components: six years of real-world historical data from three buildings, for offline RL, and a lightweight interactive simulator for each of these buildings, calibrated using the historical data, for online and model-based RL. For ease of use, our RL environments are all compatible with the OpenAI gym environment standard. We also demonstrate a novel method of calibrating the simulator, as well as baseline results on training an RL agent on the simulator, predicting real-world data, and training an RL agent directly from data. We believe this benchmark will accelerate progress and collaboration on building optimization and environmental sustainability research.

4957On the Transfer of Object-Centric Representation Learning

[openreview] [pdf]

Abstract The goal of object-centric representation learning is to decompose visual scenes into a structured representation that isolates the entities into individual vectors. Recent successes have shown that object-centric representation learning can be scaled to real-world scenes by utilizing features from pre-trained foundation models like DINO. However, so far, these object-centric methods have mostly been applied in-distribution, with models trained and evaluated on the same dataset. This is in contrast to the underlying foundation models, which have been shown to be applicable to a wide range of data and tasks. Thus, in this work, we answer the question of whether current real-world capable object-centric methods exhibit similar levels of transferability by introducing a benchmark comprising seven different synthetic and real-world datasets. We analyze the factors influencing performance under transfer and find that training on diverse real-world images improves generalization to unseen scenarios. Furthermore, inspired by the success of task-specific fine-tuning in foundation models, we introduce a novel fine-tuning strategy to adapt pre-trained vision encoders for the task of object discovery. We find that the proposed approach results in state-of-the-art performance for unsupervised object discovery, exhibiting strong zero-shot transfer to unseen datasets.

4958Learning on One Mode: Addressing Multi-Modality in Offline Reinforcement Learning

[openreview] [pdf]

Abstract Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose Weighted Imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.

4959Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency

[openreview] [pdf]

Abstract With the introduction of video diffusion model, audio-conditioned human video generation has recently achieved significant breakthroughs in both the naturalness of motion and the synthesis of portrait details. Due to the limited control of audio signals in driving human motion, existing methods often add auxiliary spatial signals such as movement regions to stabilize movements, which compromise the naturalness and freedom of motion. To address this issue, we propose an end-to-end audio-only conditioned video diffusion model named Loopy. Specifically, we designed two key modules: an inter- and intra-clip temporal module and an audio-to-latents module. These enable the model to better utilize long-term motion dependencies and establish a stronger audio-portrait movement correlation. Consequently, the model can generate more natural and stable portrait videos with subtle facial expressions, without the need for manually setting movement constraints. Extensive experiments show that Loopy outperforms recent audio-driven portrait diffusion models, delivering more lifelike and high-quality results across various scenarios. Video samples are available athttps://loopyavataranony.github.io/

4960Don’t Say No: Jailbreaking LLM by Suppressing Refusal

[openreview] [pdf]

Abstract Ensuring the safety alignment of Large Language Models (LLMs) is crucial to generating responses consistent with human values. Despite their ability to recognize and avoid harmful queries, LLMs are vulnerable to jailbreaking attacks, where carefully crafted prompts seduce them to produce toxic content. One category of jailbreak attacks is reformulating the task as an optimization by eliciting the LLM to generate affirmative responses. However, such optimization objective has its own limitations, such as the restriction on the predefined objectionable behaviors, leading to suboptimal attack performance. In this study, we first uncover the reason why vanilla target loss is not optimal, then we explore and enhance the loss objective and introduce the DSN\textit{DSN} (Don’t Say No) attack, which achieves successful attack by suppressing refusal. Another challenge in studying jailbreak attacks is the evaluation, as it is difficult to directly and accurately assess the harmfulness of the responses. The existing evaluation such as refusal keyword matching reveals numerous false positive and false negative instances. To overcome this challenge, we propose an Ensemble Evaluation pipeline that novelly incorporates Natural Language Inference (NLI) contradiction assessment and two external LLM evaluators. Extensive experiments demonstrate the potential of the DSN\textit{DSN} and effectiveness of Ensemble Evaluation compared to baseline methods.

4961ParaSolver: A Hierarchical Parallel Integral Solver for Diffusion Models

[openreview] [pdf]

Abstract This paper explores the challenge of accelerating the sequential inference process of Diffusion Probabilistic Models (DPMs). We tackle this critical issue from a dynamic systems perspective, in which the inherent sequential nature is transformed into a parallel sampling process. Specifically, we propose a unified framework that generalizes the sequential sampling process of DPMs as solving a system of banded nonlinear equations. Under this generic framework, we reveal that the Jacobian of the banded nonlinear equations system possesses a unit-diagonal structure, enabling further approximation for acceleration. Moreover, we theoretically propose an effective initialization approach for parallel sampling methods. Finally, we construct ParaSolver, a hierarchical parallel sampling technique that enhances sampling speed without compromising quality. Extensive experiments show that ParaSolver achieves up to 12.1× speedup in terms of wall-clock time. The source code will be publicly available.

4962Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos

[openreview] [pdf]

Abstract Simulation offers a promising approach for cheaply scaling training data for robotic generalist policies. To scalably generate data from diverse and realistic tasks, existing algorithms either rely on large language models (LLMs) that may hallucinate tasks not interesting for robotics; or digital twins, which require careful real-to-sim alignment and are hard to scale. To address these challenges, we introduce Video2Policy, a novel framework that leverages large amounts of internet RGB videos to reconstruct tasks based on everyday human behavior. Our approach comprises two phases: (1) task generation through object mesh reconstruction and 6D position tracking; and (2) reinforcement learning utilizing LLM-generated reward functions and iterative in-context reward reflection for the task. We demonstrate the efficacy of Video2Policy by reconstructing over 60 videos from the Something-Something-v2 (SSv2) dataset, which depicts diverse and complex human behaviors on 9 different tasks. Our method can successfully train RL policies on such tasks, including complex and challenging tasks such as throwing. Furthermore, we show that a generalist policy trained on the collected sim data generalizes effectively to new tasks and outperforms prior approaches. Finally, we show the performance of our policies improves by simply including more internet videos. We believe that the proposed Video2Policy framework is a step towards generalist policies that can execute practical robotic tasks based on everyday human behavior.

4963Cross-Cultural Recipe Transformation via Neural Network and Encoder-Based Models

[openreview] [pdf]

Abstract Every cuisine has a culinary fingerprint characterized by its idiosyncratic ingredient composition. Transforming the culinary signature of a recipe is a creative endeavor. Traditionally, such fusion recipes have arisen from creative human interventions as a product of trial and error. Herein, we present a framework to transform the culinary signature of a recipe from one regional cuisine to another. A clustering-based computational strategy was developed, which replaces the ingredients of a recipe, one at a time, to achieve the transformation of the cuisine. We used a neural network-based Word2Vec-Doc2Vec model and three encoder-based BERT models to capture the context of an ingredient within the culinary landscape. The performance of recipe transformation strategies was evaluated by scoring their success at ‘Recipe Transformation’ and manually assessing the most frequent ingredient replacements for every fusion experiment. We observe that the encoder-based models perform better at transforming recipes with fewer ingredient replacements needed, suggesting that BERT-based models are better at providing more meaningful ingredient replacements to transform the culinary signature of recipes. The percentage of successful recipe transformations in the case of Word2Vec-Doc2Vec, BERT-Mean Pooling, BERT-CLS Pooling, and BERT-SBERT model are 99.95%, 43.1%, 41.65%, and 41.45% respectively, indicating that the neural network-based model can better cluster the cuisine-wise ingredient embeddings. On the other hand, for a successful recipe transformation, the average percentage of ingredients replaced for Word2Vec-Doc2Vec, BERT-Mean Pooling, BERT-CLS Pooling, and BERT-SBERT model are 77%, 52.3%, 51.6% and 51.5%, respectively. Our study shows a way forward for implementing cross-cultural fusion of recipes.

4964CAAP: Context-Aware Action Planning Prompting to Solve Computer Tasks with Front-End UI Only

[openreview] [pdf]

Abstract Software robots have long been used in Robotic Process Automation (RPA) to automate mundane and repetitive computer tasks. With the advent of Large Language Models (LLMs) and their advanced reasoning capabilities, these agents are now able to handle more complex or previously unseen tasks. However, LLM-based automation techniques in recent literature frequently rely on HTML source code for input or application-specific API calls for actions, limiting their applicability to specific environments. We propose an LLM-based agent that mimics human behavior in solving computer tasks. It perceives its environment solely through screenshot images, which are then converted into text for an LLM to process. By leveraging the reasoning capability of the LLM, we eliminate the need for large-scale human demonstration data typically required for model training. The agent only executes keyboard and mouse operations on Graphical User Interface (GUI), removing the need for pre-provided APIs to function. To further enhance the agent’s performance in this setting, we propose a novel prompting strategy called Context-Aware Action Planning (CAAP) prompting, which enables the agent to thoroughly examine the task context from multiple perspectives. Our agent achieves an average success rate of 94.5% on MiniWoB++ and an average task score of 62.3 on WebShop, outperforming all previous studies of agents that rely solely on screen images. This method demonstrates potential for broader applications, particularly for tasks requiring coordination across multiple applications on desktops or smartphones, marking a significant advancement in the field of automation agents. Codes and models are accessible athttps://github.com/caap-agent/caap-agent.

4965In-N-Out: Robustness to In-Domain Noise and Out-of-Domain Generalization

[openreview] [pdf]

Abstract Training on real-world data is challenging due to its complex nature, where data is often noisy and may require understanding diverse domains. Methods focused on Learning with Noisy Labels (LNL) may help with noise, but they often assume no domain shifts. In contrast, approaches for Domain Generalization (DG) could help with domain shifts, but these methods either consider label noise but prioritize out-of-domain (OOD) gains at the cost of in-domain (ID) performance, or they try to balance ID and OOD performance, but do not consider label noise at all. Thus, no work explores the combined challenge of balancing ID and OOD performance in the presence of label noise, limiting their impact. We refer to this challenging task as In-N-Out, and this work provides the first exploration of its unique properties. We find that combining the settings explored in LNL and DG poses new challenges not present in either task alone, and thus, requires direct study. Our findings are based on a study comprised of three real-world datasets and one synthesized noise dataset, where we benchmark a dozen unique methods along with many combinations that are sampled from both the LNL and DG literature. We find that the best method for each setting varies, with older DG and LNL methods often beating the SOTA. A significant challenge we identified stems from unbalanced noise sources and domain-specific sensitivities, which makes using traditional LNL sample selection strategies that often perform well on LNL benchmarks a challenge. While we show this can be mitigated when domain labels are available, we find that LNL and DG regularization methods often perform better.

4966A Deep Generative Learning Approach for Two-stage Adaptive Robust Optimization

[openreview] [pdf]

Abstract Two-stage adaptive robust optimization (ARO) is a powerful approach for planning under uncertainty, balancing first-stage decisions with recourse decisions made after uncertainty is realized. To account for uncertainty, modelers typically define a simple uncertainty set over which potential outcomes are considered. However, classical methods for defining these sets unintentionally capture a wide range of unrealistic outcomes, resulting in overly-conservative and costly planning in anticipation of unlikely contingencies. In this work, we introduce AGRO, a solution algorithm that performs adversarial generation for two-stage adaptive robust optimization using a variational autoencoder. AGRO generates high-dimensional contingencies that are simultaneously adversarial and realistic, improving the robustness of first-stage decisions at a lower planning cost than standard methods. To ensure generated contingencies lie in high-density regions of the uncertainty distribution, AGRO defines a tight uncertainty set as the image of ``latent’’ uncertainty sets under the VAE decoding transformation. Projected gradient ascent is then used to maximize recourse costs over the latent uncertainty sets by leveraging differentiable optimization methods. We demonstrate the cost-efficiency of AGRO by applying it to both a synthetic production-distribution problem and a real-world power system expansion setting. We show that AGRO outperforms the standard column-and-constraint algorithm by up to 1.8% in production-distribution planning and up to 11.6% in power system expansion.

4967ANALOGXPERT: AUTOMATING ANALOG TOPOLOGY SYNTHESIS BY INCORPORATING CIRCUIT DESIGN EXPERTISE INTO LARGE LANGUAGE MODELS

[openreview] [pdf]

Abstract Analog circuits are crucial in modern electronic systems, and automating their design has attracted significant research interest. One of major challenges is topology synthesis, which determines circuit components and their connections. Recent studies explore large language models (LLM) for topology synthesis. However, the scenarios addressed by these studies do not align well with practical applications. Specifically, existing work uses vague design requirements as input and outputs an ideal model, but detailed structural requirements and device-level models are more practical. Moreover, current approaches either formulate topology synthesis as graph generation or Python code generation, whereas practical topology design is a complex process that demands extensive design knowledge. In this work, we propose AnalogXpert, a LLM-based agent aiming at solving practical topology synthesis problem by incorporating circuit design expertise into LLMs. First, we represent analog topology as SPICE code and introduce a subcircuit library to reduce the design space, in the same manner as experienced designers. Second, we decompose the problem into two sub-task (i.e., block selection and block connection) through the use of CoT and in-context learning techniques, to mimic the practical design process. Third, we introduce a proofreading strategy that allows LLMs to incrementally correct the errors in the initial design, akin to human designers who iteratively check and adjust the initial topology design to ensure accuracy. Finally, we construct a high-quality benchmark containing both real data (30) and synthetic data (2k). AnalogXpert achieves 40% and 23% success rates on the synthetic dataset and real dataset respectively, which is markedly better than those of GPT-4o (3% on both the synthetic dataset and the real dataset).

4968A Statistical Framework for Ranking LLM-based Chatbots

[openreview] [pdf]

Abstract Evaluating large language models (LLMs) effectively is essential for advancing their development and ensuring alignment with human preferences. Platforms like Chatbot Arena have made significant strides by gathering millions of votes through crowdsourced pairwise comparisons to rank LLMs, offering valuable data for assessing model performance. However, the statistical methods employed rely on simplistic approaches, such as the Elo rating system, which inadequately handles ties in competitions and overlooks the underlying relationships between competing models. In this paper, we introduce a more rigorous statistical framework that builds upon the data from Chatbot Arena while correcting these methodological shortcomings. We apply well-established statistical models to properly account for ties within an axiomatic framework. Additionally, we introduce a novel factor analysis that captures the complexity of ties across pairs of competitors, significantly improving the overall model performance. These improvements not only enhance the handling of ties but also increase the accuracy of win and loss predictions compared to previous methods. Additionally, we incorporate Thurstonian representations to model covariance structures between competitors, allowing for deeper insights beyond rankings. We also address previously unrecognized symmetries in the likelihood function that can hinder optimization and propose constraints to ensure stable parameter estimation. Finally, we provide a Python package, leaderbot, to facilitate reproducibility. Our experiments demonstrate significant improvements in accuracy for both ties and win-loss outcomes, offering a robust alternative to existing methods.

4969Unmasking Trees for Tabular Data

[openreview] [pdf]

Abstract Despite much work on advanced deep learning and generative modeling techniques for tabular data generation and imputation, traditional methods have continued to win on imputation benchmarks. We herein present UnmaskingTrees, a simple method for tabular imputation (and generation) employing gradient-boosted decision trees which are used to incrementally unmask individual features. This approach offers state-of-the-art performance on imputation, and on generation given training data with missingness; and it has competitive performance on vanilla generation. To solve the conditional generation subproblem, we propose a tabular probabilistic prediction method, BaltoBot, which fits a balanced tree of boosted tree classifiers. Unlike older methods, it requires no parametric assumption on the conditional distribution, accommodating features with multimodal distributions; unlike newer diffusion methods, it offers fast sampling, closed-form density estimation, and flexible handling of discrete variables. We finally consider our two approaches as meta-algorithms, demonstrating in-context learning-based generative modeling with TabPFN.

4970MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data

[openreview] [pdf]

Abstract Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed ``MathOdyssey’’ dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and evaluation code are publicly available.

4971The Ability of Large Language Models to Evaluate Constraint-satisfaction in Agent Responses to Open-ended Requests

[openreview] [pdf]

Abstract Generative AI agents are often expected to respond to complex user requests that have No One Right Answer (NORA), e.g., “design a vegetarian meal plan below 1800 calories”. Such requests may entail a set of constraints that the agent should adhere to. To successfully develop agents for NORA scenarios, an accurate automatic evaluation framework is essential, and specifically - one capable of validating the satisfaction of constraints in the agent’s response. Recently, large language models (LLMs) have been adopted as versatile evaluators for many NORA tasks, but their ability to evaluate constraint-satisfaction in generated text remains unclear. To study this, we develop and release a novel Arithmetic Constraint-Satisfaction (ACS) benchmarking dataset. The dataset consists of complex user requests with corresponding constraints, agent responses and human labels indicating each constraint’s satisfaction level in the response. A unique property of this dataset is that validating many of its constraints requires reviewing the response as a whole (in contrast to many other benchmarks that require the validation of a single independent item). Moreover, it assesses LLMs in performing reasoning, in-context data extraction, arithmetic calculations, and counting. We then benchmark both open and proprietary LLMs on evaluating constraint-satisfaction, and show that most models still have a significant headroom for improvement, and that errors primarily stem from reasoning issues. In addition, most models exhibit a skewed constraint-satisfaction prediction pattern, with higher accuracy where the ground-truth label is “satisfied”. Lastly, few-shot prompting for our task proved to be rather challenging, since many of the studied models showed a degradation in performance when it was introduced.

4972Learning Adaptive Lighting via Channel-Aware Guidance

[openreview] [pdf]

Abstract Learning lighting adaption is a key step in obtaining a good visual perception and supporting downstream vision tasks. There are multiple light-related tasks (e.g., image retouching and exposure correction) and previous studies have mainly investigated these tasks individually. However, we observe that the light-related tasks share fundamental properties: i) different color channels have different light properties, and ii) the channel differences reflected in the time and frequency domains are different. Based on the common light property guidance, we propose a Learning Adaptive Lighting Network (LALNet), a unified framework capable of processing different light-related tasks. Specifically, we introduce the color-separated features that emphasize the light difference of different color channels and combine them with the traditional color-mixed features by Light Guided Attention (LGA). The LGA utilizes color-separated features to guide color-mixed features focusing on channel differences and ensuring visual consistency across channels. We introduce dual domain channel modulation to generate color-separated features and a wavelet followed by a vision state space module to generate color-mixed features. Extensive experiments on four representative light-related tasks demonstrate that LALNet significantly outperforms state-of-the-art methods on benchmark tests and requires fewer computational resources.

4973FacLens: Transferable Probe for Foreseeing Non-Factuality in Large Language Models

[openreview] [pdf]

Abstract Despite advancements in large language models (LLMs), non-factual responses remain prevalent. Unlike extensive studies on post-hoc detection of such responses, this work studies non-factuality prediction (NFP), aiming to predict whether an LLM will generate a non-factual response to a question before the generation process. Previous efforts on NFP have demonstrated LLMs’ awareness of their internal knowledge, but they still face challenges in efficiency and transferability. In this work, we propose a lightweight NFP model named Factuality Lens (FacLens), which effectively probes hidden representations of questions for the NFP task. Besides, we discover that hidden question representations sourced from different LLMs exhibit similar NFP patterns, which enables the transferability of FacLens across LLMs to reduce development costs. Extensive experiments highlight FacLens’s superiority in both effectiveness and efficiency.

4974On the (un) interpretability of Ensembles: A Computational Analysis

[openreview] [pdf]

Abstract Despite the widespread adoption of ensemble models, it is widely acknowledged within the ML community that they offer limited interpretability. For instance, while a single decision tree is considered interpretable, ensembles of decision trees (e.g., boosted-trees) are usually regarded as black-boxes. Although this reduced interpretability is widely acknowledged, the topic has received only limited attention from a theoretical and mathematical viewpoint. In this work, we provide an elaborate analysis of the interpretability of ensemble models through the lens ofcomputational complexitytheory. In a nutshell, we explore different forms of explanations, and analyze whether obtaining explanations for ensembles is strictly computationally less tractable than for their constituent base models. We show that this is indeed the case for ensembles that consist of interpretable models, such as decision trees or linear models; but this is not the case for ensembles consisting of more complex models, such as neural networks. Next, we perform a fine-grained analysis using parameterized complexity to measure the impact of different problem parameters on an ensemble’s interpretability. Our findings reveal that even if we shrink thesizeof all base models in an ensemble substantially, the ensemble as a whole remains intractable to interpret. However, an analysis of thenumberof base models yields a surprising dynamic --- while ensembles consisting of a limited number of decision trees can be interpreted efficiently, ensembles that consist of a small (evenconstant) number of linear models are computationally intractable to interpret.

4975Realistic Evaluation of Model Merging for Compositional Generalization

[openreview] [pdf]

Abstract Merging has become a widespread way to cheaply combine individual models into a single model that inherits their capabilities and attains better performance. This popularity has spurred rapid development of many new merging methods, which are typically validated in disparate experimental settings and frequently differ in the assumptions made about model architecture, data availability, and computational budget. In this work, we characterize the relative merits of different merging methods by evaluating them in a shared experimental setting and precisely identifying the practical requirements of each method. Specifically, our setting focuses on using merging for compositional generalization\textit{compositional generalization} of capabilities in image classification, image generation, and natural language processing. Additionally, we measure the computational costs of different merging methods as well as how they perform when scaling the number of models being merged. Taken together, our results clarify the state of the field of model merging and provide a comprehensive and rigorous experimental setup to test new methods.

4976Mitigating Time Discretization Challenges with WeatherODE: A Sandwich Physics-Driven Neural ODE for Weather Forecasting

[openreview] [pdf]

Abstract In the field of weather forecasting, traditional models often grapple with discretization errors and time-dependent source discrepancies, which limit their predictive performance. In this paper, we present WeatherODE, a novel one-stage, physics-driven ordinary differential equation (ODE) model designed to enhance weather forecasting accuracy. By leveraging wave equation theory and integrating a time-dependent source model, WeatherODE effectively addresses the challenges associated with time-discretization error and dynamic atmospheric processes. Moreover, we design a CNN-ViT-CNN sandwich structure, facilitating efficient learning dynamics tailored for distinct yet interrelated tasks with varying optimization biases in advection equation estimation. Through rigorous experiments, WeatherODE demonstrates superior performance in both global and regional weather forecasting tasks, outperforming recent state-of-the-art approaches by significant margins of over 40.0% and 31.8% in root mean square error (RMSE), respectively. The source code is available athttps://anonymous.4open.science/r/WeatherODE-5C13/.

4977Model Tells You Where to Merge: Adaptive KV Cache Merging for LLMs on Long-Context Tasks

[openreview] [pdf]

Abstract How to efficiently serve Large Language Models (LLMs) has become a pressing issue because of their huge computational cost in their autoregressive generation process. To mitigate computational costs, LLMs often employ the KV Cache technique to improve the generation speed. While improving the computational efficiency, the storage requirements of the KV cache are substantial, particularly in long-context scenarios, leading to significant memory consumption. Existing KV cache eviction methods often degrade the performance of LLMs in long-context scenarios due to the information loss introduced by eviction. In this paper, we propose a novel KV cache merging approach, called KVMerger, to achieve adaptive KV cache compression for long-context tasks without significant performance degradation under constrained memory budgets. Our approach is inspired by the intriguing observation that key states exhibit high similarity at the token level within a single sequence. To facilitate merging, we develop an effective yet straightforward merging set identification algorithm to identify suitable KV states for merging. Our merging set identification algorithm stimulates the second observation that KV cache sparsity, from similarity perspective, is independent of the dataset and remains persistent at the model level. Subsequently, we propose a Gaussian kernel weighted merging algorithm to selectively merge all states within each merging set. We conduct extensive experiments to demonstrate the effectiveness of KVMerger for long-context tasks under constrained memory budgets, applying it to models including Llama2-7B/13B-chat and Mistral-7B-instruct across various tasks. We also compare our method with other KV cache compression algorithms, including H2O and CaM, showing that our method achieves superior performance across tasks with different KV cache budgets.

4978Provably Safeguarding a Classifier from OOD and Adversarial Samples: an Extreme Value Theory Approach

[openreview] [pdf]

Abstract This paper introduces a novel method, Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE), which transforms a classifier into an abstaining classifier, offering provable protection against out-of-distribution and adversarial samples. The approach is based on a Generalized Extreme Value (GEV) model of the training distribution in the classifier’s latent space, enabling the formal characterization of OOD samples. Interestingly, under mild assumptions, the GEV model also allows for a formal characterization of adversarial samples. The abstaining classifier, which rejects samples based on their assessment by the GEV model, provably avoids OOD and adversarial samples. The empirical validation of the approach, conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and tested on medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet), demonstrates its frugality, stability, and efficiency compared to the state of the art.

4979Agree to Disagree: Demystifying Homogeneous Deep Ensembles through Distributional Equivalence

[openreview] [pdf]

Abstract Deep ensembles improve the performance of the models by taking the average predictions of a group of ensemble members. However, the origin of these capabilities remains a mystery and deep ensembles are used as a reliable “black box” to improve the performance. Existing studies typically attribute such improvement to Jensen gaps of the deep ensemble method, where the loss of the mean does not exceed the mean of the loss for any convex loss metric. In this work, we demonstrate that Jensen’s inequality is not responsible for the effectiveness of deep ensembles, and convexity is not a necessary condition. Instead, Jensen Gap focuses on the “average loss” of individual models, which provides no practical meaning. Thus it fails to explain the core phenomena of deep ensembles such as the superiority to any single ensemble member, the decreasing loss with the number of ensemble members, etc. Regarding this mystery, we provide theoretical analysis and comprehensive empirical results from a statistical perspective that reveal the true mechanism of deep ensembles. Our results highlight that deep ensembles originate from the homogeneous output distribution across all ensemble members. Specifically, the predictions of homogeneous models (Abe et al., 2022b) have the distributional equivalence property – Although the predictions of independent ensemble members are point-wise different, they form an identical distribution. Such agreement and disagreement contribute to deep ensembles’ “magical power”. Based on this discovery, we provide rigorous proof of the effectiveness of deep ensembles and analytically quantify the extent to which ensembles improve performance. The derivations not only theoretically quantify the effectiveness of deep ensembles for the first time, but also enable estimation schemes that foresee the performance of ensembles with different capacities. Furthermore, different from existing studies, our results also point out that deep ensembles work in a different mechanism from model scaling a single model, even though significant correlations between them have been observed.

4980Different Rates for Different Weights: Decoupled Relative Learning Rate Schedules

[openreview] [pdf]

Abstract In this work, we introduce a novel approach for optimizing neural network training by adjusting learning rates across weights of different components in Transformer models. Traditional methods often apply a uniform learning rate across all network layers, potentially overlooking the unique dynamics of each part. Remarkably, our introduced Relative Learning Rate Schedules (RLRS) method accelerates the training process by 13.6%, particularly in complex models such as the Mixture of Experts (MoE). Hyperparameters of RLRS can be efficiently tuned on smaller models and then extrapolated to 27x larger ones. This simple and effective method results in a substantial reduction in training time and computational resources, offering a practical and scalable solution for optimizing large-scale neural networks.

4981A Realistic Threat Model for Large Language Model Jailbreaks

[openreview] [pdf]

Abstract A plethora of jailbreaking attacks have been proposed to obtain harmful responses from safety-tuned LLMs. In their original settings, these methods all largely succeed in coercing the target output, but their attacks vary substantially in fluency and computational effort. In this work, we propose a unified threat model for the principled comparison of these methods. Our threat model combines constraints in perplexity, measuring how far a jailbreak deviates from natural text, and computational budget, in total FLOPs. For the former, we build an N-gram model on 1T tokens, which, in contrast to model-based perplexity, allows for an LLM-agnostic and inherently interpretable evaluation. We adapt popular attacks to this new, realistic threat model, with which we, for the first time, benchmark these attacks on equal footing. After a rigorous comparison, we not only find attack success rates against safety-tuned modern models to be lower than previously presented, but also find that attacks based on discrete optimization significantly outperform recent LLM-based attacks. Further, our threat model is interpretable, thus it allows for a comprehensive analysis and comparison of jailbreak attacks. We find that effective attacks exploit and abuse infrequent N-grams, either selecting N-grams absent from real-world text or rare ones, e.g. specific to code datasets.

4982Multi-Agent Causal Discovery Using Large Language Models

[openreview] [pdf]

Abstract Large Language Models (LLMs) have demonstrated significant potential in causal discovery tasks by utilizing their vast expert knowledge from extensive text corpora. However, the multi-agent capabilities of LLMs in causal discovery remain underexplored. This paper introduces a general framework to investigate this potential. The first is the Meta Agents Model, which relies exclusively on reasoning and discussions among LLM agents to conduct causal discovery. The second is the Coding Agents Model, which leverages the agents’ ability to plan, write, and execute code, utilizing advanced statistical libraries for causal discovery. The third is the Hybrid Model, which integrates both the Meta Agents Model and Coding Agents Model approaches, combining the statistical analysis and reasoning skills of multiple agents. Our proposed framework shows promising results by effectively utilizing LLMs’ expert knowledge, reasoning capabilities, multi-agent cooperation, and statistical causal methods. By exploring the multi-agent potential of LLMs, we aim to establish a foundation for further research in utilizing LLMs multi-agent for solving causal-related problems.

4983Provable Length Generalization in Sequence Prediction via Spectral Filtering

[openreview] [pdf]

Abstract We consider the problem of length generalization in sequence prediction. We define a new metric of performance in this setting -- the Unfair-Regret -- which measures regret against a benchmark predictor with longer context length than available to the learner. We continue by studying this concept from the lens of the spectral filtering algorithm. We give a gradient-based learning algorithm that provably length generalizes for linear dynamical systems. We conclude with proof-of-concept experiments demonstrating the validity of our theory.

4984Using Stochastic Gradient Descent to Smooth Nonconvex Functions: Analysis of Implicit Graduated Optimization

[openreview] [pdf]

Abstract The graduated optimization approach is a heuristic method for finding global optimal solutions for nonconvex functions by using a function smoothing operation with stochastic noise. We show that stochastic noise in stochastic gradient descent (SGD) has the effect of smoothing the objective function, the degree of which is determined by the learning rate, batch size, and variance of the stochastic gradient. Using this finding, we propose and analyze a new graduated optimization algorithm that varies the degree of smoothing by varying the learning rate and batch size, and provide experimental results on image classification tasks with ResNets that support our theoretical findings. We further show that there is an interesting correlation between the degree of smoothing by SGD’s stochastic noise, the well-studied ``sharpness’’ indicator, and the generalization performance of the model.

4985Do We Need Domain-Specific Embedding Models? An Empirical Investigation

[openreview] [pdf]

Abstract Embedding models play a crucial role in representing and retrieving information across various NLP applications. Recent advancements in Large Language Models (LLMs) have further enhanced the performance of embedding models, which are trained on massive amounts of text covering almost every domain. These models are often benchmarked on general-purpose datasets like Massive Text Embedding Benchmark (MTEB), where they demonstrate superior performance. However, a critical question arises: Is the development of domain-specific embedding models necessary when general-purpose models are trained on vast corpora that already include specialized domain texts? In this paper, we empirically investigate this question, choosing the finance domain as an example. We introduce the Finance Massive Text Embedding Benchmark (FinMTEB), a counterpart to MTEB that consists of financial domain-specific text datasets. We evaluate the performance of seven state-of-the-art embedding models on FinMTEB and observe a significant performance drop compared to their performance on MTEB. To account for the possibility that this drop is driven by FinMTEB’s higher complexity, we propose four measures to quantify dataset complexity and control for this factor in our analysis. Our analysis provides compelling evidence that state-of-the-art embedding models struggle to capture domain-specific linguistic and semantic patterns. Moreover, we find that the performance of general-purpose embedding models on MTEB is not correlated with their performance on FinMTEB, indicating the need for domain-specific embedding benchmarks for domain-specific embedding models. This study sheds light on developing domain-specific embedding models in the LLM era.

4986PersonaMath: Enhancing Math Reasoning through Persona-Driven Data Augmentation

[openreview] [pdf]

Abstract While closed-source Large Language Models (LLMs) demonstrate strong mathematical problem-solving abilities, open-source models continue to struggle with such tasks. To bridge this gap, we propose a data augmentation approach and introduce PersonaMathQA, a dataset derived from MATH and GSM8K, on which we train the PersonaMath models. Our approach consists of two stages: the first stage is learning from Persona Diversification, and the second stage is learning from Reflection. In the first stage, we regenerate detailed chain-of-thought (CoT) solutions as instructions using a closed-source LLM and introduce a novel persona-driven data augmentation technique to enhance the dataset’s quantity and diversity. In the second stage, we incorporate reflection to fully leverage more challenging and valuable questions. Evaluation of our PersonaMath models on MATH and GSM8K reveals that the PersonaMath-7B model (based on LLaMA-2-7B) achieves an accuracy of 24.2% on MATH and 68.7% on GSM8K, surpassing all baseline methods and achieving state-of-the-art performance. Notably, our dataset contains only 70.3K data points—merely 17.8% of MetaMathQA and 27% of MathInstruct—yet our model outperforms these baselines, demonstrating the high quality and diversity of our dataset, which enables more efficient model training. We open-source the PersonaMathQA dataset, PersonaMath models, and our code for public usage.

4987Towards better generalization: Weight Decay induces low-rank bias for neural networks

[openreview] [pdf]

Abstract We study the implicit bias towards low-rank weight matrices when training neural networks (NN) with Weight Decay (WD). We prove that when a ReLU NN is sufficiently trained with Stochastic Gradient Descent (SGD) and WD, its weight matrix is approximately a rank-two matrix. Empirically, we demonstrate that WD is a necessary condition for inducing this low-rank bias across both regression and classification tasks. Our work differs from previous studies as our theoretical analysis does not rely on common assumptions regarding the training data distribution, optimality of weight matrices, or specific training procedures. Furthermore, by leveraging the low-rank bias, we derive improved generalization error bounds and provide numerical evidence showing that better generalization can be achieved. Thus, our work offers both theoretical and empirical insights into the strong generalization performance of SGD when combined with WD.

4988Augmented Flow Matching via Variance Reduction with Auxiliary Variables

[openreview] [pdf]

Abstract Flow matching is a simulation-free approach that scalably generates an ODE, in which its path traverses between two different distributions. However, conventional flow matching relies on the training pairs drawn independently, inducing high variance that might slow down training process and degrade the performance upon training. To mitigate this, we propose augmented flow matching, a simple yet efficient framework that can be ubiquitously applied to flow matching with slight modification to the models. We first find that when some auxiliary variables that are correlated to the training data, then they contribute on variance reduction of the flow matching loss estimation, when used together with the training data pair. With this observation, we construct auxiliary variables that are correlated to the training pair, which is obtained by simple and effective linear operation from the input data. Finally, we show that with this simple modification on the training phase, we achieve the improved model flexibility and performance when the ODE is applied on the learned model.

4989Multlingual Abstractive Event Extraction for the Real World

[openreview] [pdf]

Abstract Event extraction (EE) is a valuable tool for making sense of large amounts of unstructured data, with a wide range of real-world applications, from studying disease outbreaks to monitoring political violence. Current EE systems rely on cumbersome mention-level annotations, and event arguments are frequently restricted to ungrounded spans of text, which hinders the aggregation and analysis of extracted events. In this paper, we define a new abstractive event extraction (AEE) task that moves away from the surface form and instead requires a deeper wholistic understanding of the input text. To support research in this direction, we release a new multilingual, expert-annotated event dataset called Lemonade, which covers 16 languages, including several for which no event dataset currently exists. Lemonade has 41,148 events, and is based on the Armed Conflict Location and Event Data Project, which has been collecting and coding data on political violence around the globe for over a decade. We introduce a novel zero-shot AEE system Zest that achieves a score of 57.2% F1 on Lemonade. With our supervised model that achieves 71.6% F1, they represent strong baselines for this new dataset.

4990TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

[openreview] [pdf]

Abstract In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across three public large language models on four datasets. The code is available onhttps://anonymous.4open.science/r/TPO.

4991Recurrent Drafter for Fast Speculative Decoding in Large Language Models

[openreview] [pdf]

Abstract We present Recurrent Drafter (ReDrafter), an advanced speculative decoding approach that achieves state-of-the-art speedup for large language models (LLMs) inference. The performance gains are driven by three key aspects: (1) leveraging a recurrent neural network (RNN) as the draft model conditioning on LLM’s hidden states, (2) applying a dynamic tree attention algorithm over beam search results to eliminate duplicated prefixes in candidate sequences, and (3) training through knowledge distillation from the LLM. ReDrafter accelerates Vicuna inference in MT-Bench by up to 3.5x with a PyTorch implementation on Nvidia H100 GPUs. To demonstrate its practicality in production environments, we integrate ReDrafter into TensorRT-LLM, reaching up to 2.5x speedup on H100 GPUs. We also validated its effectiveness for on-device applications by implementing the approach in MLX and benchmarking performance on Metal GPUs in Apple Silicon chips, achieving up to 2.3x speedup.

4992SiDyP: Simplex Diffusion with Dynamic Prior for Denoising Llama-Generated Labels

[openreview] [pdf]

Abstract The traditional process of creating labeled datasets is not only labor-intensive but also expensive. Recent breakthroughs in open-source large language models (LLMs), such as Llama-3, have opened a new avenue in generating labeled datasets automatically for various natural language processing (NLP) tasks to provide an alternative to such expensive annotation process. However, the reliability of such auto-generated labels remains a significant concern due to inherent inaccuracies. When learning from such noisy labels, the model’s generalization is likely to be harmed as it is prone to overfit those label noises. In this paper, we propose the \textbf{Si}mplex Diffusion with a \textbf{Dy}namic \textbf{P}rior (\textbf{SiDyP}) model to calibrate incorrect labels, thus enhancing classifier robustness to noisy labels. While diffusion models have largely been overshadowed by transformer architectures for NLP, our work shows that combining diffusion with transformers can further improve text-based tasks. Our framework leverages simplex diffusion model to iteratively correct noisy labels conditioned on training dynamic trajectories. The potential true label candidates, obtained by neighborhood label distribution in embedding space, progressively based on the feedback of the diffusion model. Our SiDyP model can increase the performance of the BERT classifier fine-tuned on both zero-shot and few-shot Llama-3 generated noisy label datasets by an average of 5.33% and 7.69% respectively. Our extensive experiments, which explore different LLMs, diverse noise types (real-world and synthetic), ablation studies, and multiple baselines, demonstrate the effectiveness of SiDyP across a range of NLP tasks. We will make code and data publicly (under a CC BY 4.0 license) available on GitHub upon publication of the work.

4993Energy-Based Diffusion Language Models for Text Generation

[openreview] [pdf]

Abstract Despite remarkable progress in autoregressive language models, alternative generative paradigms beyond left-to-right generation are still being actively explored. Discrete diffusion models, with the capacity for parallel generation, have recently emerged as a promising alternative. Unfortunately, these models still underperform the autoregressive counterparts, with the performance gap increasing when reducing the number of sampling steps. Our analysis reveals that this degradation is a consequence of an imperfect approximation used by diffusion models. In this work, we propose Energy-based Diffusion Language Model (EDLM), an energy-based model operating at the full sequence level for each diffusion step, introduced to improve the underlying approximation used by diffusion models. More specifically, we introduce an EBM in a residual form, and show that its parameters can be obtained by leveraging a pretrained autoregressive model or by finetuning a bidirectional transformer via noise contrastive estimation. We also propose an efficient generation algorithm via parallel important sampling. Comprehensive experiments on language modeling benchmarks show that our model can consistently outperform state-of-the-art diffusion models by a significant margin, and approaches autoregressive models’ perplexity. We further show that, without any generation performance drop, our framework offers a 1.3x sampling speedup over existing diffusion models.

4994Improving Human Pose-Conditioned Generation: Fine-tuning ControlNet Models with Reinforcement Learning

[openreview] [pdf]

Abstract Advancements in diffusion-based text-to-image generation models have made it possible to create high-quality human images. However, generating humans in desired poses using text prompts alone remains challenging. Image-to-image generation methods utilizing additional image conditions can address this issue; however, they often struggle with generating images that accurately match conditioning images. This paper proposes a new fine-tuning framework for training ControlNet models with reinforcement learning by combining ControlNet and Denoising Diffusion Policy Optimization~(DDPO) to understand pose conditioning images better. We apply a novel reward function in the proposed framework for higher pose accuracy. We demonstrate that our method effectively improves human generation by enhancing pose accuracy and the correct generation of body parts without omissions or additions. In addition, we demonstrate that the effectiveness of using a more detailed pose dataset along with our proposed reward function that directly leverages keypoints, leads to improved training results.

4995Generalization of FedAvg Under Constrained Polyak-Lojasiewicz Type Conditions: A Single Hidden Layer Neural Network Analysis

[openreview] [pdf]

Abstract In this work, we study the optimization and the generalization performance of the widely used FedAvg algorithm for solving Federated Learning (FL) problems. We analyze the generalization performance of FedAvg by handling the optimization error and the Rademacher complexity. Towards handling optimization error, we propose novel constrained Polyak-Lojasiewicz (PL)-type conditions on the objective function that ensure the existence of a global optimal to which FedAvg converges linearly after O(log(1/ϵ))\mathcal{O}( \log ({1}/{\epsilon})) rounds of communication, where ϵ\epsilon is the desired optimality gap. Importantly, we demonstrate that a class of single hidden layer neural networks satisfies the proposed constrained PL-type conditions required to establish the linear convergence of FedAvg as long as m>nK/dm > {nK}/{d}, where mm is the width of the neural network, KK is the number of clients, nn is the number of samples at each client, and dd is the feature dimension. We then bound the Rademacher complexity for this class of neural networks and establish that both Rademacher complexity and the generalization error of FedAvg decrease at an optimal rate of O(1/n)\mathcal{O}({1}/{\sqrt{n}}). We further show that increasing the number of clients KK decreases the generalization error at the rate of O(1/n+1/nK)\mathcal{O}({1}/{\sqrt{n}} + {1}/{\sqrt{nK}}).

4996Risk-Sensitive Variational Actor-Critic: A Model-Based Approach

[openreview] [pdf]

Abstract Risk-sensitive reinforcement learning (RL) with an entropic risk measure typically requires knowledge of the transition kernel or performs unstable updates w.r.t. exponential Bellman equations. As a consequence, algorithms that optimize this objective have been restricted to tabular or low-dimensional continuous environments. In this work we leverage the connection between the entropic risk measure and the RL-as-inference framework to develop a risk-sensitive variational actor-critic algorithm (rsVAC). Our work extends the variational framework to incorporate stochastic rewards and proposes a variational model-based actor-critic approach that modulates policy risk via a risk parameter. We consider, both, the risk-seeking and risk-averse regimes and present rsVAC learning variants for each setting. Our experiments demonstrate that this approach produces risk-sensitive policies and yields improvements in both tabular and risk-aware variants of complex continuous control tasks in MuJoCo.

4997Enhancement of In-Context Reasoning in LLMs through Inductive Rule Learning

[openreview] [pdf]

Abstract Currently, Large language models (LLMs) have achieved remarkable performance across various language tasks, largely due to their training on extensive datasets and their considerable model size. These models exhibit in-context learning abilities, which is to learn through few-shot learning. However, the underlying reasoning process remains ambiguous, it is unclear whether the model simply retrieves relevant information and instructions from its training data to generate similar responses, or whether it generalizes examples to form overarching rules, which are then applied to produce accurate answers. Another method for improving few-shot learning is Chain-of-Thought prompting that complement steps by steps instruction for LLMs, so they can follow this instruction to solve many reasoning tasks. Several approaches for evaluating the reasoning abilities of LLMs typically involve task-solving through code generation, which enables models to formalize problems and leverage a code compiler to solve them precisely. However, these methods are constrained to specific task types and are insufficient for a comprehensive assessment of the model’s broader reasoning capabilities. Therefore, this paper proposes a method to enhance in-context learning capabilities through two main stages: generating general rules from the provided examples and utilizing LLMs to verify these general rules, thereby aiming to improve reliability and accuracy. At the same time, this approach seeks to investigate the inductive and deductive reasoning abilities, and can improve our understanding of the model’s reasoning by generating and applying general rules to provide transparent, clearly explained responses. The proposed method demonstrates competitive performance on the 1D-ARC benchmark and several traditional language tasks, suggesting its potential for more robust evaluation of LLM reasoning abilities.

4998EC-Diffuser: Multi-Object Manipulation via Entity-Centric Behavior Generation

[openreview] [pdf]

Abstract Object manipulation is a common component of everyday tasks, but learning to manipulate objects from high-dimensional observations presents significant challenges. These challenges are heightened in multi-object environments due to the combinatorial complexity of the state space as well as of the desired behaviors. While recent approaches have utilized large-scale offline data to train models from pixel observations, achieving performance gains through scaling, these methods struggle with compositional generalization in unseen object configurations with constrained network and dataset sizes. To address these issues, we propose a novel behavioral cloning (BC) approach that leverages object-centric representations and an entity-centric Transformer with diffusion-based optimization, enabling efficient learning from offline image data. Our method first decomposes observations into Deep Latent Particles (DLP), which are then processed by our entity-centric Transformer that computes attention at the particle level, simultaneously predicting object dynamics and the agent’s actions. Combined with the ability of diffusion models to capture multi-modal behavior distributions, this results in substantial performance improvements in multi-object tasks and, more importantly, enables compositional generalization. We present BC agents capable of zero-shot generalization to perform tasks with novel compositions of objects and goals, including larger numbers of objects than seen during training. We provide video rollouts on our webpage:https://sites.google.com/view/ec-diffuser.

4999LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K

[openreview] [pdf]

Abstract State-of-the-art large language models (LLMs) are now claiming remarkable supported context lengths of 256k or even more. In contrast, the average context lengths of mainstream benchmarks are insufficient (5k-21k), and they suffer from potential knowledge leakage and inaccurate metrics, resulting in biased evaluation. This paper introduces LV-Eval, a challenging long-context benchmark with five length levels (16k, 32k, 64k, 128k, and 256k) reaching up to 256k words. LV-Eval features two main tasks, single-hop QA and multi-hop QA, comprising 11 bilingual datasets. The design of LV-Eval has incorporated three key techniques, namely confusing facts insertion, keyword and phrase replacement, and keyword-recall-based metric design. The advantages of LV-Eval include controllable evaluation across different context lengths, challenging test instances with confusing facts, mitigated knowledge leakage, and more objective evaluations. We evaluate 15 LLMs on LV-Eval and conduct ablation studies on the benchmarking techniques. The results reveal that: (i) Moonshot-v1 and recent large-scale open-source models, such as Qwen-2.5-72B and Llama-3.1-70B, achieve the highest performance on LV-Eval, particularly at lengths below 64k64k. (ii) Models exhibit distinct score trends. For example, GLM-4-9B-128k, Yi-6B-200k, and Llama3-8B-1M exhibit a relatively gentle degradation of performance, but their absolute performances may not necessarily be higher than those of LLMs with shorter context lengths. (iii) LLMs’ performances can significantly degrade in the presence of confusing information, especially in the pressure test of “needle in a haystack”. (iv) Issues related to knowledge leakage and inaccurate metrics introduce bias in evaluation, and these concerns are alleviated in LV-Eval.

5000Beyond accuracy: understanding the performance of LLMs on exams designed for humans

[openreview] [pdf]

Abstract Many recent studies of LLM performance have focused on the ability of LLMs to achieve outcomes comparable to humans on academic and professional exams. However, it is not clear whether such studies shed light on the extent to which models show reasoning ability, and there is controversy about the significance and implications of such results. We seek to look more deeply into the question of how and whether the performance of LLMs on exams designed for humans reflects true aptitude inherent in LLMs. We do so by making use of the tools of psychometrics which are designed to perform meaningful measurement in test taking. We leverage a unique dataset that captures the detailed performance of over 5M students across 8 college-entrance exams given over a span of two years in Brazil. With respect to the evaluation of LLM abilities, we show that the tools of Item Response Theory (IRT) provide a more informative evaluation of model performance than the usual accuracy metrics employed in previous studies. Digging deeper, we show that the modeling framework of IRT, by explicitly modeling the difficulty levels of questions, allows us to quantitatively distinguish between LLMs that answer questions in “human-like” patterns versus LLMs that do not. We also show how to quantitatively identify cases in which exam results are not reliable measurements of an LLM’s ability. Using the tools of IRT we can also identify specific questions that appear to be either much easier, or much harder, for machines than for humans, and we give some reasons for those differences. Overall, our study shows that the conventional focus on accuracy as the primary performance metric for LLM studies does not allow us to deeply understand the true capabilities of LLMs and compare them to that of humans. Thus, we claim that psychometric modeling should play a larger role in the evaluation of LLM capabilities on exams designed for humans.

5001Lens: Rethinking Multilingual Enhancement for Large Language Models

[openreview] [pdf]

Abstract Despite the growing global demand for large language models (LLMs) that serve users from diverse linguistic backgrounds, most cutting-edge LLMs remain predominantly English-centric. This creates a performance gap across languages, restricting access to advanced AI services for non-English speakers. Current methods to enhance multilingual capabilities largely rely on data-driven post-training techniques, such as multilingual instruction tuning or continual pre-training. However, these approaches encounter significant challenges, including the scarcity of high-quality multilingual datasets and the limited enhancement of multilingual capabilities. They often suffer from off-target issues and catastrophic forgetting of central language abilities. To this end, we propose \textsc{Lens}, a novel approach to enhance multilingual capabilities of LLMs by leveraging their internal language representation spaces. Specially, \textsc{Lens} operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. Using the central language as a pivot, the target language is drawn closer to it within the language-agnostic subspace, allowing it to inherit well-established semantic representations. Meanwhile, in the language-specific subspace, the representations of the target and central languages are pushed apart, enabling the target language to express itself distinctly. Extensive experiments on one English-centric and two multilingual LLMs demonstrate that \textsc{Lens} effectively improves multilingual performance without sacrificing the model’s original central language capabilities, achieving superior results with much fewer computational resources compared to existing post-training approaches.